Apple Intelligence models were trained in a ‘responsible’ manner, according to the company

Apple has released a technical paper outlining the development of models behind its Apple Intelligence, a suite of generative AI features set to be integrated into iOS, macOS, and iPadOS in the coming months. The paper addresses ethical concerns regarding the training data, asserting that no private user data was used. Instead, Apple used licensed data, curated public datasets, and information gathered by its web crawler, Applebot.

The company clarified that it sourced training data responsibly, avoiding private user data. This follows a controversy where Proof News reported that Apple used a dataset, The Pile, which includes subtitles from YouTube videos, to train its models without creators’ consent. Apple later stated that those models would not be used in its products.

Open-source code from GitHub, licensed data from publishers, and publicly accessible online data were all used in the Apple Foundation Models (AFM) training. Apple reportedly made deals with major publishers like NBC and Condé Nast to use their news archives for training, and filtered open-source code to include only repositories with permissive licenses such as MIT, ISC, or Apache.

The training dataset for the AFM models comprised around 6.3 trillion tokens, significantly less than the 15 trillion tokens used by Meta for its Llama 3.1 model. Apple also used human feedback and synthetic data to refine these models, aiming to reduce harmful outputs.

Apple’s approach to sourcing data, especially from publicly available web content, highlights ongoing debates about the ethics and legality of such practices. While some companies claim that scraping public data is protected under fair use, this remains a contentious issue subject to increasing legal scrutiny. Apple allows webmasters to block its crawler, but this leaves individuals reliant on third-party platforms in a difficult position if those platforms do not opt out. The future of generative AI models and their training methods may ultimately be determined by the courts, as Apple aims to present itself as an ethical participant in this evolving landscape.