Home > Blog > New Study Suggests OpenAI Models Memorized Copyrighted Material
Concerns Around OpenAI's Use of Copyrighted Content
A recent study has raised concerns regarding the training methods employed by OpenAI for its AI models. The study suggests that OpenAI may have utilized copyrighted material without proper authorization, leading to legal challenges from various stakeholders such as authors and developers.
OpenAI is currently facing multiple lawsuits alleging that the company used copyrighted works, including books and code, to develop its models. While OpenAI asserts that its actions fall within the realm of fair use, critics argue that existing U.S. copyright laws do not explicitly permit the use of copyrighted content for training AI models.
Research on Detecting Memorization of Training Data
A collaborative effort by research teams from the University of Washington, Stanford, and the University of Copenhagen has introduced a novel method to identify instances where AI models have "memorized" specific segments of their training data, potentially infringing on copyright laws.
AI models are typically designed to generate new content based on learned patterns rather than replicate existing data verbatim. However, there have been cases where models recreate entire copyrighted segments, such as film scenes or published articles.
The recent study focused on detecting unique and uncommon words, referred to as "high-surprisal" words, in literary texts. By removing these words from excerpts of fiction and New York Times articles and prompting OpenAI models (including GPT-4 and GPT-3.5) to predict the missing words, the researchers could assess if the models had memorized the original content.
Findings from the study indicated that GPT-4 exhibited signs of memorizing portions of copyrighted fiction books, particularly from a dataset known as BookMIA, which contains samples of e-books. Additionally, traces of memorization were observed in New York Times articles, albeit less frequently.
Importance of Transparency in AI Development
Abhilasha Ravichander, a Ph.D. student at the University of Washington and a study co-author, emphasized the significance of transparency in AI model development. She stressed the need for tools that enable thorough examination and auditing of these models to ensure their reliability.
Ravichander stated, "To build trustworthy language models, we need tools that let us examine and audit them scientifically. Our research offers one such tool, but the broader issue is a lack of transparency around what data these models are trained on."
Despite OpenAI entering into licensing agreements and providing opt-out options for rights holders in training datasets, the company continues to push for more lenient regulations regarding the incorporation of copyrighted content in AI models. It has also been advocating for fair use protections concerning AI training at the governmental level.