10 Shocking Revelations About OpenAI's Copyright Violations

OpenAI Researcher Highlights How OpenAI Violated Copyright Law

The New York Times reports Suchir Balaji’s departure from OpenAI after spending four years as an artificial intelligence researcher with the company. He was instrumental in helping OpenAI hoover up enormous amounts of data, scraping the web for knowledge to build out its large language models (LLMs).

Balaji told The NY Times that while working for OpenAI, he did not consider whether the company had a legal right to build its products by scraping data from other sources. He assumed any data published on the internet and available freely was up for grabs—whether the data was copyrighted or not. So pirate sites that archive copyrighted books, paywalled news sites, and even Reddit posts were fair game for the massive data machine.

Microsoft and OpenAI Face a New Lawsuit for Copyright Violations

Realization of Copyright Violation

Balaji says in 2022 he thought harder about how the company was approaching data collection and came to the conclusion that how OpenAI gathered data was a violation of copyright law and that technology like ChatGPT was damaging to the internet as a whole. In August 2024, Balaji departed the company because he believed OpenAI would cause more harm than societal benefit.

Sharing Concerns

Lennon, John - The John Lennon Collection - Amazon.com Music

Earlier this week, Balaji published an essay on his website detailing his concerns about the future of OpenAI. He believes that how AI companies gather data does not fall within the ‘fair use’ that AI data companies like OpenAI and Anthropic are arguing—saying regulation of AI is the only way out of this mess.

“While generative models rarely produce outputs that are substantially similar to any of their training inputs, the process of training a generative model involves making copies of copyrighted data,” Balaji writes. “If these copies are unauthorized, this could potentially be considered copyright infringement, depending on whether or not the specific use of the model qualifies as ‘fair use.’”

White Paper: How the Pervasive Copying of Expressive Works to ...

“Because ‘fair use’ is determined on a case-by-case basis, no broad statement can be made about when generative AI qualifies for fair use.” Balaji points to traffic drops for major sites like Stack Overflow as potentially destroying the internet as new users ask their questions to generative AI models rather than the human help resource that the model was trained on.

While OpenAI has arranged for licensing agreements with several newspapers, it still faces lawsuits from authors who say they did not consent to an LLM being trained on their copyrighted works.