The Legal Battle: OpenAI vs. Data Owners

OpenAI faces further lawsuits over copyrighted data used to train ...

OpenAI uses all publicly available data to train ChatGPT, including books and articles from the internet. Now the owners of this data want to be paid for their work.

Synthetic data: A safer, smarter solution for training AI?

Training data is an essential part of building AI models that are taking over the tech world. Leading technology companies like Google, Meta, OpenAI, Anthropic, and Microsoft are all looking for new data sources. Meta even considered buying Simon & Schuster, one of the largest publishing houses in the world, at one point.

Legal Battles Over Copyrighted Data

Part of the problem is that publishers are increasingly accusing these companies of hoovering up copyrighted data. They want to be paid for their work. Meta and OpenAI have argued in comments to the U.S. Copyright Office that publishing copyrighted material online makes it “publicly available” and thus falls under fair use law.

However, they still have to make this argument in court because the company is facing lawsuits from several groups over the copyrighted material.

OpenAI says it's “impossible” to create useful AI models without ...

The Center for Investigative Reporting, a nonprofit news organization sometimes known by the acronym CIR that merged with Mother Jones and Reveal earlier this year, filed suit against OpenAI and Microsoft in federal court last week. The suit accuses OpenAI of being “built on the exploitation of copyrighted works of creators around the world, including CIR.”

CIR’s lawyers accused OpenAI and Microsoft of using copyrighted material from Mother Jones to train their GPT and Copilot AI models.

Authors' Allegations

In another class-action lawsuit filed by the Author’s Guild, two authors claimed that the company used information from their books to train ChatGPT. The New York Times also filed a similar lawsuit against the company in December 2023.

In May, court documents in the Author’s Guild’s lawsuit revealed that OpenAI had deleted two massive datasets used to train GPT-3. The Guild’s lawyers said the two datasets likely contained “more than 100,000 published books.”

Tech companies battle content creators over use of copyrighted ...

The two employees responsible for compiling the data no longer work for OpenAI, court documents say.

Exploring Solutions

OpenAI has begun signing licensing agreements with news organizations to ensure fair use of their work. But the volume of content required to keep these bots learning continuously will require far more than a handful of licensing agreements. One solution is synthetic data, which is not collected from the real world but is artificially generated and can easily be generated by machine learning algorithms.

OpenAI has considered synthetic data as an option for training its models but expressed concerns about producing high-quality data.

CEO Sam Altman mentioned at a technology conference in May 2023, “As long as you can cross the synthetic data event horizon, where the model is smart enough to create good synthetic data, you’ll be fine.” The company has also explored a process in which AI models work together – one AI system produces data while another evaluates it.

OpenAI did not immediately respond to Business Insider’s request for comment.

Your email address will not be published. Required fields are marked *