10 Secrets Revealed in Meta's Race Against OpenAI

Inside Meta's race to beat OpenAI: “We need to learn how to build ...

A major copyright lawsuit against Meta has revealed a trove of internal communications about the company’s plans to develop its Llama open-source AI models, which includes discussions about avoiding “media coverage suggesting we have used a dataset we know to be pirated.” The messages, which were part of a series of exhibits unsealed by a California court, suggest Meta used copyrighted data when training its AI systems and worked to conceal it — as it raced to beat rivals like OpenAI and Mistral. Portions of the messages were first revealed last week.

Goal to Build GPT4

In an October 2023 email to Meta AI researcher Hugo Touvron, Ahmad Al-Dahle, Meta’s vice president of generative AI, wrote that the company’s goal “needs to be GPT4,” referring to the large language model OpenAI announced in March 2023. Meta had “to learn how to build frontier and win this race,” Al-Dahle added. Those plans apparently involved the book piracy site Library Genesis (LibGen) to train its AI systems.

Use of LibGen

An undated email from Meta director of product Sony Theakanath, sent to VP of AI research Joelle Pineau, weighed whether to use LibGen internally only, for benchmarks included in a blog post, or to create a model trained on the site. In the email, Theakanath writes that “GenAI has been approved to use LibGen for Llama3... with a number of agreed upon mitigations” after escalating it to “MZ” — presumably Meta CEO Mark Zuckerberg.

As noted in the email, Theakanath believed “Libgen is essential to meet SOTA [state-of-the-art] numbers,” adding “it is known that OpenAI and Mistral are using the library for their models (through word of mouth).” Mistral and OpenAI haven’t stated whether they use LibGen. (The Verge reached out to both for more information.)

Legal Disputes

The court documents stem from a class action lawsuit that author Richard Kadrey, comedian Sarah Silverman, and others filed against Meta, accusing it of using illegally obtained copyrighted content to train its AI models in violation of intellectual property laws. Meta, like other AI companies, has argued that using copyrighted material in training data should constitute legal fair use.

Challenges in Data Acquisition

Last June, The New York Times reported on the frantic race inside Meta after ChatGPT’s debut, revealing the company had hit a wall: it had used up almost every available English book, article, and poem it could find online. Desperate for more data, executives reportedly discussed buying Simon & Schuster outright and considered hiring contractors in Africa to summarize books without permission.

Unveiling GPT-4o: The Next Generation AI Model - Fusion Chat

Data Scarcity

It’s been reported that frontier labs like OpenAI and Anthropic have hit a data wall, which means they don’t have sufficient new data to train their large language models. Many leaders have denied this. OpenAI CEO Sam Altman said plainly: “There is no wall.”

This data scarcity has led to a whole lot of weird new ways to get unique data. Bloomberg reported that frontier labs like OpenAI and Google have been paying digital content creators between $1 and $4 per minute for their unused video footage through a third-party in order to train LLMs (both of those companies have competing AI video generation products).

Meta Secretly Trained Its AI on a Notorious Piracy Database, Newly ...

With companies like Meta and OpenAI hoping to grow their AI systems as fast as possible, things are bound to get a bit messy. Though a judge partially dismissed Kadrey and Silverman’s class action lawsuit last year, the evidence outlined here could strengthen parts of their case as it moves forward in court.