The Secrets Behind Meta's AI Chatbot Training Revealed

Meta's AI Chatbot Says It Was Trained on Millions of YouTube Transcriptions

The Meta AI chatbot is more willing to share what data it was trained on than Meta is. Meta, formerly known as Facebook, first released Meta AI last year amid its big push into the generative AI space to keep up with a wave of public interest prompted in late 2022 by the release of OpenAI's ChatGPT. It expanded Meta AI in April as a chat and image generator function across all its apps, including Instagram and WhatsApp. Still, it hasn't disclosed much about how its chatbot was trained.

Training Data

When Business Insider asked Meta AI a series of questions about what data it was trained on and how Meta obtained such data, the chatbot gave some interesting answers. Meta AI told Business Insider that it was trained on large datasets of transcriptions from YouTube videos. In addition, it said Meta has its own web scraper bot called "MSAE," an acronym for Meta Scraping and Extraction, which it said scrapes large amounts of data from the web to train AI models. Meta had not disclosed the existence of this scraper before. YouTube's terms of service prohibit the use of bots and scrapers to collect its data, and the use of such data without its permission, something OpenAI has recently come under scrutiny for purportedly doing.

Meta's Response

A Meta spokesman did not deny any of Meta AI's answers about its scraper or training data. BI provided Meta with the prompts it used. Instead, the spokesman suggested that Meta AI could be incorrect. "As with all generative AI systems, the models could return inaccurate or inappropriate outputs," the spokesman said. "We'll continue to improve these features as they evolve, and more people share their feedback." The spokesman also noted, "Like others in the industry, we use web crawlers," without specifying the MSAE bot Meta AI cited.

Additional Training Data

"Generative AI models take a large amount of data to effectively train, so a combination of sources are used for training, including information that's publicly available online and annotated data," the spokesperson added. Meta AI initially said its training data included a third-party dataset of 3.7 million transcribed YouTube videos. The chatbot specified that it "did not use its web scraper bot to scrape YouTube videos directly." In responding to further queries about its YouTube training data, Meta AI said its training data included another, larger dataset of transcriptions from 6 million YouTube videos also compiled by a third party.

10 Web Scraping Challenges and Solutions | PromptCloud

It added that its training data includes two additional sets of YouTube transcriptions or subtitles, one with 1.5 million videos and another with 2.5 million videos, as well as a dataset of transcriptions from 2,500 TED Talks posted to YouTube. These datasets were all compiled by third parties, according to Meta AI. Meta's chatbot said it "takes steps to avoid collecting copyrighted data."

Web Scraping and Training

Results for several queries cited sources like NBC News, CNN, and The Financial Times. Meta AI often did not include sources for its responses, unless specifically asked to do so. Meta is currently considering new paid deals with media publishers to gain access to more AI training data, as BI reported, which could improve Meta AI's results.

The content is stored in massive datasets fed into LLMs and often regurgitated by generative AI tools like ChatGPT. Several ongoing lawsuits concern owned and copyrighted content being freely absorbed by the world's biggest tech companies. The US Copyright Office is expected to release new guidance on acceptable use for AI companies later this year.

Axel Springer, Business Insider's parent company, has a global deal to allow OpenAI to train its models on its media brands' reporting. On February 28, Axel Springer, Business Insider's parent company, joined 31 other media groups and filed a $2.3 billion suit against Google in Dutch court, alleging losses suffered due to the company's advertising practices.