Unveiling Meta's Secret Web Crawler: The AI Data Collector

Reports: A new web crawler launched by Meta last month is quietly ...

Earlier this year, Zuckerberg boasted on an earnings call that his company's social platforms had amassed a data set for AI training that was even ‘greater than the Common Crawl’, an entity that has scraped roughly 3 billion web pages each month since 2011.

Meta has quietly unleashed a new web crawler to scour the Internet and collect data en masse to feed its AI model. The crawler, named the Meta External Agent, was launched last month according to three firms that track web scrapers and bots across the web. The automated bot essentially copies, or “scrapes”, all the data that is publicly displayed on websites, for example the text in news articles or the conversations in online discussion groups.

Solved: Is robots.txt blocking blog and collection pages from ...

Unveiling the Meta External Agent

A representative of Dark Visitors, which offers a tool for website owners to automatically block all known scraper bots, said Meta External Agent is analogous to OpenAI’s GPTBot, which scrapes the web for AI training data. Two other entities involved in tracking web scrapers confirmed the bot’s existence and its use for gathering AI training data.

Meta, the parent company of Facebook, Instagram, and Whatsapp, updated a corporate website for developers with a tab disclosing the existence of the new scraper in late July, according to a version history found using the Internet Archive. Besides updating the page, Meta has not publicly announced the new crawler.

A Meta spokesman said the company has had a crawler under a different name “for years,” although this crawler – dubbed Facebook External Hit – “has been used for different purposes over time, like sharing link previews.”

Controversy Surrounding Web Data Scraping for AI Models

Scraping web data to train AI models is a controversial practice that has led to numerous lawsuits by artists, writers, and others, who say AI companies used their content and intellectual property without their consent. Some AI companies like OpenAI and Perplexity have struck deals in recent months that pay content providers for access to their data. Fortune was among several news providers that announced a revenue-sharing deal with Perplexity in July.

Facebook's five pillars of Responsible AI

While close to 25% of the world’s most popular websites now block GPTBot, only 2% are blocking Meta’s new bot, data from Dark Visitors shows.

The Role of Robots.txt and Data Collection

In order for a website to attempt to block a web scraper, it must deploy robots.txt, a line of code added to a codebase, in order to signal to a scraper bot that it should ignore that site’s information.

Such scrapers are used to pull mass amounts of data and written text from the web, to be used as training data for generative AI models, also referred to as large language models or LLMs, and related tools. Meta’s Llama is one of the largest LLMs available, and it powers things like Meta AI, an AI chat bot that now appears on various Meta platforms.

Implications for Meta's AI Development

The existence of the new crawler suggests Meta's vast trove of data may no longer be enough however, as the company continues to work on updating Llama and expanding Meta AI. LLMs typically need new and quality training data to keep improving in functionality. Meta is on track to spend up to US$40bil this year, mostly on AI infrastructure and related costs.

Source: Fortune.com/The New York Times