Stack Overflow Implements New Strategy, Charging AI Giants for Training Data
Developing large-scale AI projects such as ChatGPT and Dall-E costs hundreds of millions of dollars, and it’s about to get even more expensive. Traditionally, companies including OpenAI and Google have paid nothing for much of their training data, simply scraping it from the web. However, Stack Overflow, the popular internet forum for computer programming help, has announced plans to begin charging large AI developers as soon as the middle of this year for access to the 50 million questions and answers on its service. The platform has more than 20 million registered users.
Using large language models (LLMs) to generate programming code is considered one of the biggest opportunities in the technology sector, with Microsoft charging as much as $19 a month per person for its code generator GitHub Copilot. As a result, community platforms such as Stack Overflow and Reddit believe they should be compensated for their contributions to the technology sector to help accelerate the development of high-quality LLMs.
Stack Overflow CEO, Prashanth Chandrasekar, described the potential additional revenue as vital to ensuring Stack Overflow can keep attracting users and maintaining high-quality information. He argues that this will also help future chatbots, which need “to be trained on something that's progressing knowledge forward. They need new knowledge to be created.”
Often, data sets used in AI development are built through unofficial means such as dispatching software that scrapes content from websites. While this is typically considered legal in the US, copyright issues and websites’ terms of use against this practice have left it in dispute. Few websites, such as Reddit and Stack Overflow, have been more inviting. They offer downloadable “data dumps” or real-time data portals to help software to access their content known as APIs. In Stack Overflow’s case, LLM developers are getting their hands on data through a mix of dumps, APIs, and scraping, Chandrasekar says, all of which today can be done for free.
However, Chandrasekar says that LLM developers are violating Stack Overflow’s terms of service. While users own the content they post on Stack Overflow, it all falls under a Creative Commons license that requires anyone later using the data to mention where it came from. When AI companies sell their models to customers, they “are unable to attribute each and every one of the community members whose questions and answers were used to train the model, thereby breaching the Creative Commons license,” Chandrasekar says.
Stack Overflow’s decision to seek compensation from companies tapping its data is part of a broader generative AI strategy, as the News/Media Alliance, a US trade group of publishers including Condé Nast, revealed principles calling on generative AI developers to negotiate any use of their data for training and other purposes and respect their right to fair compensation.
With expectations that ChatGPT-style bots and other products built on LLMs will reap huge profits, other companies with stocks of content needed to train machine learning algorithms are also looking to be paid. Some news publishers have been wary of how Microsoft’s new Bing chatbot handles their content. Nonetheless, so far, only a few public deals over access to training data have been announced, such as photo bank Shutterstock agreeing to provide data for a fee, and Stack Overflow and Reddit plan to make their data available for free to some people and companies. Chandrasekar says Stack Overflow only wants remuneration from companies developing LLMs for big, commercial purposes. “When people start charging for products that are built on community-built sites like ours, that's where it's not fair use,” he says.