10 Legal Questions Arising from Access to AI Training Data

Published On Sat Jun 21 2025

Bad Vibes: Access to AI Training Data Sparks Legal Questions

As “vibe coding” goes mainstream, AI companies are rushing to build the biggest and most authoritative tech knowledge bases to train the next generation of AI copilots. But how will AI companies obtain these curated troves of valuable tech data? Recent moves by Stack Overflow and Reddit show how it might play out.

The Rise of Vibe Coding

Vibe coding–or telling a coding copilot what you want, and then sitting back while the AI generates code for you–is all the rage today. Searches for “vibe coding” are up 6,700% over the past 12 months, and even renowned technologists like CEO Ali Ghodsi rely on them.

“You’d even hear Ali himself tell you these days, ‘Look, I just mostly ask [Databricks] Assistant for what I need,” said Databricks VP of Marketing Joel Minnick. “If the first attempt at the code doesn’t work, I just kind of give it the error code and tell it ‘try again,’ and it tries again, and now it’s right.”

Access to Tech Data

The combination of huge swaths of sample code and the incredible learning power of large language models (LLMs) give coding copilots their capabilities. What’s more, when questions arise over some technical topic, the Web’s vast array of discussion boards provides ample fodder for copilots to get even the small details correct.

Legal Battles Over Data Access

Reddit filed a lawsuit against Anthropic, accusing the AI company of scraping its website for content to train its AI models, in violation of its data policy. Reddit claims that Anthropic accessed its platform more than 100,000 times since July 2024 to scrape user-generated content for AI training, in violation of Reddit’s terms of service.

Another popular source for technical content is Stack Overflow, which has a vast knowledge base focused on technical topics. Stack Overflow recently signed a deal with Snowflake to enable its user-generated data to be available via the Snowflake Marketplace.

Enforcing Data Usage Policies

Stack Overflow has taken steps to prevent its data from being scraped for AI purposes and to authenticate that users are human. It also has a strict policy against allowing AI-generated answers on the site, emphasizing the importance of human curation.

The message to AI model builders and users is clear: If high quality, human-sourced data is important to your endeavor, then you should be willing to pay the provider a fair sum, while simultaneously ensuring user privacy is maintained at all times.

Unauthorized access to valuable data will not be tolerated, as seen in the legal battles between tech companies and platforms. As the landscape of AI training data evolves, companies are being forced to reevaluate their data usage policies to protect user privacy and ensure fair and legal access to valuable information.