What happens when AI, like ChatGPT, is trained on its own data ...
In large language model collapse, there are generally three sources of errors: The model itself, the way the model is trained, and the data — or lack thereof — that the model is trained on.
Large language models, like DeepSeek-R1 or OpenAI's ChatGPT, are kind of like the predictive text feature in your phone on steroids. In order for them to "learn" how to write, these models are trained on millions of examples of human-written text.
In the past, this training usually involved having the models read the whole Internet. But nowadays — thanks in part to these large language models themselves — a lot of content on the Internet is written by generative AI. That means that AI models trained now may consume their own synthetic content — and suffer the consequences.
View the AI-generated images mentioned in this episode.
Have another topic in artificial intelligence you want us to cover? Let us know by emailing [email protected]!
Listen to Short Wave on Spotify and Apple Podcasts.
Listen to every episode of Short Wave sponsor-free and support our work at NPR by signing up for Short Wave+ at plus.npr.org/shortwave.
This episode was produced by Hannah Chinn. It was edited by our showrunner, Rebecca Ramirez. The audio engineer was Jimmy Keeley.