Researchers Combat AI Hallucinations in Math | KQED
One of the biggest problems with using AI in education is that the technology hallucinates. That’s the word the artificial intelligence community uses to describe how its newest large language models make up stuff that doesn’t exist or isn’t true. Math is a particular land of make-believe for AI chatbots. Several months ago, I tested Khan Academy’s chatbot, which is powered by ChatGPT. The bot, called Khanmigo, told me I had answered a basic high school Algebra 2 problem involving negative exponents wrong. I knew my answer was right. After typing in the same correct answer three times, Khanmigo finally agreed with me. It was frustrating.
Errors matter. Kids could memorize incorrect solutions that are hard to unlearn, or become more confused about a topic. I also worry about teachers using ChatGPT and other generative AI models to write quizzes or lesson plans. At least a teacher has the opportunity to vet what AI spits out before giving or teaching it to students. It’s riskier when you’re asking students to learn directly from AI.
Combatting AI Hallucinations
Computer scientists are attempting to combat these errors in a process they call “mitigating AI hallucinations.” Two researchers from University of California, Berkeley, recently documented how they successfully reduced ChatGPT’s instructional errors to near zero in algebra. They were not as successful with statistics, where their techniques still left errors 13% of the time. Their paper was published in May 2024 in the peer-reviewed journal PLOS One.
In the experiment, Zachary Pardos, a computer scientist at the Berkeley School of Education, and one of his students, Shreya Bhandari, first asked ChatGPT to show how it would solve an algebra or statistics problem. They discovered that ChatGPT was “naturally verbose” and they did not have to prompt the large language model to explain its steps. But all those words didn’t help with accuracy. On average, ChatGPT’s methods and answers were wrong a third of the time. In other words, ChatGPT would earn a grade of a D if it were a student.
Improving Accuracy
Current AI models are bad at math because they’re programmed to figure out probabilities, not follow rules. Math calculations are all about rules. It’s ironic because earlier versions of AI were able to follow rules, but unable to write or summarize. Now we have the opposite.
The Berkeley researchers took advantage of the fact that ChatGPT, like humans, is erratic. They asked ChatGPT to answer the same math problem 10 times in a row. I was surprised that a machine might answer the same question differently, but that is what these large language models do. Often the step-by-step process and the answer were the same, but the exact wording differed. Sometimes the methods were bizarre and the results were dead wrong.
Researchers grouped similar answers together. When they assessed the accuracy of the most common answer among the 10 solutions, ChatGPT was astonishingly good. For basic high-school algebra, AI’s error rate fell from 25% to zero. For intermediate algebra, the error rate fell from 47% to 2%. For college algebra, it fell from 27% to 2%.
However, when the scientists applied this method, which they call “self-consistency,” to statistics, it did not work as well. ChatGPT’s error rate fell from 29% to 13%, but still more than one out of 10 answers was wrong. I think that’s too many errors for students who are learning math.
Impact on Learning
The big question, of course, is whether these ChatGPT’s solutions help students learn math better than traditional teaching. In a second part of this study, researchers recruited 274 adults online to solve math problems and randomly assigned a third of them to see these ChatGPT’s solutions as a “hint” if they needed one. On a short test afterwards, these adults improved 17% compared to less than 12% learning gains for the adults who could see a different group of hints written by undergraduate math tutors. Those who weren’t offered any hints scored about the same on a post-test as they did on a pre-test.