The Rise of Hallucinations in ChatGPT's New Reasoning Models

Published On Thu May 08 2025

ChatGPT's new reasoning models are hallucinating more often than...

ChatGPT’s new reasoning models are showing an increase in hallucinations compared to previous iterations, leading to concerns about their reliability. The root cause of this issue is currently being investigated. One notable observation is that introducing task breakdown capabilities has negatively impacted the reliability of OpenAI’s LLM (Large Language Model).

Effects on Benchmark Performance

On benchmarks like PersonaQA and SimpleQA, OpenAI’s latest reasoning models (o3 and o4-mini) have exhibited hallucination rates ranging from 33% to 79%. This trend is expected to affect the reasoning models of other tech giants such as Google and DeepSeek.

Medical Hallucination in Foundation Models and Their Impact on ...

One possible explanation for this phenomenon is a cascading failure scenario, where repeated interactions with the LLM compound minor inaccuracies, leading to more significant errors as the process unfolds. The accumulation of these small errors can result in noticeable discrepancies in the model's outputs.

Impact on Product Development

The high failure rates of reasoning models pose a significant obstacle to their practical application in consumer-oriented products. This setback delays the integration of LLMs into consumer-facing AI products, with no clear timeline for resolving the reliability issues.

Comparing SLMs and LLMs

While LLMs boast broader knowledge coverage, SLMs excel in depth, allowing them to deliver specialized capabilities within limited domains. Although LLMs are designed for generalization, their current unreliability undermines their usability in practical AI applications.

OpenAI's New Reasoning Models Excel at Coding but Show Increased ...

Anthropic’s recent findings highlight the challenge in trusting LLM-generated explanations and outputs. The lack of transparency and reliability in LLMs hinders their readiness for widespread deployment.

The Reliability Issue

The inherent unexplainability and unreliability of LLMs underscore their current unsuitability for mainstream use. However, it’s essential to acknowledge the technological advancements that do work, particularly the effectiveness of SLMs, instead of dismissing the entire field due to these challenges.

Navigating the Challenges of LLMs: Guardrails AI to the Rescue

Conclusion

Despite the setbacks faced by reasoning models like LLMs, ongoing research and advancements in AI technology remain promising. Understanding the limitations and exploring alternative approaches will pave the way for more reliable and efficient generative models in the future.