ChatGPT's new reasoning models are hallucinating more often than...
ChatGPT’s new reasoning models are showing an increase in hallucinations compared to previous iterations, leading to concerns about their reliability. The root cause of this issue is currently being investigated. One notable observation is that introducing task breakdown capabilities has negatively impacted the reliability of OpenAI’s LLM (Large Language Model).
Effects on Benchmark Performance
On benchmarks like PersonaQA and SimpleQA, OpenAI’s latest reasoning models (o3 and o4-mini) have exhibited hallucination rates ranging from 33% to 79%. This trend is expected to affect the reasoning models of other tech giants such as Google and DeepSeek.

One possible explanation for this phenomenon is a cascading failure scenario, where repeated interactions with the LLM compound minor inaccuracies, leading to more significant errors as the process unfolds. The accumulation of these small errors can result in noticeable discrepancies in the model's outputs.
Impact on Product Development
The high failure rates of reasoning models pose a significant obstacle to their practical application in consumer-oriented products. This setback delays the integration of LLMs into consumer-facing AI products, with no clear timeline for resolving the reliability issues.
Comparing SLMs and LLMs
While LLMs boast broader knowledge coverage, SLMs excel in depth, allowing them to deliver specialized capabilities within limited domains. Although LLMs are designed for generalization, their current unreliability undermines their usability in practical AI applications.

Anthropic’s recent findings highlight the challenge in trusting LLM-generated explanations and outputs. The lack of transparency and reliability in LLMs hinders their readiness for widespread deployment.
The Reliability Issue
The inherent unexplainability and unreliability of LLMs underscore their current unsuitability for mainstream use. However, it’s essential to acknowledge the technological advancements that do work, particularly the effectiveness of SLMs, instead of dismissing the entire field due to these challenges.

Conclusion
Despite the setbacks faced by reasoning models like LLMs, ongoing research and advancements in AI technology remain promising. Understanding the limitations and exploring alternative approaches will pave the way for more reliable and efficient generative models in the future.




















