Breaking Down DeepSeek R1, o1 Pro, and Grok 3 Performance

A comparative analysis of DeepSeek R1, DeepSeek-R1-Lite, o1 Pro, and Grok 3

The ability of large language models (LLMs) to accurately answer medical board-style questions reflects their potential to benefit medical education and real-time clinical decision-making. With the recent advance to reasoning models, the latest LLMs excel at addressing complex problems in benchmark math and science tests. This study assessed the performance of first-generation reasoning models—DeepSeek’s R1 and R1-Lite, OpenAI’s o1 Pro, and Grok 3—on 493 ophthalmology questions sourced from the StatPearls and EyeQuiz question banks.

o1 Pro achieved the highest overall accuracy (83.4%), significantly outperforming DeepSeek R1 (72.5%), DeepSeek-R1-Lite (76.5%), and Grok 3 (69.2%) (p < 0.001 for all pairwise comparisons). o1 Pro also demonstrated superior performance in questions from eight of nine ophthalmologic subfields, questions of second and third order cognitive complexity, and on image-based questions. DeepSeek-R1-Lite performed the second best, despite relatively small memory requirements, while Grok 3 performed inferiorly overall.

Generative artificial intelligence (AI) is a category of AI that learns patterns from large datasets and uses these learned patterns to produce new content, including text, images, music, or videos. Large language models (LLMs) are a subset of generative AI that form the foundation of advanced chatbot tools such as OpenAI’s ChatGPT. LLMs have an emerging role in medicine to assist with medical education and clinical decision-making, offering potential assistance in areas such as board exam preparation and differential diagnosis generation.

🚀 DeepSeek-R1-Lite-Preview is now live: unleashing supercharged ...

References:

Assessing the accuracy and reliability of LLMs in complex scientific reasoning is crucial for their safe and effective integration into both medical education and clinical practice. In the United States, the field of ophthalmology relies on multiple-choice standardized exams administered by the American Board of Ophthalmology to assure the public that practitioners have the medical knowledge, clinical judgment, and professionalism required to provide high-quality patient care.

Recent research has demonstrated that newer generations of LLMs perform better than earlier models on medical board-style questions. Models designed for complex reasoning, including OpenAI o1 Pro and Grok 3, have traditionally been associated with significant computational resources and costs. However, a novel model from DeepSeek achieved high benchmark performance despite substantially lower training costs and computational demands.

This study evaluated the performance of four reasoning models—DeepSeek R1, DeepSeek-R1-Lite (15 billion parameters), o1 Pro, and Grok 3—on ophthalmology board-style questions. We hypothesized that DeepSeek R1 would perform comparably to o1 Pro and Grok 3, while outperforming DeepSeek-R1-Lite, due to its more robust architecture and known benchmark performance.

How AI is transforming medicine — Harvard Gazette

Clarifying the performance of these models could provide insight into the feasibility of using reasoning models, including cost-efficient and lower-memory models, in medical education and clinical decision-making.