Unveiling AI's Struggles with PhD-Level History Tests

Published On Tue Jan 21 2025

Can ChatGPT Pass a PhD-Level History Test?

Recent research has shown that even the most advanced artificial intelligence models, such as GPT-4 Turbo, still struggle to perform well on a PhD-level history test. According to a study conducted by complexity scientist Peter Turchin and computer scientist Maria del Rio-Chanona, these AI models only achieved a balanced accuracy of 46% when tested on historical knowledge.

Challenges Faced by AI Models in Historical Knowledge

Turchin, known for his work on the Seshat Global History Databank, collaborated with an international team to assess the capabilities of AI models like ChatGPT-4, Llama, and Gemini in understanding historical data. The study revealed that while these large language models (LLMs) excel in certain areas, their performance is significantly lacking in the context of historical knowledge outside of North America and Western Europe.

The researchers presented their findings at the NeurIPS conference, showcasing the limitations of AI in grasping complex historical narratives and making informed judgments about past societies.

History of Artificial Intelligence: A Detailed Timeline

Insights from the Study

The assessment, based on questions similar to those found in the Seshat Global History Databank, highlighted the struggles of AI models in handling expert-level history inquiries. While they performed better on ancient history questions, their accuracy dropped when faced with more recent historical events.

Furthermore, the study exposed disparities in model performance across different geographic regions, with potential biases in the training data affecting their understanding of certain historical narratives. Models also faced challenges in addressing topics like discrimination and social mobility, indicating the need for improvement in handling nuanced historical concepts.

Implications for Historians and AI Developers

Despite their impressive capabilities, LLMs like GPT-4 Turbo still lack the depth of understanding required for advanced historical analysis. While they excel in basic factual knowledge, they fall short when it comes to complex historical interpretations at a PhD level. This research serves as a valuable benchmark for both historians and AI developers, providing insights into the strengths and limitations of AI chatbots in historical research.

Chaingpt Applications In Financial Must Have Ai Tools

The study's authors, including researchers from Complexity Science Hub, the University of Oxford, and the Alan Turing Institute, are dedicated to enhancing the dataset and refining the benchmark to address regional biases and improve the models' handling of intricate historical information.

For those interested in exploring the full study, titled “Large Language Models’ Expert-level Global History Knowledge Benchmark (HiST-LLM),” presented at the NeurIPS conference, it offers a comprehensive analysis of the current state of AI models in the realm of historical knowledge.