When ChatGPT Reattempts UPSC: GPT-4's Performance
In February, the AI-chatbot ChatGPT's attempt to clear UPSC Prelims, one of the toughest exams in the world, became a source of amusement for aspirants when it failed to answer 46 out of 100 questions correctly. Since then, OpenAI released GPT-4, the most advanced Large Language Model (LLM) to date. Recently, we conducted the same experiment and asked GPT-4 the same 100 questions. This time, it got 86 questions right, resulting in a much better performance than its predecessor.
Prelims consists of two papers, General Studies Paper-I and General Studies Paper-II, but we only considered Paper-I for both attempts. The cut-off for the previous year was 87.54 marks, considering only Paper-I. GPT-4 scored 162.76 marks, which indicates that ChatGPT Plus (powered by GPT-4) could clear UPSC Prelims. In the first attempt, ChatGPT had answered only 54 questions correctly.
OpenAI did not disclose GPT-4's architecture, model size, hardware, or training method in the technical paper. However, they revealed that they tested GPT-4 on a diverse set of benchmarks, including simulating exams that were designed for humans. According to the technical paper, GPT-4 outperforms ChatGPT (GPT-3.5) on most exams tested, including UPSC.
One of the major reasons for ChatGPT's poor performance was its tendency to hallucinate. In contrast, GPT-4 is less likely to make up facts and more creative. This played an important role in its improved performance in the UPSC exam. GPT-4 also showed some level of hallucination, but to a lesser degree than its predecessor.
Another observation was that both models answered history-related questions incorrectly, despite it being an area where they are expected to perform well. Further, ChatGPT had failed in an exam designed for a sixth-grader, and GPT-4 could not clear the English literature exam, indicating that these models still have limitations.
It is important to note that altering the inquiry could prompt GPT-4 to arrive at accurate responses. This implies that in some instances, rephrasing the same question could lead GPT-4 to provide correct answers, and vice versa. However, in this experiment, only the bot's initial responses were considered.
Finally, it's important to note that this was just a fun experiment, and no concrete judgments should be made based on these results. Generative AI can analyse enormous volumes of data and patterns, but it cannot take the place of human originality, creativity, and common sense. Therefore, human oversight is essential for its creation and implementation.