How GPT-4 is revolutionising neurosurgery board exams

Published On Fri May 12 2023
How GPT-4 is revolutionising neurosurgery board exams

ChatGPT-4 outperforms GPT-3.5 and Google Bard in neurosurgery

A recent study was conducted to test the performance of three general Large Language Models (LLMs) - ChatGPT (or GPT-3.5), GPT-4, and Google Bard - on higher-order questions representing the American Board of Neurological Surgery (ABNS) oral board examination. The performance and accuracy of these models were compared after varying question characteristics.

Previously, ChatGPT was tested on a 500-question module imitating neurosurgery written board exams with a score of 73.4%. Its updated model, GPT-4, became available for public use on March 14, 2023, and similarly attained passing scores in >25 standardized exams. Studies documented that GPT-4 showed >20% performance improvements on the United States Medical Licensing Exam (USMLE).

In the present study, GPT-4 was assessed on a 149-question module imitating the neurosurgery oral board exam. The model attained a score of 82.6% and outperformed ChatGPT's score of 62.4%. Additionally, GPT-4 demonstrated a markedly better performance than ChatGPT in the Spine subspecialty (90.5% vs. 64.3%).

Google Bard, another artificial intelligence (AI)-based chatbot, had real-time web crawling capabilities and could offer more contextually relevant information while generating responses for standardized exams in fields of medicine, business, and law. However, the model generated correct responses for only 44.2% (66/149) of questions, while it generated incorrect responses to 45% (67/149) of questions, and declined to answer 10.7% (16/149) of questions. In fact, GPT-4 outperformed Google Bard in all categories and demonstrated improved performance in question categories for which ChatGPT showed lower accuracy.

The study findings underscore the urgent need for neurosurgeons to stay informed on emerging LLMs and their varying performance levels for potential clinical applications. Moreover, there is an urgent need to develop more trust in LLM systems, thus, rigorous validation of their performance on increasingly higher-order and open-ended scenarios should continue. It would ensure the safe and effective integration of these LLMs into clinical decision-making processes.