Performance of ChatGPT in Radiology: Reliability, Repeatability, and Robustness
OpenAI’s ChatGPT, powered by generative pretrained transformer language models, is making waves in several fields, including medicine and radiology, despite its generic nature without fine-tuning for specific domains. Various studies have highlighted its potential in aiding decision-making, creating protocols, and handling patient inquiries. However, a notable drawback is its tendency to produce inaccurate responses, termed as hallucinations, impacting its precision, particularly in medical scenarios. While the stochastic behavior of these models boosts response diversity and flexibility, it raises concerns around reliability, repeatability, and robustness, especially in critical areas like radiology where precision is paramount.
Moreover, ChatGPT might display unwarranted confidence in its answers, posing risks, especially for inexperienced users. Currently, it lacks a mechanism to indicate its level of certainty. Consequently, a recent study published in Radiology was undertaken to assess the reliability, repeatability, robustness, and confidence levels of GPT-3.5 and GPT-4 through repeated prompts with a radiology board–style examination.

Assessing the Performance of ChatGPT in Radiology
In an exploratory prospective study, the default versions of ChatGPT (GPT-3.5 and GPT-4) underwent a series of 150 radiology board–style multiple-choice text-based questions over three attempts, separated by intervals of ≥1 month and then 1 week. The objective was to evaluate reliability (accuracy over time) and repeatability (consistency over time) by comparing the correctness and responses between attempts. Furthermore, the study tested robustness (ability to withstand adversarial prompting) by challenging ChatGPT with an adversarial prompt on the third attempt. Confidence ratings ranging from 1 to 10 were gathered after each challenging prompt and on the third attempt.
No Parameter Adjustment but Adversarial Prompting
The study utilized the default versions of ChatGPT (GPT-3.5 and GPT-4) without any parameter modifications or prompt alterations. Between March 2023 and January 2024, each of the 150 radiology board–style questions along with their answer choices was inputted into ChatGPT three times at distinct intervals, with dedicated sessions for each question in each attempt. To test robustness, ChatGPT faced an adversarial prompt after each answer choice on the third attempt, repeated three times within the same session. The model's confidence in its responses was also evaluated by prompting it to rate its confidence from 1 to 10 on the third attempt and after each challenge prompt within the same session.

Overconfidence, Poor Repeatability, and Robustness
While both GPT-3.5 and GPT-4 showcased consistent accuracy over time, their repeatability and robustness fell short. Additionally, both models showed signs of overconfidence, with GPT-4 displaying better judgment of accuracy compared to GPT-3.5. Despite GPT-4 exhibiting slightly improved repeatability over GPT-3.5, both models were prone to altering responses when faced with challenges. This susceptibility to adversarial input may result from prioritizing natural language generation over precision.
Despite GPT-4 displaying slightly better discernment of response accuracy, both models frequently demonstrated overconfidence, signaling the necessity for cautious reliance on their confidence assessments. While these default versions of ChatGPT hold promise for clinical and patient-centric applications, their limitations emphasize the importance of optimization, including parameter adjustments and safeguards, for radiology-specific tasks to ensure reliability, repeatability, and robustness.
Source: RSNA Radiology
Image Credit: iStock
Impacts of Digital Transformation in Healthcare
Healthcare has recently experienced a monumental shift driven by the widespread adoption and integration of digital technologies. Digital tools and technologies are revolutionizing patient care delivery and optimizing processes across all healthcare levels and environments.
An overview of the European Federation for Cancer Images (EUCAIM) initiative aims to drive innovation and adoption of digital technologies in cancer care. The initiative targets quicker and more precise clinical decision-making, diagnostics, treatment, and predictive medicine for cancer patients.

The transformation of healthcare hinges on digitalization, addressing challenges, and steps to integrate data, leverage AI, enhance cybersecurity, and facilitate global data exchange. Strategic planning, collaboration, and innovation play pivotal roles in steering this journey towards excellence in patient care.




















