Can ChatGPT Outsmart Google in Medical Advice Delivery?

Can AI chatbots like ChatGPT give better medical answers than Google? A new study shows they can — but only if you ask them the right way.

How reliable are search engines and artificial intelligence (AI) chatbots when it comes to answering health-related questions? In a recent study published in NPJ Digital Medicine, Spanish researchers investigated the performance of four major search engines and seven large language models (LLMs), including ChatGPT and GPT-4, in answering 150 medical questions. The findings revealed interesting patterns in accuracy, prompt sensitivity, and retrieval-augmented model effectiveness.

Some of the biggest failures by AI chatbots involved confidently giving answers that went against medical consensus, making these mistakes particularly dangerous in health settings.

The internet has now become a primary source of health information, with millions relying on search engines to find medical advice. However, search engines often return results that may be incomplete, misleading, or inaccurate.

Performance of Search Engines and LLMs

Large language models (LLMs) have emerged as alternatives to regular search engines and are capable of generating coherent answers based on vast training data. However, while recent studies have examined the performance of LLMs in specialized medical domains, most evaluations have focused on a single model. Additionally, there is little research comparing LLMs with traditional search engines in health-related contexts, and few studies explore how LLM performance changes under different prompting strategies or when combined with retrieved evidence.

PDF) A study of search engines for health sciences

The accuracy of search engines and LLMs also depends on factors such as input phrasing, retrieval bias, and model reasoning capabilities. Moreover, despite their promise, LLMs sometimes generate misinformation, raising concerns about their reliability.

Study Findings

The present study aimed to assess the accuracy and performance of search engines and LLMs by evaluating their effectiveness in answering health-related questions and the impact of retrieval-augmented approaches.

The researchers tested four major search engines — Yahoo!, Bing, Google, and DuckDuckGo — and seven LLMs, including GPT-4, ChatGPT, Llama3, MedLlama3, and Flan-T5. Among these, GPT-4, ChatGPT, Llama3, and MedLlama3 generally performed best, while Flan-T5 underperformed.

Trust your doctor: Study shows human medical professionals are ...

Search engines often returned top results that didn’t answer the question directly, but when they did, those answers were usually correct — highlighting a precision problem rather than accuracy.

Comparison and Analysis

For search engines, the top 20 ranked results were analyzed. Interestingly, the study found that 'lazy' users achieved similar accuracy to 'diligent' users and, in some cases, even performed better, suggesting that top-ranked search engine results may often suffice—though this raises concerns when incorrect information ranks highly.

LLMs were tested under different prompting conditions, and the study explored retrieval-augmented generation, where LLMs were fed search engine results before generating responses.

Conclusion

In summary, the study highlighted search engines' and LLMs' strengths and weaknesses in answering health-related questions. While LLMs generally outperformed search engines, their accuracy was found to be highly dependent on input prompts and retrieval augmentation.

Perplexity Offers a New Conversational Search Experience ...

Moreover, while combining both technologies appears promising, ensuring the reliability of retrieved information remains a challenge. The researchers emphasized that smaller LLMs, when supported with high-quality search evidence, can perform on par with much larger models—raising questions about the future of health information retrieval.