Keywords: Artificial Intelligence, Primary Care, General Practitioner, Examination, Benchmark
Background:
Recent advancements in artificial intelligence (AI) have shown significant promise in healthcare. OpenAI's ChatGPT, specifically version GPT-4, has demonstrated competence in medical multiple-choice assessments. However, its performance on complex, free-text cases, particularly in primary care, remains largely unexplored. The Swedish family medicine specialist examination provides a unique opportunity to compare GPT-4's responses against those of real doctors in managing multifaceted clinical scenarios.
Research questions:
How does GPT-4’s performance compare to that of randomly selected and top-tier doctors in scoring comprehensive free-text responses to primary care cases?
What are the specific strengths and weaknesses of GPT-4 in diagnosing, recommending treatments, and addressing psychosocial complexities?
Can GPT-4’s responses inform the future development of AI in clinical decision support?
Method:
This observational comparative study evaluated 48 cases from the Swedish family medicine specialist examination (2017–2022). The cases, consisting of long-form clinical scenarios, were assessed by three groups:
Group A: Randomly selected doctor responses
Group B: Top-tier doctor responses
Group C: GPT-4-generated responses.
Each response was scored using a structured evaluation guide adapted from official scoring criteria. Blinded reviewers assessed the responses, awarding scores on a 10-point scale. Statistical analyses included paired t-tests to compare mean scores and interclass correlation coefficients to evaluate scoring reliability.
Results:
The mean scores for Groups A, B, and C were 6.0, 7.2, and 4.5, respectively. Random doctor responses scored 1.6 points higher than GPT-4 (p<0.001), and top-tier doctor responses outperformed GPT-4 by 2.7 points (p<0.001). GPT-4’s responses were notably less comprehensive in differential diagnosis, treatment recommendations, and addressing social factors. However, GPT-4 demonstrated potential in structuring responses and providing general medical knowledge.
Conclusions:
GPT-4 underperformed compared to both randomly selected and top-tier doctors in managing complex primary care cases. While its outputs reveal significant limitations in medical accuracy and contextual understanding, GPT-4 shows promise as a supplementary tool for clinical decision support.
Points for discussion:
Implications for clinical practice: The study highlights the current limitations of GPT-4 in primary care, emphasizing the need for human oversight in AI-driven medical decision-making.
Potential for improvement: With targeted training and prompt engineering, GPT-4’s performance may improve in areas such as differential diagnosis and psychosocial assessment.
Future research directions: Further studies should explore the integration of AI models like GPT-4 into clinical workflows, focusing on their role in enhancing efficiency while ensuring patient safety.
#32