Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

  • Rohaid Ali
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;
  • Oliver Y. Tang
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;
  • Ian D. Connolly
    Department of Neurosurgery, Massachusetts General Hospital, Boston, Massachusetts, USA;
  • Jared S. Fridley
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;
  • John H. Shin
    Department of Neurosurgery, Massachusetts General Hospital, Boston, Massachusetts, USA;
  • Patricia L. Zadnik Sullivan
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;
  • Deus Cielo
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;
  • Adetokunbo A. Oyelese
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;
  • Curtis E. Doberstein
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;
  • Albert E. Telfeian
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;
  • Ziya L. Gokaslan
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;
  • Wael F. Asaad
    Department of Neurosurgery, The Warren Alpert Medical School of Brown University, Providence, Rhode Island, USA;

抄録

<jats:sec> <jats:title>BACKGROUND AND OBJECTIVES:</jats:title> <jats:p>General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated the capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. We aimed to assess the performance of 3 LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation.</jats:p> </jats:sec> <jats:sec> <jats:title>METHODS:</jats:title> <jats:p>The 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ<jats:sup>2</jats:sup>, Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics.</jats:p> </jats:sec> <jats:sec> <jats:title>RESULTS:</jats:title> <jats:p>On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both <jats:italic toggle="yes">P</jats:italic> < .01), and GPT-4 outperformed GPT-3.5 (<jats:italic toggle="yes">P</jats:italic> = .023). Among 6 subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in 4 categories relative to Bard (all <jats:italic toggle="yes">P</jats:italic> < .01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (odds ratio [OR] = 0.80, <jats:italic toggle="yes">P</jats:italic> = .042) and Bard (OR = 0.76, <jats:italic toggle="yes">P</jats:italic> = .014), but not GPT-4 (OR = 0.86, <jats:italic toggle="yes">P</jats:italic> = .085). GPT-4's performance on imaging-related questions surpassed GPT-3.5's (68.6% vs 47.1%, <jats:italic toggle="yes">P</jats:italic> = .044) and was comparable with Bard's (68.6% vs 66.7%, <jats:italic toggle="yes">P</jats:italic> = 1.000). However, GPT-4 demonstrated significantly lower rates of “hallucination” on imaging-related questions than both GPT-3.5 (2.3% vs 57.1%, <jats:italic toggle="yes">P</jats:italic> < .001) and Bard (2.3% vs 27.3%, <jats:italic toggle="yes">P</jats:italic> = .002). Lack of question text description for questions predicted significantly higher odds of hallucination for GPT-3.5 (OR = 1.45, <jats:italic toggle="yes">P</jats:italic> = .012) and Bard (OR = 2.09, <jats:italic toggle="yes">P</jats:italic> < .001).</jats:p> </jats:sec> <jats:sec> <jats:title>CONCLUSION:</jats:title> <jats:p>On a question bank of predominantly higher-order management case scenarios for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google Bard.</jats:p> </jats:sec>

収録刊行物

  • Neurosurgery

    Neurosurgery 93 (5), 1090-1098, 2023-06-12

    Ovid Technologies (Wolters Kluwer Health)

被引用文献 (1)*注記

もっと見る

詳細情報 詳細情報について

問題の指摘

ページトップへ