Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

Rohaid Ali, Oliver Y. Tang, Ian D. Connolly, Jared S. Fridley, John H. Shin, Patricia L. Zadnik Sullivan, Deus Cielo, Adetokunbo A. Oyelese, Curtis E. Doberstein, Albert E. Telfeian, Ziya L. Gokaslan, Wael F. Asaad

doi:10.1227/neu.0000000000002551

<jats:sec><jats:title>BACKGROUND AND OBJECTIVES:</jats:title><jats:p>General large language models (LLMs), such as ChatGPT (GPT-3.5), have demonstrated the capability to pass multiple-choice medical board examinations. However, comparative accuracy of different LLMs and LLM performance on assessments of predominantly higher-order management questions is poorly understood. We aimed to assess the performance of 3 LLMs (GPT-3.5, GPT-4, and Google Bard) on a question bank designed specifically for neurosurgery oral boards examination preparation.</jats:p></jats:sec><jats:sec><jats:title>METHODS:</jats:title><jats:p>The 149-question Self-Assessment Neurosurgery Examination Indications Examination was used to query LLM accuracy. Questions were inputted in a single best answer, multiple-choice format. χ<jats:sup>2</jats:sup>, Fisher exact, and univariable logistic regression tests assessed differences in performance by question characteristics.</jats:p></jats:sec><jats:sec><jats:title>RESULTS:</jats:title><jats:p>On a question bank with predominantly higher-order questions (85.2%), ChatGPT (GPT-3.5) and GPT-4 answered 62.4% (95% CI: 54.1%-70.1%) and 82.6% (95% CI: 75.2%-88.1%) of questions correctly, respectively. By contrast, Bard scored 44.2% (66/149, 95% CI: 36.2%-52.6%). GPT-3.5 and GPT-4 demonstrated significantly higher scores than Bard (both<jats:italic toggle="yes">P</jats:italic>< .01), and GPT-4 outperformed GPT-3.5 (<jats:italic toggle="yes">P</jats:italic>= .023). Among 6 subspecialties, GPT-4 had significantly higher accuracy in the Spine category relative to GPT-3.5 and in 4 categories relative to Bard (all<jats:italic toggle="yes">P</jats:italic>< .01). Incorporation of higher-order problem solving was associated with lower question accuracy for GPT-3.5 (odds ratio [OR] = 0.80,<jats:italic toggle="yes">P</jats:italic>= .042) and Bard (OR = 0.76,<jats:italic toggle="yes">P</jats:italic>= .014), but not GPT-4 (OR = 0.86,<jats:italic toggle="yes">P</jats:italic>= .085). GPT-4's performance on imaging-related questions surpassed GPT-3.5's (68.6% vs 47.1%,<jats:italic toggle="yes">P</jats:italic>= .044) and was comparable with Bard's (68.6% vs 66.7%,<jats:italic toggle="yes">P</jats:italic>= 1.000). However, GPT-4 demonstrated significantly lower rates of “hallucination” on imaging-related questions than both GPT-3.5 (2.3% vs 57.1%,<jats:italic toggle="yes">P</jats:italic>< .001) and Bard (2.3% vs 27.3%,<jats:italic toggle="yes">P</jats:italic>= .002). Lack of question text description for questions predicted significantly higher odds of hallucination for GPT-3.5 (OR = 1.45,<jats:italic toggle="yes">P</jats:italic>= .012) and Bard (OR = 2.09,<jats:italic toggle="yes">P</jats:italic>< .001).</jats:p></jats:sec><jats:sec><jats:title>CONCLUSION:</jats:title><jats:p>On a question bank of predominantly higher-order management case scenarios for neurosurgery oral boards preparation, GPT-4 achieved a score of 82.6%, outperforming ChatGPT and Google Bard.</jats:p></jats:sec>

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

説明

収録刊行物

被引用文献 (2)*注記

詳細情報詳細情報について

書き出し

問題の指摘

Performance of ChatGPT, GPT-4, and Google Bard on a Neurosurgery Oral Boards Preparation Question Bank

説明

収録刊行物

被引用文献 (2)*注記

詳細情報 詳細情報について

書き出し

問題の指摘

詳細情報詳細情報について