Diagnostic performance of generative artificial intelligences for a series of complex case reports
-
- Takanobu Hirosawa
- Dokkyo Medical University
-
- Yukinori Harada
- Dokkyo Medical University
-
- Kazuya Mizuta
- Dokkyo Medical University
-
- Tetsu Sakamoto
- Dokkyo Medical University
-
- Kazuki Tokumasu
- Okayama University
-
- Taro Shimizu
- Dokkyo Medical University
書誌事項
- 公開日
- 2024-07-21
- 資源種別
- journal article
- 権利情報
-
- https://creativecommons.org/licenses/by-nc/4.0/
- https://journals.sagepub.com/page/policies/text-and-data-mining-license
- DOI
-
- 10.1177/20552076241265215
- 公開者
- SAGE Publications
この論文をさがす
説明
<jats:sec> <jats:title>Background</jats:title> <jats:p>Diagnostic performance of generative artificial intelligences (AIs) using large language models (LLMs) across comprehensive medical specialties is still unknown.</jats:p> </jats:sec> <jats:sec> <jats:title>Objective</jats:title> <jats:p>We aimed to evaluate the diagnostic performance of generative AIs using LLMs in complex case series across comprehensive medical fields.</jats:p> </jats:sec> <jats:sec> <jats:title>Methods</jats:title> <jats:p> We analyzed published case reports from the <jats:italic toggle="yes">American Journal of Case Reports</jats:italic> from January 2022 to March 2023. We excluded pediatric cases and those primarily focused on management. We utilized three generative AIs to generate the top 10 differential-diagnosis (DDx) lists from case descriptions: the fourth-generation chat generative pre-trained transformer (ChatGPT-4), Google Gemini (previously Bard), and LLM Meta AI 2 (LLaMA2) chatbot. Two independent physicians assessed the inclusion of the final diagnosis in the lists generated by the AIs. </jats:p> </jats:sec> <jats:sec> <jats:title>Results</jats:title> <jats:p> Out of 557 consecutive case reports, 392 were included. The inclusion rates of the final diagnosis within top 10 DDx lists were 86.7% (340/392) for ChatGPT-4, 68.6% (269/392) for Google Gemini, and 54.6% (214/392) for LLaMA2 chatbot. The top diagnoses matched the final diagnoses in 54.6% (214/392) for ChatGPT-4, 31.4% (123/392) for Google Gemini, and 23.0% (90/392) for LLaMA2 chatbot. ChatGPT-4 showed higher diagnostic accuracy than Google Gemini ( <jats:italic toggle="yes">P</jats:italic> < 0.001) and LLaMA2 chatbot ( <jats:italic toggle="yes">P</jats:italic> < 0.001). Additionally, Google Gemini outperformed LLaMA2 chatbot within the top 10 DDx lists ( <jats:italic toggle="yes">P</jats:italic> < 0.001) and as the top diagnosis ( <jats:italic toggle="yes">P</jats:italic> = 0.010). </jats:p> </jats:sec> <jats:sec> <jats:title>Conclusions</jats:title> <jats:p>This study demonstrated the diagnostic performance of generative AIs including ChatGPT-4, Google Gemini, and LLaMA2 chatbot. ChatGPT-4 exhibited higher diagnostic accuracy than the other platforms. These findings suggest the importance of understanding the differences in diagnostic performance among generative AIs, especially in complex case series across comprehensive medical fields, like general medicine.</jats:p> </jats:sec>
収録刊行物
-
- DIGITAL HEALTH
-
DIGITAL HEALTH 10 2024-07-21
SAGE Publications
- Tweet
詳細情報 詳細情報について
-
- CRID
- 1360025430639525504
-
- ISSN
- 20552076
-
- 資料種別
- journal article
-
- データソース種別
-
- Crossref
- KAKEN
- OpenAIRE

