Abstract
Purpose
To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.
Methods
Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score − 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey’s multiple comparison tests.
Results
Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).
Conclusion
Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.




Similar content being viewed by others
Abbreviations
- FRES:
-
Flesch Reading Ease Score
- FKGL:
-
Flesch-Kincaid Grade Level
- HCC:
-
Hepatocellular carcinoma
- LLM:
-
Large language model
- LI-RADS:
-
Liver Imaging Reporting and Data System
References
Gulati R, Nawaz M, Pyrsopoulos NT. Health literacy and liver disease. Clin Liver Dis (Hoboken) 2018; 11:48–51.
Haver HL, Ambinder EB, Bahl M, Oluyemi ET, Jeudy J, Yi PH. Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology. 2023;307(4):e230424.
Cao JJ, Kwon DH, Ghaziani TT, et al. Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis. AJR Am J Roentgenol. 2023;221(4):556–559.
Yeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023;29(3):721–732.
Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023;329(10):842–844.
Li H, Moon JT, Iyer D, et al. Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin Imaging. 2023;101:137–141.
Bajaj S, Gandhi D, Nayar D. Potential Applications and Impact of ChatGPT in Radiology [published online ahead of print, 2023 Oct 5]. Acad Radiol. 2023;S1076-6332(23)00460–9.
CT/MRI LI-RADS® v2018 CORE. American College of Radiology. Accessed September 12, 2023. https://www.acr.org/-/media/ACR/Files/RADS/LI-RADS/LI-RADS-2018-Core.pdf
Flesch Kincaid Calculator. Good Calculators. Accessed September 12, 2023. https://goodcalculators.com/flesch-kincaid-calculator/
The Patient Education Materials Assessment Tool (PEMAT) and User’s Guide. Agency for Healthcare Research and Quality. Updated November 2020. Accessed December 20, 2023. https://www.ahrq.gov/health-literacy/patient-education/pemat.html
Roberts RH, Ali SR, Hutchings HA, Dobbs TD, Whitaker IS. Comparative study of ChatGPT and human evaluators on the assessment of medical literature according to recognised reporting standards. BMJ Health Care Inform. 2023;30(1):e100830
Bosbach WA, Senge JF, Nemeth B, et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol. Published online April 17, 2023.
Gebrael G, Sahu KK, Chigarira B, et al. Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0. Cancers (Basel). 2023;15(14):3717.
Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. Published 2023 Feb 8.
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.
Bera K, O’Connor G, Jiang S, Tirumani SH, Ramaiya N. Analysis of ChatGPT publications in radiology: Literature so far. Curr Probl Diagn Radiol. Published online October 20, 2023.
Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard. Radiology. 2023;307(5):e230922.
Haver HL, Lin CT, Sirajuddin A, Yi PH, Jeudy J. Use of ChatGPT, GPT-4, and Bard to Improve Readability of ChatGPT’s Answers to Common Questions About Lung Cancer and Lung Cancer Screening. AJR Am J Roentgenol. 2023;221(5):701–704.
Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology. 2023;307(5):e230582.
Terms of Use. Open AI. Updated November 14, 2023. Accessed December 20, 2023. https://openai.com/policies/terms-of-use
Generative AI Terms of Service. Google. Updated August 9, 2023. Accessed December 20, 2023. https://policies.google.com/terms/generative-ai
Bing Conversational Experiences and Image Creator Terms. Bing. Updated August 4, 2023. Accessed December 20, 2023. https://www.bing.com/new/termsofuse
Stossel LM, Segar N, Gliatto P, Fallar R, Karani R. Readability of patient education materials available at the point of care. J Gen Intern Med. 2012;27(9):1165–1170.
Medicare Intermediary Manual. Centers for Medicare and Medicaid Services. Updated May 3, 2002. Accessed December 20, 2023. https://www.cms.gov/Regulations-and-Guidance/Guidance/Transmittals/downloads/R419A2.pdf
Funding
Aya Kamaya receives book royalties from Elsevier and grant support from Canon Medical Systems. Justin R Tse receives grant support from GE Healthcare and Bayer Healthcare, and is a consultant for Intuitive Surgical, Inc., AbSolutions Med, Inc., and Ascelia Pharma AB. No funding was received to assist with the preparation of this manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Cao, J.J., Kwon, D.H., Ghaziani, T.T. et al. Large language models’ responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdom Radiol 49, 4286–4294 (2024). https://doi.org/10.1007/s00261-024-04501-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00261-024-04501-7