Skip to main content
Log in

Large language models’ responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability

  • Hepatobiliary
  • Published:
Abdominal Radiology Aims and scope Submit manuscript

Abstract

Purpose

To assess the accuracy, reliability, and readability of publicly available large language models in answering fundamental questions on hepatocellular carcinoma diagnosis and management.

Methods

Twenty questions on liver cancer diagnosis and management were asked in triplicate to ChatGPT-3.5 (OpenAI), Gemini (Google), and Bing (Microsoft). Responses were assessed by six fellowship-trained physicians from three academic liver transplant centers who actively diagnose and/or treat liver cancer. Responses were categorized as accurate (score 1; all information is true and relevant), inadequate (score 0; all information is true, but does not fully answer the question or provides irrelevant information), or inaccurate (score − 1; any information is false). Means with standard deviations were recorded. Responses were considered as a whole accurate if mean score was > 0 and reliable if mean score was > 0 across all responses for the single question. Responses were also quantified for readability using the Flesch Reading Ease Score and Flesch-Kincaid Grade Level. Readability and accuracy across 60 responses were compared using one-way ANOVAs with Tukey’s multiple comparison tests.

Results

Of the twenty questions, ChatGPT answered nine (45%), Gemini answered 12 (60%), and Bing answered six (30%) questions accurately; however, only six (30%), eight (40%), and three (15%), respectively, were both accurate and reliable. There were no significant differences in accuracy between any chatbot. ChatGPT responses were the least readable (mean Flesch Reading Ease Score 29; college graduate), followed by Gemini (30; college) and Bing (40; college; p < 0.001).

Conclusion

Large language models provide complex responses to basic questions on hepatocellular carcinoma diagnosis and management that are seldomly accurate, reliable, or readable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+
from $39.99 /Month
  • Starting from 10 chapters or articles per month
  • Access and download chapters and articles from more than 300k books and 2,500 journals
  • Cancel anytime
View plans

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Abbreviations

FRES:

Flesch Reading Ease Score

FKGL:

Flesch-Kincaid Grade Level

HCC:

Hepatocellular carcinoma

LLM:

Large language model

LI-RADS:

Liver Imaging Reporting and Data System

References

  1. Gulati R, Nawaz M, Pyrsopoulos NT. Health literacy and liver disease. Clin Liver Dis (Hoboken) 2018; 11:48–51.

    Article  PubMed  Google Scholar 

  2. Haver HL, Ambinder EB, Bahl M, Oluyemi ET, Jeudy J, Yi PH. Appropriateness of Breast Cancer Prevention and Screening Recommendations Provided by ChatGPT. Radiology. 2023;307(4):e230424.

    Article  PubMed  Google Scholar 

  3. Cao JJ, Kwon DH, Ghaziani TT, et al. Accuracy of Information Provided by ChatGPT Regarding Liver Cancer Surveillance and Diagnosis. AJR Am J Roentgenol. 2023;221(4):556–559.

    Article  PubMed  Google Scholar 

  4. Yeo YH, Samaan JS, Ng WH, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol. 2023;29(3):721–732.

    Article  PubMed  PubMed Central  Google Scholar 

  5. Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of Cardiovascular Disease Prevention Recommendations Obtained From a Popular Online Chat-Based Artificial Intelligence Model. JAMA. 2023;329(10):842–844.

    Article  PubMed  PubMed Central  Google Scholar 

  6. Li H, Moon JT, Iyer D, et al. Decoding radiology reports: Potential application of OpenAI ChatGPT to enhance patient understanding of diagnostic reports. Clin Imaging. 2023;101:137–141.

    Article  PubMed  Google Scholar 

  7. Bajaj S, Gandhi D, Nayar D. Potential Applications and Impact of ChatGPT in Radiology [published online ahead of print, 2023 Oct 5]. Acad Radiol. 2023;S1076-6332(23)00460–9.

    Google Scholar 

  8. CT/MRI LI-RADS® v2018 CORE. American College of Radiology. Accessed September 12, 2023. https://www.acr.org/-/media/ACR/Files/RADS/LI-RADS/LI-RADS-2018-Core.pdf

  9. Flesch Kincaid Calculator. Good Calculators. Accessed September 12, 2023. https://goodcalculators.com/flesch-kincaid-calculator/

  10. The Patient Education Materials Assessment Tool (PEMAT) and User’s Guide. Agency for Healthcare Research and Quality. Updated November 2020. Accessed December 20, 2023. https://www.ahrq.gov/health-literacy/patient-education/pemat.html

  11. Roberts RH, Ali SR, Hutchings HA, Dobbs TD, Whitaker IS. Comparative study of ChatGPT and human evaluators on the assessment of medical literature according to recognised reporting standards. BMJ Health Care Inform. 2023;30(1):e100830

    Article  PubMed  PubMed Central  Google Scholar 

  12. Bosbach WA, Senge JF, Nemeth B, et al. Ability of ChatGPT to generate competent radiology reports for distal radius fracture by use of RSNA template items and integrated AO classifier. Curr Probl Diagn Radiol. Published online April 17, 2023.

  13. Gebrael G, Sahu KK, Chigarira B, et al. Enhancing Triage Efficiency and Accuracy in Emergency Rooms for Patients with Metastatic Prostate Cancer: A Retrospective Analysis of Artificial Intelligence-Assisted Triage Using ChatGPT 4.0. Cancers (Basel). 2023;15(14):3717.

    Article  PubMed  Google Scholar 

  14. Gilson A, Safranek CW, Huang T, et al. How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ. 2023;9:e45312. Published 2023 Feb 8.

    Article  PubMed  PubMed Central  Google Scholar 

  15. Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198.

    Article  PubMed  PubMed Central  Google Scholar 

  16. Bera K, O’Connor G, Jiang S, Tirumani SH, Ramaiya N. Analysis of ChatGPT publications in radiology: Literature so far. Curr Probl Diagn Radiol. Published online October 20, 2023.

  17. Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A. How AI Responds to Common Lung Cancer Questions: ChatGPT vs Google Bard. Radiology. 2023;307(5):e230922.

    Article  PubMed  Google Scholar 

  18. Haver HL, Lin CT, Sirajuddin A, Yi PH, Jeudy J. Use of ChatGPT, GPT-4, and Bard to Improve Readability of ChatGPT’s Answers to Common Questions About Lung Cancer and Lung Cancer Screening. AJR Am J Roentgenol. 2023;221(5):701–704.

    Article  PubMed  Google Scholar 

  19. Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a Radiology Board-style Examination: Insights into Current Strengths and Limitations. Radiology. 2023;307(5):e230582.

    Article  PubMed  Google Scholar 

  20. Terms of Use. Open AI. Updated November 14, 2023. Accessed December 20, 2023. https://openai.com/policies/terms-of-use

  21. Generative AI Terms of Service. Google. Updated August 9, 2023. Accessed December 20, 2023. https://policies.google.com/terms/generative-ai

  22. Bing Conversational Experiences and Image Creator Terms. Bing. Updated August 4, 2023. Accessed December 20, 2023. https://www.bing.com/new/termsofuse

  23. Stossel LM, Segar N, Gliatto P, Fallar R, Karani R. Readability of patient education materials available at the point of care. J Gen Intern Med. 2012;27(9):1165–1170.

    Article  PubMed  PubMed Central  Google Scholar 

  24. Medicare Intermediary Manual. Centers for Medicare and Medicaid Services. Updated May 3, 2002. Accessed December 20, 2023. https://www.cms.gov/Regulations-and-Guidance/Guidance/Transmittals/downloads/R419A2.pdf

Download references

Funding

Aya Kamaya receives book royalties from Elsevier and grant support from Canon Medical Systems. Justin R Tse receives grant support from GE Healthcare and Bayer Healthcare, and is a consultant for Intuitive Surgical, Inc., AbSolutions Med, Inc., and Ascelia Pharma AB. No funding was received to assist with the preparation of this manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Justin R. Tse.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary Material 1

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, J.J., Kwon, D.H., Ghaziani, T.T. et al. Large language models’ responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdom Radiol 49, 4286–4294 (2024). https://doi.org/10.1007/s00261-024-04501-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00261-024-04501-7

Keywords

Profiles

  1. Paul Kwo