Abstract
Clinical decision making in gastroenterology and hepatology has become increasingly complex and challenging for physicians. This growing complexity can be addressed by computational tools that support clinical decisions. Although numerous clinical decision support systems (CDSS) have emerged, they have faced difficulties with real-world performance and generalizability, resulting in limited clinical adoption. Generative artificial intelligence (AI), particularly large language models (LLMs), are introducing new possibilities for CDSS by offering more flexible and adaptable support that better reflects complex clinical scenarios. LLMs can process unstructured text, including patient data and medical guidelines, and integrate various information sources with high accuracy, especially when augmented with retrieval-augmented generation. Thus, LLMs can provide dynamic, context-specific support by generating personalized treatment recommendations, identifying potential complications based on patient history, and enabling natural language interactions with health-care providers. However, important challenges persist, particularly regarding biases, hallucinations, interoperability barriers, and proper training of health-care providers. We examine the parallel evolution of the complexity in clinical management in gastroenterology and hepatology, and the technical developments leading to current generative AI models. We discuss how these advances are converging to create effective CDSS, providing a conceptual basis for further development and clinical adoption of these systems.
This is a preview of subscription content, access via your institution
Access options
Access Nature and 54 other Nature Portfolio journals
Get Nature+, our best-value online-access subscription
$32.99 / 30 days
cancel any time
Subscribe to this journal
Receive 12 print issues and online access
$189.00 per year
only $15.75 per issue
Buy this article
- Purchase on SpringerLink
- Instant access to full article PDF
Prices may be subject to local taxes which are calculated during checkout



Similar content being viewed by others
References
Densen, P. Challenges and opportunities facing medical education. Trans. Am. Clin. Climatol. Assoc. 122, 48–58 (2011).
Morris, Z. S., Wooding, S. & Grant, J. The answer is 17 years, what is the question: understanding time lags in translational research. J. R. Soc. Med. 104, 510–520 (2011).
Porter, J., Boyd, C., Skandari, M. R. & Laiteerapong, N. Revisiting the time needed to provide adult primary care. J. Gen. Intern. Med. 38, 147–155 (2023).
Macaron, M. M. et al. A systematic review and meta analysis on burnout in physicians during the COVID-19 pandemic: a hidden healthcare crisis. Front. Psychiatry 13, 1071397 (2022).
Ferrucci, L. & Kohanski, R. Better care for older patients with complex multimorbidity and frailty: a call to action. Lancet Healthy Longev. 3, e581–e583 (2022).
Osheroff, J. A. et al. Improving Outcomes with Clinical Decision Support: An Implementer’s Guide (HIMSS, 2012).
Osheroff, J. A. et al. A roadmap for national action on clinical decision support. J. Am. Med. Inform. Assoc. 14, 141–145 (2007).
Moxey, A. et al. Computerized clinical decision support for prescribing: provision does not guarantee uptake. J. Am. Med. Inform. Assoc. 17, 25–33 (2010).
Kortteisto, T., Komulainen, J., Mäkelä, M., Kunnamo, I. & Kaila, M. Clinical decision support must be useful, functional is not enough: a qualitative study of computer-based clinical decision support in primary care. BMC Health Serv. Res. 12, 349 (2012).
Patterson, E. S. et al. Identifying barriers to the effective use of clinical reminders: bootstrapping multiple methods. J. Biomed. Inform. 38, 189–199 (2005).
Liberati, E. G. et al. What hinders the uptake of computerized decision support systems in hospitals? A qualitative study and framework for implementation. Implement. Sci. 12, 113 (2017).
de Dombal, F. T. Computers, diagnoses and patients with acute abdominal pain. Arch. Emerg. Med. 9, 267–270 (1992).
Sutton, R. T. et al. An overview of clinical decision support systems: benefits, risks, and strategies for success. npj Digit. Med. 3, 17 (2020).
Dang, A. Real-world evidence: a primer. Pharm. Med. 37, 25–36 (2023).
Zhang, K. et al. A generalist vision-language foundation model for diverse biomedical tasks. Nat. Med. 30, 3129–3141 (2024).
Lu, M. Y. et al. A multimodal generative AI copilot for human pathology. Nature https://doi.org/10.1038/s41586-024-07618-3 (2024).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at arXiv https://doi.org/10.48550/arXiv.2303.12712 (2023).
Gemini Team Google. Gemini 1.5: unlocking multimodal understanding across millions of tokens of context. Preprint at arXiv https://doi.org/10.48550/arXiv.2403.05530 (2024).
Blease, C. R., Locher, C., Gaab, J., Hägglund, M. & Mandl, K. D. Generative artificial intelligence in primary care: an online survey of UK general practitioners. BMJ Health Care Inf. 31, e101102 (2024).
Laohawetwanit, T., Pinto, D. G. & Bychkov, A. A survey analysis of the adoption of large language models among pathologists. Am. J. Clin. Pathol. https://doi.org/10.1093/ajcp/aqae093 (2024).
Spotnitz, M. et al. A survey of clinicians’ views of the utility of large language models. Appl. Clin. Inform. 15, 306–312 (2024).
Ferber, D. et al. GPT-4 for information retrieval and comparison of medical oncology guidelines. NEJM AI 1, AIcs2300235 (2024).
Wiest, I. C. et al. Privacy-preserving large language models for structured medical information retrieval. npj Digit. Med. 7, 257 (2024).
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat. Med. https://doi.org/10.1038/s41591-024-02855-5 (2024).
Wornow, M. et al. Zero-shot clinical trial patient matching with LLMs. NEJM AI 2, AIcs2400360 (2025).
Weissman, G. E., Mankowitz, T. & Kanter, G. P. Unregulated large language models produce medical device-like output. npj Digit. Med. 8, 148 (2025).
US Department of Health and Human Services. Artificial intelligence-enabled device software functions: lifecycle management and marketing submission recommendations. Draft guidance for industry and Food and Drug Administration staff. FDA www.fda.gov/media/184856/download (2025).
EUR-Lex. Regulation (EU) 2017/745 of the European Parliament and of the Council of 5 April 2017 on medical devices, amending Directive 2001/83/EC, Regulation (EC) No 178/2002 and Regulation (EC) No 1223/2009 and repealing Council Directives 90/385/EEC and 93/42/EEC. EUR-Lex eur-lex.europa.eu/eli/reg/2017/745/oj/eng (2025).
Vieujean, S. et al. Understanding the therapeutic toolkit for inflammatory bowel disease. Nat. Rev. Gastroenterol. Hepatol. https://doi.org/10.1038/s41575-024-01035-7 (2025).
Colombel, J.-F., Narula, N. & Peyrin-Biroulet, L. Management strategies to improve outcomes of patients with inflammatory bowel diseases. Gastroenterology 152, 351–361.e5 (2017).
Bruner, L. P., White, A. M. & Proksell, S. Inflammatory bowel disease. Prim. Care 50, 411–427 (2023).
Rogler, G., Singh, A., Kavanaugh, A. & Rubin, D. T. Extraintestinal manifestations of inflammatory bowel disease: current concepts, treatment, and implications for disease management. Gastroenterology 161, 1118–1132 (2021).
Ui-Haq, Z. et al. Health-care resource use and costs associated with inflammatory bowel disease in northwest London: a retrospective linked database study. BMC Gastroenterol. 24, 480 (2024).
Ewais, T. et al. A systematic review and meta-analysis of mindfulness based interventions and yoga in inflammatory bowel disease. J. Psychosom. Res. 116, 44–53 (2019).
Allen, A. M., Younossi, Z. M., Diehl, A. M., Charlton, M. R. & Lazarus, J. V. Envisioning how to advance the MASH field. Nat. Rev. Gastroenterol. Hepatol. 21, 726–738 (2024).
Zelber-Sagi, S. et al. Food inequity and insecurity and MASLD: burden, challenges, and interventions. Nat. Rev. Gastroenterol. Hepatol. 21, 668–686 (2024).
Younossi, Z. M. et al. The global epidemiology of nonalcoholic fatty liver disease (NAFLD) and nonalcoholic steatohepatitis (NASH): a systematic review. Hepatology 77, 1335–1347 (2023).
Riazi, K. et al. The prevalence and incidence of NAFLD worldwide: a systematic review and meta-analysis. Lancet Gastroenterol. Hepatol. 7, 851–861 (2022).
Yang, Z. et al. Global burden of metabolic dysfunction-associated steatotic liver disease attributable to high fasting plasma glucose in 204 countries and territories from 1990 to 2021. Sci. Rep. 14, 22232 (2024).
Stefan, N., Yki-Järvinen, H. & Neuschwander-Tetri, B. A. Metabolic dysfunction-associated steatotic liver disease: heterogeneous pathomechanisms and effectiveness of metabolism-based treatment. Lancet Diabetes Endocrinol. 13, 134–148 (2025).
Rinella, M. E. et al. A multisociety Delphi consensus statement on new fatty liver disease nomenclature. Hepatology 78, 1966–1986 (2023).
Soto-Catalán, M. et al. Semaglutide improves liver steatosis and de novo lipogenesis markers in obese and type-2-diabetic mice with metabolic-dysfunction-associated steatotic liver disease. Int. J. Mol. Sci. 25, 2961 (2024).
Wattacheril, J. J. The role of noninvasive biomarkers: evaluation and management of MASLD. GI & Hepatology News (5 June 2024).
Schneider, C. V. et al. Large-scale identification of undiagnosed hepatic steatosis using natural language processing. EClinicalMedicine 62, 102149 (2023).
Eskridge, W. et al. Metabolic dysfunction-associated steatotic liver disease and metabolic dysfunction-associated steatohepatitis: the patient and physician perspective. J. Clin. Med. 12, 6216 (2023).
Konyn, P., Ahmed, A. & Kim, D. Current epidemiology in hepatocellular carcinoma. Expert. Rev. Gastroenterol. Hepatol. 15, 1295–1307 (2021).
Samant, H., Amiri, H. S. & Zibari, G. B. Addressing the worldwide hepatocellular carcinoma: epidemiology, prevention and management. J. Gastrointest. Oncol. 12, S361–S373 (2021).
European Association for the Study of the Liver. EASL clinical practice guidelines on the management of hepatocellular carcinoma. J. Hepatol. 82, 315–374 (2025).
Reig, M. et al. BCLC strategy for prognosis prediction and treatment recommendation: the 2022 update. J. Hepatol. 76, 681–693 (2022).
Yip, T. C.-F. & Wong, G. L.-H. Transforming the landscape of liver cancer detection and care. Nat. Rev. Gastroenterol. Hepatol. https://doi.org/10.1038/s41575-024-01018-8 (2024).
Alawyia, B. & Constantinou, C. Hepatocellular carcinoma: a narrative review on current knowledge and future prospects. Curr. Treat. Options Oncol. 24, 711–724 (2023).
Cherradi, S. et al. Modelling hepatocellular carcinoma microenvironment phenotype to evaluate drug efficacy. Sci. Rep. 15, 1179 (2025).
Jing, F., Li, X., Jiang, H., Sun, J. & Guo, Q. Combating drug resistance in hepatocellular carcinoma: no awareness today, no action tomorrow. Biomed. Pharmacother. 167, 115561 (2023).
Ducreux, M. et al. The management of hepatocellular carcinoma. Current expert opinion and recommendations derived from the 24th ESMO/World Congress on Gastrointestinal Cancer, Barcelona, 2022. ESMO Open. 8, 101567 (2023).
Clusmann, J. et al. Machine learning predicts liver cancer risk from routine clinical data: a large population-based multicentric study. Preprint at medRxiv https://doi.org/10.1101/2024.11.03.24316662 (2024).
Sung, H. et al. Colorectal cancer incidence trends in younger versus older adults: an analysis of population-based cancer registry data. Lancet Oncol. 26, 51–63 (2025).
US National Library of Medicine. ClinicalTrials.gov clinicaltrials.gov/study/NCT05080673?cond=colonoscopy%20screening%20intervals&intr=NCT05080673&rank=1 (2024).
Kuipers, E. J. & Spaander, M. C. Personalized screening for colorectal cancer. Nat. Rev. Gastroenterol. Hepatol. 15, 391–392 (2018).
Kuipers, E. J. et al. Colorectal cancer. Nat. Rev. Dis. Primers 1, 15065 (2015).
Kahn, C. L. et al. Circulating tumor DNA in addition to fecal immunochemical test in a dual-test colorectal cancer screening approach. Clin. Colorectal Cancer 24, 310–319.e1 (2025).
Brenne, S. S. et al. Colorectal cancer detected by liquid biopsy 2 years prior to clinical diagnosis in the HUNT study. Br. J. Cancer 129, 861–868 (2023).
Mateo, J. et al. A framework to rank genomic alterations as targets for cancer precision medicine: the ESMO scale for clinical actionability of molecular targets (ESCAT). Ann. Oncol. 29, 1895–1902 (2018).
Kasi, P. M. et al. BESPOKE study protocol: a multicentre, prospective observational study to evaluate the impact of circulating tumour DNA guided therapy on patients with colorectal cancer. BMJ Open. 11, e047831 (2021).
Kasi, P. M. et al. Circulating tumor DNA (ctDNA) for informing adjuvant chemotherapy (ACT) in stage II/III colorectal cancer (CRC): interim analysis of BESPOKE CRC study [abstract]. J. Clin. Oncol. 42, 9 (2024).
Tie, J. et al. Circulating tumor DNA analysis guiding adjuvant therapy in stage II colon cancer. N. Engl. J. Med. 386, 2261–2272 (2022).
Yukami, H. et al. Circulating tumor DNA (ctDNA) dynamics in patients with colorectal cancer (CRC) with molecular residual disease: updated analysis from GALAXY study in the CIRCULATE-JAPAN [abstract]. J. Clin. Oncol. 42, 6 (2024).
Kramer, A. et al. Early evaluation of the effectiveness and cost-effectiveness of ctDNA-guided selection for adjuvant chemotherapy in stage II colon cancer. Ther. Adv. Med. Oncol. 16, 17588359241266164 (2024).
Krell, M., Llera, B. & Brown, Z. J. Circulating tumor DNA and management of colorectal cancer. Cancers 16, 21 (2023).
Li, W. et al. Analytical evaluation of circulating tumor DNA sequencing assays. Sci. Rep. 14, 4973 (2024).
Pascual, J. et al. ESMO recommendations on the use of circulating tumour DNA assays for patients with cancer: a report from the ESMO Precision Medicine Working Group. Ann. Oncol. 33, 750–768 (2022).
Stetson, D. et al. Next-generation molecular residual disease assays: do we have the tools to evaluate them properly? J. Clin. Oncol. 42, 2736–2740 (2024).
Denlinger, C. S. & Barsevick, A. M. The challenges of colorectal cancer survivorship. J. Natl Compr. Canc. Netw. 7, 883–893; quiz 894 (2009).
Yu, M., Ren, L., Zheng, M., Hong, M. & Wei, Z. Delayed diagnosis of Wilson’s Disease report from 179 newly diagnosed cases in China. Front. Neurol. 13, 884840 (2022).
Melas, N., Amin, R., Gyllemark, P., Younes, A. H. & Almer, S. Whipple’s disease: the great masquerader — a high level of suspicion is the key to diagnosis. BMC Gastroenterol. 21, 128 (2021).
Ronicke, S. et al. Can a decision support system accelerate rare disease diagnosis? Evaluating the potential impact of Ada DX in a retrospective study. Orphanet J. Rare Dis. 14, 69 (2019).
Schaaf, J., Sedlmayr, M., Sedlmayr, B. & Storf, H. User-centred development of a diagnosis support system for rare diseases. Stud. Health Technol. Inform. 293, 11–18 (2022).
Yakubovich, K. Clinical decision-making theories. Ann. Innov. Med., https://doi.org/10.59652/aim.v1i2.51 (2023).
Vorisek, C. N. et al. Fast healthcare interoperability resources (FHIR) for interoperability in health research: systematic review. JMIR Med. Inform. 10, e35724 (2022).
Goodrum, H., Roberts, K. & Bernstam, E. V. Automatic classification of scanned electronic health record documents. Int. J. Med. Inform. 144, 104302 (2020).
Poissant, L., Pereira, J., Tamblyn, R. & Kawasumi, Y. The impact of electronic health records on time efficiency of physicians and nurses: a systematic review. J. Am. Med. Inform. Assoc. 12, 505–516 (2005).
Gold, R. et al. Using electronic health record-based clinical decision support to provide social risk-informed care in community health centers: protocol for the design and assessment of a clinical decision support tool. JMIR Res. Protoc. 10, e31733 (2021).
Quan, P. L. et al. Usefulness of drug allergy alert systems: present and future. Curr. Treat. Options Allergy https://doi.org/10.1007/s40521-023-00351-8 (2023).
Elkin, P. L. et al. The introduction of a diagnostic decision support system (DXplainTM) into the workflow of a teaching hospital service can decrease the cost of service for diagnostically challenging diagnostic related groups (DRGs). Int. J. Med. Inform. 79, 772–777 (2010).
Colabianchi, S., Costantino, F. & Sabetta, N. Assessment of a large language model based digital intelligent assistant in assembly manufacturing. Comput. Ind. 162, 104129 (2024).
Wekenborg, M. K., Gilbert, S. & Kather, J. N. Examining human-AI interaction in real-world healthcare beyond the laboratory. npj Digit. Med. 8, 169 (2025).
Vaccaro, M., Almaatouq, A. & Malone, T. When combinations of humans and AI are useful: a systematic review and meta-analysis. Nat. Hum. Behav. 8, 2293–2303 (2024).
Groopman, J. E. How Doctors Think (Houghton Mifflin, 2007).
Kawamoto, K., Houlihan, C. A., Balas, E. A. & Lobach, D. F. Improving clinical practice using clinical decision support systems: a systematic review of trials to identify features critical to success. BMJ 330, 765 (2005).
Webster, C. S., Taylor, S. & Weller, J. M. Cognitive biases in diagnosis and decision making during anaesthesia and intensive care. BJA Educ. 21, 420–425 (2021).
Chen, Z. et al. Harnessing the power of clinical decision support systems: challenges and opportunities. Open. Heart 10, e002432 (2023).
Miller, R. A., McNeil, M. A., Challinor, S. M., Masarie, F. E. Jr & Myers, J. D. The INTERNIST-1/QUICK MEDICAL REFERENCE project — status report. West. J. Med. 145, 816–822 (1986).
Shortliffe, E. H. Mycin: a knowledge-based computer program applied to infectious diseases. Proc. Annu. Symp. Comput. Appl. Med. Care 1977, 66–69 (1977).
Wright, A. & Sittig, D. F. A four-phase model of the evolution of clinical decision support architectures. Int. J. Med. Inform. 77, 641–649 (2008).
Darmoni, S. J. & Poynard, T. Computer-aided decision support in hepatology. Scand. J. Gastroenterol. 27, 889–896 (1992).
Bates, D. W. et al. The impact of computerized physician order entry on medication error prevention. J. Am. Med. Inform. Assoc. 6, 313–321 (1999).
Malchow-Møller, A., Bjerregaard, B. & Hilden, J. Computer-assisted diagnosis in gastroenterology. Scand. J. Gastroenterol. Suppl. 216, 225–233 (1996).
Babic, A., Mathiesen, U., Hedin, K., Bodemar, G. & Wigertz, O. Assessing an AI knowledge-base for asymptomatic liver diseases. Proc. AMIA Symp. 1998, 513–517 (1998).
Quinn, J. An HL7 (Health Level Seven) overview. J. AHIMA 70, 32–34 (1999). quiz 35–6.
Ferrucci, D. Introduction to “This is Watson”. IBM J. Res. Dev. 56, 235–249 (2012).
Papadopoulos, P., Soflano, M., Chaudy, Y., Adejo, W. & Connolly, T. M. A systematic review of technologies and standards used in the development of rule-based clinical decision support systems. Health Technol. 12, 713–727 (2022).
Dugas, M., Schauer, R., Volk, A. & Rau, H. Interactive decision support in hepatic surgery. BMC Med. Inform. Decis. Mak. 2, 5 (2002).
Ash, J. S., Sittig, D. F., Campbell, E. M., Guappone, K. P. & Dykstra, R. H. Some unintended consequences of clinical decision support systems. AMIA Annu. Symp. Proc. 2007, 26–30 (2007).
Castaneda, C. et al. Clinical decision support systems for improving diagnostic accuracy and achieving precision medicine. J. Clin. Bioinforma. 5, 4 (2015).
Eberhardt, J., Bilchik, A. & Stojadinovic, A. Clinical decision support systems: potential with pitfalls. J. Surg. Oncol. 105, 502–510 (2012).
Trivedi, M. H. et al. Development and implementation of computerized clinical guidelines: barriers and solutions. Methods Inf. Med. 41, 435–442 (2002).
Rajpurkar, P., Chen, E., Banerjee, O. & Topol, E. AI in health and medicine. Nat. Med. 28, 31–38 (2022).
Gliadkovskaya, A. Some doctors are using public AI chatbots like ChatGPT in clinical decisions. Is it safe? FIERCE Healthcare www.fiercehealthcare.com/special-reports/some-doctors-are-using-public-generative-ai-tools-chatgpt-clinical-decisions-it (2024).
Tu, T. et al. Towards conversational diagnostic artificial intelligence. Nature 642, 442–450 (2025).
Rosen, J. M. Generative AI in pediatric gastroenterology. Curr. Gastroenterol. Rep. 26, 342–348 (2024).
Truhn, D., Reis-Filho, J. S. & Kather, J. N. Large language models should be used as scientific reasoning engines, not knowledge databases. Nat. Med. https://doi.org/10.1038/s41591-023-02594-z (2023).
Chui, M., Hazan, E., Roberts, R., Singla, A. & Smaje, K. The economic potential of generative AI: the next productivity frontier (McKinsey & Company, 2023).
Strachan, J. W. A. et al. Testing theory of mind in large language models and humans. Nat. Hum. Behav. 8, 1285–1295 (2024).
Odabashian, R. et al. Assessment of ChatGPT-3f.5’s knowledge in oncology: comparative study with ASCO-SEP benchmarks. JMIR AI 3, e50442 (2024).
Kaiser, K. N. et al. Use of large language models as clinical decision support tools for management pancreatic adenocarcinoma using National Comprehensive Cancer Network guidelines. Surgery 182, 109267 (2025).
Zhou, S. et al. The performance of large language model-powered chatbots compared to oncology physicians on colorectal cancer queries. Int. J. Surg. 110, 6509–6517 (2024).
Gong, E. J. et al. The potential clinical utility of the customized large language model in gastroenterology: a pilot study. Bioengineering 12, 1 (2024).
Zhou, Q. et al. GastroBot: a Chinese gastrointestinal disease chatbot based on the retrieval-augmented generation. Front. Med. 11, 1392555 (2024).
Kainz, J. et al. Fine-tuning an existing large language model with knowledge from the medical expert system Hepaxpert. Stud. Health Technol. Inform. 327, 143–147 (2025).
Lim, D. Y. Z. et al. ChatGPT on guidelines: providing contextual knowledge to GPT allows it to provide advice on appropriate colonoscopy intervals. J. Gastroenterol. Hepatol. 39, 81–106 (2024).
Gorelik, Y., Ghersin, I., Maza, I. & Klein, A. Harnessing language models for streamlined postcolonoscopy patient management: a novel approach. Gastrointest. Endosc. 98, 639–641.e4 (2023).
Mukherjee, S. et al. Assessing ChatGPT’s ability to reply to queries regarding colon cancer screening based on multisociety guidelines. Gastro Hep Adv. 2, 1040–1043 (2023).
Patil, N. S., Huang, R. S., van der Pol, C. B. & Larocque, N. Using artificial intelligence chatbots as a radiologic decision-making tool for liver imaging: do ChatGPT and Bard communicate information consistent with the ACR appropriateness criteria? J. Am. Coll. Radiol. 20, 1010–1013 (2023).
Pugliese, N. et al. Accuracy, reliability, and comprehensibility of ChatGPT-generated medical responses for patients with nonalcoholic fatty liver disease. Clin. Gastroenterol. Hepatol. 22, 886–889.e5 (2024).
Endo, Y. et al. Quality of ChatGPT responses to questions related to liver transplantation. J. Gastrointest. Surg. 27, 1716–1719 (2023).
Zheng, N. S. et al. Detection of gastrointestinal bleeding with large language models to aid quality improvement and appropriate reimbursement. Gastroenterology 168, 111–120.e4 (2025).
Wiest, I. C. et al. Deep sight: enhancing periprocedural adverse event recording in endoscopy by structuring text documentation with privacy-preserving large language models. iGIE 3, 447–452.e5 (2024).
Scherbakov, D. et al. Using large language models for extracting stressful life events to assess their impact on preventive colon cancer screening adherence. BMC Public. Health 25, 12 (2025).
Gu, K. et al. Using GPT-4 for LI-RADS feature extraction and categorization with multilingual free-text reports. Liver Int. 44, 1578–1587 (2024).
Matute-González, M. et al. Utilizing a domain-specific large language model for LI-RADS v2018 categorization of free-text MRI reports: a feasibility study. Insights Imaging 15, 280 (2024).
Spitzl, D. et al. Leveraging large language models for accurate classification of liver lesions from MRI reports. Comput. Struct. Biotechnol. J. 27, 2139–2146 (2025).
Kim, H. et al. Conversion of mixed-language free-text CT reports of pancreatic cancer to National Comprehensive Cancer Network structured reporting templates by using GPT-4. Korean J. Radiol. 26, 557–568 (2025).
Wang, A. et al. Large language model answers medical questions about standard pathology reports. Front. Med. 11, 1402457 (2024).
Pereyra, L., Schlottmann, F., Steinberg, L. & Lasa, J. Colorectal cancer prevention: is chat generative pretrained transformer (Chat GPT) ready to assist physicians in determining appropriate screening and surveillance recommendations? J. Clin. Gastroenterol. 58, 1022–1027 (2024).
Chatziisaak, D. et al. Concordance of ChatGPT artificial intelligence decision-making in colorectal cancer multidisciplinary meetings: retrospective study. BJS Open. 9, zraf040 (2025).
Cao, J. J. et al. Large language models’ responses to liver cancer surveillance, diagnosis, and management questions: accuracy, reliability, readability. Abdom. Radiol. 49, 4286–4294 (2024).
Abou Chaar, M. K., Grigsby-Rocca, G., Huang, M. & Blackmon, S. H. ChatGPT vs expert-guided care pathways for postesophagectomy symptom management. Ann. Thorac. Surg. Short Rep. 2, 674–679 (2024).
Huo, B. et al. Clinical artificial intelligence: teaching a large language model to generate recommendations that align with guidelines for the surgical management of GERD. Surg. Endosc. 38, 5668–5677 (2024).
Ye, Y. et al. Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection. Front. Public. Health 13, 1566982 (2025).
Malik, S. et al. Evaluating artificial intelligence-driven responses to acute liver failure queries: a comparative analysis across accuracy, clarity, and relevance. Am. J. Gastroenterol. https://doi.org/10.14309/ajg.0000000000003255 (2024).
Ghersin, I. et al. Comparative evaluation of a language model and human specialists in the application of European guidelines for the management of inflammatory bowel diseases and malignancies. Endoscopy 56, 706–709 (2024).
Samaan, J. S. et al. Examining the accuracy and reproducibility of responses to nutrition questions related to inflammatory bowel disease by generative pre-trained transformer-4. Crohns Colitis 360 7, otae077 (2025).
Sato, M. et al. Efficacy of a large language model in classifying branch-duct intraductal papillary mucinous neoplasms. Abdom. Radiol. https://doi.org/10.1007/s00261-025-05062-z (2025).
Gui, X. et al. Enhancing hepatopathy clinical trial efficiency: a secure, large language model-powered pre-screening pipeline. BioData Min. 18, 42 (2025).
Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. npj Digit. Med. 7, 102 (2024).
Zhu, M. et al. Large language model trained on clinical oncology data predicts cancer progression. npj Digit. Med. 8, 397 (2025).
Lusetti, F. et al. Applications of generative artificial intelligence in inflammatory bowel disease: a systematic review. Dig. Liver Dis. https://doi.org/10.1016/j.dld.2025.04.026 (2025).
Gravina, A. G. et al. May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients’ questions? An evidence-controlled analysis. World J. Gastroenterol. 30, 17–33 (2024).
Yeo, Y. H. et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin. Mol. Hepatol. 29, 721–732 (2023).
Lee, T.-C. et al. ChatGPT answers common patient questions about colonoscopy. Gastroenterology 165, 509–511.e7 (2023).
Kral, J., Hradis, M., Buzga, M. & Kunovsky, L. Exploring the benefits and challenges of AI-driven large language models in gastroenterology: think out of the box. Biomed. Pap. Med. Fac. Univ. Palacky. Olomouc Czech. Repub. 168, 277–283 (2024).
Gong, E. J. et al. Large language models in gastroenterology: systematic review. J. Med. Internet Res. 26, e66648 (2024).
Yim, D., Khuntia, J., Parameswaran, V. & Meyers, A. Preliminary evidence of the use of generative AI in health care clinical services: systematic narrative review. JMIR Med. Inform. 12, e52073 (2024).
Loh, E. ChatGPT and generative AI chatbots: challenges and opportunities for science, medicine and medical leaders. BMJ Lead. 8, 51–54 (2023).
Ramjee, P. et al. CataractBot: an LLM-powered expert-in-the-loop chatbot for cataract patients. In Proc. ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (eds Mamykina, L. & Ploetz, T) 45 (ACM, 2025).
Clusmann, J. et al. The future landscape of large language models in medicine. Commun. Med. 3, 141 (2023).
Ullah, E., Parwani, A., Baig, M. M. & Singh, R. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology — a recent scoping review. Diagn. Pathol. 19, 43 (2024).
Zuo, K., Jiang, Y., Mo, F. & Lio, P. KG4Diagnosis: a hierarchical multi-agent LLM framework with knowledge graph enhancement for medical diagnosis. Preprint at arXiv https://doi.org/10.48550/arXiv.2412.16833 (2025).
Masanneck, L., Meuth, S. G. & Pawlitzki, M. Evaluating base and retrieval augmented LLMs with document or online support for evidence based neurology. npj Digit. Med. 8, 137 (2025).
Ge, J. et al. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatology 80, 1158–1168 (2024).
Griot, M., Hemptinne, C., Vanderdonckt, J. & Yuksel, D. Large language models lack essential metacognition for reliable medical reasoning. Nat. Commun. 16, 642 (2025).
Lindsey, J. et al. On the biology of a large language model. Transformer Circuits transformer-circuits.pub/2025/attribution-graphs/biology.html#dives-hallucinations (2025).
Farquhar, S., Kossen, J., Kuhn, L. & Gal, Y. Detecting hallucinations in large language models using semantic entropy. Nature 630, 625–630 (2024).
Wei, J. et al. Measuring short-form factuality in large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2411.04368 (2024).
Yu, H., Cheng, T., Cheng, Y. & Feng, R. FineMedLM-o1: enhancing the medical reasoning ability of LLM from supervised fine-tuning to test-time training. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.09213 (2025).
Li, R., Wang, X. & Yu, H. LlamaCare: an instruction fine-tuned large language model for clinical NLP. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (eds Calzolari, N. et al.) 10632–10641 (ELRA and ICCL, 2024).
Laka, M., Carter, D., Milazzo, A. & Merlin, T. Challenges and opportunities in implementing clinical decision support systems (CDSS) at scale: interviews with Australian policymakers. Health Policy Technol. 11, 100652 (2022).
Abell, B. et al. Identifying barriers and facilitators to successful implementation of computerized clinical decision support systems in hospitals: a NASSS framework-informed scoping review. Implement. Sci. 18, 32 (2023).
Zakka, C. et al. Almanac – retrieval-augmented language models for clinical medicine. NEJM AI 1, aioa2300068 (2024).
Jiang, Y. et al. MedAgentBench: a virtual EHR environment to benchmark medical LLM agents. NEJM AI https://doi.org/10.1056/AIdbp2500144 (2025).
Callahan, A. et al. Using aggregate patient data at the bedside via an on-demand consultation service. NEJM Catal. Innov. Care Deliv., https://doi.org/10.1056/CAT.21.0224 (2021).
Bedi, S. et al. MedHELM: holistic evaluation of large language models for medical tasks. Preprint at arXiv https://doi.org/10.48550/arXiv.2505.23802 (2025).
Armitage, H. Clinicians can ‘chat’ with medical records through new AI software, ChatEHR. Stanford Medicine News Center med.stanford.edu/news/all-news/2025/06/chatehr.html (2025).
Yao, Y. et al. A survey on large language model (LLM) security and privacy: the good, the bad, and the ugly. High Confid. Comput. 4, 100211 (2024).
OWASP. OWASP top 10 for LLM applications 2025. OWASP genai.owasp.org/resource/owasp-top-10-for-llm-applications-2025/ (2025).
Clusmann, J. et al. Prompt injection attacks on vision language models in oncology. Nat. Commun. 16, 1239 (2025).
Clusmann, J. et al. Incidental prompt injections on vision–language models in real-life histopathology. NEJM AI 2, aics2500078 (2025).
Freyer, O., Wiest, I. C., Kather, J. N. & Gilbert, S. A future role for health applications of large language models depends on regulators enforcing safety standards. Lancet Digit. Health 6, e662–e672 (2024).
Hubinger, E. et al. Sleeper agents: training deceptive LLMs that persist through safety training. Preprint at arXiv https://doi.org/10.48550/arXiv.2401.05566 (2024).
Greenblatt, R. et al. Alignment faking in large language models. Preprint at arXiv https://doi.org/10.48550/arXiv.2412.14093 (2024).
Derraz, B. et al. New regulatory thinking is needed for AI-based personalised drug and cell therapies in precision oncology. npj Precis. Oncol. 8, 23 (2024).
Guevara, M. et al. Large language models to identify social determinants of health in electronic health records. npj Digit. Med. 7, 6 (2024).
Yu, K.-H., Healey, E., Leong, T.-Y., Kohane, I. S. & Manrai, A. K. Medical artificial intelligence and human values. N. Engl. J. Med. 390, 1895–1904 (2024).
Omar, M. et al. Sociodemographic biases in medical decision making by large language models. Nat. Med. 31, 1873–1881 (2025).
Zack, T. et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit. Health 6, e12–e22 (2024).
Yang, J., Soltan, A. A. S., Eyre, D. W., Yang, Y. & Clifton, D. A. An adversarial training framework for mitigating algorithmic biases in clinical machine learning. npj Digit. Med. 6, 55 (2023).
Goh, E. et al. Physician clinical decision modification and bias assessment in a randomized controlled trial of AI assistance. Commun. Med. 5, 59 (2025).
Savage, T., Nayak, A., Gallo, R., Rangan, E. & Chen, J. H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. npj Digit. Med. 7, 20 (2024).
How to edit anthropomorphic language about artificial intelligence. Nat. Rev. Phys. 5, 263 (2023).
Goh, E. et al. GPT-4 assistance for improvement of physician performance on patient care tasks: a randomized controlled trial. Nat. Med. 31, 1223–1238 (2025).
Wu, C. et al. Towards evaluating and building versatile large language models for medicine. npj Digit. Med. 8, 58 (2025).
Bedi, S. et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA https://doi.org/10.1001/jama.2024.21700 (2024).
Gallifant, J. et al. The TRIPOD-LLM reporting guideline for studies using large language models. Nat. Med. 31, 60–69 (2025).
Lekadir, K. et al. FUTURE-AI: international consensus guideline for trustworthy and deployable artificial intelligence in healthcare. BMJ 388, e081554 (2025).
Moons, K. G. M. et al. PROBAST+AI: an updated quality, risk of bias, and applicability assessment tool for prediction models using regression or artificial intelligence methods. BMJ 388, e082505 (2025).
Zou, J. & Topol, E. J. The rise of agentic AI teammates in medicine. Lancet 405, 457 (2025).
Anthropic. Building effective agents. Anthropic www.anthropic.com/engineering/building-effective-agents (2024).
Wu, S. et al. A comparative study on reasoning patterns of OpenAI’s o1 model. Preprint at arXiv https://doi.org/10.48550/arXiv.2410.13639 (2024).
DeepSeek-AI. DeepSeek-R1: incentivizing reasoning capability in LLMs via reinforcement learning. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.12948 (2025).
Schmidgall, S. et al. Agent laboratory: using LLM agents as research assistants. Preprint at arXiv https://doi.org/10.48550/arXiv.2501.04227 (2025).
Gottweis, J. et al. Towards an AI co-scientist. Preprint at arXiv https://doi.org/10.48550/arXiv.2502.18864 (2025).
Gao, S. et al. Empowering biomedical discovery with AI agents. Cell 187, 6125–6151 (2024).
Penadés, J. R. et al. AI mirrors experimental science to uncover a novel mechanism of gene transfer crucial to bacterial evolution. Preprint at bioRxiv https://doi.org/10.1101/2025.02.19.639094 (2025).
Ferber, D. et al. Development and validation of autonomous artificial intelligence agent for clinical decision-making in oncology. Nat. Cancer https://doi.org/10.1038/s43018-025-00991-6 (2025).
Busch, F. et al. Current applications and challenges in large language models for patient care: a systematic review. Commun. Med. 5, 26 (2025).
Hasjim, B. J. et al. The AI agent in the room: informing objective decision making at the transplant selection committee. Preprint at medRxiv https://doi.org/10.1101/2024.12.06.24318575 (2024).
Lobentanzer, S. et al. A platform for the biomedical application of large language models. Nat. Biotechnol. https://doi.org/10.1038/s41587-024-02534-3 (2025).
Chen, J. Bringing generative AI to healthcare perspective. Sequoia www.sequoiacap.com/article/generative-ai-for-healthcare-perspective/ (2023).
Bhimani, M. et al. Real-world evaluation of large language models in healthcare (RWE-LLM): a new realm of AI safety & validation. Preprint at medRxiv https://doi.org/10.1101/2025.03.17.25324157 (2025).
Jung, D., Butler, A., Park, J. & Saperstein, Y. Evaluating the impact of a specialized LLM on physician experience in clinical decision support: a comparison of ask Avo and ChatGPT-4. Preprint at arXiv https://doi.org/10.48550/arXiv.2409.15326 (2024).
Harvey H. World’s first regulatory clearance for a large language model medical device. Hardian Health. https://www.hardianhealth.com/insights/valmed-ai-medical-device-regulatory-clearance (2025).
McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 642, 451–457 (2025).
Ifargan, T., Hafner, L., Kern, M., Alcalay, O. & Kishony, R. Autonomous LLM-driven research — from data to human-verifiable research papers. NEJM AI doi.org/10.1056/AIoa2400555 (2025).
Ke, Y. et al. Mitigating cognitive biases in clinical decision-making through multi-agent conversations using large language models: simulation study. J. Med. Internet Res. 26, e59439 (2024).
Lee, D. Prompt infection: LLM-to-LLM prompt injection within multi-agent systems. Preprint at https://arxiv.org/html/2410.07283v1 (2024).
Abbasian, M. et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. npj Digit. Med. 7, 82 (2024).
Li, Y. et al. LKAN: LLM-based knowledge-aware attention network for clinical staging of liver cancer. IEEE J. Biomed. Health Inf. 29, 3007–3020 (2025).
Berry, P., Dhanakshirur, R. R. & Khanna, S. Utilizing large language models for gastroenterology research: a conceptual framework. Ther. Adv. Gastroenterol. 18, 17562848251328577 (2025).
Qin, Y., Chang, J., Li, L. & Wu, M. Enhancing gastroenterology with multimodal learning: the role of large language model chatbots in digestive endoscopy. Front. Med. 12, 1583514 (2025).
Berner, E. S. (ed.) Clinical Decision Support Systems: Theory and Practice (Springer, 2016).
Meta AI. Introducing Meta Llama 3: the most capable openly available LLM to date. Meta AI ai.meta.com/blog/meta-llama-3/ (2024).
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).
Tunstall, L., Von Werra, L. & Wolf, T. Natural Language Processing with Transformers: Building Language Applications with Hugging Face (O’Reilly Media, 2022).
Friedman, C. & Hripcsak, G. Natural language processing and its future in medicine. Acad. Med. 74, 890–895 (1999).
Tunstall, L., Von Werra, L. & Wolf, T. Natural language processing with transformers: building language applications with hugging face. O’Reilly Media https://www.oreilly.com/library/view/natural-language-processing/9781098136789/ (2022).
Chapman, W. W. et al. Overcoming barriers to NLP for clinical text: the role of shared tasks and the need for additional creative solutions. J. Am. Med. Inform. Assoc. 18, 540–543 (2011).
Nadkarni, P. M., Ohno-Machado, L. & Chapman, W. W. Natural language processing: an introduction. J. Am. Med. Inform. Assoc. 18, 544–551 (2011).
Leaman, R., Khare, R. & Lu, Z. Challenges in clinical natural language processing for automated disorder normalization. J. Biomed. Inform. 57, 28–37 (2015).
Acknowledgements
J.N.K. discloses support for the research and publication of this work from the EU’s Horizon Europe Research and Innovation programme (GENIAL, 101096312) and the European Research Council ERC (NADIR, grant number 101114631). J.C. discloses support for the research for this work from the Mildred-Scheel-Postdoktorandenprogramm of the German Cancer Aid (grant number 70115730).
Author information
Authors and Affiliations
Contributions
All the authors contributed equally to all aspects of the article.
Corresponding author
Ethics declarations
Competing interests
J.N.K. declares consulting services for Bioptimus, Owkin, DoMore Diagnostics, Panakeia, AstraZeneca, Mindpeak and MultiplexDx; holds shares in StratifAI and Synagen; has received a research grant from GSK; and has received honoraria from AstraZeneca, Bayer, Daiichi Sankyo, Eisai, Janssen, Merck, MSD, BMS, Roche, Pfizer and Fresenius. I.C.W. has received honoraria from AstraZeneca. All other authors declare no competing interests.
Peer review
Peer review information
Nature Reviews Gastroenterology & Hepatology thanks Dennis Shung, who co-reviewed with Sunny Chung; Arsela Prelaj; and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Related links
FDA-approved AI-enabled medical devices: https://www.fda.gov/medical-devices/software-medical-device-samd/artificial-intelligence-enabled-medical-devices
STAT AI Tracker: https://apps.statnews.com/ai-tracker/public/index.html
Supplementary information
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wiest, I.C., Bhat, M., Clusmann, J. et al. Large language models for clinical decision support in gastroenterology and hepatology. Nat Rev Gastroenterol Hepatol (2025). https://doi.org/10.1038/s41575-025-01108-1
Accepted:
Published:
DOI: https://doi.org/10.1038/s41575-025-01108-1