Abstract
Artificial intelligence has been widely used to answer questions in the medical context. This study aimed to evaluate the performance, reliability, and precision of ChatGPT-4.0 in responding to multiple-choice questions (MCQs) previously administered to medical students. We conducted an observational and cross-sectional study to assess the performance of ChatGPT by analyzing its accuracy, examining associations with specific knowledge areas and Bloom’s taxonomy levels, assessing the influence of the psychometric properties of the items, and investigating the effect of images on the results. From the eight examinations analyzed, chatbot performance varied from 46.7 to 90.0% on the first attempt, 47.5 to 90% on the second attempt, and 28.3 to 89.2% on the third attempt. The concordance rate varied from 56.2% to 62.0% with Cohen’s kappa coefficients ranging from 0.071 to 0.217. On the second and third attempts, basic science had the highest scores (90.0 and 93.3%, respectively), whereas surgery (55.8%) and pediatrics (43.4%) had the lowest scores. In summary, the chatbot demonstrated poor performance that was inferior to its human counterparts in medical examinations and low reliability and precision in answering medical questions.

Similar content being viewed by others
Data Availability
All data generated or analyzed during this study are included in this published article.
References
Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I et al. GPT-4 technical report. 2023. Available from: http://arxiv.org/abs/2303.08774
Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare. 2023;11:887. https://doi.org/10.3390/healthcare11060887.
Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLoS Digit Health. 2023. https://doi.org/10.1371/journal.pdig.0000198
Al-Khater KMK. Comparative assessment of three AI platforms in answering USMLE Step 1 anatomy questions or identifying anatomical structures on radiographs. Clin Anat. 2024. https://doi.org/10.1002/ca.24243
Schuwirth LWT, van der Vleuten CPM. The use of progress testing. Vol. 1, Perspectives on Medical Education. Bohn Stafleu van Loghum. 2012;24–30. https://doi.org/10.1007/s40037-012-0007-2
Bicudo AM, Hamamoto Filho PT, Abbade JF, Hafner M de LMB, Maffei CML. Teste de Progresso em Consórcios para Todas as Escolas Médicas do Brasil. Rev Bras Educ Med. 2019;43(4):151–6.
Huh S. Are ChatGPT’s knowledge and interpretation ability comparable to those of medical students in Korea for taking a parasitology examination?: a descriptive study. J Educ Eval Health Prof. 2023;20. https://doi.org/10.3352/jeehp.2023.20.1
Ishida K, Hanada E. Potential of ChatGPT to pass the Japanese medical and healthcare professional national licenses: A Literature Review. Cureus. 2024. https://doi.org/10.7759/cureus.66324
Wang H, Wu WZ, Dou Z, He L, Yang L. Performance and exploration of ChatGPT in medical examination, records and education in Chinese: Pave the way for medical AI. Int J Med Inform. 2023;1:177.
Meyer A, Riese J, Streichert T. Comparison of the Performance of GPT-3.5 and GPT-4 With that of medical students on the written German medical licensing examination: Observ Stud. JMIR Med Educ. 2024;10(1). https://doi.org/10.7759/cureus.66324
Rodrigues Alessi M, Gomes HA, Lopes de Castro M, Terumy Okamoto C. Performance of ChatGPT in solving questions from the progress test (Brazilian national medical exam): a potential artificial intelligence tool in medical practice. Cureus. 2024. https://doi.org/10.7759/cureus.64924
Haupt CE, Marks M. AI-Generated Medical Advice - GPT and beyond. JAMA Am Med Assoc. 2023;329:1349–50.
Sumbal A, Sumbal R, Amir A. Can ChatGPT-3.5 pass a medical exam? A systematic review of ChatGPT’s performance in academic testing. J Med Educ Curric Dev. 2024;11. https://doi.org/10.1177/23821205241238641
Hamamoto Filho PT, Silva E, Ribeiro ZMT, Hafner M de LMB, Cecilio-Fernandes D, Bicudo AM. Relationships between Bloom’s taxonomy, judges’ estimation of item difficulty and psychometric properties of items from a progress test: a prospective observational study. Sao Paulo Med J. 2020;138(1):33–9.
Cecilio-Fernandes D, Kerdijk W, Bremers AJ, Aalders W, Tio RA. Comparison of level of cognitive process between case-based items and non-case-based items of the interuniversity progress test of medicine in the Netherlands. J Educ Eval Health Prof. 2018;15:28.
Anderson LW, Krathwohl DR, Airasian PW, Cruikshank KA, Mayer RE, Pintrich PR, Raths J, Wittrock MC. A taxonomy for learning, teaching, and assessing: a revision of bloom’s taxonomy of educational objectives. New York: Longman; 2001.
Bloom BS. Taxonomy of educational objectives: the classification of educational goals. Handbook I: Cognitive domain. New York: Longman; 1956.
Hamamoto Filho PT, Hashimoto M, Lima AR de A, Diehl LA, Costa NT, Rehder PM, et al. Reliability across content areas in progress tests assessing medical knowledge: a Brazilian cross-sectional study with implications for medical education assessments. Sao Paulo Med J. 2024;142(6).
Tavakol M, O’Brien D. Psychometrics for physicians: everything a clinician needs to know about assessments in medical education. Int J Med Educ NLM (Medline). 2022;13:100–6.
American Educational Research Association APA. American educational research association APA, & National Council on Measurement in Education. Standards for educational and psychological testing. Washington, DC: Washington, DC: American Educational Research Association. 2014.
Morreel S, Mathysen D, Verhoeven V. Aye, AI! ChatGPT passes multiple-choice family medicine exam. Med Teach. 2023;45:665–666. https://doi.org/10.1080/0142159X.2023.2190918.
Ali R, Tang OY, Connolly ID, Zadnik Sullivan PL, Shin JH, Fridley JS et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations [Internet]. 2023. Available from: http://medrxiv.org/lookup/doi/10.1101/2023.03.25.23287743
JIN, Qiao et al. Hidden flaws behind expert-level accuracy of multimodal GPT-4 vision in medicine. NPJ Digit Med. 2024;7(1):190.
Mihalache A, Popovic MM, Muni RH. Performance of an Artificial Intelligence Chatbot in Ophthalmic Knowledge Assessment. JAMA Ophthalmol. 2023;141(6):589–97.
Ameer I, Zhou Y, Hu Y, Zuo X, Peng X, Li Z, et al. Zero-shot Clinical Entity Recognition using ChatGPT. Available from: https://www.researchgate.net/publication/369623655
GOBIRA, Mauro et al. Performance of ChatGPT-4 in answering questions from the brazilian national examination for medical degree revalidation. Revista da Associação Médica Brasileira. 2023;69:e20230848.
Bolgova O, Shypilova I, Sankova L, Mavrych V. How well did ChatGPT perform in answering questions on different topics in gross anatomy? Eur J Med Health Sci. 2023;5(6):94–100.
Singal A, Goyal S. Reliability and efficiency of ChatGPT 3.5 and 4.0 as a tool for scalenovertebral triangle anatomy education. Surg Radiol Anat. 2025;47(1). https://doi.org/10.1007/s00276-024-03513-8
Mavrych V, Ganguly P, Bolgova O. Using large language models (ChatGPT, Copilot, PaLM, Bard, and Gemini) in gross anatomy course: comparative analysis. Clin Anat. 2024. https://doi.org/10.1002/ca.24244
Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings [Internet]. 2023. Available from: http://medrxiv.org/lookup/doi/10.1101/2023.01.22.23284882
Ali K, Cockerill J, Bennett JH, Belfield L, Tredwin C. Transfer of basic science knowledge in a problem-based learning curriculum. Eur J Dent Educ. 2020;24(3):542–7.
Sau S GDSRKGLAJMSAFTRRVGSJ. Accuracy and quality of ChatGPT-4o and Google Gemini performance on image-based neurosurgery board questions. Neurosurg Rev. 2025;25:320.
Diniz-Freitas M, Lago-Méndez L, Limeres-Posse J, Diz-Dios P. Challenging ChatGPT-4V for the diagnosis of oral diseases and conditions. Oral Dis. 2024. https://doi.org/10.1111/odi.15169
Shool S, Adimi S, Saboori Amleshi R, Bitaraf E, Golpira R, Tara M. A systematic review of large language model (LLM) evaluations in clinical medicine. BMC Med Inform Decis Making. 2025;25. https://doi.org/10.1186/s12911-025-02954-4. BioMed Central Ltd.
Arun G, Perumal V, Urias FPJB, Ler YE, Tan BWT, Vallabhajosyula R, et al. ChatGPT versus a customized AI chatbot (Anatbuddy) for anatomy education: a comparative pilot study. Anat Sci Educ. 2024. https://doi.org/10.1002/ase.2502
Collins BR, Black EW, Rarey KE. Introducing anatomyGPT: a customized artificial intelligence application for anatomical sciences education. Clin Anat. 2024;37(6):661–9.
Totlis T, Natsis K, Filos D, Ediaroglou V, Mantzou N, Duparc F, et al. The potential role of ChatGPT and artificial intelligence in anatomy education: a conversation with ChatGPT. Surg Radiol Anat. 2023;45(10):1321–9.
Kipp M. From GPT-3.5 to GPT-4.o: a leap in AI’s medical exam performance. Information. 2024;15(9):543.
Author information
Authors and Affiliations
Contributions
RF conceptualized and designed the study, was involved in data acquisition, and wrote the manuscript. JSR, LSR, and AMB contributed to the first draft of the manuscript. PHTF was involved in data acquisition, statistical analysis of the study data, and writing and revision of the manuscript.
Corresponding author
Ethics declarations
Competing Interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Ferretti, R., Rizzi, J.S., Requena, L.S. et al. Factors Associated with Lower Performance of Artificial Intelligence on Answering Undergraduate Medical Education Multiple-Choice Questions. Med.Sci.Educ. (2025). https://doi.org/10.1007/s40670-025-02426-4
Accepted:
Published:
DOI: https://doi.org/10.1007/s40670-025-02426-4