Fig. 3: Recall of “None of the above” of models on the MetaMedQA benchmark, including the 95% confidence interval. | Nature Communications

Fig. 3: Recall of “None of the above” of models on the MetaMedQA benchmark, including the 95% confidence interval.

From: Large Language Models lack essential metacognition for reliable medical reasoning

Fig. 3

Representative statistical significance was determined using a one-way ANOVA with a Tukey correction for multiple comparisons and is indicated by asterisks above the brackets (* p < 0.05 and **** p < 0.0001; Meerkat 7b vs Qwen2 7b, p = 0.012). Results are presented as mean values +/− 95% CI (n = 115). Models of the same family share the same color. Source data are provided as a Source Data file.

Back to article page