Download PDF

Article

Are You Asking GPT-4 Medical Questions Properly? - Prompt Engineering in Consistency and Reliability with Evidence-Based Guidelines for ChatGPT-4: A Pilot Study

https://doi.org/10.21203/rs.3.rs-3336823/v1

This work is licensed under a CC BY 4.0 License

Journal Publication

published 20 Feb, 2024

Read the published version in npj Digital Medicine  →

You are reading this latest preprint version

Background

GPT-4 is a newly developed large language model that has been preliminarily applied in the medical field. However, GPT-4’s relevant theoretical knowledge of computer science has not been effectively transferred to the medical field.

Objective

To explore the application of prompt engineering in GPT-4 and to examine the reliability of GPT-4.

Methods

Different styles of prompts were designed and used to ask GPT-4 questions about agreement with the American Academy of Orthopaedic Surgeons (AAOS) osteoarthritis (OA) evidenced-based guidelines. Each question was asked 5 times. We compared the consistency with guidelines across different evidence levels for different prompts and assessed the reliability of different prompts by asking the same question 5 times.

Results

The ROT style had a significant performance for strong recommendations, with a total consistency of 77.5%, and showed steady performance at other levels of evidence compared to other prompts. The reliability of GPT-4 in different prompts was not stable (Fleiss kappa ranged from 0.334 to 0.525, and Kendall’s coefficient ranged from 0.701 to 0.814).

Conclusions

The application of prompt engineering could improve the performance of GPT-4 in medicine. The reliability of GPT-4 in answering medical questions is not clear, and further research is necessary.

Large language models (LLMs) have shown good capabilities to perform various natural language processing (NLP) tasks, such as summarizing, translating, code synthesis, and even logical reasoning.13 Chat-GPT is one of the most representative LLMs, and the latest version, GPT-4, is considered to have a larger training volume and stronger processing ability. 13 It has been involved in related medical studies in areas such as case diagnosis, medical examinations, and guideline consistency assessment.47

However, the current performance of GPT-4 in the medical field is not perfect. In the diagnosis of complex cases, GPT-4's diagnosis was consistent with the final diagnosis in 39% of the cases, and it showed an average consistency of 60% with the guidelines for digestive system diseases.4,6 Further research and exploration on how to optimize its performance in the medical field are necessary.1,4,6 Self-consistency has always been a crucial parameter for assessing the performance of LLMs.8 However, the reliability of GPT-4 in providing professional medical knowledge has not been investigated, and the answers provided by LLMs may not always be stable.

Prompt engineering is a new discipline that focuses on the development and optimization of prompt words, thereby helping users apply LLMs to various scenarios and research fields. In computer science, LLMs can obtain ideal and stable answers through prompt engineering, and adopting different prompts will affect the performance of LLMs, which is somewhat reflected in mathematical problems.811 The newly used prompt designs currently include chain of thoughts (COT) and tree of thoughts (TOT).10,11 With the proposal of the COT and TOT theories in the computer science LMM field, corresponding prompts have been developed and exhibited improved performance in mathematical problems10,11. To demonstrate the application of these different prompts, we use different prompts to show the application of COT and TOT theories in a simple math problem (Fig. 1).1012 However, no previous studies have explored the differences in the performance of existing different prompts in medical questions, nor have any studies examined whether there is a need to develop prompts specifically for medical questions. In summary, the application of LLMs in medicine is currently thriving. However, most current research seems to focus more on the results of using LLMs rather than how to better use LLMs. Testing the reliability of LLMs in answering medical questions, using different prompts, and even developing prompts specifically for medical questions could fundamentally change the application of LLMs in medicine and future research.

Therefore, it is important to investigate whether and how prompt engineering may improve the performance of LLMs in answering medical-related questions. The 2019 Global Burden of Disease tool identified osteoarthritis (OA) as one of the most prevalent and debilitating diseases. 13 As it is a common disease with a large patient population, patients and doctors may seek relevant professional knowledge online, which includes GPT-4. Therefore, investigating GPT-4’s performance with respect to OA-related questions would serve as an appropriate example of how to improve answer quality through prompt engineering.

This study aims to develop prompts tailored to professional medical questions and test the performance of GPT4 with different prompt techniques in generating answers to OA-related medical questions. This study also aims to explore the reliability of GPT-4 with different prompt techniques.

Disease selection and evidenced-based CPG selection

GPT-4 was used on July 20, 2023, for testing consistency with the American Academy of Orthopaedic Surgeons (AAOS) evidenced-based clinical practice guideline (CPG) of OA. With more than 39,000 members, the AAOS is the world’s largest medical association of musculoskeletal specialists14, and the OA guidelines provided by the AAOS are backed by detailed evidence and review reports.15 The OA guidelines have a detailed evidence assessment system based on research evidence and cover various management recommendations, including drug treatment for OA, physical therapy, and patient education. It is an authoritative and comprehensive guide with detailed content. More detailed information can be found in the complete version of the OA guidelines.16

Prompt design

Based on the current application of prompting engineering in computer science and the task of this study, four types of prompts were applied for this study based on the COT and TOT theories: IO, 0-COT, P-COT (performed-COT) and ROT (reflection of thoughts). These types of prompts were developed to test the compliance of GPT-4’s answers regarding the AAOS guidelines and assess the reliability of the answers in repeated requests. GPT-4 was tasked with generating one concise rating score as the final output and providing a valid explanation. In P-COT, the listed steps are designed based on the evaluation process of evidence-based medicine and the different steps in generating 170 responses in 0-COT. ROT simulates the mode in which multiple doctors would discuss and reach a conclusion in medical practice. Its feature is that it can simulate three experts discussing a certain medical professional problem and adjust previous thoughts in the discussion process until a consensus has been reached. For the content and illustration of the four prompts, please refer to Fig. 2 and Supplementary File 1.

Data collection and data process

Each item from the AAOS guidelines was reformatted as questions to Chat-GPT4, and the results showed the level of recommendation with explanation. The answers provided by GPT-4 were compared to those of the AAOS guidelines, and each level provided by GPT-4 was offset from the corresponding AAOS level, as shown in Fig. 3. The results of the answer level by Gpt-4 of different are summarized in Supplementary File 2. We extracted 34 items (Supplementary File 3) of advice from the evidenced-based OA CPG provided by the AAOS. Each piece of advice was asked 5 times. Each question was asked in a separate dialogue box to avoid the influence of context on the answers. Finally, each prompt was asked a total of 170 times, and the four prompts were asked a total of 680 times. The answer to each question was recorded. Answers that do not follow the instructions of the prompt are considered invalid data. For answers to the four prompts, please refer to Supplementary File 4–7.

Statistical analysis

Statistical analysis was conducted using SPSS 23.0 (IBM, New York, NY, USA) and GraphPad Prism version 8 (GraphPad Software, CA, USA). The statistical analysis was divided into two sections: the assessment of consistency with the guidelines and the assessment of reliability.

In the statistical analysis of consistency with the guidelines, the absolute difference between the level of GPT-4 answers and the corresponding level of AAOS recommendations can be 0, 1, 2, or 3. This means that the collected data are categorized as ordinal data. To compare whether there was a difference in average consistency with the guidelines among different prompts, the Kruskal‒Wallis test was used. In the Kruskal‒Wallis test17, the rank order corresponding to the difference levels 0, 1, 2, and 3 are 4, 3, 2, and 1, respectively. This means that during a comparison between two groups, the group with the higher mean rank and a p value less than 0.05 can statistically be considered the group with higher average consistency with the guideline. For multiple comparisons, corrections were made using Dunn's test.18 To further compare the absolute consistency, we grouped the categorical data collected into a category with a rank difference of 0 and another with a rank difference not equal to 0 and then conducted a chi-square test19. Bonferroni correction was used for multiple comparisons.20

The statistical analysis of reliability includes calculating Kendall's coefficient of concordance and Fleiss kappa.2123 Generally, if the coefficient is < 0.2, the reliability is poor; a coefficient between 0.2 and 0.4 indicates fair reliability; and a coefficient between 0.4 and 0.6 indicates moderate reliability. A value between 0.6 and 0.8 indicates strong reliability; a value between 0.8 and 1.0 indicates almost perfect reliability. Kendall's coefficient of concordance considers the order of data, while Fleiss kappa only cares about the consistency of data classification. Invalid data were treated according to missing data procedures using methods such as imputation and listwise deletion in the statistical analysis.24

Consistency with the guideline

The difference value distribution between GPT-4 and the AAOS in terms of different levels of recommendation is shown in Fig. 4. The multiple comparisons of average consistency are shown in Table 13, and the multiple comparisons of absolute consistency are shown in Table 46.

Table 1

Multiple comparisons of average consistency in strong level

Kruskal-Wallis test

Multiple comparisons(A vs. B)

Mean rank A

Mean rank B

Z

p

Adjusted p

p value: <0.0001*

IO (n = 39) vs. 0-COT (n = 39)

54.65

75.42

2.353

0.0186*

0.1117

IO vs. P-COT (n = 40)

54.65

90.71

4.111

< 0.0001*

0.0002*

IO vs. ROT (n = 39)

54.65

94.91

4.561

< 0.0001*

< 0.0001*

Kruskal-Wallis statistic:25.65

0-COT vs. P-COT

75.42

90.71

1.743

0.0813

0.4879

0-COT vs. ROT

75.42

94.91

2.208

0.0273*

0.1636

P-COT vs. ROT

90.71

94.91

0.4786

0.6322

> 0.9999

* p < 0.05 α = 0.05

 
Table 2

Multiple comparisons of average consistency in moderate level

Kruskal-Wallis test

Multiple comparisons(A vs. B)

Mean rank A

Mean rank B

Z

p

Adjusted p

p value: 0.744

IO (n = 40) vs. 0-COT (n = 40)

82.81

76.85

0.6927

0.4885

> 0.9999

IO vs. P-COT (n = 40)

82.81

77.54

0.6128

0.5400

> 0.9999

IO vs. ROT (n = 40)

82.81

84.80

0.2309

0.8174

> 0.9999

Kruskal-Wallis statistic:1.24

0-COT vs. P-COT

76.85

77.54

0.07987

0.9363

> 0.9999

0-COT vs. ROT

76.85

84.80

0.9235

0.3557

> 0.9999

P-COT vs. ROT

77.54

84.80

0.8437

0.3988

> 0.9999

* p < 0.05 α = 0.05

 
Table 3

Multiple comparisons of average consistency in limited level

Kruskal-Wallis test

Multiple comparisons(A vs. B)

Mean rank A

Mean rank B

Z

p

Adjusted p

p value: <0.0001*

IO (n = 78) vs. 0-COT (n = 76)

179.0

161.7

1.472

0.1411

0.8467

IO vs. P-COT (n = 80)

179.0

124.5

4.703

< 0.0001*

< 0.0001*

IO vs. ROT (n = 80)

179.0

165.7

1.149

0.2508

> 0.9999

Kruskal-Wallis statistic:24.49

0-COT vs. P-COT

161.7

124.5

3.191

0.0014*

0.0085*

0-COT vs. ROT

161.7

165.7

0.3399

0.7339

> 0.9999

P-COT vs. ROT

124.5

165.7

3.577

0.0003*

0.0021*

* p < 0.05 α = 0.05

 
Table 4

Multiple comparisons of absolute consistency in the strong level.

Prompt

The level gap is not 0

The level gap is 0 (%)

χ2

p

0-COT

19

21(52.5%)

4.178

0.041*

IO

28

12(30%)

0-COT

19

21

4.381

0.036*

P-COT

10

30(75%)

0-COT

19

21

5.495

0.019*

ROT

9

31(77.5%)

IO

28

12

16.241

0.000**

P-COT

10

30

IO

28

12

18.152

0.000**

ROT

9

31

P-COT

10

30

0.069

0.793

ROT

9

31

* p < 0.05 ** p < 0.008 α = 0.05

Significance level after Bonferroni correction: α = 0.008

 
Table 5

Multiple comparisons of absolute consistency in the moderate level.

Prompt

The level gap is not 0

The level gap is 0

χ2

p

0-COT

28

12(30%)

0.503

0.478

IO

25

15(37.5%)

0-COT

28

12

0.058

0.809

P-COT

27

13(32.5%)

0-COT

28

12

0.879

0.348

ROT

24

16(40%)

IO

25

15

0.220

0.639

P-COT

27

13

IO

25

15

0.053

0.818

ROT

24

16

P-COT

27

13

0.487

0.485

ROT

24

16

* p < 0.05 ** p < 0.008 α = 0.05

Significance level after Bonferroni correction: α = 0.008

 
Table 6

Multiple comparisons of absolute consistency in the limited level.

Prompt

The level gap is not 0

The level gap is 0

χ2

p

0-COT

24

56(70%)

3.451

0.063*

IO

14

66(82.5%)

0-COT

24

56

6.667

0.010*

P-COT

40

40(50%)

0-COT

24

56

0.502

0.479

ROT

20

60(75%)

IO

14

66

18.896

0.000**

P-COT

40

40

IO

14

66

1.345

0.246

ROT

20

60

P-COT

40

40

10.667

0.001**

ROT

20

60

* p < 0.05 ** p < 0.008 α = 0.05

Significance level after Bonferroni correction: α = 0.008

Strong level

Eight pieces of advice are rated as strong by the AAOS guideline, with 40 responses for each prompt. In the multiple comparisons of average consistency, the adjusted p value indicates that the average consistency of ROT and P-COT is significantly higher than that of IO (Table 1). In the multiple comparisons of the absolute consistency (Table 4), after Bonferroni correction, the rate of the 0 level gap of ROT (77.5%) and P-COT (75%) was significantly higher than that of IO (30%).

Moderate level

Eight pieces of advice were rated as moderate, with 40 responses for each prompt. In the multiple comparisons of average consistency and absolute consistency (Table 2and Table 5), there was no significant difference between groups.

Limited level

Sixteen pieces of advice have a limited recommendation rating, with 80 responses for each prompt. In the multiple comparisons of average consistency, the adjusted p value indicates that the average consistency of P-COT is significantly lower than that of other groups (Table 3). In the multiple comparisons of the absolute consistency (Table 6), after Bonferroni correction, the rate of the 0 level gap of P-COT (50%) was significantly lower than that of ROT (75%) and IO (82.5%).

Consensus level

The recommendation level of the two pieces of advice was consensus. Considering the small sample size, no statistical test was conducted, and the distribution of responses is shown in Fig. 4(d).

The reliability

Kendall's coefficient of concordance and Fleiss kappa of each prompt are summarized in Table 7. When the order of data is considered, the results of Kendall's coefficient of concordance indicate that the reliability of each prompt is relatively strong (0.701 to 0.814). When only considering whether GPT-4 output the same level of recommendation in each response, the results of Fleiss kappa indicate that the reliability of each prompt is fair to moderate (0.334 to 0.525).

Table 7

Reliability of GPT-4.

Prompt

Kendall's coefficient of concordance

Fleiss kappa (95%CI)

IO

0.814

0.525(0.523 ~ 0.527)

0-COT

0.767

0.450(0.448 ~ 0.452)

P-COT

0.701

0.334(0.332 ~ 0.337)

ROT

0.772

0.467(0.465 ~ 0.470)

Invalid data and corresponding processing measures

A total of 14 invalid data points appeared, and there were two categories.

Category A: the final rating was not provided (twice in IO, 5 times in 0-COT).

Category B: the rating was not an integer (twice in IO, 4 times in 0-COT, and once in ROT).

All the invalid data were processed according to missing data procedures. 24

In the multiple comparisons of average consistency, all the invalid data are considered missing data. In the multiple comparisons of absolute consistency, all the invalid data are grouped in the “level gap is not 0”. In the calculation of Flessi Kappa, all invalid data of category A are considered as an independent classification, and the invalid data of category B are treated as different classifications based on the values generated by GPT-4. In the calculation of Kendall's coefficient of concordance, 4 invalid data points in category A appeared in the same line, and therefore, listwise deletion was adopted. Other invalid data in category A are used for mean imputation. Invalid data in Category B were entered as the value generated by GPT-4.

Considering that the proportion of missing data (14/680) is small and the corresponding processing measures are adopted, the statistical error caused by it is acceptable.

Principal Finding

The results of this study suggested that prompt engineering may enhance the accuracy of GPT-4 in answering medical questions. Additionally, GPT-4 does not always provide the same answer to the same medical questions. The novel ROT prompt outperformed other prompts in providing professional OA knowledge consistent with clinical guidelines.

Consistency with the guideline

The performance of different prompts varies. From the statistical results, ROT performs most evenly and prominently. In terms of 'strong' intensity, ROT is superior to IO, and it is not significantly inferior to other prompts at other levels. In contrast, although P-COT's answers at 'strong' intensity are better than IO's, its performance at the 'limited' intensity level is significantly worse than other prompts.

Development of ROT for medical-related questions

We develop ROT prompt to mimic medical consultation involving multiple doctors in a problem-based setting. It may be able to utilize GPT’s computational power to a greater extent and enhance the robustness of the answers. Directly, the total word count of Supplementary Files 4–7 significantly increased (45,547 words in IO, 76,592 words in 0-COT, 94,853 words in P-COT, 162,329 words in ROT). Additionally, the ROT prompt asked GPT-4 to return to previous thoughts and examine if they were appropriate, which may improve the reliability and robustness of the answer and reduce AI hallucinations.

Furthermore, the design based on ROT can minimize the occurrence of egregiously incorrect answers from GPT-4. For instance, regarding a 'strong' level suggestion, "Lateral wedge insoles are not recommended for patients with knee osteoarthritis." ROT provided four 'strong' answers and one 'moderate' answer in five responses. Among them, in this 'moderate' response (Supplementary File 8), initially, two “experts” provided "limited" as their answer, and one “expert” answered "moderate". After “discussion”, all “experts” agreed on a 'moderate' recommendation. The final reason was that even though there was high-quality evidence to support the advice, there might still be a slight potential benefit for some individuals. Notably, that the reasons given by the two experts for "limited" seem to be more in line with the statement "Lateral wedge insoles are recommended for patients with knee osteoarthritis." This implies that these two “experts” did not fully understand the medical advice correctly, as “Expert C” mentioned in step five: "Observes that the results are somewhat mixed, but there's a general agreement that the benefits, if any, from lateral wedge insoles are limited." However, after the “discussion”, the final revised recommendation and reason were deemed acceptable. Referring to the application of TOT in the 24-point game11, the prompt designed in the style of TOT as well as the ROT in this study could offer more possibilities at every step of the task, aiming to induce GPT to generate more accurate answers.

The statistical results for suggestions of other intensities are mixed. One reason might be that these medical recommendations themselves are controversial. GPT's answers can provide various perspectives. For instance, a recommendation from the AAOS that was rated as "moderate" states: “Arthroscopic partial meniscectomy can be used for the treatment of meniscal tears in patients with concomitant mild to moderate osteoarthritis who have failed physical therapy or other nonsurgical treatments.” There has been an editorial25 suggesting that the supporting evidence for this recommendation is limited, and the phrasing of the recommendation itself is problematic. GPT-4, in its answers across different prompts, tends to hover between limited and moderate, with reasons related to efficacy, quality of evidence, and other related factors.

Reliability

If we only consider whether the answers generated by GPT-4 are the same each time, the Fleiss Kappa values for each prompt show fair to moderate reliability. This could be related to GPT-4's parameter settings. The ‘temperature’ setting allows GPT-4 to produce flexible text outputs. Moreover, medical questions are not like mathematical problems. The process of logical reasoning is versatile and mutable. Just as the 24-point game can be definitively solved in just three steps, there is no completely determined process for evidence-based evaluation of a piece of medical advice, much like the variability in the steps given in 0-COT. Therefore, the answers from GPT-4 tend to be more flexible and diverse.

However, when considering the order of the data, Kendall's coefficient of concordance of each prompt shows that GPT-4's answers have a high degree of internal consistency. Combined with the previous analysis on consistency with the guideline, this implies that for recommendations rated as "strong" by the guideline, GPT-4 might rate them as "moderate", but it is unlikely to rate them as "consensus". In conclusion, even though the GPT-4 might not provide the same answer every time, by asking with an appropriate prompt, one can obtain relatively accurate responses.

Two previous studies6,7 briefly described reliability. The study of Yoshiyasu et al.7 only reproduced inaccurate responses. Walker et al.6 reported that the internal concordance of the provided information was complete (100%). In this study, reliability was investigated by asking GPT-4 the same question five times, and the results were analysed. According to the results of our study, it is suggested that GPT-4 does not always provide consistent answers for the same medical question.

How to better use GPT-4?

The results of this study suggested that the prompt technique and enquiry times significantly influenced the quality and reliability of answers provided by GPT-4 in answering professional medical questions. ROT may be recommended for medical-related questions, especially for in-depth questions. It is recommended to ask GPT-4 the same questions several times to obtain more comprehensive answers. You can keep asking the GPT-4 the same question until it does not provide any different information. A template of ROT prompts is presented and can be modified to answer medical-related questions in various scenarios (Fig. 5).

Advantage and limitation

This study is a pilot study to explore the application of prompt engineering in medical questions and the reliability of GPT-4's answers to medical questions. Our data collection is nonhuman subjective scoring, objectively reflecting the consistency and reliability of GPT-4 responses. However, the study was designed based on the expected answers from the guidelines and lacks prospective validation. This study is performed with GPT-4’s web interface, so it would not be feasible to modify the temperature of GPT-4 to create a more robust answer, which can be done by professional programmers who have acquired the Application Programming Interface (API) key for the LLM. However, this study is designed to mimic GPT-4’s performance in routine medical practice and provide users such as doctors and patients with information on how to improve the quality of their answers. Therefore, all experiments were performed in the web interface.

Implications for Future Research

Currently, prompt engineering is in full progress in the computer field, but there is no specific medical research related to it. Prompt engineering can help doctors learn how to ask better questions so that GPT can give better answers. Additionally, the reliability of the GPT-4 in answering medical questions is worth exploring. In future studies, our team will aim to explore the appropriate prompts for medical questions and try to call the API to change the parameters such as the ‘temperature’ to improve reliability. And We will further explore the application of prompt engineering and the reliability of answers if diagnosis, medical examination, and so on. Meanwhile It is suggested that future related studies should examine the engineering and repetition of cue words and formulate relevant guidelines.

Appropriate prompts may improve the accuracy of answers to professional medical questions. Additionally, it is recommended to ask the same questions several times to obtain more comprehensive thoughts, especially for in-depth medical questions. Some of the recommendations provided by the AAOS may still be controversial, but the recommendations rated as ‘strong’ are rarely disputed and can be considered the gold standard, and the findings in this study also show significant differences in the performance of each prompt in the strong group.

Competing interests: The authors declare no competing interests.

Contributions:

Li Wang and Xi Chen are the main designer and executor of the study and manuscript, and have accessed and verified the data.

Jian Li is responsible for proposing revisions of the manuscript and the decision to submit the manuscript.

XiangWen Deng and HaoWen are consultants of knowledge related about computer science.

MingKe You and WeiZhi Liu participate in the drafting of the manuscript.

Funding information:

This work was funded by the Key Research and Development Fund of Sichuan Science and Technology Planning Department. Grant number 2022YFS0372 (received by Jian Li) and the Youth Research Fund of Sichuan Science and Technology Planning Department. Grant number 23NSFSC4894 (received by Xi Chen)

  1. Lee, P., Bubeck, S. & Petro, J. Benefits, Limits, and Risks of GPT-4 as an AI Chatbot for Medicine. N Engl J Med 388, 1233-1239, doi:10.1056/NEJMsr2214184 (2023).
  2. Waisberg, E. et al. GPT-4: a new era of artificial intelligence in medicine. Irish journal of medical science, doi:10.1007/s11845-023-03377-8 (2023).
  3. Scanlon, M., Breitinger, F., Hargreaves, C., Hilgert, J.-N. & Sheppard, J. (arXiv, 2023).
  4. Kanjee, Z., Crowe, B. & Rodman, A. Accuracy of a Generative Artificial Intelligence Model in a Complex Diagnostic Challenge. JAMA 330, 78-80, doi:10.1001/jama.2023.8288 (2023).
  5. Cai, L. Z. et al. Performance of Generative Large Language Models on Ophthalmology Board Style Questions. American journal of ophthalmology, doi:10.1016/j.ajo.2023.05.024 (2023).
  6. Walker, H. L. et al. Reliability of Medical Information Provided by GPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument. J Med Internet Res 25, e47479, doi:10.2196/47479 (2023).
  7. Yoshiyasu, Y. et al. GPT-4 accuracy and completeness against International Consensus Statement on Allergy and Rhinology: Rhinosinusitis. Int Forum Allergy Rhinol, doi:10.1002/alr.23201 (2023).
  8. Wang, X. et al. Self-Consistency Improves Chain of Thought Reasoning in Language Models. (2023).
  9. Strobelt, H. et al. Interactive and Visual Prompt Engineering for Ad-hoc Task Adaptation with Large Language Models. (2022).
  10. Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. (2023).
  11. Yao, S. et al. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. (2023).
  12. Hulbert, D. Using Tree-of-Thought Prompting to boost GPT's reasoning, <https://github.com/dave1010/tree-of-thought-prompting> (2023).
  13. 2019 Global Burden of Disease (GBD) study, <https://vizhub.healthdata.org/gbd-results/> (2019).
  14. Newsroom. AAOS Updates Clinical Practice Guideline for Osteoarthritis of the Knee, <https://www.aaos.org/aaos-home/newsroom/press-releases/aaos-updates-clinical-practice-guideline-for-osteoarthritis-of-the-knee/> (2021).
  15. KNEE, O. O. T. Clinical Practice Guideline on Management of Osteoarthritis of the Knee (3rd Edition), <https://www.aaos.org/quality/quality-programs/lower-extremity-programs/osteoarthritis-of-the-knee/> (2021).
  16. Directors, T. A. A. o. O. S. B. o. Management of Osteoarthritis of the Knee (Non-Arthroplasty) <https://www.aaos.org/globalassets/quality-and-practice-resources/osteoarthritis-of-the-knee/oak3cpg.pdf> (2019).
  17. Theodorsson-Norheim, E. Kruskal-Wallis test: BASIC computer program to perform nonparametric one-way analysis of variance and multiple comparisons on ranks of several independent samples. Comput Methods Programs Biomed 23, 57-62, doi:10.1016/0169-2607(86)90081-7 (1986).
  18. Elliott, A. C. & Hynan, L. S. A SAS® macro implementation of a multiple comparison post hoc test for a Kruskal–Wallis analysis. Comput Methods Programs Biomed 102, 75-80, doi:10.1016/j.cmpb.2010.11.002 (2011).
  19. Goldstein, M., Wolf, E. & Dillon, W. On a test of independence for contingency tables. Communications in Statistics - Theory and Methods 5, 159-169, doi:10.1080/03610927808827340 (1976).
  20. Armstrong, R. A. When to use the Bonferroni correction. Ophthalmic Physiol Opt 34, 502-508, doi:10.1111/opo.12131 (2014).
  21. Sun, M. et al. Cognitive Impairment in Men with Prostate Cancer Treated with Androgen Deprivation Therapy: A Systematic Review and Meta-Analysis. J Urol 199, 1417-1425, doi:10.1016/j.juro.2017.11.136 (2018).
  22. Schoonjans, F., Zalata, A., Depuydt, C. E. & Comhaire, F. H. MedCalc: a new computer program for medical statistics. Comput Methods Programs Biomed 48, 257-262, doi:10.1016/0169-2607(95)01703-8 (1995).
  23. Fleiss, J. L. & Cohen, J. The Equivalence of Weighted Kappa and the Intraclass Correlation Coefficient as Measures of Reliability. Educational and Psychological Measurement 33, 613-619, doi:10.1177/001316447303300309 (1973).
  24. Pigott, T. D. A Review of Methods for Missing Data. Educational Research and Evaluation 7, 353-383, doi:10.1076/edre.7.4.353.8937 (2001).
  25. Leopold, S. S. Editorial: The New AAOS Guidelines on Knee Arthroscopy for Degenerative Meniscus Tears are a Step in the Wrong Direction. Clin Orthop Relat Res 480, 1-3, doi:10.1097/CORR.0000000000002068 (2022).

(Not answered)

Download PDF

Journal Publication

published 20 Feb, 2024

Read the published version in npj Digital Medicine  →

  • Editorial decision: revise

    10 Oct, 2023

  • Review #3 received at journal

    08 Oct, 2023

  • Review #2 received at journal

    29 Sep, 2023

  • Reviewer #3 agreed at journal

    24 Sep, 2023

  • Reviewer #2 agreed at journal

    19 Sep, 2023

  • Review #1 received at journal

    17 Sep, 2023

  • Reviewer #1 agreed at journal

    17 Sep, 2023

  • Reviewers invited by journal

    17 Sep, 2023

  • Editor assigned by journal

    08 Sep, 2023

  • Submission checks completed at journal

    08 Sep, 2023

  • First submitted to journal

    08 Sep, 2023

You are reading this latest preprint version