Systematic Review
Open access
Published: 07 August 2025

A comparative study of screening performance between abstrackr and GPT models: Systematic review and contextual analysis

BMC Medical Informatics and Decision Making volume 25, Article number: 293 (2025) Cite this article

542 Accesses
Metrics details

Abstract

Background

Systematic reviews (SRs) and rapid reviews (RRs) are critical methodologies for synthesizing existing research evidence. However, the growing volume of literature has made the process of screening studies one of the most challenging steps in conducting systematic reviews.

Methods

This systematic review aimed to compare the performance of Abstrackr and GPT models (including GPT-3.5 and GPT-4) in literature screening for systematic reviews. We identified relevant studies through comprehensive searches in PubMed, Cochrane Library, and Web of Science, focusing on those that provided key performance metrics such as recall, precision, specificity, and F1 score.

Results

GPT models demonstrated superior performance compared to Abstrackr in precision (0.51 vs. 0.21), specificity (0.84 vs. 0.71), and F1 score (0.52 vs. 0.31), reflecting a higher overall efficiency and better balance in screening. This makes GPT models particularly effective in reducing false positives during fine-screening tasks.

Conclusion

Abstrackr and GPT models each offer distinct advantages in literature screening. Abstrackr is more suitable for the initial screening phases, whereas GPT models excel in fine-screening tasks. To optimize the efficiency and accuracy of systematic reviews, future screening tools could integrate the strengths of both models, potentially leading to the development of hybrid systems tailored to different stages of the screening process.

Peer Review reports

Background

Systematic reviews (SRs) are widely regarded as cornerstone methods in evidence-based medicine, public health, and numerous other disciplines for synthesizing existing research evidence. The primary goal of these reviews is to systematically collect, evaluate, and summarize current research findings, thereby providing reliable evidence to support clinical decision-making and inform policy development [1, 2]. However, the increasing volume of published literature has significantly heightened the complexity of conducting these reviews. Among the various stages of the systematic review process, study screening—especially the initial identification of relevant studies—has become one of the most resource-intensive and error-prone tasks [3].

Literature screening is typically divided into three distinct phases: title screening, title/abstract screening, and full-text screening. Traditionally, this process is carried out by two independent reviewers who manually assess each study for relevance, ensuring the inclusion of pertinent research. While this method is effective, it becomes exceedingly time-consuming and prone to human error when confronted with large volumes of studies [4]. Reviewers, even those with extensive experience, often face fatigue due to the sheer workload, leading to potential inconsistencies and omissions in study inclusion. Consequently, improving screening efficiency and alleviating the manual burden have emerged as central challenges in the field [5].

In response to these challenges, recent advancements in artificial intelligence (AI) and machine learning (ML) technologies have opened new avenues for automating and streamlining the literature screening process [6]. Several AI-driven tools, such as Rayyan, Abstrackr, and Colandr, have been developed to assist in systematic review screening. These tools leverage techniques such as active learning and natural language processing (NLP) to predict the relevance of studies based on initial training datasets labeled by human reviewers. By prioritizing studies most likely to be relevant, these tools aim to significantly reduce the workload for human reviewers, thus enhancing screening efficiency [3, 7].

For example, Gates et al. demonstrated the effectiveness of Abstrackr in four systematic review projects, achieving up to an 88% reduction in workload while maintaining a sensitivity of over 75% across all projects [3]. Similarly, Salles dos Reis et al. conducted a comparative study of three tools—Rayyan, Abstrackr, and Colandr—and found that all tools achieved specificities near 99%, effectively reducing the time spent on irrelevant literature. Among these, Abstrackr employs a Support Vector Machine (SVM)-based classifier integrated with an Active Learning and Dual Supervision framework [8]. Following automated deduplication, it vectorizes titles and abstract texts using TF-IDF to convert them into numerical features. The system dynamically optimizes decision boundaries through Dual Supervision and Active Learning, which stands out for its accessibility, broad applicability, and high efficiency, making it one of the most widely used tools in the literature screening process [6]. Critically, deduplication is essential for Abstrackr: Duplicate literature may drastically alter decision boundaries during automated screening due to repeated identical numerical features, leading to significant false positives or false negatives in screening outcomes.However, research has shown that Abstrackr tends to exhibit a higher false-negative rate when handling tasks with complex screening criteria or large datasets, and its predictive accuracy is also contingent on the size and quality of the training set [9].

At the same time, GPT models, particularly GPT-3.5 and GPT-4, have made significant strides in natural language processing in recent years. Unlike traditional machine learning models, GPT models are pre-trained on vast amounts of text data, endowing them with an advanced understanding of language and exceptional capabilities for generating and interpreting text [5]. These models’ ability to understand complex contexts and generate human-like text enables them to perform not only title and abstract screening, but also to adapt to complex inclusion criteria with high precision and sensitivity [7]. However, the substantial computational resources required for both training and inference pose a challenge to their widespread adoption in systematic review processes [3].

Despite the promising potential of both Abstrackr and GPT models, there is a notable gap in the literature regarding direct comparative studies of their performance in literature screening for systematic reviews. Most existing studies focus on evaluating the effectiveness of individual tools, with few directly comparing Abstrackr and GPT models using identical datasets and tasks [6]. To address this gap, this study aims to systematically compare the screening performance of Abstrackr and GPT models, analyzing their strengths and limitations and exploring their suitability for different phases of literature screening in systematic reviews.

Methods

This study employed a systematic review approach to evaluate the efficiency of Abstrackr and GPT models (including GPT-3.5, GPT-4, and their variants) in automating the literature screening process for systematic reviews. The focus was on comparing their performance across different types of screening tasks. We systematically extracted data from relevant studies, assessed performance metrics, and analyzed the differences between the two tools. To ensure scientific rigor, the study followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for both literature screening and data extraction.

Search strategy and inclusion criteria

A comprehensive search was conducted in the following databases: PubMed, Cochrane Library, and Web of Science. The literature search employed the following key terms: ‘systematic review’, ‘meta-analysis’, ‘screening’, ‘title’, ‘abstract’, ‘machine learning’, and specific tools such as ‘Abstrackr’ and ‘GPT models’ (e.g., ‘GPT-3.5’, ‘GPT-4’).The search covered literature from January 2015 to June 2025 to capture the most recent research advancements. Additionally, relevant reviews and conference proceedings were manually searched to identify studies that might not have been indexed in the databases.(Fig. 1)

Inclusion criteria

Studies using Abstrackr or GPT models (including GPT-3.5, GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo) for literature screening.
Studies using GPT models are API-based GPT models.
Studies that provided or allowed the inference of confusion matrices.
Studies that involved various types of systematic review tasks, including but not limited to drug interventions, public health research, and clinical treatments.

Exclusion criteria

Studies that did not provide recall, precision, F1-score, specificity, or accuracy, and for which these metrics could not be inferred from confusion matrix values.
Studies that involved automated screening tools other than Abstrackr or GPT models.
Narrative reviews, opinion pieces, or studies with inaccessible full texts.

Two independent reviewers were tasked with screening the literature and extracting relevant data. They recorded study characteristics, task domains, performance metrics, dataset features, and tool versions using standardized data extraction forms to ensure consistency and minimize biases.

Data analysis

The performance of Abstrackr and GPT models was evaluated using common metrics: recall, precision, specificity, F1 score, accuracy, and error rates [10]. Heterogeneity among studies was assessed using I² statistics, which quantifies the degree of variation in the results across studies [11]. Fixed-effects models were employed when heterogeneity was low (I² < 50%), while random-effects models were applied in cases of high heterogeneity (I² ≥ 50%) [12].

To compare the performance of Abstrackr and GPT, weighted averages were calculated across studies. Additionally, subgroup analyses were performed to examine performance differences between various versions of GPT models (e.g., GPT-3.5 vs. GPT-4), as well as between different types of systematic review tasks [13, 14].

All statistical analyses were conducted using R or Stata software, and the results were visualized using radar chart to assess the robustness and potential biases in the included studies [15, 16].

Results

Following systematic review protocols, 13 studies were ultimately included (Fig. 1). We detail the automated screening models utilized in these studies and present their confusion matrices (Table 1). A comprehensive comparison of screening performance metrics between Abstrackr and GPT-based tools is provided, with radar charts visually demonstrating key efficiency differences (Fig. 2).

Table 1 Matrix of included studies

Full size table

Recall

The weighted average recall for Abstrackr was 0.740 (95% CI: 0.725, 0.755), whereas for GPT, it was 0.791 (95% CI: 0.781, 0.801). Abstrackr demonstrated superior performance in identifying relevant literature compared to GPT(p < 0.001).

Precision

The weighted average precision for Abstrackr was 0.119 (95% CI: 0.114, 0.124), whereas for GPT, it was 0.465 (95% CI: 0.456, 0.474). Abstrackr exhibited a large number of false positives when marking study relevance, while GPT achieved significantly higher accuracy in identifying relevant studies (p < 0.001).

Specificity

The weighted average specificity for Abstrackr was 0.780 (95% CI: 0.776,0.784), whereas for GPT, it was 0.973 (95% CI: 0.968,0.977). Both models performed well in terms of specificity, but GPT demonstrated superior performance, suggesting that it was better at accurately excluding irrelevant studies and reducing the risk of misclassifying irrelevant studies as relevant (p < 0.001).

F1 Score

The weighted average F1 score for Abstrackr was 0.206 (95% CI: 0.199,0.213), whereas for GPT, it was 0.591 (95% CI: 0.581,0.601).

The F1 score reflects the balance between precision and recall. Abstrackr struggled to achieve a satisfactory balance between these two metrics, whereas GPT demonstrated strong performance in literature screening tasks, maintaining high precision while ensuring reasonable recall. Overall, GPT exhibited higher screening efficiency than Abstrackr (p < 0.001).

Accuracy

The weighted average accuracy for Abstrackr was 0.779 (95% CI: 0.775,0.783), whereas for GPT, it was 0.908 (95% CI: 0.905, 0.911). GPT models showed slightly higher accuracy than Abstrackr (p < 0.001).

Error Rate

The weighted average error rate for Abstrackr was 0.221 (95% CI: 0.217, 0.225), whereas for GPT, it was 0.092 (95% CI: 0.089,0.095)(p < 0.001).

Heterogeneity analysis

Abstrackr model

The heterogeneity analysis for the Abstrackr model showed low I² values across all performance metrics:

Recall (I² = 0.10), Precision (I² = 0.10), Specificity (I² = 0.22), F1 Score (I² = 0.03), Accuracy (I² = 0.13), and Error Rate (I² = 0.08).

This indicates high consistency between studies.

GPT model

The heterogeneity analysis for the GPT model also showed low I² values for most metrics:

Recall (I² = 0.07), Precision (I² = 0.20), F1 Score (I² = 0.07), Accuracy (I² = 0.00), and Error Rate (I² = 0.00).

However, specificity (I² = 0.47) demonstrated higher heterogeneity.

To address this, we conducted a subgroup analysis of GPT models for specificity.

Subgroup analysis of GPT model specificity

The subgroup analysis revealed that GPT-4 series models (including GPT-4 and ChatGPT-4 Turbo) showed significantly higher specificity compared to GPT-3.5 series models (including GPT-3.5 and ChatGPT-3.5 Turbo).

A Mann-Whitney U test indicated a significant difference in the distribution of specificity between the two groups (U = 17.00, p = 0.01). Hence, it can be concluded that the GPT-4 series models outperformed the GPT-3.5 series models in terms of specificity (Fig. 3).

Discussion

This study is the first systematic review comparing the performance of Abstrackr and GPT models in literature screening for systematic reviews. The results show that GPT outperforms Abstrackr in several key metrics. Also, GPT models demonstrated significantly higher recall (0.791 vs 0.740, p < 0.001), indicating its ability to identify a larger proportion of relevant studies, suggesting GPT may be suitable for both initial and precision screening phases. Specifically, the weighted average precision of GPT was 0.465 (95% CI: 0.456, 0.474), significantly higher than Abstrackr’s precision of 0.119 (95% CI: 0.114, 0.124). This suggests that Abstrackr has a higher rate of false positives when marking relevant studies, while GPT achieves significantly better accuracy in screening relevant articles [29, 30].

Abstrackr exhibited higher error rate (0.221, 95% CI: 0.217, 0.225) compared to GPT (0.092, 95% CI: 0.089,0.095), while GPT demonstrated superior performance in F1 score (0.591, 95% CI: 0.581,0.601 vs. 0.206, 95% CI: 0.199,0.213), specificity (0.973, 95% CI: 0.968,0.977 vs. 0.780, 95% CI: 0.776,0.784), and accuracy (0.908, 95% CI: 0.905, 0.911 vs. 0.779, 95% CI: 0.775,0.783). These results highlight GPT’s higher overall efficiency and balanced performance in literature screening. GPT models not only maintain high sensitivity in complex screening tasks but also effectively reduce errors and workload [31, 32], providing a more reliable tool for literature screening [33, 34].

Recall reflects a tool’s ability to capture relevant studies, which is critical in systematic reviews to ensure that all relevant literature is included [35]. Previous studies have shown that GPT-4’s recall typically ranges between 0.80 and 0.87. Even in complex tasks such as systematic reviews of behavioral interventions, GPT-4 has achieved a recall rate of 0.875, demonstrating strong adaptability to challenging tasks [36]. By contrast, Abstrackr’s recall performance has been reported to be more variable. Some studies have shown that Abstrackr’s recall decreases significantly, dropping as low as 0.60, when faced with large datasets or complex inclusion criteria [15]. This instability affects the comprehensiveness of systematic reviews and the reliability of their conclusions. Additionally, Abstrackr typically prioritizes high recall by introducing more false positives (FP) to ensure that all relevant studies are covered [37]. GPT models, on the other hand, tend to balance recall and precision to reduce false positives. Therefore, the dual strengths of GPT models in recall and precision enable them to handle both preliminary screening and detailed screening tasks effectively.

High precision, on the other hand, can effectively reduce the burden of manual screening, especially when dealing with large datasets [38]. Previous studies have shown that Abstrackr performs poorly in terms of precision, particularly in large-scale tasks where precision can drop as low as 0.06 [29]. This means that many studies marked as relevant by Abstrackr are actually irrelevant, significantly increasing the screening workload [39]. In contrast, GPT-4’s precision typically ranges between 0.80 and 0.83, significantly reducing the number of false positives and alleviating the burden of manual screening. This finding is consistent with our study, which suggests that GPT models are superior to Abstrackr in large-scale tasks [13].

The F1 score, which is the harmonic mean of precision and recall, directly reflects a tool’s overall performance. Our study found that the weighted average F1 score of Abstrackr was 0.206 (95% CI: 0.199,0.213), whereas GPT achieved a weighted average F1 score of 0.591 (95% CI: 0.581,0.601). GPT-4 was better able to balance these two metrics, which we attribute to its strong language understanding capabilities based on deep learning, allowing it to more accurately identify the relevance of studies [40].

In terms of specificity, GPT models outperformed Abstrackr, particularly the GPT-4 series, indicating their strong capability to exclude irrelevant studies. High specificity means the tool can accurately exclude most irrelevant studies, which is especially critical in tasks where irrelevant studies constitute the majority [41]. Therefore, GPT models are more suitable for the fine-screening phase of systematic reviews, where a high specificity is required to reduce the screening workload. GPT models significantly outperformed Abstrackr in both specificity and recall, overcoming the inherent trade-off dilemma faced by traditional tools.

Although the accuracy difference between Abstrackr and GPT-4 was small, accuracy is not the most representative metric for evaluating literature screening tools in systematic reviews. This is because systematic reviews often deal with a majority of irrelevant studies, and accuracy can be heavily influenced by the proportion of irrelevant studies. Metrics such as recall, precision, and F1 score are more reflective of a tool’s actual performance in complex tasks [39].

GPT demonstrates comprehensively superior performance over Abstrackr in target tasks. In terms of time efficiency, Abstrackr reduces preliminary screening time by 30–50%, while GPT further cuts the average per-article screening time from 8 minutes to 1.2 minutes, achieving a 70% reduction in detailed screening time [5] Dennstadt F et al. compared the full-process costs of LLMs versus traditional tools [42], revealing that:Screening 10,000 articles via GPT-4 API costs approximately $200. Abstrackr, though an open-source free tool, incurs additional human verification costs [43]. Economic analyses confirm GPT models are more cost-effective for large-scale systematic reviews, but teams must balance API expenses against labor savings. Selection should align with literature volume.

Critically, with the rise of LLMs, standardized prompt usage significantly impacts model efficacy. Studies utilizing GPT for automated screening consistently employ structured prompt engineering frameworks. Core principles of standardized GPT prompting include [44]:

Structured Task Instructions: Explicitly define AI roles, output formats, and content boundaries to ensure academic compliance.
Modular Conditional Design: Prioritize “ToDo” over “NotToDo” directives to avoid ambiguous exclusions.
Layered Constraints: Implement step-wise restrictions for complex tasks to mitigate hallucination risks.
Iterative Refinement Mechanisms: Enhance rigor through multi-round human revisions (“Draft → Logic Validation → Terminology Enhancement”).

Beyond Abstrackr and GPT, the literature screening tool ecosystem exhibits distinct stratification:

Premium tools (e.g., DistillerSR, EPPI-Reviewer, Covidence, SWIFT Active Screener),which support all 9 essential paid features, ideal for large systematic reviews—yet subscription costs may deter small teams.Free efficient tools like Rayyan maintains high recall (0.93) while reducing labor by 40%.Adapted Solutions (e.g., EndNote-Bramer/Excel-VonVille methods), enable basic screening but suffer high error rates (28%), suitable only for reviews under 100 articles.Tool selection must balance budget, literature scale, and functional requirements.

Limitations

To determine the optimal AI tool for literature screening in terms of accuracy and efficiency compared to human reviewers, comparative study designs would be preferred. However, the current studies, while reporting workload reduction and time savings, lack critical information, such as:

1、 The training time required for researchers when first using the AI tools.

2、 The time needed to prepare training datasets for the AI tools.

3、The time required to summarize the results of the training dataset and the final screening results (including title and abstract screening phases as well as full-text screening phases).

The need for researchers, along with human reviewers, to manually review all titles and abstracts with inconsistencies during screening to ensure no relevant studies are missed.

4、Additionally, the lack of direct comparisons between the two tools for identical tasks, with consistent literature volume and task complexity, limits the generalizability of the findings. Since the accuracy and specificity of AI tools have not yet reached 100%, human assistance is still required. In practice, the nature of the research question, the research team’s familiarity with relevant terminology, and the research topic itself are critical factors in deciding whether to use AI tools for literature screening.

5、In future studies, Embase and other databases should be incorporated into the search strategy to enhance the comprehensiveness of the research, as only PubMed, Cochrane Library, and Web of Science (WOS) were included in the current work.

Conclusion and future directions

This study is the first to systematically compare the performance of Abstrackr and GPT models in literature screening for systematic reviews. The results demonstrate that the two tools have distinct advantages and applicability in different tasks. Abstrackr ensures comprehensiveness with a high recall rate (0.82), making it suitable for initial screening tasks requiring high coverage. In contrast, GPT models excel in precision (0.51), specificity (0.84), and F1 score (0.52), making them better suited for fine-screening tasks aimed at reducing false positives. The differences in error rate and accuracy between the two tools were minimal, but each has its strengths.

These findings provide valuable guidance for research teams in choosing literature screening tools in different contexts. They also highlight the need for further optimization of AI tools to balance the trade-offs between recall and specificity, paving the way for future research and development.

Data availability

All data generated or analyzed during this study are included in this published article.

References

Affengruber L, Van Der Maten MM, Spiero I, Nussbaumer-Streit B, Mahmić-Kaknjo M, Ellen ME, et al. An exploration of available methods and tools to improve the efficiency of systematic review production: a scoping review. BMC Med Res Methodol. 2024 Sep 18;24(1):210.
Borah R, Brown AW, Capers PL, Kaiser KA. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open. 2017 Feb 27;7(2):e012545.
Cierco Jimenez R, Lee T, Rosillo N, Cordova R, Cree IA, Gonzalez A, et al. Machine learning computational tools to assist the performance of systematic reviews: a mapping review. BMC Med Res Methodol. 2022 Dec 16;22(1):322.
Norman CR, Leeflang MMG, Porcher R, Névéol A. Measuring the impact of screening automation on meta-analyses of diagnostic test accuracy. Syst Rev. 2019 Oct 28;8(1):243.
Issaiy M, Ghanaati H, Kolahi S, Shakiba M, Jalali AH, Zarei D, et al. Methodological insights into ChatGPT’s screening performance in systematic reviews. BMC Med Res Methodol. 2024 Mar 27;24(1):78.
Van Der Mierden S. Software tools for literature screening in systematic reviews in biomedical research. Altex. Internet]. 2019 [cited 2024 Oct 21]; Available from https://www.altex.org/index.php/altex/article/view/1257.
Masoumi S, Amirkhani H, Sadeghian N, Shahraz S. Natural language processing (NLP) to facilitate abstract review in medical research: the application of BioBERT to exploring the 20-year use of NLP in medical research. Syst Rev. 2024 Apr 15;13(1):107.
Wallace BC, Small K, Brodley CE, Lau J, Trikalinos TA. Deploying an interactive machine learning system in an evidence-based practice center: abstrackr. In: Proceedings of the 2nd ACM SIGHIT International Health Informatics Symposium [Internet]. Miami Florida USA: ACM; 2012 [2025 July 22]. p. 819–24. Available from https://doi.org/10.1145/2110363.2110464
Qin X, Liu J, Wang Y, Liu Y, Deng K, Ma Y, et al. Natural language processing was effective in assisting rapid title and abstract screening when updating systematic reviews. J Clin Epidemiol. 2021 May;133:121–29.
Article PubMed Google Scholar
Clark J, Glasziou P, Del Mar C, Bannach-Brown A, Stehlik P, Scott AM. A full systematic review was completed in 2 weeks using automation tools: a case study. J Clin Epidemiol. 2020;121:81–90.
Article PubMed Google Scholar
Lerner I, Créquit P, Ravaud P, Atal I. Automatic screening using word embeddings achieved high sensitivity and workload reduction for updating living network meta-analyses. J Clin Epidemiol. 2019;108:86–94.
Article PubMed Google Scholar
Bashir R, Surian D, Dunn AG. The risk of conclusion change in systematic review updates can be estimated by learning from a database of published examples. J Clin Epidemiol. 2019;110:42–49.
Article PubMed Google Scholar
Dennstädt F, Zink J, Putora PM, Hastings J, Cihoric N. Title and abstract screening for literature reviews using large language models: an exploratory study in the biomedical domain. Syst Rev. 2024 Jun 15;13(1):158.
Frunza O, Inkpen D, Matwin S, Klement W, O’Blenis P. Exploiting the systematic review protocol for classification of medical abstracts. Artif Intell Med. 2011 Jan;51(1):17–25.
Article PubMed Google Scholar
Tóth B, Berek L, Gulácsi L, Péntek M, Zrubka Z. Automation of systematic reviews of biomedical literature: a scoping review of studies indexed in PubMed. Syst Rev. 2024 Jul 8;13(1):174.
Van Dijk SHB, Brusse-Keizer MGJ, Bucsán CC, Van Der Palen J, Doggen CJM, Lenferink A. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open. 2023 Jul;13(7):e072254.
Article PubMed PubMed Central Google Scholar
Carey N, Harte M, Mc Cullagh L. A text-mining tool generated title-abstract screening workload savings: performance evaluation versus single-human screening. J Clin Epidemiol. 2022;149:53–59.
Article PubMed Google Scholar
Giummarra MJ, Lau G, Gabbe BJ. Evaluation of text mining to reduce screening workload for injury-focused systematic reviews. Inj Prev. 2020 Feb;26(1):55–60.
Article PubMed Google Scholar
Rathbone J, Hoffmann T, Glasziou P. Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Syst Rev. 4:80. 2015 Jun 15.
Gates A, Gates M, Sebastianski M, Guitard S, Elliott SA, Hartling L. The semi-automation of title and abstract screening: a retrospective exploration of ways to leverage Abstrackr’s relevance predictions in systematic and rapid reviews. BMC Med Res Methodol. 2020 Jun 3;20(1):139.
Gates A, Johnson C, Hartling L. Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool. Syst Rev. 2018 Mar 12;7(1):45.
Dos Reis AHS, de Oliveira ALM, Fritsch C, Zouch J, Ferreira P, Polese JC. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev. 2023 Apr 15;12(1):68.
Gates A, Guitard S, Pillay J, Elliott SA, Dyson MP, Newton AS, et al. Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools. Syst Rev. 2019 Dec;8(1):278.
Article PubMed PubMed Central Google Scholar
Chelli M, Descamps J, Lavoué V, Trojani C, Azar M, Deckert M, et al. Hallucination rates and reference accuracy of chatgpt and bard for systematic reviews: comparative analysis, J Med Internet Res. 2024 May 22;26:e53164.
Matsui K, Utsumi T, Aoki Y, Maruki T, Takeshima M, Takaesu Y. Human-comparable sensitivity of large language models in identifying eligible studies through title and abstract screening: 3-layer strategy using GPT-3.5 and GPT-4 for systematic reviews. J Med Internet Res. 2024 Aug 16;26:e52758..
Li M, Sun J, Tan X. Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Syst Rev. 2024 Aug 21;13(1):219.
Dai ZY, Wang FQ, Shen C, Ji YL, Li ZY, Wang Y, et al., Accuracy of large language models for literature screening in thoracic surgery: diagnostic study, J Med Internet Res, 2025 Mar 11;27:e67488.
Chibwe K, Mantilla-Calderon D, Ling F. Evaluating GPT models for automated literature screening in wastewater-based epidemiology. ACS Environ Au. 2025 Jan 15;5(1):61–68.
Kebede MM, Le Cornet C, Fortner RT. In-depth evaluation of machine learning methods for semi-automating article screening in a systematic review of mechanistic literature. Res Synth Methods. 2023 Mar;14(2):156–72.
Article PubMed Google Scholar
Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan—a web and mobile app for systematic reviews. Syst Rev. 2016 Dec;5(1):210.
Article PubMed PubMed Central Google Scholar
Bekhuis T, Tseytlin E, Mitchell KJ, Demner-Fushman D.Feature engineering and a proposed decision-support system for systematic reviewers of medical evidence. PLoS One. 2014;9(1):e86277.
Article PubMed PubMed Central Google Scholar
Adam GP, Pappas D, Papageorgiou H, Evangelou E, Trikalinos TA. A novel tool that allows interactive screening of pubmed citations showed promise for the semi-automation of identification of biomedical literature. J Clin Epidemiol. 2022;150:63–71.
Article PubMed Google Scholar
Bannach-Brown A, Przybyła P, Thomas J, Rice ASC, Ananiadou S, Liao J, et al. Machine learning algorithms for systematic review: reducing workload in a preclinical review of animal studies and reducing human screening error. Syst Rev. 2019 Jan 15;8(1):23.
Landschaft A, Antweiler D, Mackay S, Kugler S, Rüping S, Wrobel S, et al. Implementation and evaluation of an additional GPT-4-based reviewer in PRISMA-based medical systematic literature reviews. Int J Med Inform. 2024 Sep;189:105531.
Article PubMed Google Scholar
Wang Z, Nayfeh T, Tetzlaff J, O’Blenis P, Murad MH. Error rates of human reviewers during abstract screening in systematic reviews. PLoS One. 2020;15(1):e0227742.
Article CAS PubMed PubMed Central Google Scholar
Reichenpfader D, Müller H, Denecke K. Large language model-based information extraction from free-text radiology reports: a scoping review protocol. BMJ Open. 2023 Dec 9;13(12):e076865.
Brockmeier AJ, Ju M, Przybyła P, Ananiadou S. Improving reference prioritisation with PICO recognition. BMC Med Inform Decis Mak. 2019 Dec 5;19(1):256.
Fatima A, Shafique MA, Alam K, Fadlalla Ahmed TK, Mustafa MS. ChatGPT in medicine: a cross-disciplinary systematic review of ChatGPT’s (artificial intelligence) role in research, clinical practice, education, and patient interaction. Medicine (Baltimore). 2024 Aug 9;103(32):e39250.
Khurana D, Koli A, Khatter K, Singh S. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl. 2023 Jan;82(3):3713–44.
Article PubMed Google Scholar
Zimmerman J, Soler RE, Lavinder J, Murphy S, Atkins C, Hulbert L, et al. Iterative guided machine learning-assisted systematic literature reviews: a diabetes case study. Syst Rev. 2021 Apr 2;10(1):97.
Pham B, Jovanovic J, Bagheri E, Antony J, Ashoor H, Nguyen TT, et al. Text mining to support abstract screening for knowledge syntheses: a semi-automated workflow. Syst Rev. 2021 Dec;10(1):156.
Article PubMed PubMed Central Google Scholar
Dennstadt F, Zink J, Putora PM, et al. Syst Rev. 2024;13(1):158.
Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C. Automated Paper Screening for Clinical Reviews Using Large Language Models: data Analysis Study. J Med Internet Res. 2024 Jan 12;26:e48996.
Wang L, Chen X, Deng X, et al. Npj Digit Med. 2024;7(1).

Download references

Acknowledgements

We thank the National Natural Science Foundation of China and the R&D Program of Beijing Municipal Education Commission for their financial support. We also acknowledge the contributions of the Chinese Institutes for Medical Research, Beijing, for providing essential resources.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62276173) and R&D Program of Beijing Municipal Education Commission (No. KZ202210025034) and Chinese Institutes for Medical Research, Beijing (Grant No. CX24PY15).

Author information

Authors and Affiliations

Department of Orthopaedic Surgery, Beijing Anzhen Hospital, Capital Medical University, Beijing, 100013, China
Sheyang Xu, Zhiheng Zhao, Xingling Liu & Xiang-long Meng

Authors

Sheyang Xu
View author publications
Search author on:PubMed Google Scholar
Zhiheng Zhao
View author publications
Search author on:PubMed Google Scholar
Xingling Liu
View author publications
Search author on:PubMed Google Scholar
Xiang-long Meng
View author publications
Search author on:PubMed Google Scholar

Contributions

SX: Analyzed and interpreted data related to Abstrackr and GPT. WC: Was the primary contributor in writing the manuscript. XL: Led the literature search and selection process. All authors have read and approved the final manuscript.

Corresponding author

Correspondence to Xiang-long Meng.

Ethics declarations

Ethics approval

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Xu, S., Zhao, Z., Liu, X. et al. A comparative study of screening performance between abstrackr and GPT models: Systematic review and contextual analysis. BMC Med Inform Decis Mak 25, 293 (2025). https://doi.org/10.1186/s12911-025-03138-w

Download citation

Received: 12 February 2025
Accepted: 30 July 2025
Published: 07 August 2025
DOI: https://doi.org/10.1186/s12911-025-03138-w

A comparative study of screening performance between abstrackr and GPT models: Systematic review and contextual analysis

Abstract

Background

Methods

Results

Conclusion

Background

Methods

Search strategy and inclusion criteria

Inclusion criteria

Exclusion criteria

Data analysis

Results

Recall

Precision

Specificity

F1 Score

Accuracy

Error Rate

Heterogeneity analysis

Abstrackr model

GPT model

Subgroup analysis of GPT model specificity

Discussion

Limitations

Conclusion and future directions

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

BMC Medical Informatics and Decision Making

Contact us