Abstract
Background
Recruitment for cohorts involving complex liver diseases, such as hepatocellular carcinoma and liver cirrhosis, often requires interpreting semantically complex criteria. Traditional manual screening methods are time-consuming and prone to errors. While AI-powered pre-screening offers potential solutions, challenges remain regarding accuracy, efficiency, and data privacy.
Methods
We developed a novel patient pre-screening pipeline that leverages clinical expertise to guide the precise, safe, and efficient application of large language models. The pipeline breaks down complex criteria into a series of composite questions and then employs two strategies to perform semantic question-answering through electronic health records: (1) Pathway A, Anthropomorphized Experts’ Chain of Thought strategy; and (2) Pathway B, Preset Stances within an Agent Collaboration strategy, particularly in managing complex clinical reasoning scenarios. The pipeline is evaluated on key metrics including precision, recall, time consumption, and counterfactual inference—at both the question and criterion levels.
Results
Our pipeline achieved a notable balance of high precision (e.g., 0.921, criteria level) and good overall recall (e.g., ~ 0.82, criteria level), alongside high efficiency (0.44s per task). Pathway B excelled in high-precision complex reasoning (while exhibiting a specific recall profile conducive to accuracy), whereas Pathway A was particularly effective for tasks requiring both robust precision and recall (e.g., direct data extraction), often with faster processing times. Both pathways achieved comparable overall precision while offering different strengths in the precision-recall trade-off. The pipeline showed promising precision-focused results in hepatocellular carcinoma (0.878) and cirrhosis trials (0.843).
Conclusions
This data-secure and time-efficient pipeline shows high precision and achieves good recall in hepatopathy trials, providing promising solutions for streamlining clinical trial workflows. Its efficiency, adaptability, and balanced performance profile make it suitable for improving patient recruitment. And its capability to function in resource-constrained environments further enhances its utility in clinical settings.
Similar content being viewed by others
Background
In hepatology, the complexity of chronic diseases like cirrhosis, liver failure, and hepatocellular carcinoma demands meticulous patient selection for clinical research [1]. Clinicians must navigate extensive electronic health records (EHRs) to identify patients with specific diagnoses or treatments [2], such as esophageal varices or those undergoing Transjugular Intrahepatic Portosystemic Shunt (TIPS). The challenge deepens when broader categories like autoimmune diseases or prior beta-blocker therapy are involved, requiring comprehensive evaluation [3]. Additionally, determining whether liver cancer is primary or secondary, identifying the causes of liver failure or cirrhosis, or recognizing symptoms like jaundice adds further layers of complexity, necessitating detailed analysis and clinical insight. These tasks expose the limits of current methods. While traditional manual screening and automated approaches have attempted to address these challenges, they fall short in key areas [4].
Bidirectional Encoder Representations from Transformers (BERT)-based tools have been used for subject matching across various diseases while struggling with generalizability [4,5,6,7]. Large language models (LLMs), with their superior contextual semantic understanding, offer a promising avenue to overcome these limitations [8, 9]. Domain-specific LLMs, fine-tuned on medical corpora or using knowledge graph embeddings, perform well in clinical text summarization [10, 11], note generation [12, 13], and medical question answering [14]. However, EHR security policies commonly necessitate local LLM deployment, precluding the use of non-open-source models like Chat Generative Pre-Trained Transformer (ChatGPT), Gemini, or Claude. Additionally, we believe the primary challenge lies in the need for strategic support in complex reasoning rather than in domain-specific knowledge [15]. Thus, both domain-specific and general LLMs start from a similar point. Moreover, domain-specific LLMs’ extensive IT resource demands often exceed institutional capacities. Therefore, this study focuses on locally deployed general open-source LLMs for practical reasons.
Open-source LLMs often face criticism in professional reasoning tasks for non-factual hallucination and imbalanced attention focus [16]. To address these issues, effective strategies have been engineered across various domains, including (1) Chain of Thought (CoT) prompting; (2) Agent-Collaboration (Agent-Collab) frameworks.
CoT prompting and extensions
CoT prompting is a powerful method by breaking down complex tasks into a series of intermediate steps [17]. Traditional “Let’s think step by step” approaches with few-shot learning cannot meet our complex reasoning needs. A typical approach involves using prompt engineering techniques to guide the model in exhibiting a thought process consistent with a benchmark individual. In this study, we introduce “Anthropomorphized Experts’ CoT” inspired by AI biomimicry. This method emulates human expert reasoning, derived from interviews and summarized into concise steps. The principle is to break down complex thought processes into simple, manageable steps.
Agent-collab frameworks
Leveraging the capabilities of LLMs in agent mode, where independent agents collaborate, converse, and debate within an Agent-Collab framework, has consistently been of interest. Compared to the consultative question-and-answer model, the Agent-Collab architecture with debating agents excels in all aspects [18]. The Agent-Collab framework typically requires: 1) defining agent roles, and 2) setting the number of dialogue rounds. Roles can be specified or predefined. In this study, we employed a Counterfactual Debating with Preset Stances method, which has shown high consistency in existing research and indicates that two dialogue rounds yield the best results [19]. The Agent-Collab framework can simulate a ‘mock trial’ scenario to identify and rectify potential errors or biases in interpretation, ensuring a more accurate and comprehensive assessment of patient eligibility. Methods such as self-critique and reflection are also employed to confront the inherent biases of LLMs, compelling them to consider alternative perspectives and generate more accurate outcomes.
To address domain knowledge gaps in locally deployed models, we use powerful non-open-source LLMs to enhance non-sensitive inclusion/exclusion criteria. Building on these advancements, our pipeline transforms complex logical issues into manageable semantic tasks, creating a robust framework for patient pre-screening from admission notes.
Methods
Data preparation
The EHR data used in this study, specifically narrative admission notes, typically included sections such as chief complaint, history of present illness and past medical history (a sample note is provided in Supplemental Material S1). These data were obtained from the First Affiliated Hospital of Guangxi University of Chinese Medicine (FAHGUCM). The dataset comprised over 16,000 anonymized admission notes from the Hepatology Department collected over the past 10 years. For detailed analysis, we randomly selected 4,000 records, which was representative of the overall dataset in terms of the distribution of hepatopathy and associated clinical presentations.
We manually reviewed the inclusion and exclusion criteria of six real-world clinical trials on hepatopathy, and carefully selected 58 criteria assessable using information typically documented in admission notes. These criteria primarily focused on general health, medical history, and prior treatments, excluding those requiring specific laboratory results or specialized examinations not routinely captured in admission notes (details in Supplemental Table S2). These trials addressed various liver conditions, including chronic liver failure (ChiCTR2100044187 [20]), cirrhosis (NCT04353193 [21], NCT04850534 [22], NCT03911037 [23], NCT01311167 [24]), and hepatocellular carcinoma (NCT04021056 [25]).
We deployed four LLMs within an intranet security environment: Qwen1.5-7B-Chat-GPTQ-Int4 [26] (QWEN1.5), Baichuan2-13B-Chat [27] (BAICHUAN), ChatGLM3-6B [28] (GLM) and Qwen2-7B-Instruct-GPTQ-Int4 [26] (QWEN2). The experiments were conducted on Lenovo ST650V2 servers equipped with four NVIDIA GeForce RTX 3090 Turbo GPUs.
Pipeline design
The core of our approach is a multi-strategy pipeline designed to systematically process narrative admission notes, extracting relevant clinical information and applying logical reasoning. The architecture is detailed in Fig. 1.
Comparison of two pathways for criteria identification. (A) shows the complete pre-screening process, including criteria conversion and question-based assessment. (B) and (D) depict Pathway A: Anthropomorphized Experts’ Chain of Thought (CoT), which models the opinions of three real-world operators to create a CoT that aligns with inclusion and exclusion criteria. (C) and (E) illustrate Pathway B: Preset Stance Agent-Collab, where agents’ stances are preset, followed by a voting and arbitration process to determine final answers
Criteria conversion
We employed an external-CoT strategy, leveraging advanced non-open-source LLMs (OpenAI ChatGPT-4o, Google Gemini Advanced, and Anthropic Claude 3.5 Sonnet) to decompose complex eligibility criteria into simpler and manageable questions. Each LLM generated question sets from a standardized prompt template (see Supplemental Material S2-S5), which were subsequently integrated and refined using ChatGPT-4o to produce a comprehensive, non-redundant, and conflict/ambiguity-free final question set.
Question-based assessment
We employed two distinct pathways to address the decomposed questions.
Pathway A: anthropomorphized experts’ CoT
We conducted interviews with key individuals to summarize their thought processes, which were then used to develop the corresponding CoT. Each LLM, prompted to act in an independent role illustrated in Fig. 1(D), analyzed EHR content and provided assessments for corresponding questions: (1) Clinical Research Coordinator (CRC): Focused on identifying relevant EHR domains using empirical knowledge, followed by targeted data extraction. Employed nuanced terminology comprehension beyond basic keyword matching; (2) Junior Doctor (JD): Conducted comprehensive EHR chart review, extracted pertinent information, and generated inferential responses to clinical issues; (3) Information Engineer (IE): Prioritized extracting key terminology from questions, performed global matching within EHR text, and analyzed semantic negations in identified regions. Additionally, we experimented with a majority vote approach, aggregating responses from the three professional perspectives. Specifically, “yes” votes formed one category, while “no” and “uncertain” votes were combined into another.
Pathway B: preset stance agent-collab
We proposed an Agent-Collab framework simulating a mock trial, wherein three agents with distinct roles and stances independently evaluate questions. To enhance efficiency, the evaluation process is limited to a maximum of two rounds. (1) Agent A (Positive Assumption) - “Proponent”: Begins with a positive stance (e.g., assuming the patient meets the criteria specified in the question) and conducts a thorough analysis of the admission note to support this assumption, providing a conclusion with supporting evidence; (2) Agent B (Negative Assumption) - “Opponent”: Adopts a negative stance (e.g., assuming the patient does not meet the criteria in the question) and performs an independent analysis of the same admission note, offering a conclusion with supporting reasoning and evidence; (3) Agent C (Adjudicator) - “Judge”: Acts as the adjudicator, reviewing the conclusions and reasoning from Agents A and B. If both agents agree, Agent C delivers the consensus conclusion. In the case of disagreement, Agent C identifies inconsistencies and prompts a second, more detailed evaluation round, ultimately providing a final judgement on the question.
Evaluation metrics
To establish a robust gold standard for all performance evaluations at the question level, clinical experts manually reviewed and annotated the converted question sets from 4,000 admission notes. Specifically, three board-certified junior clinicians—each with three years of standardized residency training in hepatology—independently annotated the dataset. The annotation process was conducted prior to pipeline execution. All annotators were blinded to the outputs of the pipeline and to each other’s annotations. For each item, we adopted a majority voting strategy to determine the final ground truth label. Against this gold standard, the pipeline’s performance was assessed at both the question and criterion levels using several key metrics. For determining the effectiveness of patient identification in the context of clinical trial pre-screening, precision (the proportion of identified patients who were truly eligible) and recall (the proportion of truly eligible patients who were successfully identified) were considered critical indicators. Accordingly, F1 score and accuracy were also calculated based on these gold standard annotations. At the criterion level, we aggregated answers to questions using predefined logical rules (detailed in the Supplemental Material S6). We conducted an advanced evaluation of the pipeline, including a counterfactual inference analysis [29], by manually reviewing outputs that deviated from the gold standard. Additionally, processing time was measured to verify the pipeline’s scalability and efficiency for large-scale pre-screenings.
We categorized the questions into relevant clinical domains and further classified them based on reasoning complexity into two types: Classification and Direct Match. Specifically, Direct Match tasks typically involved identifying explicitly stated keywords or entities from the text, whereas Classification tasks required inferring an answer based on broader contextual information or synthesizing multiple pieces of textual evidence. This approach allowed us to analyze performance across different clinical domains and task types, providing deeper insight into the pipeline’s strengths and limitations.
Ethical approval and consent to participate
The study was approved by the Ethics Committee of FAHGUCM (Ref. No. 2020-046-02). The requirement for informed consent was waived by the Ethics Committee due to the observational nature of the study, and all data were de-identified and anonymized.
Results
We automatically generated 87 questions from 58 original criteria, simplifying complex reasoning into straightforward semantic tasks (mapping logic in Supplemental Table S1). The connection between these criteria and specific trials is detailed in Supplemental Table S2. Various prompt templates were used throughout the pipeline, as outlined in Supplemental Materials S2-S5.
Question level assessment
At the question level, both pathways demonstrated robust pre-screening performance (Table 1). Pathway A (majority vote) achieved a precision of 0.877 and recall of 0.822. Pathway B yielded a slightly higher precision (0.892) but lower recall (0.793). While Pathway B’s overall precision gain was modest and its general recall lower, its advantages in specific complex reasoning tasks—particularly in reliability (see Counterfactual Inference section) and achieving a targeted precision-recall balance in those contexts—are noteworthy (task-specific performance detailed below; corresponding recall data in Supplemental Table S3). Notably, Pathway A’s JD role achieved the highest individual recall (0.884, Table 1), consistent with its design for comprehensive data retrieval.
Both pathways demonstrated high annotation consistency, exceeding 97% across all roles and strategies (Fig. 2). Additionally, as shown in Fig. 3, a substantial proportion of responses achieved precision above 0.85, with 63.2% for Pathway B and 71.3% for Pathway A’s majority vote.
Heatmap of annotation consistency across admission records. Rows correspond to questions and columns to admission records. Each cell indicates whether the answer is consistent (white) or inconsistent (gray) based on expert annotations. Panels (A) through (E) show results for different pathways: (A) Pathway A-Role-CRC, (B) Pathway A- Role-JD, (C) Pathway A- Role-IE, (D) Pathway A-majority vote, and (E) Pathway B
Precision across questions for varying pathways. This figure shows question-level precision for each strategy: (A) Pathway A- Role-CRC, (B) Pathway A- Role-JD, (C) Pathway A- Role-IE, (D) Pathway A-majority vote and (E) Pathway B. Questions are ordered by increasing precision along the X-axis, with precision values on the Y-axis. Vertical dashed lines indicate questions with a precision of 0.85
Performance across clinical categories and task complexities is detailed in Table 2 (precision) and Supplemental Table S3 (recall). Pathway A (CRC role) showed high precision in direct match tasks (Table 2), maintaining a strong precision-recall balance (Supplemental Table S3).
In complex reasoning tasks such as ‘Symptom and Event’ and ‘Etiology and Pathology’ (classification), Pathway B exhibited higher precision (0.847 and 0.921 respectively, Table 2) but lower recall (0.775 for ‘Symptom and Event’ and 0.722 for ‘Etiology and Pathology (classification)’, Supplemental Table S3) compared to Pathway A’s JD role, which achieved specific recalls of 0.901 and 0.922 in these respective tasks (Supplemental Table S3).
This highlights a key precision-recall trade-off for complex scenarios: Pathway A (e.g., JD) may be preferred for maximizing patient identification (higher recall), while Pathway B offers higher precision and reliability (see Counterfactual Inference) for identified positive cases in ambiguous contexts.
Criterion level performance
When individual question responses were aggregated to the criterion level, the overall performance of each pathway was detailed in Table 3. Pathway A (majority vote) achieved a precision of 0.922 and a recall of 0.819. Pathway B demonstrated comparable precision at 0.920 but a lower recall of 0.785.
These results indicate that while both pathways are highly precise at the criterion level, Pathway A’s majority vote strategy was more effective in capturing a larger proportion of criteria-meeting patients due to its higher recall. The F1 scores (Pathway A: 0.835, Pathway B: 0.820) also reflect this balance. The choice between pathways at the criterion level would therefore depend on the relative importance placed on maximizing patient capture (favoring Pathway A) versus achieving the highest possible precision for positive identifications (where both are strong, but Pathway B’s strengths in handling complex constituent questions, as discussed at the question level, might be a factor for criteria involving such questions).
Counterfactual inference
We evaluated counterfactual inference rates across the pathways. Pathway B had the lowest rate at 0.25%, indicating strong consistency. In Pathway A, the rates varied among roles, with CRC at 0.82%, IE at 0.93%, and JD at 1.78%. The majority vote approach effectively reduced Pathway A’s overall counterfactual inference rate to 0.77%, demonstrating the benefit of collective decision-making in minimizing errors.
Pathway B exhibited the lowest counterfactual inference rate (0.25%), a significant advantage for trustworthiness. This ensures that patients identified via Pathway B (often chosen for its precision in specific complex tasks, despite its focused recall in those tasks as per Supplemental Table S3) are more reliably accurate.
Time consumption
To evaluate efficiency, we measured the time spent on question assessment using an optimal concurrency level of 3 to balance speed and resource use. Pathway A showed significantly faster processing times, averaging 0.588 s per question for CRC, 0.395 s for JD, and 0.340 s for IE. In contrast, Pathway B took an average of 2.570 s per question (Fig. 4).
Supplemental Fig. 1 illustrates the variability in processing times for both pathways across different questions. Pathway B exhibited a wider range due to the possibility of a second evaluation round. Overall, our pipeline demonstrated efficient and manageable processing times. With further optimization, Pathway A could potentially process a single patient’s eligibility in approximately 10 s.
Additional research
We also conducted a preliminary evaluation of other locally deployed LLMs (QWEN1.5, BAICHUAN, and GLM) on a subset of the data. QWEN1.5 generally outperformed the other models. BAICHUAN showed a tendency for over-inference, while GLM faced challenges with constraint adherence.
Furthermore, we compared the performance of QWEN1.5 and QWEN2 when deployed locally. Interestingly, QWEN1.5 outperformed QWEN2, despite the latter’s enhancements in reasoning capabilities. This discrepancy might be attributed to QWEN2’s conservative approach, leading to a higher rate of false negatives, or the lack of prompt adaptation for QWEN2 in our specific setting.
Discussion
Principle findings
In this study, the pipeline combined anthropomorphized experts’ CoT with a preset stance Agent-Collab strategy and a locally deployed LLM, effectively addressing the challenges in clinical trial pre-screening for hepatology by achieving by achieving a notable balance of high precision in critical areas and acceptable recall, alongside improved efficiency. Notably, the entire pipeline operated on a single consumer-grade server within a fully air-gapped data, internet-isolated environment.
In conclusion, we propose a hybrid approach for clinical application, considering the precision-recall profiles:: (1) For non-inferential recognition tasks involving explicit diagnoses and interventions (i.e., Direct Match tasks), pathway A with majority vote is recommend, offering a strong balance of high precision (typically > 0.85 as shown in Table 2) and high recall (typically > 0.80 as shown in Supplemental Table S3) for these tasks; (2) For tasks requiring complex reasoning (i.e., Classification tasks), Pathway B is more suitable, providing excellent precision (generally in the 0.85–0.92 range according to Table 2), ensuring high reliability of positive identifications, alongside a moderate recall profile (around 0.65–0.78 based on Supplemental Table S3) that highlights a clear precision-recall trade-off for these demanding tasks.
Effective hepatopathy pre-screening in secure environments
The pipeline demonstrated notable overall performance at the criterion level, which was achieved within a secure environment with limited resources. For instance, Pathway A (majority vote) achieved an F1-score of 0.835, supported by its precision of 0.922 and recall of 0.819 (Table 3). Additionally, we assessed the trial-level precision (aggregation rules detailed in Supplemental Material S7), achieving an average of 0.72, considering the limitations of admission notes in determining patient eligibility. Despite these constraints, the pipeline achieved higher precision in hepatocellular carcinoma (0.878) and cirrhosis trials (0.843), indicating its potential utility in hepatopathy trials. Compared to similar studies in Table 4, our pipeline showed competitive or superior performances in both precision and recall, even when using a closed-source LLM. Notably, compared to the included traditional non-LLM baselines, our LLM-driven pathways indicated clear advantages in key metrics such as precision (e.g., Pathway A: 0.922 vs. 0.540 rule-based; Pathway B: 0.920 vs. 0.737 hybrid) and recall (Pathway B: 0.785 vs. 0.560 hybrid). While acknowledging potential variations across studies, these findings underscore the promise of our LLM strategies for applications valuing secure, local deployment with open-source models in resource-limited clinical settings.
For tasks such as identifying hepatocellular carcinoma, cirrhosis, or specific interventions like TIPS (largely Direct Match tasks), the CoT-based approach (e.g., Pathway A - Majority Vote) delivered high precision (typical around 0.92, which compares favorably to reported manual annotation performance [30]) and also demonstrated good recall (e.g., typically > 0.85 for these tasks, as shown in Supplemental Table S3), contributing to both accuracy and comprehensive capture for these straightforward recognition tasks. This was achieved with minimal processing time (a typical of 0.44 s per task, verse traditional manual annotation which take over 10 s per task, representing a 95% reduction [34]). For more complex tasks, such as determining the etiology of liver failure, assessing whether liver cancer is primary, identifying systemic diseases, or verifying HBV antiviral treatment (largely Classification tasks), the Agent-Collab framework (Pathway B) achieved high precision (generally in the 0.85–0.92 range, as detailed in Table 2). It is important to consider that this high precision in complex scenarios was accompanied by a moderate recall profile (around 0.65–0.78 for Pathway B in these tasks, according to Supplemental Table S3), illustrating a clear precision-recall trade-off where this pathway prioritizes the accuracy of positive identifications in such nuanced scenarios. By leveraging externally defined logic—derived from advanced LLMs or clinical experts—this dual-pathway approach supported the method’s efficiency and adaptability, demonstrating its capability in handling a wide range of clinical scenarios in hepatopathy research by permitting an informed choice based on the desired precision-recall emphasis.
Clinical expertise guides LLM applications to address complex challenges
Researchers often view new technologies and standard clinical practices separately. In this study, we integrated proven clinical methods with advanced technologies to effectively address complex problems, yielding promising results in terms of both precision and recall, as well as their balance.
Pathway A’s anthropomorphized experts guided the LLM from general semantic understanding to targeted entity recognition, a critical feature for enhancing the performance of weaker open-source LLMs. Meanwhile, Pathway B’s preset stance Agent-Collab strategy effectively managed complex and semantically disordered texts, leading to highly precise outputs in such scenarios (as shown in Table 2), though this was often balanced with a moderate recall profile that prioritized the certainty of identified cases (Supplemental Table S3). By combining advanced LLM capabilities with clinical expertise, both strategies showed potential for improving the overall effectiveness (considering both precision and recall) of complex clinical assessments (Fig. 5).
Pathway B demonstrated notable precision in complex reasoning and nuanced categorization tasks (Table 2), which, while ensuring high-certainty identification, was associated with a moderate recall profile (Supplemental Table S3). In contrast, Pathway A (particularly its CRC and IE roles) excelled in direct recognition tasks (Direct Match tasks), achieving a strong balance of both high precision and good recall (as detailed in Table 2 and Supplemental Table S3 respectively). This fundamental difference in their precision-recall characteristics highlights their complementary strengths: Pathway B is particularly suited for complex scenarios where the accuracy of positive identifications is paramount, supported by its lower counterfactual inference rates. Conversely, Pathway A provides a efficient solution with robust overall performance (e.g., comparable overall precision for majority vote to Pathway B, Tables 1 and 3, but with better recall for direct match tasks) and lower resource consumption, making it a practical choice when broad capture and efficiency, especially for simpler tasks, are prioritized.
Exploratory insights on pipeline adaptability in traditional Chinese medicine
Notably, our dataset, which comes from FAHGUCM in China, a regionally influential hospital in hepatopathy (annual outpatient visits reach 50,000, with around 5,000 inpatient admissions). This dataset integrates both Traditional Chinese Medicine (TCM) and Western medicine, included expressions influenced by TCM. While not being the primary focus of our evaluation, these expressions were also tested, and preliminary results suggested that the pipeline can adapt to these diverse medical terminologies. Some natural language expressions in the recorded texts reflect a TCM influence. Although these expressions were not involved in the nano-ranking Q&A, we experimented with incorporating some TCM phrasing in our tests. For example, we used criterion like “whether the patient has insomnia”, and the corpus was “患者寐差” (poor sleep quality) or “卧不安寐” (lie unable to sleep) instead of the more commonly seen “患者存在失眠情况” (the patient has insomnia condition). The recognition performance was satisfactory, suggesting potential applicability in this area. However, we did not conduct systematic and rigorous testing on these variations.
Limitations
Firstly, the information within admission notes is limited, focusing on key sections like chief complaints, current disease history, and past medical history. Our gold standard was also derived from these notes. While this ensures a fair evaluation of the pipeline against expert interpretation of these specific documents, the inherent scope of admission notes means that the reported performance metrics reflect fidelity to this data source rather than an absolute measure against more comprehensive patient records. To address this, we selected criteria from multiple clinical trials related to hepatopathy and manually filtered them to ensure relevance to our dataset. Secondly, the data in this study were sourced from a single, regionally influential hepatology center. While covering multiple liver diseases, this may limit the generalizability of our findings. Further external validation is needed to ensure its applicability in other clinical settings. Furthermore, some criteria have limited number of positive instances, making recall sensitive to misclassifications, which should be considered when interpreting these specific recall figures. Additionally, while this study evaluated two distinct strategic pathways, a detailed component-wise analysis or ablation study for complex Pathway B, which could further elucidate the specific contributions of its individual elements (e.g., agent roles or debate mechanisms), was not performed as part of this initial investigation.
Despite these limitations, our approach provides valuable insights into the use of limited EHR data for clinical trial matching. By focusing on likely available data points, we enhance the real-world applicability of our analysis. This focused methodology also facilitates scalability to other settings with similar data limitations. Future work should aim to enhance data collection in EHR systems, improve model robustness for incomplete data, and explore more granular analyses of complex model architectures to optimize their design. Expanding this approach to other medical domains also remains a key future direction.
Conclusions
We have developed an effective patient pre-screening strategy, showing high precision and achieving good recall in hepatopathy trials. Designed to operate efficiently in environments with limited IT resources and stringent security requirements, our pipeline combines different approaches to achieve precision, recall and efficiency across various clinical tasks. This cost-effective and adaptable solution has the potential to enhance the pre-screening process and improve patient recruitment in clinical trials by aiming for a comprehensive and reliable initial screen.
Data availability
The data analyzed in this study are not publicly available due to privacy or ethical restrictions but can be obtained from the corresponding author upon reasonable request.
References
Aithal GP, Palaniyappan N, China L, et al. Guidelines on the management of Ascites in cirrhosis. Gut. 2021;70(1):9–29.: 10.1136/gutjnl-2020-321790 [published Online First: 2020/10/18].
Hersh WR, Weiner MG, Embi PJ, et al. Caveats for the use of operational electronic health record data in comparative effectiveness research. Med Care. 2013;51(8 Suppl 3):S30–7. https://doi.org/10.1097/MLR.0b013e31829b1dbd. [published Online First: 2013/06/19].
Escudie JB, Rance B, Malamut G, et al. A novel data-driven workflow combining literature and electronic health records to estimate comorbidities burden for a specific disease: a case study on autoimmune comorbidities in patients with Celiac disease. BMC Med Inf Decis Mak. 2017;17(1):140. https://doi.org/10.1186/s12911-017-0537-y. [published Online First: 2017/10/01].
Ni Y, Wright J, Perentesis J, et al. Increasing the efficiency of trial-patient matching: automated clinical trial eligibility pre-screening for pediatric oncology patients. BMC Med Inf Decis Mak. 2015;15:28. https://doi.org/10.1186/s12911-015-0149-3. [published Online First: 2015/04/17].
Beck JT, Rammage M, Jackson GP, et al. Artificial intelligence tool for optimizing eligibility screening for clinical trials in a large community Cancer center. JCO Clin Cancer Inf. 2020;4:50–9. https://doi.org/10.1200/CCI.19.00079. [published Online First: 2020/01/25].
Hassanzadeh H, Karimi S, Nguyen A. Matching patients to clinical trials using semantically enriched document representation. J Biomed Inf. 2020;105:103406. https://doi.org/10.1016/j.jbi.2020.103406. [published Online First: 2020/03/15].
Tissot HC, Shah AD, Brealey D, et al. Natural Language processing for mimicking clinical trial recruitment in critical care: A Semi-Automated simulation based on the leopards trial. IEEE J Biomed Health Inf. 2020;24(10):2950–59. https://doi.org/10.1109/JBHI.2020.2977925. [published Online First: 2020/03/10].
Lee P, Bubeck S, Petro JJNEJM. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. 2023;388(13):1233–39.
Lee P, Goldberg C, Kohane I. The AI revolution in medicine: GPT-4 and beyond: Pearson, 2023.
Van Veen D, Van Uden C, Blankemeier L, et al. Adapted large Language models can outperform medical experts in clinical text summarization. Nat Med. 2024;30(4):1134–42. https://doi.org/10.1038/s41591-024-02855-5. [published Online First: 2024/02/28].
Devarakonda MV, Mohanty S, Sunkishala RR, Mallampalli N, Liu XJJ. Clinical trial recommendations using Semantics-Based inductive inference and knowledge graph embeddings. 2024;154:104627.
Van Veen D, Van Uden C, Blankemeier L et al. Clinical text summarization: Adapting large language models can outperform human experts. 2023.
Yuan D, Rastogi E, Naik G et al. A Continued Pretrained LLM Approach for Automatic Medical Note Generation. 2024.
Singhal K, Tu T, Gottweis J et al. Towards Expert-Level Medical Question Answering with Large Language Models. 2023. https://ui.adsabs.harvard.edu/abs/2023arXiv230509617S (accessed May 01, 2023).
Thai DN, Ardulov V, Mena JU et al. ACR: A Benchmark for Automatic Cohort Retrieval. 2024.
Zhu Y, Yuan H, Wang S et al. Large language models for information retrieval: A survey. 2023.
Wei J, Wang X, Schuurmans D et al. Chain-of-Thought prompting elicits reasoning in large Language models. 2023 doi: https://doi.org/10.48550/arXiv.2201.11903
Kenton Z, Siegel NY, Kramár J et al. On scalable oversight with weak LLMs judging strong LLMs. 2024.
Fang Y, Li M, Wang W, Lin H. Feng FJapa. Counterfactual Debating with Preset Stances for Hallucination Elimination of LLMs. 2024.
ChiCTR2100044187. Secondary ChiCTR2100044187. https://www.chictr.org.cn/hvshowproject.html?id=97371&v=1.6
NCT04353193. Secondary NCT04353193. https://clinicaltrials.gov/study/NCT04353193
NCT04850534. Secondary NCT04850534. https://clinicaltrials.gov/study/NCT04850534
NCT03911037. Secondary NCT03911037. https://clinicaltrials.gov/study/NCT03911037
NCT01311167. Secondary NCT01311167. https://clinicaltrials.gov/study/NCT01311167
NCT04021056. Secondary NCT04021056. https://clinicaltrials.gov/study/NCT04021056
Bai J, Bai S, Chu Y et al. Qwen Technical Report. 2023. https://ui.adsabs.harvard.edu/abs/2023arXiv230916609B (accessed September 01, 2023).
Yang A, Xiao B, Wang B et al. Baichuan 2: Open Large-scale Language Models. 2023. https://ui.adsabs.harvard.edu/abs/2023arXiv230910305Y (accessed September 01, 2023).
GLM T, Zeng A, Xu B et al. ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools. 2024.
Ji Z, Lee N, Frieske R, et al. Surv Hallucination Nat Lang Generation. 2023;55(12):1–38.
Jin Q, Wang Z, Floudas CS et al. Matching patients to clinical trials with large language models. 2023.
Nievas M, Basu A, Wang Y, Singh HJJAMIA. Distilling large language models for matching patients to clinical trials. 2024:ocae073.
Kusa W, Mendoza ÓE, Knoth P, Pasi G, Hanbury AJJ. Effective matching of patients to clinical trials using entity extraction and neural re-ranking. 2023;144:104444.
Ghosh S, Abushukair HM, Ganesan A, Pan C, Naqash AR, Lu K. (2024). Harnessing explainable artificial intelligence for patient-to-clinical-trial matching: A proof-of-concept pilot study using phase I oncology trials. PLoS ONE, 19(10), e0311510.
Wei Q, Franklin A, Cohen T, Xu H. Clinical text annotation - what factors are associated with the cost of time? AMIA Annu Symp Proc. 2018;2018:1552–60. published Online First: 2019/03/01.
Acknowledgements
We sincerely thank the China National GeneBank for their technical support. We also appreciate the support from RUIYI’s Clinical Multi-omics Data Research Workstation for our research.
Funding
This study was supported by the Second Batch of National TCM Clinical Research Base Construction Units (No. 131 [2018], issued by the National Administration of Traditional Chinese Medicine).
Author information
Authors and Affiliations
Contributions
X.G.: Conceptualization, Methodology, Writing - Original Draft, Formal Analysis. H.L.: Methodology, Formal Analysis, Software, Writing - Review & Editing. X.W.: Data Curation, Validation, Visualization, Supervision. L.L.: Investigation, Project Administration, Writing - Review & Editing. Y.X.: Conceptualization, Funding Acquisition, Writing - Review & Editing, Supervision. L.W.: Methodology, Project Administration, Writing - Review & Editing, Supervision.
Corresponding authors
Ethics declarations
Ethics approval
The study was approved by the First Affiliated Hospital of Guangxi University of Chinese Medicine (approved Ref. No. 2020-046-02).
Patient consent statement
The requirement for informed consent was waived by the Ethics Committee of the First Affiliated Hospital of Guangxi University of Chinese Medicine, due to the retrospective nature of the study, and all clinical data were de-identified and anonymized.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Gui, X., Lv, H., Wang, X. et al. Enhancing hepatopathy clinical trial efficiency: a secure, large language model-powered pre-screening pipeline. BioData Mining 18, 42 (2025). https://doi.org/10.1186/s13040-025-00458-5
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s13040-025-00458-5