Main

Circular RNAs (circRNAs) are a unique class of noncoding RNA molecules characterized by their covalent circular structure1. Recent studies have revealed their important roles in various cellular processes, including the regulation of signaling pathways, sequestration of miRNA and RNA-binding proteins, initiation of gene transcription, inhibition of mRNA translation and promotion of protein degradation1,2,3,4,5,6,7,8,9,10. Advances in RNA-sequencing (RNA-seq) technologies have enabled the rapid accumulation of large RNA-seq datasets, providing unprecedented opportunities for circRNA research. For instance, Chinnaiyan et al. generated sequencing data from over 2,000 human cancer samples, identifying circRNAs with potential as cancer biomarkers11. These expansive datasets also serve as crucial resources for training large-scale models to predict circRNAs and their regulation mechanisms. In our previous work, we used RNA-seq data from 394 tissue and cell line samples to train a deep learning model capable of predicting circRNA differential splicing from single-cell and spatial transcriptomics data, as well as other low-depth datasets12. Given the diverse roles and complex regulation networks of circRNAs, the continued expansion of RNA-seq data holds immense potential for uncovering novel biogenesis and degradation mechanisms, predicting unknown functions and optimizing sequence designs for therapeutic applications.

Current circRNA detection methods can be generally classified into alignment-based and pseudo-reference-based approaches13,14 (Supplementary Table 1). Alignment-based approaches identify circRNA-specific alignment features distinct from linear RNAs or background noise15. For example, we developed CIRI16 and CIRI2 (ref. 17), a tool series renowned for reliable de novo circRNA detection18 and superior sensitivity in recent large-scale benchmarking13. In contrast, pseudo-reference-based approaches rely on predefined back-splice junction (BSJ) libraries constructed from annotated exon combinations. While effective in reducing false positives, this strategy is limited to well-annotated genomes and cannot detect novel circRNAs with unannotated splice sites13,14. Despite a decade of development, current tools still suffer from poor scalability due to exponentially increasing runtime and memory demands on large RNA-seq datasets, and accurate quantification of circRNAs remains challenging due to their low abundance relative to mRNAs19 and strong batch effects across different RNA-seq cohorts.

To address these challenges, we introduce CIRI3, a tool for large-scale circRNA detection and characterization. Building on its predecessor, CIRI3 is optimized for rapid circRNA detection from multisample alignment results, offering improved quantification accuracy, runtime efficiency and memory usage. Key innovations in CIRI3 include robust identification of intronic self-ligated circRNAs and targeted quantification for user-defined circRNA lists. CIRI3 supports alignments from both BWA20 and STAR21, yielding consistent results across aligners (Extended Data Fig. 1a–c). CIRI3 outputs both BSJ and forward-splice junction (FSJ) reads, enabling comprehensive statistical analyses, including differential expression, differential splicing, and differential circularization (Methods).

The CIRI3 workflow consists of two primary alignment-scanning modules—high-confidence BSJ discovery and FSJ/BSJ read recovery (Fig. 1a). To optimize efficiency, CIRI3 adopts a dynamic multithreaded task partitioning approach during the two scanning phases, mitigating runtime inefficiencies caused by variability in single-thread performance (Methods and Fig. 1a, step 1). In the first scan, CIRI3 identifies potential paired chiastic clipping signals in read alignment (Fig. 1a, step 2), a highly sensitive proxy for BSJ16,17. To ensure reliability, CIRI3 refines and filters paired chiastic clipping signals by requiring perfectly matched splicing signals flanking the putative BSJs, including canonical GT−AG dinucleotides or noncanonical splicing motifs. When a GTF/GFF annotation is provided, CIRI3 extracts exon and intron boundaries to further enhance splicing signal filtration. These steps yield a set of high-confidence BSJs.

Fig. 1: Overview of the CIRI3 pipeline.
figure 1

a, Step 1: input files include SAM/BAM alignments and a reference (ref.) genome (required), with annotation (including exon coordinates) and circRNA list files optional. Step 2: high-confidence (high-conf.) BSJs are identified by correcting and filtering chiastic clipping signals using splicing signals (for example, GT–AG). Step 3: FSJ and candidate BSJ reads are obtained using blocking search, multiseed matching with MLE and Smith–Waterman alignment. Step 4: CIRI3 generates detailed circRNA annotations and expression profiles for BSJ and FSJ reads. Step 5: CIRI3 provides three types of statistical analysis. b, Sensitivity and precision of five circRNA detection tools on the Hs68 cell line, with corresponding F1 scores indicated. c, The bar plot shows the absolute (Abs) Pearson correlation values between log2(BSJ read counts) of software-predicted circRNAs and the Cq values of experimentally validated circRNAs from RT−qPCR experiments in the SW480 cell line. The numbers above each bar represent the counts of experimentally validated circRNAs detected by each tool. d, Runtime of six tools on datasets with 200, 800, 2,400 and 4,800 million reads. All tools were run using 10 threads, except for KNIFE, which used the default thread setting. The asterisk indicates that the entire circRNA analysis process could not be completed within 14 days on a 256 GB memory server. e, The memory required (random-access memory (RAM)) by the tools for the same data as in d.

Source data.

In the second scan, CIRI3 employs a blocking search approach to recover missed BSJ reads and identify FSJ reads associated with each high-confidence BSJ (Fig. 1a, step 3). The reference genome is segmented into small blocks and each high-confidence BSJ is indexed into a hash table using the blocks corresponding to its splice sites as key values. This enables targeted searches for junction reads within the associated blocks and their neighbors, substantially improving search efficiency. For reads supporting only one splice site of a BSJ in raw alignments, CIRI3 re-evaluates their splice junction positions and orientations through pseudo- or re-alignment with genome sequences. Building on the multiseed pseudo-alignment and maximum likelihood estimation (MLE) adopted in CIRI2, CIRI3 incorporates Smith−Waterman local sequence alignment as an additional criterion to improve classification accuracy. By applying count or ratio thresholds to the BSJs identified in the first scan, CIRI3 generates detailed annotations and expression profiles of circRNAs across multiple samples (Fig. 1a, step 4). Finally, CIRI3 facilitates downstream differential expression analyses using its integrated statistical algorithms (Fig. 1a, step 5).

To evaluate circRNA detection performance, we compared CIRI3 with five widely used tools (that is, find_circ8, KNIFE22, CIRCexplorer3 (ref. 23), DCC24 and CIRI2) using RNA-seq data from Hs68 cell line samples treated with or without RNase R25. Detected circRNAs were classified as enriched (putative positives), depleted (false positives) or unaffected (Methods). Compared with find_circ, KNIFE, CIRCexplorer3 and DCC, CIRI3 demonstrated a higher putative positive and lower false positive rate (Extended Data Fig. 1d–h). While CIRI2 showed substantial overlap with CIRI3, 54 out of 109 circRNAs uniquely detected by CIRI3 were putative positives. Among all these tools, CIRI3 achieved the highest sensitivity and precision (F1 score of 0.74) (Fig. 1b). Notably, CIRI3 detected the most putative positives among circRNAs unique to each tool.

Moreover, CIRI3 can detect intronic self-ligated circRNAs that were undetectable by other short-read tools26 (Extended Data Fig. 2). Applied to liver RNA-seq samples from five species27, CIRI3 identified 59 such events, where all 16 detected in opossum were enriched after RNase R treatment. The RNase R validated intronic self-ligated circRNAs were predominantly ranged from 300 to 800 bp, and 90% originated from protein-coding genes. In a prostate cancer cohort (n = 181), CIRI3 detected 2,286 intronic self-ligated circRNAs derived from introns that were significantly shorter than those not involved in circRNA formation (Wilcoxon rank-sum test, P < 2.2 × 10⁻¹⁶), suggesting that shorter introns are more prone to back-splicing and circRNA biogenesis. Together, these results highlight the scalability and efficiency of CIRI3 in identifying diverse circRNA subtypes across different species and disease contexts.

To benchmark the quantification accuracy of CIRI3 against other tools, we analyzed simulated paired-end RNA-seq datasets with 20−100× coverage, calculating the Pearson correlation coefficient (PCC) and root mean squared error (r.m.s.e.) between the BSJ read counts identified by each tool and the simulated BSJ read counts17 (Extended Data Fig. 3). CIRI3 consistently achieved PCC values above 0.983, with a mean of 0.990, outperforming all others across coverage levels. This improvement over CIRI2 (mean PCC of 0.954) can be attributed to the integration of the Smith−Waterman alignment, which recovers BSJ reads missed by CIRI2. Furthermore, CIRI3 accurately quantified FSJ reads and junction ratios, achieving mean PCC values of 0.977 and 0.980, respectively. In terms of r.m.s.e., CIRI3 consistently exhibited the lowest errors across all coverage levels, further confirming its superior quantification accuracy.

Next, we further evaluated CIRI3’s performance on real RNA-seq data from three cell lines (SW480, NCI-H23 and HLF), with 1,479 circRNAs quantified using RT−qPCR13. Among the six tools evaluated, CIRI3 and CIRI2 were the most sensitive, detecting 1,172 and 1,174 validated circRNAs, respectively. CIRI3 also demonstrated the highest quantification accuracy for the SW480 and NCI-H23 datasets, with PCC values of −0.701 and −0.728, respectively, when comparing log-transformed BSJ read counts to Cq (quantification cycle) values (Fig. 1c and Extended Data Fig. 4). For the HLF cell line, all tools exhibited lower accuracy, but CIRI3, CIRCexplorer3 and DCC showed comparable correlations ranging from −0.656 to −0.653, outperforming other methods. We also benchmarked computational efficiency (Extended Data Fig. 5a,b and Supplementary Table 1). CIRI3 processed the 295-million-read SW480 dataset in just 0.25 h, while other tools were 8−149 times slower, requiring 2.0−37.1 h with 25 threads. Memory usage was also a notable challenge for most tools. CIRCexplorer3, find_circ, DCC, CIRI2 and KNIFE required 27.7, 34.9, 50.8, 139.2 and 205.1 GB of memory, respectively, substantially exceeding the modest 12.2 GB required by CIRI3.

In large-scale data analysis, a common strategy to reduce computational resource requirements is to process datasets individually before combining results during downstream integration. To assess the impact of this separate-detection mode on circRNA analysis, we divided the SW480 dataset into three subsets and compared the results obtained from processing the entire dataset (joint-detection mode) versus combining results from the subsets. We found that all tools showed compromised performance in circRNA detection and quantification when using the separate-detection mode. Taking find_circ and DCC as examples (Extended Data Fig. 5c–f), the separate-detection mode reduced memory usage by 49.3% and 22.6%, respectively, but detected 22,719 and 8,312 fewer circRNAs, missing 53 and 11 out of 294 and 292 RT−qPCR validated circRNAs, respectively. In addition, quantification accuracy declined, with the absolute PCC dropping from 0.647 and 0.655 to 0.528 and 0.592, respectively. When focusing on low-abundance circRNAs, the correlation between read counts from the separate-detection mode and the joint-detection mode was only moderate, with PCC values of 0.682 and 0.789 for find_circ and DCC, respectively. Similarly, reprocessing the Hs68 cell line samples using this strategy resulted in a substantial reduction in enriched circRNAs detected by find_circ. To further evaluate the impact on cohort-level circRNA analysis, we analyzed data from 181 prostate cancer samples using CIRI3 for a comparison between the joint-detection mode and the separate-detection mode. The results showed that the separate-detection mode identified 24,885 fewer circRNAs, corresponding to a 16.2% decrease (Extended Data Fig. 5g). Furthermore, the separate-detection mode resulted in lower average BSJ read counts (47.5 versus 38.6) and fewer identified samples (14.7 versus 11.3) per circRNA. In particular, the joint-detection mode identified 38,578 highly prevalent circRNAs expressed in at least 15 samples, which is 38.7% more than the separate-detection mode. These findings underscore the limitations of the separate-detection mode and highlight the importance of computational efficiency, which enables simultaneous processing of large or multiple datasets for comprehensive circRNA analysis.

To further evaluate the computational efficiency, we tested all tools on RNA-seq datasets with 200 million, 800 million, 2,400 million and 4,800 million reads. CIRI3 was the only tool capable of processing the terabyte-sized 4,800 million-read datasets in 24 h (Fig. 1d). While all tools could handle 200 million-read dataset, KNIFE failed to process the 800 million-read dataset, and DCC was the only tool besides CIRI2 and CIRI3 that could process the 2,400 million-read dataset within 2 weeks using less than 256 GB of memory (with ten threads). Compared with CIRI2, CIRI3 demonstrated much faster processing speeds and lower memory usage. Notably, while memory usage increased with dataset size for all tools, CIRI3 showed the smallest increase, ranging from 12 GB to 27.5 GB (Fig. 1e). As an ultimate stress test, we attempted to process a collection of 296 deeply sequenced samples from RNAAtlas28, totaling 39.2 trillion reads. CIRI3 was the only tool capable of completing this task, processing 21 TB of SAM files in 105.31 h with a peak memory usage of 45.85 GB.

To systematically evaluate the robustness of circRNA detection tools, we performed subsampling analyses using RNA-seq data from two cell lines (Methods and Extended Data Fig. 6a–f). In Hs68, the proportion of enriched circRNAs detected by each tool remained stable, while the absolute number of enriched circRNAs increased markedly with sequencing depth. Notably, CIRI3 consistently identified the highest number of enriched circRNAs across all subsampling levels. Further validation was conducted using RNA-seq data from the SW480 cell line, which includes 416 RT−PCR-validated circRNAs. The results demonstrated a steady improvement in detection performance for all tools as sequencing depth increased. Among them, CIRI3 exhibited the best performance at all subsampling levels, identifying the largest number of validated circRNAs. Taken together, these findings indicate that although all tools demonstrate good robustness, CIRI3 shows a clear advantage in detection sensitivity.

Inspired by the Percent Spliced In metric widely used in splicing analysis, the BSJ ratio was designed to quantify the relative abundance and differential splicing of circRNAs, offering insights into their biogenesis regulation16,17,24,29. However, batch effects in RNA-seq data have been reported to impact expression quantification based on read counts30. To evaluate this for BSJ ratios calculated by CIRI3, we analyzed RNA-seq data from 62 samples across four tissues (Supplementary Table 2)27,31,32,33,34,35,36, and performed principal component analysis on both BSJ read counts and junction ratios (Extended Data Fig. 7a). While BSJ read counts failed to distinguish samples by tissue type, circRNA junction ratios clearly clustered samples according to their tissue of origin, with minimal batch effects. For example, liver tissue samples showed unique circRNA abundance profiles based on BSJ ratios but showed similar expression patterns to spleen and testis samples when using BSJ read counts.

We next investigated whether BSJ ratios could serve as biomarkers in clinical research. Using RNA-seq data from four studies, we analyzed 44 pairs of tumor and adjacent normal tissue samples from patients with hepatocellular carcinoma (HCC)31,37,38. Applying the statistical tests in CIRI3, we identified 102 differentially expressed circRNAs based on read counts and 563 differentially spliced circRNAs based on junction ratios. Only 18 circRNAs overlapped between the two analyses, and the host genes of these two circRNA groups showed significant enrichment in distinct biologically relevant pathways, suggesting that BSJ ratios capture a different set of circRNAs with potential clinical relevance (Extended Data Fig. 6g,h). While differentially spliced circRNAs exhibited clear BSJ ratio patterns distinguishing tumor and normal samples (Extended Data Fig. 7b,c), no such patterns were observed in the counts per million (CPM) values of differentially expressed circRNAs. Notably, several experimentally validated HCC-associated circRNAs, including hsa-MET-0001 (ref. 39), hsa-SMARCA5_0005 (ref. 40) and hsa-ZKSCAN1_0001 (ref. 41) (circAtlas ID), showed significant changes in junction ratios (Extended Data Fig. 7d).

To compare the utility of junction ratios and read counts to identify biomarkers, we trained support vector machine models using BSJ ratios and CPM values of 30 representative circRNAs from three studies to classify normal and tumor samples across all four studies. BSJ ratio models consistently outperformed CPM-based models, achieving higher accuracy in the test datasets (mean value 0.924 versus 0.661) (Extended Data Fig. 7e,f), The smaller performance gap between training and test data for BSJ ratio models further underscores their low batch effect and generalization ability in biomarker studies.

To systematically investigate the expression pattern and diagnostic potential of circRNAs in cancer, we collected 2,535 total RNA-seq data from human cancerous and normal tissue samples, covering 30 cancer types (Methods) (Fig. 2a). We identified 470,641 circRNAs across all samples, with an average of 8,245 detected in each sample. Colorectal cancer (CRC), triple-negative breast cancer (TNBC) and glioblastoma multiforme (GBM) samples exhibited the highest numbers, indicating abundant circRNA expression in these cancer types (Fig. 2b). On the basis of this dataset, we constructed CIRIonco (CIRI Oncology, https://ngdc.cncb.ac.cn/cirionco), a comprehensive circRNA database. A comparison with existing circRNA databases11,42,43 revealed an overlap of 294,692 circRNAs (62.6%) (Extended Data Fig. 8a). A higher proportion of BSJs recorded exclusively in CIRIonco were located in intronic regions, suggesting the high sensitivity of CIRI3 in identifying these previously underrepresented circRNAs due to its annotation-independent design (Extended Data Fig. 8b).

Fig. 2: Cohort analysis of human cancer tissue data by CIRI3.
figure 2

a, An overview of the CIRIonco database dataset: 2,535 total RNA samples from the GEO database were analyzed, encompassing 22 tissue types and 30 cancer types. circRNAs were identified using the BWA and CIRI3 pipeline to construct a tumor-specific circRNAs database. b, A stacked bar plot showing the number of circRNAs detected across different cancer types, with colors indicating different circRNA types. The filled squares represent sample number for each disease. CRC, colorectal cancer; TNBC, triple-negative breast cancer; GBM, glioblastoma multiforme; HGSC, high-grade serous carcinoma; ALL, acute lymphoblastic leukemia; AML, acute myeloid leukemia; PRCA, prostate cancer; ASCC, anal squamous cell carcinoma; UCEC, uterine corpus endometrial carcinoma; HNSCC, head and neck squamous cell carcinoma; L-BRCA, luminal breast cancer; GC, gastric cancer; ESCC, esophageal squamous cell carcinoma; CC, cervical cancer; RCC, renal cell carcinoma; THCA, thyroid carcinoma; NSCLC, non-small cell lung cancer; SARC, sarcoma; LGG, low-grade glioma; HB, hepatoblastoma; CSCC, cutaneous squamous cell carcinoma; SKCM, skin cutaneous melanoma; PDAC, pancreatic ductal adenocarcinoma; BLCA, bladder urothelial carcinoma; THC, thymic carcinoid; VS, vestibular schwannoma; ACC, adrenocortical carcinoma; APA, aldosterone-producing adenoma; CPA, cortisol-producing adenoma; THYM, thymoma. c, A schematic of the pretrained model, which leverages differentially spliced circRNAs to classify cancer versus normal samples and the transfer learning model applied to colon tissue. d, Left: validation and test AUROC of the pretrained model on non-colon samples. Right: AUROC on colon samples for generalization, small-scale training and transfer learning. ACC, accuracy. e, A schematic of the hierarchical tree based on system, tissue and disease levels, containing candidate markers for classification. f, Left: the network plot illustrates overlaps in circRNAs markers across hierarchical levels. The node size and pie chart represent marker count and prediction precision. The circle on the edge shows the overlap marker number with different background colors (gray, disease level; brown, tissue level; blue, system level). Right: the bar plot shows the predict precision at the tissue level.

Source data.

Next, we used differentially spliced circRNAs between cancer and normal samples as input features to train a five-layer fully connected deep neural network (pretrained model) for sample classification (Fig. 2c). This pretrained model performed well on both the validation and test datasets, achieving overall accuracies and areas under the receiver operating characteristic curve (AUROCs) over 88% and 0.91, respectively (Fig. 2d). A generalization test on colon tissue samples not included in the training set achieved an accuracy of 88% and an AUROC of 0.94, indicating strong performance on unseen tissue types. We further assessed the model’s transferability using 172 colon tissue samples from small cohorts in multiple studies, with one study used as the test set and the others used for training in each evaluation. While the newly trained model yielded a test accuracy of only 77.78%, transfer learning by freezing the first two layers of the pretrained model and fine-tuning the remaining layers achieved an accuracy of 88.89% and an area under the curve of 0.98, demonstrating the pretrained model’s superior performance in small-sample data.

Building upon this framework, we further used circRNAs as biomarkers to stratify cancer samples at the system, tissue and disease levels (Fig. 2e and Extended Data Fig. 8c). We constructed a system, tissue and disease stratification tree, and used differential spliced circRNAs as candidate markers at each hierarchical level. These markers were then used to train LightGBM classifiers to predict the system, tissue or disease origin of each sample. Our results revealed substantial overlap and connection among marker circRNAs across different systems, tissues and diseases, and the consequent biomarker network highlighted the complexity and diversity of circRNA regulation (Fig. 2f). LightGBM classifiers achieved high classification performance, with mean precision values of 0.959 at both the system and tissue levels, and 0.974 at the disease level, further demonstrating the strong potential of BSJ ratio-based circRNAs as robust biomarkers. The CIRIonco database provides an extensive and scalable resource for cancer-related circRNA research and functional exploration, laying a solid foundation for their application in cancer subtyping and precision diagnostics.

Over the past decade, our understanding of circRNAs has greatly improved, in part due to advancements in circRNA detection methodology that have facilitated the discovery of their biogenesis and functions44,45,46,47,48. In this study, we presented CIRI3, which addresses several critical challenges in circRNA detection. The scalable design of CIRI3 makes it highly efficient at processing cohort-scale data, and also capable of discovering underexplored circRNAs lacking canonical GT−AG splice sites, such as intronic self-ligated circRNAs. CIRI3 also facilitates diverse downstream analysis by providing accurate identification and quantification of circRNAs. For example, while CIRI3 was not developed to directly detect the internal structure or full-length isoform of circRNAs, its precise BSJ position output can enhance the detection of these features by leveraging either overlapped paired-end data (for example, CIRI-full33) or long-read (for example, CIRI-long26) data.

It should be noted that the performance of circRNA detection methods is affected by library preparation strategies. RNase R treatment is a circRNA-specific enrichment strategy, which largely improves the sensitivity of detection. However, samples treated by RNase R only constitute a relatively small proportion of existing RNA-seq data and the efficiency of enrichment varies across samples and protocols, making RNase R treatment unsuitable for quantitative analysis14. Nevertheless, CIRI3 enhances the accuracy of BSJ ratio measurement, a valuable metric for filtering candidate circRNAs based on relative changes between RNase R-treated and untreated total RNA-seq data17,24. In contrast, total RNA-seq data preserves the features of both circRNAs and linear RNAs, representing the most commonly used data for circRNA analysis. The BSJ ratio calculated from total RNA-seq data reflects the natural proportion of circRNAs compared with other RNAs, and our study further demonstrated its high reliability and low variability across different datasets, supporting its utility in circRNA biomarker identification. However, CIRI3 was not specifically designed for correcting batch effects arising from variations in RNA integrity values33 or circRNA sequencing protocols14. Therefore, careful experimental design and rigorous quality control of circRNA libraries remain essential to minimize technical batch effects.

Methods

CIRI3 input and output

CIRI3 requires alignment files (SAM/BAM) generated by mapping FASTQ files to a reference genome (FASTA) using BWA or STAR (Supplementary File 1), as well as the same reference genome (FASTA) as inputs. Optionally, users can provide a gene annotation file (GTF) and a list of circRNAs of interest. This pipeline outputs detailed annotations of circRNAs and their expression profiles for downstream analysis.

Partition of alignment files

CIRI3 employs dynamic multithreaded task partitioning to optimize computational resource allocation. Alignment files are segmented into chunks, with the number of chunks adjusted based on input size and the preset thread number.

If the input data are less than 200 GB or if dividing the input size by the number of threads results in chunks less than 20 GB, the number of chunks is set to be the number of threads, with each thread processing one chunk. Otherwise, the data are split into 20 GB chunks. As scanning proceeds, CIRI3 assigns available threads to unscanned chunks, preventing overruns for single thread and improving overall efficiency.

High-confidence BSJ discovery

CIRI3 begins by scanning SAM alignments to identify reads exhibiting paired chiastic clipping patterns, characterized by chiastic soft/hard clipping in their CIGAR strings (for example, xS/HyMzS/H paired with (x + y)S/HzM and/or xM(y + z)S/H), which indicate candidate back-splice junctions. Putative junctions are further validated by identifying canonical (GT−AG) or noncanonical splicing signals and, if provided, exon−intron boundaries from the annotation file. Reads passing these filters are classified as supporting ‘high-confidence BSJs’. Among these, reads with mapping scores above a threshold of 10 for both segments are designated as ‘high-confidence BSJ reads’.

FSJ/BSJ reads recovery

In the second scan, CIRI3 utilizes a blocking search approach. The reference genome is segmented into non-overlapping blocks, with each block’s length defined as: block_size = max_read_length – 2 × ε, where max_read_length is the maximum read length and ε is a tolerance parameter. This parameter provides a margin of error for the alignment positions of reads supporting BSJ sites, thereby improving detection sensitivity. High-confidence BSJs are indexed into these intervals based on their coordinates, and targeted searches are conducted within the associated block and neighboring blocks to recover junction-related reads. Reads supporting only one splice site of a high-confidence BSJ are re-evaluated using MLE based on multiple seed matching. If necessary, the Smith−Waterman algorithm is applied for precise alignment and classification.

circRNA annotation and quantification

CIRI3 quantifies circRNAs using the identified junctions. For BSJs sharing start or end sites, the Smith−Waterman algorithm assesses sequence similarity to determine if they represent distinct junctions. High-confidence BSJ reads from similar junctions are merged toward the junction with higher counts. Finally, CIRI3 outputs detailed circRNA annotations and expression profiles based on supporting BSJ and FSJ reads.

Identification of the intronic self-ligated circRNAs

CIRI3 identifies intronic self-ligated circRNAs during the first scan. The coordinates of these BSJ reads are corrected using intron boundaries from the annotation file. The Smith−Waterman algorithm aligns the 5 bases from both ends of these reads against the reference genome, filtering out reads with >1 bp insertion, deletion or mismatch.

Differential expression analysis

CIRI3 provides three levels of differential expression analysis: (1) differential expression and (2) differential splicing of circRNAs and (3) differential circularization (relative abundances of circRNAs from the same gene).

For differential expression analysis of single circRNA, CIRI3 employs algorithms from CIRIquant29. For datasets without biological replicates, differential expression scores are calculated using a generalized fold change approach. For datasets with replicates, CIRI3 mitigates systematic batch effects before performing differential expression analysis. First, gene expression profile is obtained using featureCounts (version 2.0.2)49. Next, using Trimmed Mean of M-value normalization in edgeR package (version 4.2.2)50, normalization factors based on gene expression are calculated and applied to normalize circRNA expression profile. Differential circRNA expression across conditions is then performed using the quasi-likelihood ratio test in edgeR. For differential splicing and circularization, CIRI3 employs algorithms from rMATS (version 4.1.2)51.

Benchmarking circRNA detection

Using the Hs68 cell line dataset, circRNA enrichment is determined by comparing read counts between RNase R-treated and untreated datasets of CIRI3 and five commonly used detection tools: CIRI2, CIRCexplorer3, DCC, and find_circ and KNIFE. The parameters used for each tool are provided in Supplementary File 1. A circRNA is labeled as enriched if its BSJ read count in treated samples is at least twice that in untreated samples, depleted if its count in treated samples is less than in untreated samples and unaffected otherwise, and accordingly, enriched circRNAs in RNase R-treated samples were considered putative positives, while depleted ones were treated as false positives. For each run, the start and end times were recorded to monitor the total runtime, while RAM usage was assessed by executing the ‘qstat -f Job_ID’ command at 10 s intervals.

Separate-detection mode and joint-detection mode

We divided the FASTQ files of the SW480 cell line into three equal subsets. Each subset was processed individually, and results from the three subsets were merged by summing the corresponding BSJ read counts (separate-detection mode). In the joint-detection mode, CIRI3 treats the three subsets as a single unified dataset and performs joint analysis in a single run, whereas other tools process the entire SW480 dataset directly without supporting joint analysis across subsets.

Enrichment analysis

We performed Kyoto Encyclopedia of Genes and Genomes and Gene Ontology enrichment analyses on the host genes of differentially expressed circRNAs and differentially spliced circRNAs using the R package clusterProfiler.

Robustness evaluation

To evaluate the robustness of circRNA detection tools, RNA-seq data from the Hs68 and SW480 cell lines were used for subsampling. Reads were randomly subsampled from the untreated dataset to generate subsets at 20%, 40%, 60% and 80% of the original sequencing depth. The detection tools were applied to each subset and the number of enriched, unaffected and depleted circRNAs was determined as described above. Each subsampling was repeated three times to calculate the mean and standard deviation.

Construction of a deep learning model for disease classification based on circRNA features

We constructed a five-layer fully connected deep neural network to classify tumor and normal samples based on circRNA BSJ ratios. All samples except those from colon tissue were randomly split into training, validation and test sets in a 3:1:1 ratio. All colon-derived samples were held out as an independent test set to evaluate the model’s generalizability to unseen tissue types.

The input features consisted of differentially spliced circRNAs that were selected according to the following criteria: (1) expressed in at least 10% of normal samples, with an average BSJ ratio in normal samples at least twofold higher than in tumor samples, (2) expressed in at least 10% of tumor samples but in less than 10% of normal samples or (3) expressed in at least 10% of disease samples with an average BSJ ratio at least 1.5-fold higher than in the corresponding normal samples. A total of 9,631 circRNAs were selected as input features. The model was trained on the training set and evaluated on the validation set, the test set and the independent colon dataset.

During transfer learning, the parameters of the first two layers of the neural network were fixed and only the last three layers were fine-tuned to improve the model’s adaptability to new datasets. Specifically, colon samples from multiple studies were divided so that one study served as the test set and the rest served as the training set.

Construction of LightGBM classifiers based on differentially spliced circRNAs

We used circRNAs as biomarkers to construct a multilevel classification system for stratifying cancer samples according to their system, tissue and disease origin. At each hierarchical level (that is, system, tissue and disease), we performed differential analysis to identify circRNAs with high specificity, which were used as candidate features for subsequent classification. Specifically, a one-versus-rest differential expression strategy was applied at each level: for each system (or tissue or disease), we compared all samples belonging to the target category against samples from all other categories at the same level and identified circRNAs that were significantly differentially spliced in each comparison. The union of all differentially spliced circRNAs was defined as the final feature set for each category. Using these features, we trained three separate LightGBM classification models for predicting system, tissue and disease labels, respectively. The input of the model consisted of the BSJ ratio values for the selected circRNAs and the output was the predicted category label (system, tissue or disease). The data were split into training and test sets, with 80% used for training and 20% for testing.

circRNA database construction (CIRIonco)

We systematically searched and collected RNA-seq datasets from the GEO database using keywords such as ‘total RNA’ and ‘ribo-zero’, focusing on studies that employed total RNA library preparation strategies. circRNAs were identified from these datasets using the BWA aligner and the CIRI3 algorithm. Differentially spliced circRNAs were subsequently used to train a deep neural network and to construct a hierarchical stratification tree. The resulting circRNA resource was organized into CIRIonco (https://ngdc.cncb.ac.cn/cirionco), an online database system implemented using the Django framework. CIRIonco provides a global overview and visualization of circRNA profiles and supports precise querying of biomarkers that distinguish different systems, tissues and diseases. In addition, it presents detailed information on the BSJ junction ratios of each circRNA across disease and normal samples in different tissues. Metadata and accession numbers are provided in Supplementary Table 3. The database is freely accessible and no login is required.

Simulated data

Simulated RNA-seq datasets were generated using CIRI simulator17. The inputs are human reference genome and gene annotations from GENCODE Release 43 (GRCh38.p13). To enable a fair comparison across methods, read length was fixed at 100 bp, while insert sizes were drawn from a mixture of two normal distributions (N(320, 70) and N(550, 70)). Sequencing coverage for both linear transcripts and circRNAs was varied across 20, 40, 60, 80 and 100 in separate datasets. All other parameters were kept at their default settings.

Real data

Publicly available RNA-seq datasets were downloaded from the SRA database (SRR444975, SRR445016, GSE162152 and GSE138734). These datasets contain RNase R-treated and untreated libraries. Data of Hs68 cell line (SRR444975 and SRR445016) were used to assess accuracy and sensitivity of typical circRNAs identification by each software. The accuracy of intronic self-ligated circRNAs identification by CIRI3 was evaluated using the GSE162152 dataset, which encompassed testis samples from five different species, including human, mouse, rat, rhesus and opossum. The reference genome for human (GRCh38.p13) and mouse (GRCm38.p6) were downloaded from GENCODE database. The reference genomes for rat (Rnor_6.0) and rhesus (Mmul_10) were downloaded from Ensembl. The reference genome for opossum (mMonDom1.pri) was downloaded from the NCBI Genome database. Assessment of circRNA quantification used the ribosomal R-treated SW480 cell line (SRR17235468), NCI-H23 cell line (SRR17235469) and HLF cell line (SRR17235470) RNA-seq data.

In addition, several RNA-seq datasets of human brain, testis, liver and spleen tissues were used for batch effect analysis (for accession numbers, see Supplementary Table 2) and RNA-seq datasets of tumor and adjacent normal liver samples from 44 HCC patients were used for DE analysis (GSE128274, GSE169289, GSE216613 and GSE77276). Further datasets from multiple cancer types and adjacent normal tissues were used for training the deep neural network and constructing the stratification tree (for accession numbers, see Supplementary Table 3).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.