NCBI Orthologs: Public Resource and Scalable Method for Computing High-Precision Orthologs Across Eukaryotic Genomes

297 Accesses
1 Altmetric
Explore all metrics

Abstract

Orthologs are fundamental for enabling comparative genomics analyses that further our understanding of eukaryotic biology. The unprecedented increase in the availability of high-quality eukaryotic genomes necessitates scalable and accurate methods for orthology inference. The National Center for Biotechnology Information (NCBI) developed “NCBI Orthologs”, a resource and a computational pipeline designed to meet this challenge within the NCBI RefSeq framework. This system integrates protein similarity, nucleotide alignment, and microsynteny to achieve high-precision ortholog assignments across diverse eukaryotes. The pipeline leverages high-quality RefSeq annotations and processes genomes individually, ensuring scalability. Resulting ortholog data, organized into gene-level anchored sets, enables propagation of functional annotation information and facilitates comparative genomics. Critically, these data are integrated into the NCBI Gene resource, providing users with access from various entry points. The NCBI Datasets resource provides an intuitive interface to explore orthologous relationships on the web and allows bulk data download via the web, command-line tools, and an API. We detail the methodology, including anchor species selection and the decision tree used to arrive at high-confidence one-to-one orthology relationships. NCBI Orthologs is a valuable resource for facilitating functional annotation efforts and enhancing our understanding of eukaryotic gene evolution.

Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes

Article Open access 16 February 2017

Update on Genomic Databases and Resources at the National Center for Biotechnology Information

Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Article Open access 05 July 2024

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Community initiatives, coupled with advanced sequencing technologies and assembly algorithms, have led to a rapid expansion of high-quality, complete genomes across the eukaryotic tree of life (Rhie et al. 2021; Darwin Tree of Life Project 2022). The National Institutes of Health (NIH) Comparative Genomics Resource (CGR) project, spearheaded by the National Center for Biotechnology Information (NCBI) at the National Library of Medicine (NLM), has developed an NCBI genomics toolkit comprised of high-quality data and high-performance tools to maximize the utility and impact of these data across many fields of biology (Bornstein et al. 2023). A central component of this toolkit is uniformly high-quality annotation of eukaryotic genomes generated by the Eukaryotic Genome Annotation Pipeline (EGAP) and a network of orthologous genes that connect annotated genes from diverse organisms.

Gene function information, as determined through experimental and genetic analyses, is predominantly available for a limited set of established model organisms (Alliance of Genome Resources Consortium 2024) with significant community interest (Takeda and Shimada 2010; Klingler and Bucher 2022; Kocher and Kingwell 2024). For the vast majority of non-model species, gene functions are initially inferred based on their similarity to genes in model organisms. For example, information on gene functions and nomenclature provided by the research community for a model organism can be propagated to related non-model species within a clade based on orthologs (Gabaldon and Koonin 2013). This approach serves as a foundational starting point for exploring gene functions in non-model species.

Orthologs are genes that originated from a shared ancestor through a speciation event (Fitch 2000). Genomes from a pair of closely related species can be assumed to contain a set of one-to-one corresponding orthologs that share similar functions (Gabaldon and Koonin 2013). Orthologs that are conserved across a clade can serve as a resource to evaluate the completeness of the annotated gene set in a genome assembly (Manni et al. 2021; Nevers et al. 2025; Prieto-Banos et al. 2025). Orthologs organized as clustered groups and gene trees are essential for tracing the evolutionary trajectories of gene functions and their regulatory mechanisms. Through comparative analyses and phylogenetic profiling, researchers can gain insights into the evolutionary dynamics of gene functions, elucidating how they have adapted and diversified over time (Emms and Kelly 2019; Thomas et al. 2022; Majidian et al. 2025). This understanding not only informs evolutionary biology but also functional genomics and the study of related genes across species.

The accurate inference of orthologous relationships is a significant challenge. Various biological processes, including gene gain, loss, and duplication, introduce substantial complexities in the underlying data (Force et al. 1999). Furthermore, current computational methods for orthology inference, which largely rely on sequence comparisons, often struggle to scale effectively with the increasing volume of high-quality genomic data. While resources like Ensembl Compara (Herrero et al. 2016), OrthoMCL-DB (Fischer et al. 2011), OrthoDB (Tegenfeldt et al. 2025), PANTHER (Thomas et al. 2022), and OMA (Altenhoff et al. 2024a, b) provide valuable orthology data, many are relatively static, lagging behind the continuous updates to genome assemblies and gene model annotations driven by new evidence and expert curation. Moreover, existing repositories can lack the comprehensive integration with other relevant resources needed for thorough comparative analyses.

Most of the current methods of inferring orthology rely solely on protein sequences (Langschied et al. 2024), neglecting valuable evolutionary signals present at the genomic level, such as microsynteny of neighboring genes (Zhao and Schranz 2019; Lovell et al. 2022) and adjacent genome sequences (Kirilenko et al. 2023). This is often due to the lack of standardized genome annotation consistently available across all species under consideration. The issue is further exacerbated by the computational demands of incorporating such data, particularly as the genome assemblies and annotations are frequently updated. For over two decades, the RefSeq project at NCBI has provided evidence-based, iteratively refined genome annotations supported by expert curation for a wide range of eukaryotes (Goldfarb et al. 2025). The scope of RefSeq continues to expand to represent additional clades, including both model and non-model species. The RefSeq data, encompassing genomic, transcript, and protein sequences, along with structural annotations anchored by stable, unique identifiers, are disseminated across multiple NCBI resources. RefSeq’s comprehensive data spanning a broad taxonomic range establishes an ideal foundation for a repository of orthologous gene sets.

In this paper, we describe an ortholog calculation method that utilizes RefSeq gene model annotations to establish precise orthologous relationships. Our method first compares all protein-coding gene models between two RefSeq eukaryote genomes to identify homologous gene pairs. In a subsequent step, we compute similarities in exonic nucleotide sequences and combine them with microsynteny information to identify the best possible one-to-one ortholog pair. This multilevel approach of integrating sequence similarities at the protein and nucleotide level and consideration of microsynteny improves the resolution of ortholog calls, particularly among closely related paralogs. With the rapid expansion of eukaryote genomes (Rhie et al. 2021; Darwin Tree of Life Project 2022; Goldfarb et al. 2025), our approach provides a scalable solution to propagate information, such as gene nomenclature, from model species to all RefSeq eukaryote genomes with high-precision and organizes RefSeq genes in ortholog sets. Orthologs calculated for RefSeq genomes further play key roles in propagating functional and structural annotation. Comparative analyses based on orthologs can further improve gene models through curation efforts, which, in turn, improves ortholog inference.

Methods

Metrics for Genome-Based Ortholog Calculation

To compute orthologs between a pair of genomes, designated as “query” and “subject”, the NCBI Orthologs pipeline utilized the following sets of input data for each genome: complete genome sequences, annotation information for all protein-coding genes, and the corresponding protein sequences (Fig. 1a). We considered a set of homologous gene pairs where the protein similarity scores (see below) were within 20% of the best score for either the query or subject gene. Our algorithm evaluated each candidate homologous gene pair between the query and subject genomes in the context of competing pairs. Each homolog gene pair (X, Y) was compared to all homolog pairs including either of the X or Y loci. This contextual evaluation was crucial, as true orthologs typically outperformed paralogous relationships across multiple metrics simultaneously. At every stage, we impose specific thresholds on the different metrics computed by the pipeline. These thresholds were empirically determined based on our experience in analyzing data of this nature during the development of EGAP and further refined as needed by reviewing data from multiple ortholog runs.

For each candidate pair, we extracted and evaluated the following metrics (Fig. 1b–d, Supplementary Fig. 1):

(1)
Protein Sequence similarity: Protein sequences from the query and subject genomes were compared in an all-versus-all manner using DIAMOND (‘-very-sensitive’ mode and reporting BLASTP-like alignment scores) (Buchfink et al. 2021). An earlier version of the pipeline used BLASTP (‘-evalue 0.001-word_size 6-threshold 21’) (Altschul et al. 1997), but we have recently switched to DIAMOND due to its superior performance while producing nearly identical results. Moving forward, the pipeline will use DIAMOND, though some public data were generated with the BLASTP pipeline. For each pair of query and subject genes, we selected the protein isoform pair exhibiting the highest protein similarity score. Raw alignment scores produced by BLASTP and DIAMOND don’t account for variations in protein lengths, presenting a key challenge in directly comparing scores between different protein pairs, as high score might simply reflect longer proteins rather than genuinely higher similarity. To overcome this, we adopted the traditional Jaccard index and modified it to suit the continuous nature of alignment scores. Our modified Jaccard index, calculated as “BLASTP_alignment_score / [sum_of_BLASTP_self_alignment_scores–BLASTP_alignment_score]”, normalizes the alignment score against the potential maximum similarity for that pair, allowing comparison across different combinations of protein isoforms for a given pair of query and subject genes.
(2)
Nucleotide-level conservation: For a given homologous gene pair represented by the best-scoring alignment between one protein isoform from each gene, we extracted sequences of all annotated exons, including untranslated regions, and concatenated them. These concatenated exonic sequences were then further extended to include additional 2 kb of flanking exonic sequences on both 5ʹ and 3ʹ ends, from adjacent genes. The resulting sequences were then aligned to each other using discontiguous-megablast (McGinnis and Madden 2004). Similar to the protein sequence comparison, a modified Jaccard index score was computed using the formula: “aligned_length/[sum_of_sequence_lengths_compared–aligned_length]”.
(3)
Microsynteny Conservation: Microsynteny was assessed by scoring the number of homologous gene pairs within a 20-locus window, encompassing at most 10 adjacent loci on either side of the gene pair under consideration.

Selecting Orthologs Among Competing Homolog Pairs

Orthologs were identified by examining all homologous gene pairs involving the query and subject genes using an algorithm (detailed below and represented by pseudocode in Supplementary Fig. 1a) that relies on the metrics computed in the preceding section.

When microsynteny score is non-zero for the candidate pair, the algorithm determined the pair as orthologs if any of these conditions were met:

The candidate pair had a non-zero microsynteny score and a protein similarity score greater than or equal to competing homolog pairs that had no microsynteny support.
The candidate pair had a microsynteny score of at least 2, while competing pairs had no microsynteny support.
The candidate pair’s microsynteny score exceeded that of competing pairs, its nucleotide alignment score was greater than or equal to any of the competing pairs, and its protein similarity exceeded all other competing homolog pairs by at least 5% and scored the highest for either the query or subject.
The candidate pair’s microsynteny score exceeded that of competing pairs by at least 2, and its nucleotide alignment score was greater than or equal to any of the competing homolog pairs.

For candidates pairs with a microsynteny score of zero, stricter criteria were applied, requiring all of the following:

No competing pair had microsynteny support.
The protein alignment covered more than 50% of the longer protein and more than 90% of the shorter protein.
Both protein similarity and nucleotide sequence alignment scores exceeded those of competing pairs by at least 5%.
The protein similarity score was the highest for either the query or subject.

This dual-strategy approach enabled the algorithm to identify orthologs with high precision even in complex genomic contexts with multiple paralogs. The microsynteny component was particularly valuable for resolving cases where sequence-based metrics alone would be inconclusive, while the stringent criteria for cases without microsynteny ensured that only well-supported orthology assignments were accepted. See Supplementary Fig. 1b–d for examples to determine the orthologs based on the protein similarity, nucleotide sequence alignment, and microsynteny scores.

Evaluation of NCBI Orthologs

To assess the performance of the NCBI Orthologs pipeline, we utilized the Quest for Orthologs (QfO) Orthology Benchmarking Service (https://orthology.benchmarkservice.org) (Nevers et al. 2022; Altenhoff et al. 2024a, b). The orthology research community established common QfO reference proteome datasets to facilitate comparative evaluations of different methodologies. Given that the “2020_04” dataset allows for the broadest comparison with existing methods, we selected this dataset as our benchmark. The QfO datasets consist of reference proteomes sourced from UniProt, which inherently lack the genome annotation information required by the NCBI Orthologs pipeline. Recognizing that a significant portion of these proteomes are derived from Ensembl annotations, we retrieved corresponding annotation data from Ensembl (Release 100) and integrated it with the proteomes. We then calculated orthologs for twelve vertebrates, using human, mouse and zebrafish as the anchor species, and for three arthropods using fruit fly as an anchor (Supplementary Table 1). To ensure compatibility, we mapped Ensembl gene identifiers to UniProtKB identifiers used in the QfO datasets. Genes that were orthologous to the same gene in the anchor genome were transitively considered as orthologs and added to the input submitted for QfO tests. We used the JSON reports of the QfO evaluation results to recreate plots in Supplementary Fig. 2. For public methods with multiple entries, we showed the best-performing entry as the representative (e.g. “sonicparanoid-mostsensitive” for SonicParanoid entries). For SwissTree challenge, we examined the raw results to identify false positive (FP) calls (see the note in Supplementary Data 1).

Results

Precise Calculation of Orthologs Anchored to Genes and Genomes at NCBI

Defining the taxonomic scope and the purpose for inferring orthologs is a crucial first step in developing an orthologs pipeline (Gabaldon and Koonin 2013). One of the important objectives of the RefSeq eukaryotic genome annotation process is to provide informative and consistent gene nomenclature. Using the network of orthologs, the RefSeq team aimed to leverage gene names from well-studied model organisms. For instance, organizations such as the HUGO Gene Nomenclature Committee (HGNC) (Seal et al. 2023) and FlyBase (Ozturk-Colak et al. 2024) have invested considerable effort in assigning informative gene names with input from the scientific community. The approach adopted by RefSeq imposes the following constraints on the data model: (1) the selection of an appropriate anchor species that has well-named genes, (2) a strict expectation of one-to-one orthologous relationships to ensure unambiguous gene naming, and (3) a stringent threshold for ortholog assignment to minimize the risk of mispairing paralogs between species—a strategy adopted to favor high-confidence calls, potentially at the expense of identifying all possible orthologs.

Gene duplication events frequently create paralogs within genomes, complicating the identification of true orthologous relationships between species. When a query genome contains N paralogs and a subject genome contains M paralogs, potentially N × M homologous pairs exist, though typically only a subset represents true one-to-one orthologs. To identify these orthologous pairs with high precision, we developed a multi-faceted algorithm that integrates multiple lines of evidence. A schematic diagram of our process to identify orthologs between two genomes is depicted in Fig. 1 and described in the Methods section. It is important to note that the unambiguous identification of a one-to-one ortholog pair is not always feasible, for example, due to recent gene duplication events. In such scenarios, the algorithm does not identify orthologs for any of the involved genes (Supplementary Fig. 1d).

Evaluation of NCBI Orthologs

We evaluated the precision of our approach using the QfO challenge (Nevers et al. 2022). Considering our scope, we tested 1-to-1 best orthologs predicted between mouse and rat, human and other vertebrates, zebrafish and other fishes, and fruit fly and other arthropods included in the 2020 reference datasets (Supplementary Table 1). Prior to testing, the reference proteome datasets were mapped to genome annotations as detailed in the Methods section. We identified 193,307 ortholog pairs from direct comparison of the fifteen genome pairs. Genes that were orthologous to the same anchor gene were transitively inferred as orthologs to each other, adding 484,330 additional ortholog pairs to the input. In total, 677,637 ortholog pairs from thirteen vertebrate and four arthropod genomes were submitted for the evaluation using the QfO Orthology Benchmarking service.

QfO benchmark results exhibited a lower recall for the NCBI Orthologs method. We believe this was a result of the following reasons: (1) our pipeline strictly returns only one-to-one ortholog pairs and avoids any ambiguous calls, (2) we calculated orthologs for all species versus a select set of anchor species (e.g., human, zebrafish, and fruit fly) followed by transitive inference between non-anchor species, and (3) we calculated orthologs for vertebrates and arthropods only. On the other hand, the precision scores for NCBI Orthologs were the best among all methods in the Gene Ontology (GO) and Enzyme Classification (EC) challenges offered by QfO (Supplementary Fig. 2a, b). This relatively high precision persisted even when the results of other methods were filtered to include only the same species pairs evaluated by our pipeline (Supplementary Fig. 2c–d). For the SwissTree challenge, where the precision was inferred as Positive Predictive Value (PPV) based on test results from multiple gene trees, the absence of data for gene trees outside of our defined taxonomic scope led to a low PPV. Despite this, an analysis of raw results revealed 537 true positive calls without any false positive calls across 15 gene trees that included taxa for which we submitted our predictions (Supplementary Data 1). Similarly, NCBI Orthologs made only one false positive call due to a chimeric gene model alongside 21,730 true positives in the VGNC challenge (Supplementary Fig. 2e and Supplementary Data 1). We were unable to run the Generalized Species Tree Discordance test likely due to the limited scope of species pairs included in our submitted prediction.

NCBI Orthologs for RefSeq Metazoan Genomes

While our pipeline is theoretically capable of computing orthologs between any two RefSeq genomes, we prioritized orthologs shared with a model organism anchor, selected based on the taxonomic relevance, to facilitate the primary purpose of propagating gene names. Given the extensive curation and community support provided by the HGNC (Seal et al. 2023) and the broader research community, the human genome (GRCh38, NCBI Homo sapiens Annotation Release GCF_000001405.40-RS_2024_08) was an obvious choice as an initial anchor. We designated it as the primary anchor for all vertebrate lineages.

The proportion of human orthologs among protein-coding genes in the query genome was affected by the distance from the human anchor as well as the gene content and genome duplication histories of the clade (Fig. 2a and b). Among primates, our pipeline identified between 15,196 and 17,869 orthologous gene pairs with human, covering between 75.5% to 85.6% of total protein-coding genes in each query genome. As we extended our analysis to more distant taxa within vertebrates, the absolute number of ortholog calls has decreased (Fig. 2a). Despite the decrease in absolute counts, the proportion of protein-coding genes with identified orthologs remained substantial across various vertebrate clades (Fig. 2b). We reported orthologs for an average 79.4% (± 4.5%) of protein-coding genes in mammals and birds. The average proportion of orthologs was reduced to 69.9% (± 5.3%) in reptiles, largely due to their genomes containing more protein-coding genes compared to those of birds with similar evolutionary distances from human (Supplementary Data 2). For amphibians, the proportion of orthologs among protein-coding genes was reduced further (60.7 ± 4.9%), mirroring the increasing distance from the human anchor (Fig. 2b and Supplementary Data 2).

The direct inference of orthologs between human and fish genomes presented a significant challenge. Due to the extensive history of gene duplications and ploidy events within fish lineages, both the absolute number and proportion of orthologs identified with human drastically decreased. For instance, the ray-finned fish clade includes a substantial number of RefSeq genomes (216 as of March 2025). Large distance from the human anchor (Fig. 3a) and ancient genome duplications (Taylor et al. 2003) resulted in a low coverage of orthologs among fishes (Fig. 2a and b). To mitigate this issue, we added the zebrafish reference genome (GRCz11, NCBI Danio rerio Annotation Release 106) as a transitive anchor to identify orthologs for all RefSeq genomes for ray-finned fish as well as sharks and relatives (Fig. 3b). When the zebrafish ortholog of a fish gene was also identified as an ortholog of a human gene, we considered the fish gene to be transitively an ortholog of the human gene (Fig. 3c, blue bar graph). Addition of the zebrafish transitive anchor identified orthologs among fish RefSeq genomes without a human ortholog, substantially increasing the ortholog coverage among fish genomes (Fig. 3c, gray bar graph).

We observed a similar pattern among orthologs shared by arthropod RefSeq genomes and the fruit fly (Drosophila melanogaster) reference genome (FlyBase Release 6.54) (Fig. 2c and d). For the arthropod clade, which covered a broader space in the RefSeq Metazoa tree than the Chordata and vertebrate clade (Fig. 3a), we used fruit fly as the model anchor (Fig. 3c). As RefSeq arthropod genomes expanded, clades that included a significant number of species that are evolutionarily distant from fruit fly emerged. These included the insect orders Hymenoptera, Coleoptera, and Lepidoptera. Similar to zebrafish for the fish clades, we used Apis mellifera (honeybee), Tribolium castaneum (red flour beetle), and Bombyx mori (silkworm) as transitive anchors for their respective orders to identify clade-specific orthologs (Fig. 3c, gray bar graph) in addition to orthologs shared with the fruit fly anchor (Fig. 3c, blue bar graph). Adding a clade-specific transitive anchor resulted in improved ortholog detection supported by stronger microsynteny signals (Supplementary Table 2). The numbers of orthologs shared with the model anchors or clade-specific transitive anchors for all vertebrate and arthropod RefSeq genomes as of March 2025 are summarized in Supplementary Data 3.

The ortholog inference process is tightly integrated into EGAP (Goldfarb et al. 2025). Upon examination of ortholog counts and microsynteny scores across multiple clades using different anchors (Fig. 3 and Supplementary Table 2), we identified one vertebrate (zebrafish) and three arthropods (honeybee, silkworm, and red flour beetle) to serve as transitive anchors in addition to human and fruit fly, which serve as the primary anchors. EGAP automatically chooses an appropriate anchor based on the taxonomy of the genome being annotated. When a transitive anchor was chosen, orthologs between the query genome and the primary anchor were inferred transitively based on the orthologs between the primary anchor and the transitive anchor. While the introduction of transitive anchors was necessary for effectively propagating names among genes that are specific to a clade, we note that the nature of orthologs are, unlike homologs, not necessarily transitive (Fitch 2000; Altenhoff et al. 2019). We extensively tested and confirmed that the query-to-primary anchor orthologs which were transitively inferred agreed with those calculated directly between the query and the primary anchor, with few exceptions (< 10 per query genome) involving mostly either a highly duplicated gene family, sub-optimal query, or transitive anchor gene models requiring manual curation.

Ortholog and Nomenclature Assignment

The primary output of the ortholog computation pipeline is a comprehensive table which enumerates all query-subject protein pairs that pass the initial all-versus-all protein alignment step, providing an exhaustive set of metrics including protein and nucleotide alignment statistics and the count of neighbors exhibiting microsynteny (Supplementary Fig. 1b–d). Notably, the table also includes a column indicating whether each protein pair has been identified as orthologous.

Internally, these tables are loaded into an SQL database and used for gene naming and reporting purposes. All gene pairs identified as orthologs are consolidated into ortholog sets. Each ortholog set is represented by the anchor GeneID and consists of GeneIDs for all genes identified as orthologs to that anchor gene. NCBI GeneIDs are stable, numerical identifiers that are trackable across different annotation versions even when the underlying sequence data or associated metadata (such as gene names, symbols, aliases, and descriptions) undergo changes (Goldfarb et al. 2025).

In cases involving transitive orthology, orthologs are computed solely between the query genome and its most specific anchor species. For example, when bumblebee is the query genome being annotated, honeybee is automatically selected as the anchor species. The output data are then loaded into the database, where transitive orthologs to fruit fly are inferred for each bumblebee gene that has a corresponding honeybee ortholog. Bumblebee genes for which fruit fly orthologs are reported are subsequently added to the fruit fly ortholog sets, whereas the remaining bumblebee genes for which only honeybee orthologs are identified, are added to the honeybee ortholog sets.

In a final step, gene names are propagated from the ortholog anchor to all members of each ortholog set. This step is crucial, as it enables the assignment of informative gene names for many species, significantly enhancing the utility of RefSeq genome annotations. For instance, among the 33 Drosophila species annotated prior to 2023, the average number of protein-coding genes was 14,105. Our pipeline identified an average of 11,743 ortholog pairs per species. Initially, nearly all protein-coding genes were assigned placeholder symbols of the format LOC followed by the unique NCBI GeneID (for example, LOC108648959). However, leveraging orthology for gene symbol propagation significantly increased the number of genes with informative symbols to an average of 5,774 per species, with a maximum of 6,135. This effort, conducted in multiple phases with thorough expert curation at each step to ensure quality, has demonstrated the power of orthology in assigning meaningful nomenclature. Building on this success and increasing confidence in the assigned names, we have expanded this name propagation strategy to other insect species, resulting in over 1.38 million genes across nearly 350 insect species now possessing informative gene symbols.

Absent the orthologs pipeline, most genes in non-model organisms would lack informative names, as these organisms typically are not covered by dedicated nomenclature assignment groups. Nomenclature updates in anchor species are automatically propagated to all members of the ortholog set, and we have established mechanisms to prevent automatic updates for specific genes when deemed necessary. All gene names, as well as their historical versions which are included as aliases, are accessible in the NCBI Gene resource and across other NCBI products, such as Genome Data Viewer (Rangwala et al. 2021) and Comparative Genome Viewer (Rangwala et al. 2024), enabling users to browse and search using informative gene names.

In addition to nomenclature propagation, ortholog data are useful in identifying structural annotation issues. For instance, we systematically query ortholog sets to detect cases where orthologs for one or only a few species are missing. Such absences can stem from various reasons, including species-specific gene loss, gene duplication events leading to our pipeline’s failure to identify a single best ortholog, or an underlying issue in structural annotation. The RefSeq team of expert curators examine these flagged cases and take appropriate corrective actions. One example demonstrating this workflow is the “prune” gene (Supplementary Fig. 3). After observing the honeybee prune gene was unexpectedly missing from the fruit fly ortholog set (with anchor GeneID 31194), we investigated further. The full results table produced by the pipeline showed an alignment between honeybee gene LOC725150 (GeneID 725150) and the fruit fly gene “pn” (GeneID 31194). Upon reviewing additional data such as RNA-seq coverage, it became clear that the annotation of LOC725150 was indeed a chimera of two adjacent genes. To rectify this, we suppressed the existing transcript (XM_006568615.3) which encoded the chimeric protein (XP_006568678.1), created NM_001434528.1 to represent LOC725150, and created a new “Pn” gene (GeneID 138447619) with transcript NM_001434529.1 as depicted in Supplementary Fig. 3.

NCBI Orthologs for RefSeq Protozoan Human Pathogens

NCBI employs the Eukaryotic Annotation Propagation Pipeline to propagate structural annotations submitted by users to RefSeq genomes of selected protozoa, fungi and a small number of other eukaryotic organisms of significant interest to the research community (Goldfarb et al. 2025). Protozoa, within NCBI, are informally defined as eukaryotic organisms excluding metazoa, viridiplantae, and fungi. At the time of this study, the RefSeq collection comprised 129 protozoan species, including several medically significant organisms implicated in human diseases such as malaria, African sleeping sickness, and Chagas disease. To facilitate comparative genomics analyses of these organisms, we employed our pipeline to identify orthologous gene pairs among them.

For the initial set of computations, we selected species from the genera Plasmodium, Trypanosoma, Leishmania, and Toxoplasma. These genera collectively represented nearly half of the protozoan species within the RefSeq collection at the time, are all medically relevant, and have substantial support from the research community, exemplified by resources such as VEuPathDB (Alvarez-Jarreta et al. 2024). Direct ortholog inference using a single anchor species proved impractical due to the large taxonomic distance between these genera and the limited number of ortholog pairs that could be identified. Consequently, we selected Plasmodium falciparum 3D7 (Fig. 4a), Trypanosoma brucei brucei TREU927, Toxoplasma gondii ME49, and Leishmania major strain Friedlin as individual anchor species for ortholog computation. The complete list of species analyzed and the corresponding number of identified ortholog pairs are presented in Supplementary Data 4. Our process identified an average of 5975 (N = 8), 7311 (N = 12), 4572 (N = 17), and 3266 (N = 8) ortholog pairs for the anchors T. brucei, L. major, P. falciparum, and T. gondii, respectively.

For Plasmodium and Leishmania, our pipeline identified orthologs for over 94% of the protein-coding genes in their respective anchor species. This high proportion is likely because we calculated orthologs for species within the same genus or subfamily as the anchors. Indeed, we were able to identify orthologs for 64.6% of P. falciparum and 69.6% of L. major protein-coding genes across all 17 and 12 species, respectively. On the other hand, only 26.3% of T. brucei protein-coding genes had orthologs across all 8 Trypanosoma species. Among the Trypanosoma genomes analyzed, the T. cruzi genome (GCF_000209065.1) notably featured an unusually large size (~ 90 Mbp) and protein-coding gene count (19,607) compared to the others (mean ~ 23.8Mbp and 9,768 genes). We hypothesize that the T. cruzi genome represents an unresolved diploid or a hybrid, leading to a larger proportion of highly similar paralogs (El-Sayed et al. 2005; Weatherly et al. 2009). Our algorithm, which avoids making ortholog calls when identification of an unambiguously pair is not feasible (as exemplified in Supplementary Fig. 1d), consequently identified fewer orthologs for T. cruzi. Excluding T. cruzi, the proportion of anchor genes with orthologs in all Tryponosoma species increased to 54.9%. Among the protozoan parasites tested, Toxoplasma gondii served as the anchor for the most diverse group of species. T. gondii shared the highest number of orthologs with its two closest species, Neospora caninum and Besnoitia besnoiti (Supplementary Data 4).

NCBI Orthologs for Selected RefSeq Fungi

Similar to protozoa, the RefSeq collection of Fungi relies on high-quality genomes with user-submitted gene model annotations available in GenBank. The genomes of organisms with significant research interest are selected based on criteria described in (Alvarez-Jarreta et al. 2024; Goldfarb et al. 2025) and processed through the Eukaryotic Annotation Propagation Pipeline to generate RefSeq genome assemblies with gene model annotations in a standardized format. While ortholog computation has not yet been integrated into this pipeline, the value of orthologous relationships in enhancing gene understanding is undeniable. The availability of orthologous relationships to genes from multiple distinct genomes significantly increases the likelihood of assigning informative protein names.

Consequently, we computed orthologous relationships for 370 fungal species, using the following taxa as comparison anchors within their respective taxonomic classes or orders: Aspergillus fumigatus strain Af293 assembly GCF_000002655.1 (Eurotiomycetes, 159 taxa), Saccharomyces cerevisiae strain S288C assembly GCF_000146045.2 (Saccharomycetes, 31 taxa), Candida albicans strain SC5314 assembly GCF_000182965.3 (Serinales, 35 taxa), Fusarium oxysporum strain Fo47 assembly GCF_013085055.1 (Sordariomycetes, 133 taxa), and Rhizopus microsporus strain ATCC 52813 assembly GCF_002708625.1 (Mucorales, 12 taxa). These anchor species were selected due to their association with extensive research over decades, their status as the taxa with the most downloaded assembly counts, and their association with widely studied strains. Our analysis identified an average of 6556 (N = 159), 4208 (N = 31), 4683 (N = 35), 7236 (N = 133), and 5983 (N = 12) ortholog pairs for the anchors A. fumigatus, S. cerevisiae, C. albicans, F. oxysporum, and R. microsporus, respectively (Supplementary Data 5).

Aspergillus fumigatus is a thermotolerant species and the primary causal agent of invasive aspergillosis (Beer et al. 2018). This species belongs to a genus which includes multiple toxin producing species which can infect a variety of eukaryotes (animal and plant) but also species important to the food, biotechnology and drug industries (Navale et al. 2021). As expected, the highest percentage of ortholog assignments for A. fumigatus was in the Aspergillus clade (97%, 55 taxa) and 7% less in the sister clade Penicillium (52 taxa) (Fig. 4). The highest number of orthologs (7441–8672) and average ortholog neighbor count (9.6–14.5) were assigned to species in the same subgenus, Fumigati (Supplementary Data 5).

Saccharomyces cerevisiae, commonly known as brewer’s yeast, has been used in baking, brewing and winemaking since ancient times. This yeast is an important model organism for molecular and cell biology research with a highly curated reference assembly and consistent genetic nomenclature (Wong et al. 2023). The highest percentage (92%) of ortholog assignments from S. cerevisiae was to species in its own genus (Saccharomyces clade, 4 taxa) and dropped down to 57% in the most distant and early diverging clade (Ascoideales, 2 taxa) (Supplementary Fig. 4a). The yeasts S. cerevisiae and C. albicans are common members of the healthy human mycobiome (Nash et al. 2017). However, C. albicans belongs to a different class (Pichiomycetes) of yeasts and is a member of the order Serinales (the CTG codon codes for serine instead of leucine) which contain several human pathogens and has the potential to cause candidiasis under certain conditions such as an immunocompromised state, dysbiosis or damage to the muco-intestinal barrier (Talapko et al. 2021). This taxonomic order also includes Candidozyma auris, a species which rapidly emerged as a serious threat in hospital acquired infections and quickly gained multidrug resistance across the globe (Lockhart et al. 2017). 96% of C. albicans proteins were assigned orthologs to proteins of species in the Candida (senso stricto) clade whereas the percentage dropped to 78% for the Candidozyma clade (Supplementary Fig. 4b) containing C. auris and related taxa which were previously known as Candida species but recently reclassified (Liu et al. 2024).

Fusarium oxysporum is a soil-borne fungus, mostly known as an economically important plant pathogen of various crops. However, F. oxysporum and other species in the genus are also opportunistic human pathogens, producing toxins and capable of localized or life-threatening disseminated infections depending on the status of an individual’s immune system (Al-Hatmi et al. 2016). The percentage of ortholog assignments of F. oxysporum proteins was 97% in the Fusarium clade containing 20 taxa (Supplementary Fig. 4c). The average orthologous neighbor count ranged from 5.5 -12.3 for taxa in the same order (Hypocreales) but as the microsynteny deteriorated dropped to 2.3–4.3 for taxa in other orders and more distant clades (Supplementary Data 5).

Rhizopus microsporus is a plant pathogen causing rot in various crops, spoilage of fresh and manufactured food, yet is also used in the production of fermented foods (Nout and Kiers 2005; Napo et al. 2025). In addition, R. microsporus is a healthcare-associated opportunistic pathogen and mainly affects predisposed individuals (Bowers et al. 2020). Interestingly, it has been found that a bacterial endosymbiont helps the fungus to evade the clearance by human macrophages (Itabangi et al. 2022). The percentage of anchor proteins with assigned orthologs to other species and clades in the Mucorales were comparatively low to that observed for taxonomic groups in the Ascomycota. Percentages ranged from 75% assigned as orthologs in the Mucoraceae family clade (not the family that R. microsporus belong to) to 48% in Phycomyces blakesleeanus and with low average ortholog neighbor counts (1–2.7) across taxa in the order (Supplementary Fig. 4d). Genetic diversity, low sampling and lack of high quality and annotated assemblies likely contributed to this outcome. Currently no annotated chromosome or complete level assembly is publicly available, and the anchor assembly will likely be updated or switched in future as data quality improve.

Data Access

NCBI Ortholog data, including sequence and metadata, can be accessed through NCBI Datasets (O’Leary et al. 2024). This resource provides both a web interface and programmatic tools to facilitate intuitive and user-friendly downloading of ortholog data.

Web Access: To explore ortholog data via the NCBI website, users can begin by visiting the NCBI homepage www.ncbi.nlm.nih.gov and entering a species name along with a gene symbol or description in the search bar (e.g. Homo sapiens ACE2). If the queried gene is part of an NCBI Ortholog set, its gene-specific knowledge panel will include an “Orthologs” button, which links directly to the corresponding ortholog page. This page allows users to browse ortholog sets by taxonomy, download associated sequences and metadata, and optionally align selected sequences using the COBALT multiple sequence alignment tool (Papadopoulos and Agarwala 2007). Additionally, the page provides links to view the gene’s genomic context using the NCBI Genome Data Viewer (Rangwala et al. 2021).

Programmatic Access: NCBI Ortholog data can be accessed programmatically using the NCBI Datasets command-line tools https://www.ncbi.nlm.nih.gov/datasets/docs/v2/command-line-tools/download-and-install/. The datasets tool enables users to view metadata for genes included in the ortholog set or download a gene data package containing sequence and metadata for one or more ortholog sets. Data retrieval can be specified using a gene symbol and taxon name, GeneID, or accession number. The --ortholog flag allows users to download all available orthologs (--ortholog all) or a filtered subset based on taxonomy (e.g., --ortholog mammals). The resulting gene data package includes both sequence data and metadata, delivered in a ZIP archive. Users can customize the download to include specific sequence types such as FASTA files for genes, transcripts, proteins, untranslated regions (UTRs), and coding sequences (CDSs). Metadata and annotations for genes, transcripts, and proteins are provided as data reports in JSON Lines (.jsonl) format. The dataformat tool, which complements datasets, can be used to generate metadata tables (.tsv) from these data reports for easier analysis. When downloading multiple ortholog sets, each gene entry in the data_report.jsonl file includes a “gene_groups” section. This section specifies the ortholog methodology (e.g., NCBI Orthologs) and the identifier for the ortholog set, which corresponds to the GeneID of the anchor organism. This information can be used to organize and categorize ortholog sets. For detailed guidance on retrieving ortholog data using the Datasets command-line tools, consult the NCBI Datasets how-to guide, available at: https://www.ncbi.nlm.nih.gov/datasets/docs/v2/how-tos/genes/download-ortholog-data-package/. Additionally, a comprehensive file containing all gene ortholog pairs is available via FTP at: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene_orthologs.gz (see Fig. 5).

Discussion

Here, we describe the methodology employed by NCBI for the computation and reporting of orthologous relationships, a fundamental process seamlessly integrated into every EGAP annotation run to facilitate accurate and consistent gene nomenclature assignment. As genome assemblies and evidence data improve, and NCBI continues to refine EGAP, gene model annotations are routinely updated to incorporate these advancements.

The development of NCBI’s ortholog pipeline and its associated data model has been primarily driven by the need to propagate gene names. This specific application has shaped the adoption of several core principles that underpin our approach. Firstly, we adhere to strict one-to-one ortholog pairings to ensure unambiguous gene name propagation. Secondly, the ability to scale the process to many hundreds of genomes is essential. Given the high volume of eukaryotic genomes annotated by NCBI annually, the pipeline is engineered to process individual genomes efficiently, obviating the computationally intensive need to recompute orthologous groupings across the entire dataset with the addition of new genomes and annotations. Thirdly, orthologs are anchored at the gene level, utilizing stable, unique identifiers that are independent of specific genome assembly versions. This decoupling is essential for accommodating frequent annotation updates, driven by both improvements to the EGAP pipeline and the availability of new genomic or transcriptomic data, thereby ensuring data consistency and minimizing disruptions to user workflows. Finally, the one-to-one nature of our ortholog calls facilitated the adoption of a simplified “ortholog sets” data model, where all genes within a set are considered orthologous to each other. The assignment of the anchor species’ GeneID as the identifier for each ortholog set further streamlines data management and eliminates the necessity for generating distinct unique identifiers for each ortholog group.

These guiding principles not only effectively serve the immediate needs for gene name propagation within NCBI but also present a contrasting approach to the methodologies employed by other similar resources. For instance, graph-based methods, which consider all-by-all relationships, often necessitate a complete recalculation of the entire ortholog dataset whenever new genomes are incorporated. While strategies like constructing a “core” phylogenetic tree and subsequently attaching genes from newly sequenced genomes can mitigate some of these computational burdens, they still represent a significant undertaking. The NCBI Orthologs pipeline distinguishes itself by integrating protein sequence similarities with comprehensive gene annotation information, including the context of neighboring exonic regions and microsynteny, to achieve a high degree of accuracy in ortholog inference. The reliance on complete annotation information, encompassing precise genomic locations of genes, transcripts, and exons, can be a considerable challenge for many individual research groups. However, EGAP’s capacity to generate consistent, high-quality annotations across a multitude of eukaryotic genomes, coupled with a standardized data model for representing genomic annotation, uniquely positions NCBI to perform such analyses and produce ortholog outputs of substantial value to the scientific community.

The approach adopted by NCBI Orthologs has inherent limitations. As described in the Results section, our algorithm is designed to conservatively exclude ambiguous ortholog calls, meaning if a single, best pair cannot be unambiguously identified, no ortholog call is made for the involved genes. We recognize that this approach does not represent paralogous relationships or other complex homology scenarios resulting from intricate evolutionary events. Consequently, a given ortholog set might lack species representation not only due to gene loss but also when certain gene duplication events lead to ambiguous ortholog calls. Conversely, if a one-to-one call is made despite a gene duplication event, nomenclature from the anchor species is propagated to only that single orthologous paralog in the query species. The remaining paralog(s) will then receive generic names, potentially obscuring the existence of additional related genes from users. This can fail to meet the needs of some users who are accustomed to one-to-many and many-to-many orthologous relationships. As we further develop this pipeline, we will carefully consider feedback from users and aim to meet their diverse needs. One potential approach is to make all results of the pipeline, including all homologous gene pairs examined along with all computed scores, publicly available to facilitate user-driven specialized analyses such as inferring one-to-many and many-to-many orthologous relationships.

While nomenclature propagation is a primary motivation for developing the NCBI orthologs pipeline, applying gene names across species can still be challenging. For instance, some genes may be named with organism or genome specific information that is not broadly applicable to other species such as phenotypes, tissues, or genomic locations, and gene clusters prone to lineage-specific changes in copy number such as histones and olfactory receptor genes are generally not amenable to orthology-based naming. Similarly, the Drosophila melanogaster nomenclature, which utilizes symbols of the format “CG” followed by a number for many genes, is not particularly descriptive, and propagating these to all arthropod orthologs would be of little benefit to users.

Recent additions of insect anchor species (Apis mellifera, Bombyx mori, and Tribolium castaneum) demonstrate a commitment to improving within-clade orthology representation. The selection of these anchors was based on careful analysis of RefSeq trees (Goldfarb et al. 2025) and ortholog computation results. We continue to monitor the RefSeq collection to identify additional transitive anchors to incorporate more clade-specific orthologs. For example, as of March 2025, RefSeq included non-insect arthropods such as 26 crustaceans and 21 arachnids. Our initial assessments indicated that the current sampling of genomes within these clades was insufficient to identify an anchor capable of adding a substantial number of clade-specific orthologs. For the time being, we report orthologs and propagate gene names using the model organism fruit fly as the anchor. While this approach limits the number of identified orthologs to a few thousand (Fig. 2c, d), these data still provide valuable resources for exploring gene functions, conservation, and evolution among non-model genomes.

While we continue to add anchors, this may still not serve the needs of all users, particularly those who want to compute orthologs between a specific pair of genomes outside our established anchor system. To address this, we intend to develop a standalone pipeline that users can run independently for more specialized needs. Ultimately, the NCBI Orthologs resource will continue to expand in parallel with the growth of genome data in RefSeq and its taxonomic scope.

References

Al-Hatmi AM, Hagen F, Menken SB, Meis JF, de Hoog GS (2016) Global molecular epidemiology and genetic diversity of Fusarium, a significant emerging group of human opportunists from 1958 to 2015. Emerg Microbes Infect 5:e124
Article PubMed PubMed Central Google Scholar
Alliance of Genome Resources Consortium (2024) Updates to the alliance of genome resources central infrastructure. Genetics 227:iyae049
Article Google Scholar
Altenhoff AM, Glover NM, Dessimoz C (2019) Inferring orthology and paralogy. Methods Mol Biol 1910:149–175
Article CAS PubMed Google Scholar
Altenhoff A, Nevers Y, Tran V, Jyothi D, Martin M, Cosentino S, Majidian S, Marcet-Houben M, Fuentes-Palacios D, Persson E et al (2024a) New developments for the quest for orthologs benchmark service. NAR Genomics Bioinform 6:lqae167
Article CAS Google Scholar
Altenhoff AM, Warwick Vesztrocy A, Bernard C, Train CM, Nicheperovich A, Prieto Banos S, Julca I, Moi D, Nevers Y, Majidian S et al (2024b) OMA orthology in 2024: improved prokaryote coverage, ancestral and extant GO enrichment, a revamped synteny viewer and more in the OMA ecosystem. Nucleic Acids Res 52:D513–D521
Article CAS PubMed Google Scholar
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389–3402
Article CAS PubMed PubMed Central Google Scholar
Alvarez-Jarreta J, Amos B, Aurrecoechea C, Bah S, Barba M, Barreto A, Basenko EY, Belnap R, Blevins A, Bohme U et al (2024) VEuPathDB: the eukaryotic pathogen, vector and host bioinformatics resource center in 2023. Nucleic Acids Res 52:D808–D816
Article CAS PubMed Google Scholar
Beer KD, Farnon EC, Jain S, Jamerson C, Lineberger S, Miller J, Berkow EL, Lockhart SR, Chiller T, Jackson BR (2018) Multidrug-resistant Aspergillus fumigatus carrying mutations linked to environmental fungicide exposure - three states, 2010–2017. MMWR Morb Mortal Wkly Rep 67:1064–1067
Article PubMed PubMed Central Google Scholar
Bornstein K, Gryan G, Chang ES, Marchler-Bauer A, Schneider VA (2023) The NIH comparative genomics resource: addressing the promises and challenges of comparative genomics on human health. BMC Genomics 24:575
Article PubMed PubMed Central Google Scholar
Bowers JR, Monroy-Nieto J, Gade L, Travis J, Refojo N, Abrantes R, Santander J, French C, Dignani MC, Hevia AI et al (2020) Rhizopus microsporus infections associated with surgical procedures, Argentina, 2006–2014. Emerg Infect Dis 26:937–944
Article PubMed PubMed Central Google Scholar
Buchfink B, Reuter K, Drost HG (2021) Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 18:366–368
Article CAS PubMed PubMed Central Google Scholar
Darwin Tree of Life Project C (2022) Sequence locally, think globally: The Darwin Tree of Life Project. Proc Natl Acad Sci USA 119:e2115642118
Article Google Scholar
El-Sayed NM, Myler PJ, Bartholomeu DC, Nilsson D, Aggarwal G, Tran AN, Ghedin E, Worthey EA, Delcher AL, Blandin G et al (2005) The genome sequence of Trypanosoma cruzi, etiologic agent of Chagas disease. Science 309:409–415
Article CAS PubMed Google Scholar
Emms DM, Kelly S (2019) OrthoFinder: phylogenetic orthology inference for comparative genomics. Genome Biol 20:238
Article PubMed PubMed Central Google Scholar
Fischer S, Brunk BP, Chen F, Gao X, Harb OS, Iodice JB, Shanmugam D, Roos DS, Stoeckert CJ Jr. (2011) Using OrthoMCL to assign proteins to OrthoMCL-DB groups or to cluster proteomes into new ortholog groups. Curr Protoc Bioinformatics. https://doi.org/10.1002/0471250953.bi0612s35
Article PubMed PubMed Central Google Scholar
Fitch WM (2000) Homology a personal view on some of the problems. Trends Genet 16:227–231
Article CAS PubMed Google Scholar
Force A, Lynch M, Pickett FB, Amores A, Yan YL, Postlethwait J (1999) Preservation of duplicate genes by complementary, degenerative mutations. Genetics 151:1531–1545
Article CAS PubMed PubMed Central Google Scholar
Gabaldon T, Koonin EV (2013) Functional and evolutionary implications of gene orthology. Nat Rev Genet 14:360–366
Article CAS PubMed PubMed Central Google Scholar
Goldfarb T, Kodali VK, Pujar S, Brover V, Robbertse B, Farrell CM, Oh DH, Astashyn A, Ermolaeva O, Haddad D et al (2025) Ncbi refseq: reference sequence standards through 25 years of curation and annotation. Nucleic Acids Res 53:D243–D257
Article PubMed Google Scholar
Herrero J, Muffato M, Beal K, Fitzgerald S, Gordon L, Pignatelli M, Vilella AJ, Searle SM, Amode R, Brent S et al (2016) Ensembl comparative genomics resources. Database (Oxford) 2016:bav096
Article PubMed Google Scholar
Itabangi H, Sephton-Clark PCS, Tamayo DP, Zhou X, Starling GP, Mahamoud Z, Insua I, Probert M, Correia J, Moynihan PJ et al (2022) A bacterial endosymbiont of the fungus Rhizopus microsporus drives phagocyte evasion and opportunistic virulence. Curr Biol 32(1115–1130):e1116
Google Scholar
Kirilenko BM, Munegowda C, Osipova E, Jebb D, Sharma V, Blumer M, Morales AE, Ahmed AW, Kontopoulos DG, Hilgers L et al (2023) Integrating gene annotation with orthology inference at scale. Science 380:eabn3107
Article CAS PubMed PubMed Central Google Scholar
Klingler M, Bucher G (2022) The red flour beetle T. castaneum: elaborate genetic toolkit and unbiased large scale RNAi screening to study insect biology and evolution. EvoDevo 13:14
Article CAS PubMed PubMed Central Google Scholar
Kocher S, Kingwell C (2024) The molecular substrates of insect eusociality. Annu Rev Genet 58:273–295
Article CAS PubMed PubMed Central Google Scholar
Langschied F, Bordin N, Cosentino S, Fuentes-Palacios D, Glover N, Hiller M, Hu Y, Huerta-Cepas J, Coelho LP, Iwasaki W et al (2024) Quest for orthologs in the era of biodiversity genomics. Genome Biol Evol. https://doi.org/10.1093/gbe/evae224
Article PubMed PubMed Central Google Scholar
Liu F, Hu ZD, Zhao XM, Zhao WN, Feng ZX, Yurkov A, Alwasel S, Boekhout T, Bensch K, Hui FL et al (2024) Phylogenomic analysis of the Candida auris-Candida haemuli clade and related taxa in the Metschnikowiaceae, and proposal of thirteen new genera, fifty-five new combinations and nine new species. Persoonia 52:22–43
Article PubMed PubMed Central Google Scholar
Lockhart SR, Etienne KA, Vallabhaneni S, Farooqi J, Chowdhary A, Govender NP, Colombo AL, Calvo B, Cuomo CA, Desjardins CA et al (2017) Simultaneous emergence of multidrug-resistant Candida auris on 3 continents confirmed by whole-genome sequencing and epidemiological analyses. Clin Infect Dis 64:134–140
Article CAS PubMed Google Scholar
Lovell JT, Sreedasyam A, Schranz ME, Wilson M, Carlson JW, Harkess A, Emms D, Goodstein DM, Schmutz J (2022) GENESPACE tracks regions of interest and gene copy number variation across multiple genomes. Elife. https://doi.org/10.7554/eLife.78526
Article PubMed PubMed Central Google Scholar
Majidian S, Nevers Y, Yazdizadeh Kharrazi A, Warwick Vesztrocy A, Pascarelli S, Moi D, Glover N, Altenhoff AM, Dessimoz C (2025) Orthology inference at scale with FastOMA. Nat Methods 22:269–272
Article CAS PubMed PubMed Central Google Scholar
Manni M, Berkeley MR, Seppey M, Simao FA, Zdobnov EM (2021) Busco update: novel and streamlined workflows along with broader and deeper phylogenetic coverage for scoring of eukaryotic, prokaryotic, and viral genomes. Mol Biol Evol 38:4647–4654
Article CAS PubMed PubMed Central Google Scholar
McGinnis S, Madden TL (2004) BLAST: at the core of a powerful and diverse set of sequence analysis tools. Nucleic Acids Res 32:W20-25
Article CAS PubMed PubMed Central Google Scholar
Napo M, Kock A, Alayande KA, Sulyok M, Ezekiel CN, Uehling J, Pawlowska TE, Adeleke RA (2025) Tomato rot by Rhizopus microsporus alters native fungal community composition and secondary metabolite production. Front Microbiol 16:1508519
Article PubMed PubMed Central Google Scholar
Nash AK, Auchtung TA, Wong MC, Smith DP, Gesell JR, Ross MC, Stewart CJ, Metcalf GA, Muzny DM, Gibbs RA et al (2017) The gut mycobiome of the Human Microbiome Project healthy cohort. Microbiome 5:153
Article PubMed PubMed Central Google Scholar
Navale V, Vamkudoth KR, Ajmera S, Dhuri V (2021) Aspergillus derived mycotoxins in food and the environment: prevalence, detection, and toxicity. Toxicol Rep 8:1008–1030
Article CAS PubMed PubMed Central Google Scholar
Nevers Y, Jones TEM, Jyothi D, Yates B, Ferret M, Portell-Silva L, Codo L, Cosentino S, Marcet-Houben M, Vlasova A et al (2022) The quest for orthologs orthology benchmark service in 2022. Nucleic Acids Res 50:W623–W632
Article CAS PubMed PubMed Central Google Scholar
Nevers Y, Warwick Vesztrocy A, Rossier V, Train CM, Altenhoff A, Dessimoz C, Glover NM (2025) Quality assessment of gene repertoire annotations with OMArk. Nat Biotechnol 43:124–133
Article CAS PubMed Google Scholar
Nout MJ, Kiers JL (2005) Tempe fermentation, innovation and functionality: update into the third millenium. J Appl Microbiol 98:789–805
Article CAS PubMed Google Scholar
O’Leary NA, Cox E, Holmes JB, Anderson WR, Falk R, Hem V, Tsuchiya MTN, Schuler GD, Zhang X, Torcivia J et al (2024) Exploring and retrieving sequence and metadata for species across the tree of life with NCBI datasets. Sci Data 11:732
Article PubMed PubMed Central Google Scholar
Ozturk-Colak A, Marygold SJ, Antonazzo G, Attrill H, Goutte-Gattat D, Jenkins VK, Matthews BB, Millburn G, Dos Santos G, Tabone CJ (2024) Flybase: updates to the Drosophila genes and genomes database. Genetics. https://doi.org/10.1093/genetics/iyad211
Papadopoulos JS, Agarwala R (2007) COBALT: constraint-based alignment tool for multiple protein sequences. Bioinformatics 23:1073–1079
Article CAS PubMed Google Scholar
Prieto-Banos S, Nevers Y, Altenhoff A, Warwick Vesztrocy A, Dessimoz C, Glover NM (2025) Annotation matters: the effect of structural gene annotation on orthology inference. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaf365
Article PubMed PubMed Central Google Scholar
Rangwala SH, Kuznetsov A, Ananiev V, Asztalos A, Borodin E, Evgeniev V, Joukov V, Lotov V, Pannu R, Rudnev D et al (2021) Accessing NCBI data using the NCBI Sequence Viewer and Genome Data Viewer (GDV). Genome Res 31:159–169
Article PubMed PubMed Central Google Scholar
Rangwala SH, Rudnev DV, Ananiev VV, Oh DH, Asztalos A, Benica B, Borodin EA, Bouk N, Evgeniev VI, Kodali VK et al (2024) The NCBI comparative genome viewer (CGV) is an interactive visualization tool for the analysis of whole-genome eukaryotic alignments. PLoS Biol 22:e3002405
Article CAS PubMed PubMed Central Google Scholar
Rhie A, McCarthy SA, Fedrigo O, Damas J, Formenti G, Koren S, Uliano-Silva M, Chow W, Fungtammasan A, Kim J et al (2021) Towards complete and error-free genome assemblies of all vertebrate species. Nature 592:737–746
Article CAS PubMed PubMed Central Google Scholar
Seal RL, Braschi B, Gray K, Jones TEM, Tweedie S, Haim-Vilmovsky L, Bruford EA (2023) Genenames.org: the HGNC resources in 2023. Nucleic Acids Res 51:D1003–D1009
Article CAS PubMed Google Scholar
Takeda H, Shimada A (2010) The art of medaka genetics and genomics: what makes them so unique? Annu Rev Genet 44:217–241
Article CAS PubMed Google Scholar
Talapko J, Juzbasic M, Matijevic T, Pustijanac E, Bekic S, Kotris I, Skrlec I (2021) Candida albicans—the virulence factors and clinical manifestations of infection. J Fungi (Basel) 7:79
Article CAS PubMed Google Scholar
Taylor JS, Braasch I, Frickey T, Meyer A, Van de Peer Y (2003) Genome duplication, a trait shared by 22000 species of ray-finned fish. Genome Res 13:382–390
Article CAS PubMed PubMed Central Google Scholar
Tegenfeldt F, Kuznetsov D, Manni M, Berkeley M, Zdobnov EM, Kriventseva EV (2025) Orthodb and busco update: annotation of orthologs with wider sampling of genomes. Nucleic Acids Res 53:D516–D522
Article PubMed Google Scholar
Thomas PD, Ebert D, Muruganujan A, Mushayahama T, Albou LP, Mi H (2022) PANTHER: making genome-scale phylogenetics accessible to all. Protein Sci 31:8–22
Article CAS PubMed Google Scholar
Weatherly DB, Boehlke C, Tarleton RL (2009) Chromosome level assembly of the hybrid Trypanosoma cruzi genome. BMC Genomics 10:255
Article PubMed PubMed Central Google Scholar
Wong ED, Miyasato SR, Aleksander S, Karra K, Nash RS, Skrzypek MS, Weng S, Engel SR, Cherry JM (2023) Saccharomyces genome database update: server architecture, pan-genome nomenclature, and external resources. Genetics. https://doi.org/10.1093/genetics/iyac191
Article PubMed PubMed Central Google Scholar
Zhao T, Schranz ME (2019) Network-based microsynteny analysis identifies major differences and genomic outliers in mammalian and angiosperm genomes. Proc Natl Acad Sci U S A 116:2165–2174
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

The authors would like to thank Vyacheslav Brover for sharing the RefSeq phylogenetic trees, Anna Glodek, Emily Davis, Anne Ketter and Ekatarina Sukharnikov for product and project management support to the RefSeq and Datasets teams, members of the RefSeq and EGAP curation team for data review and expert curation.

Funding

Open access funding provided by the National Institutes of Health. This work was supported by the National Center for Biotechnology Information of the National Library of Medicine (NLM), National Institutes of Health (NIH). The contributions of the NIH author(s) are considered Works of the United States Government. The findings and conclusions presented in this paper are those of the author(s) and do not necessarily reflect the views of the NIH or the U.S. Department of Health and Human Services.

Author information

Authors and Affiliations

National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
Dong-Ha Oh, Alexander Astashyn, Barbara Robbertse, Nuala A. O’leary, W. Ray Anderson, Laurie Breen, Eric Cox, Olga Ermolaeva, Robert Falk, Vichet Hem, J. Bradley Holmes, Patrick Masterson, Kelly M. McGarvey, Eyal Mozes, John P. Torcivia, Mirian T. N. Tsuchiya, Craig Wallin, Françoise Thibaud-Nissen, Terence D. Murphy & Vamsi K. Kodali

Authors

Dong-Ha Oh
View author publications
Search author on:PubMed Google Scholar
Alexander Astashyn
View author publications
Search author on:PubMed Google Scholar
Barbara Robbertse
View author publications
Search author on:PubMed Google Scholar
Nuala A. O’leary
View author publications
Search author on:PubMed Google Scholar
W. Ray Anderson
View author publications
Search author on:PubMed Google Scholar
Laurie Breen
View author publications
Search author on:PubMed Google Scholar
Eric Cox
View author publications
Search author on:PubMed Google Scholar
Olga Ermolaeva
View author publications
Search author on:PubMed Google Scholar
Robert Falk
View author publications
Search author on:PubMed Google Scholar
Vichet Hem
View author publications
Search author on:PubMed Google Scholar
J. Bradley Holmes
View author publications
Search author on:PubMed Google Scholar
Patrick Masterson
View author publications
Search author on:PubMed Google Scholar
Kelly M. McGarvey
View author publications
Search author on:PubMed Google Scholar
Eyal Mozes
View author publications
Search author on:PubMed Google Scholar
John P. Torcivia
View author publications
Search author on:PubMed Google Scholar
Mirian T. N. Tsuchiya
View author publications
Search author on:PubMed Google Scholar
Craig Wallin
View author publications
Search author on:PubMed Google Scholar
Françoise Thibaud-Nissen
View author publications
Search author on:PubMed Google Scholar
Terence D. Murphy
View author publications
Search author on:PubMed Google Scholar
Vamsi K. Kodali
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Vamsi K. Kodali.

Ethics declarations

Conflict of interest

The authors declare that there are no conflicts of interest.

Additional information

Handling Editor: Natasha Glover.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (XLSX 414 KB)

Supplementary file2 (PDF 1039 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Oh, DH., Astashyn, A., Robbertse, B. et al. NCBI Orthologs: Public Resource and Scalable Method for Computing High-Precision Orthologs Across Eukaryotic Genomes. J Mol Evol (2025). https://doi.org/10.1007/s00239-025-10268-2

Download citation

Received: 02 April 2025
Accepted: 05 September 2025
Published: 25 September 2025
DOI: https://doi.org/10.1007/s00239-025-10268-2

NCBI Orthologs: Public Resource and Scalable Method for Computing High-Precision Orthologs Across Eukaryotic Genomes

Abstract

Similar content being viewed by others

Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes

Update on Genomic Databases and Resources at the National Center for Biotechnology Information

Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Introduction

Methods

Metrics for Genome-Based Ortholog Calculation

Selecting Orthologs Among Competing Homolog Pairs

Evaluation of NCBI Orthologs

Results

Precise Calculation of Orthologs Anchored to Genes and Genomes at NCBI

Evaluation of NCBI Orthologs

NCBI Orthologs for RefSeq Metazoan Genomes

Ortholog and Nomenclature Assignment

NCBI Orthologs for RefSeq Protozoan Human Pathogens

NCBI Orthologs for Selected RefSeq Fungi

Data Access

Discussion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (XLSX 414 KB)

Supplementary file2 (PDF 1039 KB)

Rights and permissions

About this article

Cite this article

Keywords

NCBI Orthologs: Public Resource and Scalable Method for Computing High-Precision Orthologs Across Eukaryotic Genomes

Abstract

Similar content being viewed by others

Orthograph: a versatile tool for mapping coding nucleotide sequences to clusters of orthologous genes

Update on Genomic Databases and Resources at the National Center for Biotechnology Information

Exploring and retrieving sequence and metadata for species across the tree of life with NCBI Datasets

Explore related subjects

Introduction

Methods

Metrics for Genome-Based Ortholog Calculation

Selecting Orthologs Among Competing Homolog Pairs

Evaluation of NCBI Orthologs

Results

Precise Calculation of Orthologs Anchored to Genes and Genomes at NCBI

Evaluation of NCBI Orthologs

NCBI Orthologs for RefSeq Metazoan Genomes

Ortholog and Nomenclature Assignment

NCBI Orthologs for RefSeq Protozoan Human Pathogens

NCBI Orthologs for Selected RefSeq Fungi

Data Access

Discussion

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Supplementary Information

Supplementary file1 (XLSX 414 KB)

Supplementary file2 (PDF 1039 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords