Background

Microorganisms are the most abundant and widely distributed life forms on Earth, playing a crucial role in biogeochemical cycles and maintaining ecological balance [1, 2]. They inhabit diverse environments, from air, soil, and water to extreme habitats like deep-sea hydrothermal vents and glaciers [3, 4]. The importance of microorganisms has spurred extensive research, especially with the advancement of high-throughput sequencing technology and reduced sequencing costs. Genome-resolved metagenomic analyses, such as those for Earth’s [5], glacier [6], aquatic [7], and human [8] microbiome genome catalogs, have been conducted using two main methodologies: the gene amplicon/marker genes sequencing (e.g., 16S rRNA) and shotgun metagenomics [9]. While 16S rRNA sequencing provides general insights into microbiota, it lacks the resolution to distinguish closely related taxa and unable to accurately identify viruses [10]. Therefore, shotgun metagenomics, which sequences genomic DNA without targeting specific genes, has become the primary tool for studying microorganisms [11]. 

Genomic analyses are yielding unprecedented insights into microbial evolution and diversity and are elucidating the extent and complexity of the genetic variation in both hosts and pathogens that underlie diseases [12]. However, more than 99% of prokaryotes in the environment cannot be traditionally cultured in the laboratory. In parallel to culturing, de novo assembly of shotgun metagenomic reads and binning into MAGs, a culturing-independent and reference-free approach, is thought to be a useful strategy to efficiently discover the potential microbial diversity that is recalcitrant to the current culturing approaches in the laboratory [13,14,15,16]. Using MAGs has provided massive expansion of the tree of life from different environmental niches, enabling the discovery of unknown species and the exploration of microbial source tracking [17, 18]. In the last few years, thousands of MAGs have been reported [5, 19,20,21,22]. Building and mining MAGs sequences are becoming central processes and common tasks in inferring the functional capabilities of bacteria, as they provide genome-level insights into the functional potential of individual microbial entities. However, MAG sequences vary in quality and may contain omissions and contamination due to the inherent complexities of metagenomic data and the challenges associated with assembly processes [13, 23]. Consequently, recovering high-quality MAGs (more than 90% completeness and less than 5% contamination; hereafter referred to as HMAGs) based on the “minimum information about a metagenome-assembled genome” (MIMAG) standard [24] from shotgun metagenomic sequence data is a crucial process for future analysis. Moreover, metagenomic assemblies and binning are both time-consuming and resource-intensive processes. Therefore, there is an urgent need for an “all-in-one” database that contains high-quality MAGs data from a variety of environments and host-associated microbiota with rapidly growing metagenomic data. High-quality MAGs reference databases are required to confidently investigate the structure and function of complex microbial communities in natural or engineered ecosystems. But as far as we know, there are currently lacking comprehensive databases to provide permanent storage and public access for high-quality MAGs data based on representative metagenomic studies.

To address these limitations and facilitate the reusability and accessibility of MAGs data, we established MAGdb, a curated database that particularly focuses on high-quality assembled microbiome sequences. Overall, we collect 13,702 paired-end sequencing runs from shotgun metagenomic sequencing across 74 papers, spanning clinical, environmental, and animal research areas. The main features of MAGdb include: (i) manually curated paper information for each collected runs and all consistent format metadata, (ii) consistent taxonomic assignments of HMAGs and precomputed genome information, (iii) easily accessible categorized HMAGs with complete traceability to source raw data. MAGdb enables researchers to quickly acquire MAG sequences on microbiota of interest and provides sequence download for exploring the composition and roles of the microbiome in different areas.

Results

Design of MAGdb database

The construction scheme of the MAGdb database is illustrated in Fig. 1. In brief, the metagenomic raw data, MAGs (if provided) and the metadata were collected and manually curated based on the unit of representative research papers (see method section). Subsequently, we employed a combination analyze pipeline of metagenomic assembly and binning and recovered MAGs from related publications that do not provide MAGs (Fig. 1). The MAGs were produced by three different binning tools and then integrated and refined to remove duplicates and improve the quality of assembled genomes with metaWRAP [3]. In order to provide strict genome quality control, we only selected those MAGs that met or exceeded the high-quality standard of > 90% completeness and < 5% contamination for subsequent analyses, which we referred to as the MAGdb catalog. Finally, all curated data were assembled into the database system, and the web platform was implemented. 

Fig. 1
figure 1

Workflow for constructing the global MAG database

Overview of the MAGdb content and statistics

To date, the MAGdb has successfully collected 13,702 microbial metagenomic sequence samples from 74 research publications, covering 66 countries across 5 continents, and these samples were classified into clinical, environmental, and animal categories (Fig. 2a). Clinical samples occupy the largest proportion (76.2%), followed by environmental samples (12.04%), with the lowest proportion of animal samples (11.4%) (Fig. 2a). Specifically, the clinical category included 29 publications and 10,439 run accessions, the environmental category included 30 publications and 1703 run accessions, and the animal category included 15 publications and 1560 run accessions (Table 1). These extensive datasets serve as the fundamental resources of the MAGdb database, providing an expansive landscape for exploring MAGs. 

Fig. 2
figure 2

Summary of the data statistics in MAGdb database. a The bar and pie chart depict the sample size distribution across top ten different countries and all collected run accessions in three categories. b Distribution of quality metrics for the HMAGs (n = 99,672), showing the interquartile range between the first and third quartiles and the line inside represents the median, respectively. c The distribution of quantitative events in different classification levels (domain, phylum, genus) in bacteria and d in archaea

Table 1 Detailed data statistics of MAGdb

The MAGdb now contains a total of 99,672 HMAGs in three categories (Table 1). The HMAGs all meet or exceed the high-quality level of the MIMAG criteria (completion > 90%, contamination < 5%), exhibiting a mean completeness of 96.84% (± 2.81%) and a mean contamination rate of 1.02% (± 1.09%), with genome sizes ranging from 0.52 to 12.26 Mb and GC% content varying from 22.4% to 75% (Fig. 2b). We further taxonomically annotated these HMAGs using the GTDB-Tk based on the GTDB database. The MAGdb catalog covered 90 known phyla (82 for bacteria and 8 for archaea), 196 known classes (177 for bacteria and 19 for archaea), 501 known orders (474 for bacteria and 27 for archaea), and 2753 known genera (2687 for bacteria and 66 for archaea). The most frequently occurring genera and corresponding phyla in the bacterial domain and archaea domain are shown in Fig. 2c–d.

We also analyzed the correlation between sequencing read counts and HMAG completeness or the number of recovered MAGs. The relationship between per-sample sequencing read counts and HMAG mean completeness showed divergent trends across sample types in the clinical (a), environmental (c) and animal (e) categories: while human gut samples and animal-derived (e.g., sus scofa lung, fecal) exhibit progressive completeness declines, environmental samples (e.g., soil, water) show either gradual increases or stable plateaus (Additional file 1: Fig. S2 a, c, e). However, the number of recovered HMAGs increases with sequencing read counts in clinical (b), environmental (d), and animal (f) category, especially in fecal samples (Additional file 1: Fig. S2 b, d, f). These findings show that sequencing depth influences both MAG completeness and yield, and that optimal sequencing strategies should consider microbial community complexity.

The HMAG catalog exhibits a highly diverse and complex range of microbial species. We totally annotated 5381species and 2753 genera from the 99,672 HMAGs, while there were still 6316 HMAGs remaining unclassified at the species level. The top 10 classification levels (order, family, genus and species) for each category are depicted in Fig. 3a–c. Taxonomic analysis revealed Escherichia coli as the dominant species in clinical samples. However, most HMAGs derived from environmental and animal specimens remained unclassified at the species level. The large proportion of unclassified HMAGs suggests extensive undiscovered microbial diversity in these ecosystems. We also analyzed the top 10 species of the three categories, as well as their higher taxonomic ranks shown in Fig. 3d.

Fig. 3
figure 3

The HMAG catalogue as an expanded genomic resource. Taxonomic distribution of the HMAG dataset at order, family, genus, and species levels in clinical (a), environmental (b), and animal (c). Only the top ten taxa are shown at each taxonomic level in the three categories. d The Sankey chart shows the lineage of resources among the top 10 species frequency within the three categories

Database web interface and modules

We designed a user-friendly interface (Fig. 4a) that allows users to effectively browse and query MAGs and related information. In short, MAGdb can be divided into three main modules, namely “Rawdata,” “MAG,” and “Download.” These modules provide detailed and convenient publication links, raw data metadata, MAGs sequences, and sequence information for users to browse, search, and download in various aspects of the microbiome (Fig. 4b, c). By employing these modules, users can easily mine the sequence information of MAGs, gaining valuable insights into microbial diversity, functional potential, and genetic characteristics.

Fig. 4
figure 4

Features of MAGdb database. a Website design and interface of the MAGdb. b Data browsing interface. c Detailed information interface

The “Rawdata” module provides the list of the publication items in each category, labeled with the journal name and publication date (Fig. 4b, upper panel). Each publication item was provided with the number of collected run accessions, quality control reports, summary of the publication, official journal website link, and metadata. All the above characteristics contain additional links for users to explore detailed information. For example, the “quality control report” link provides multiple visualizations depicting the QC results of sequence data, such as reads length, GC content, adapter content, and duplication rates (Fig. 4c, upper panel). The curated metadata and associated information for each study can be downloaded as Excel (.xlsx) files.

The “MAG” module primarily offers a comprehensive resource for users to browse and explore MAG sequences in each publication item (Fig. 4b, middle panel). Users can click the publication items to access a browsing page containing sequence information of all MAGs generated in the corresponding study. In addition, the “HMAG” link allows users to swiftly navigate to the global summary page that provides the MAG statistical plots from this study, including completeness, contamination, genome size, number of contigs, N50, and taxonomic classification information (Fig. 4c, lower panel). Users can freely download the plot as a portable graphic file as well as the MAGs statistic information.

The “Download” module provides the page for downloading the MAG sequences in terms of the publications in each category (Fig. 4b, lower panel). Each publication item directs to a link for downloading a compressed file. Additionally, users can download data of multiple publications at once through batch selection. In addition, the “Help” page offers comprehensive guidance and step-by-step procedures for users to navigate and effectively utilize the MAGdb system. The “News” page displays information about versions, timestamps, and changes made in each version.

Phylogenetic and functional characterization of MAGdb

To investigate the evolutionary relationships and functional divergence within MAGdb, we first established a non-redundant genome set through dereplication at 95% ANI. This yielded 7303 non-redundant representative HMAGs, which were subsequently subjected to further analyses. To determine the phylogenetic relationships of these representative genomes, a maximum-likelihood phylogenetic tree of the representative HMAGs was generated with FastTree [25] based on 120 bacterial marker genes identified by GTDB-tk [26]. The HMAGs belonging to p__Bacillota_A exhibited the most widespread distribution in the phylogenetic tree, indicating high phylogenetic diversity. To better elucidate the functional diversity of the HMAGs in MAGdb, we annotated gene functions of the representative HMAGs with eggNOG databases [27], including COGs, KEGG pathway, and level-4 ECs. We found that a total of 94% of genes from the MAGdb had a match to at least one of the databases of COGs (n = 13,598,009 genes across 24 functional categories), ECs (n = 4,311,408 genes matching 3610 enzymes), KEGG (n = 4,913,916 genes from 872 pathway), GOs (n = 1,172,159 genes from 20,792 GOs), and CAZy (n = 194,221 from 141 CAZys) (Fig. 5b). To gain further insights into the relationships between microbial phyla, we constructed co-occurrence networks based on the frequency of each HMAG phylum across the three categories. The co-occurrence network analysis revealed distinct patterns of microbial interactions across different sample types. In animal-associated samples, we observed strong positive correlations between microbial phyla such as p__Bacillota_C, p_Spirochaetota, and P__Bacteroidota. These phyla exhibited a high degree of co-occurrence, suggesting potential ecological relationships and functional associations within the animal microbiota. In contrast, environmental samples showed strong co-occurrence between p__Latescibacterota, p__Marinisomatota, and p__ Bacillota_A, forming multiple network modules that indicate significant interactions among these phyla in the environmental microbiome. These phyla may play a crucial role in the microbial communities under specific environmental conditions. However, clinical samples exhibited predominantly negative co-occurrence patterns among major phyla, including p__Bacillota_A, p__Firmicutes_A, and p__Actinobacteriota (Fig. 5c). These antagonistic interactions likely reflect niche competition due to the variability introduced by factors such as host health status, treatments, or environmental conditions.

Fig. 5
figure 5

Exploratory Analysis of MAGdb. a Phylogenetic tree of non-redundant HMAGs constructed in this study. From the inner to outer circles, the first circle indicates the genome size of HMAGs, the second circle indicates the corresponding phyla which MAGs belonged to, the third circle shows whether the MAGs could be annotated to a species or not, and the fourth circle indicates the source category of each HMAG’s original sample category. b Number of proteins with functional annotations across the five functional categories and their degree of overlap. Vertical bars represent the number of proteins unique to each functional category or shared between the specific functional categories. Horizontal bars in the lower panel indicate the total number of proteins with functional annotation in each functional category. c Co‐occurrence network of differential phylum. Interactions between phylum (nodes) are represented by connecting lines (edges), and each node is colored according to the phylum to which it belongs. The colors of the edge lines represent positive (light red) or negative (dark blue) interactions

Discussion and future direction

Technological advancements in assembly and binning tools have led to a significant increase in the assembled fraction of the average metagenome, coupled with an exponential increase in the number of MAGs [28]. In this study, we constructed the MAGdb, an online database of curated and consistently annotated HMAGs. With 13,702 samples collected from 74 metagenomic research papers, 99,672 HMAGs were obtained in the current version of MAGdb, linked to 5381 species. Notably, 6316 HMAGs (6.3%) have not been annotated to the species level. These likely represent novel taxa, indicating that many new species remain to be discovered. With the continued accumulation of sequencing data and improvements in annotation methods, currently unidentified HMAGs may eventually be classified as novel microbial species, further enriching our understanding of microbial diversity. MAGdb exhibits outstanding comprehensive coverage over previous metagenomic databases in the microbial sequence data analysis area. Extensive comparisons with existing metagenomic databases indicated that MAGdb contains more MAGs with higher quality [29, 30], thus enabling more accurate and reliable microbial genome analyses.

We believe that MAGdb will have positive impacts on various aspects of microbiological and metagenomics researches. Specifically, MAGdb serves as a comprehensive resource for the identification, characterization, and functional annotation of metagenome-assembled genomes, enhancing our understanding of microbial functions and interactions. For example, inferring the functional capabilities of microorganisms from mining MAGs is becoming a central process in microbiology in many studies [31], such as CRISPR/Cas loci [32, 33], antimicrobial resistance genes [34] and mobile genetic elements [35]. Moreover, MAG sequences can offer invaluable insights into the identification of previously undescribed organisms, often referred to as the microbial dark matter, along with a comprehensive understanding of their genetic composition [20, 36, 37]. The advantage of MAGs exploration of unknown species becomes crucial in the context of emerging and novel infectious diseases, providing essential clues for the early detection of potential pathogens [18]. MAGs also provide information beyond the reference genome. Unlike reference genomes, MAGs facilitate the identification of unique genetic features, such as specific mutation sites and genetic variations, and the identification of key genetic features associated with environmental adaptability and ecological roles. Additionally, subspecies variations, such as strain diversity, mobile gene composition, and copy number variations, have been demonstrated to be associated with host traits and lifestyle habits [38]. The insights derived from MAGs not only extend our understanding beyond the limitations of existing reference genomes but also play a critical role in advancing diverse fields, ranging from environmental microbiology to research in human health.

The MAGdb database will continue to be upgraded and improved in the future. MAGdb is in its release version 1.0, and it will regularly get updates per year, due to the exponential number of novel metagenomes sequence added to public repositories. In addition to continuously collecting new metagenomic data over the next few years, we plan to add new contents to MAGdb, including (but not limited to) metavirome data for viral operational taxonomic units (vOTU) and deploy the analysis modules for mining functional profiles and evolutionary relationships. Additionally, we observed a gradual increase of third-generation sequencing data (long-read) within metagenome at the time when this manuscript was prepared. We therefore anticipate providing MAGs derived from third-generation sequencing assemblies in future versions. With these advancements, we believe that our database improves the reusability and exploration of metagenomic data further and helps users better understand the relationship between the microbial populations and their interactions with the environment where they live.

Conclusion

MAGdb provides a comprehensive platform for global integration and standardization of MAGs, facilitating microbial diversity exploration and cross-study comparisons for users in downstream analysis. The platform also leverages advanced visualization to illustrate the sequence data and MAGs patterns within each publication. Additionally, the website provides detailed annotations of each HMAG record including sequence characteristics and taxonomy affiliation. By providing a centralized and consistent resource, MAGdb enhances the reproducibility and comparability of studies, allowing researchers to explore microbial diversity and function on a large scale. Furthermore, the detailed annotations of HMAG records, including sequence characteristics and taxonomy affiliations, offer a robust framework for downstream analysis, facilitating the discovery of novel species, functional genes, and their roles in various ecosystems.

Construction and content

Data collection and curation

We collected the literature containing metagenome studies from the PubMed database with relevant keywords, such as “microbiome AND shotgun metagenomics” and “de-novo assembly AND microbial AND genome,” providing a list of more than 2000 publications since 2015. In order to include only high-quality studies, we firstly excluded reviews, letters, editorials, and other publications without disclosing the original sequencing data. We then refined this list by reading the abstracts and results of each publication, keeping only those studies that meet the following criteria: (1) Research quality: Studies published in peer-reviewed, high-impact journals, known for the field of microbiome metagenomics; (2) Data accessibility: Studies that provided raw data, metadata, or detailed experimental protocols to ensure reproducibility and independent verification; (3) Focus on relevant topics: Studies directly related to microbiome metagenomics in clinical (e.g., human gut, skin samples), environmental (e.g., soil, marine samples), or animal (e.g., cow, goat samples) contexts. Only those publications that aligned with the above criteria were retained. Subsequently, we manually reviewed the experiment section and only extracted shotgun metagenomics validated samples. Through multiple rounds of manual curation, the information from selected studies was collected into an Excel sheet with pre-defined fields. We did not include any studies which required additional ethics committee approvals or authorizations for access. As a result, we finally acquired a total of 74 metagenome-associated studies for subsequent analysis (Additional file 1: Fig. S1).

Paired-end raw metagenome shotgun sequencing runs together with MAGs data (if provided) were downloaded from online repositories (EBI-ENA [39], NCBI-SRA [40], CNGB-CNSA [41] and NGDC-GSA [42]) whose links can be found in the related publications. Meanwhile, for each run or sample, we also collected relevant metadata, including technical metadata, such as the run accession, sequencing platform, number of reads, and base length, as well as biological metadata, such as sample source, sample name, country, and continent. Finally, the meticulously curated metadata of raw metagenome sequencing data were compiled into an Excel spreadsheet with the above pre-defined fields.

Data processing

Our systematic analysis included 74 published metagenomic studies. Among these, only 5 studies provided both single-sample assembled genomes and complete assembly provenance metadata. For these studies containing MAGs, we assessed genome quality using CheckM (v1.2.2) [43], retaining only HMAGs that met the criteria (completeness > 90%, contamination < 5%). For the remaining 69 studies lacking MAGs, we performed de novo analysis of 6,692 metagenomic samples using a standardized bioinformatic pipeline based on the EasyMetagenome framework [44] (https://github.com/YongxinLiu/EasyMetagenome). Additionally, fastqc (v0.12.1, http://www.bioinformatics.babraham.ac.uk/projects/fastqc/) was employed to evaluate the overall quality of the raw metagenomic data, followed by the use of the MultiQC (v1.21) [45] tool to aggregate results from the fastqc report across many samples into a single self-contained HTML report before subsequent analysis.

Next, in order to obtain the MAG sequences from each publication, we processed and analyzed the paired-end metagenomic data from articles that did not provide MAG sequences [46]. In brief, the fastp (v0.23.2) [47] was used for short paired-end read quality filtering and adapter removal with the specified parameter “–dedup -q 20”, followed by host DNA removal via bowtie2 (v2.5.1) [48] alignment against the host reference genome with setting parameters “–end-to-end –no-mixed –no-discordant –no-unal –very-sensitive” if necessary (raw metagenomic data obtained from clinical and animal samples undergoes a rigorous process to remove host genome contamination before assembly analysis). Subsequently, the remaining paired-end sequence reads were assembled into contiguous sequences for each of the sequenced samples separately by MEGAHIT (v1.2.9) [49] with default parameters. Thereafter, the metawrap binning module [3], integrating MaxBin2 (v2.2.7) [50], metaBAT2 (v2.12.1) [51], and CONCOCT (v1.1.0) [52] three binning tools, were used to bin the assemblies with the “–metabat2 –maxbin2 –concoct” option. The default of the minimum length of contigs used for constructing bins with MaxBin2 and CONCOCT were 1000 bp, and metaBAT2 was defaulted to 1500 bp. To reconcile and dereplicate the three generated binner outputs, refinement of MAGs was performed by the bin_refinement module of metaWRAP [3], and CheckM (v1.2.2) [43] was used to estimate the completeness and contamination of the bins with parameters “-× 10 -c 50”, corresponding to the minimum completion and maximum contamination were 50% and 10%, respectively. Finally, we only kept the MAGs exhibiting > 90% completeness and < 5% contamination as the HMAG sequence for further analysis. All HMAGs were taxonomically annotated using the GTDB-Tk (v2.3.0) [26] (reference database version R214) with “classify_wf” workflow with default parameters [53], which produced standardized taxonomic labels that were used for user reference. The final results of each study were compiled in a single matrix-like table containing information for all generated HMAGs in this process (Fig. 1).

Exploratory analysis of HMAGs

To systematically explore all recovered HMAGs, we performed phylogenetic diversity and functional potential analysis. In order to reduce redundancy of the HMAGs and to determine the representative genomes, the resulting HMAGs across all samples and assemblies were dereplicated based on 95% ANI with the following options: “–S_algorithm skani –clusterAlg centroid -pa 0.9 -sa 0.95 -nc 0.10 -cm larger –multiround_primary_clustering” using dRep (v3.5.0) [54]. These non-redundant HMAGs were subsequently subjected to comprehensive phylogenetic analysis and functional annotation. The taxonomy annotation of the non-redundant HMAGs was performed using the module “classify_wf” of GTDB-Tk (v2.3.0)[26] against the GTDB release R214 with default parameters. The phylogenetic tree of bacterial non-redundant representative HMAG was built using FastTree (v2.1.11) [25] with the protein sequence alignments generated by the GTDB-Tk tool with parameter setting “-wag -boot 1000”, while all other parameters were set to their default values. Tree visualization and annotation were performed using an R package ggtree (v3.14.0) [55]. The putative protein-coding sequences (CDSs) of representative genomes were predicted using Prodigal (v2.6.3) [56] with the “-p meta” parameter. The predicted CDSs were then dereplicated by cd-hit-est (v4.8.1) [57] with the options “-c 0.95 -aS 0.9”. Subsequently, the representative, non-redundant CDSs were annotated with Eggnog-mapper (v2.1.12) [58], employing DIAMOND search mode against the EggNOG v5.0 database with default parameters [27]. The KEGG (Kyoto Encyclopedia of Genes and Genome) pathway, Clusters of Orthologous Groups of proteins (COGs) functional annotations, level-4 Enzyme Commission categories (Ecs), Gene Ontologies (GOs) and carbohydrate-active enzymes (CAZy) were derived from the Eggnog-mapper results. To construct co-occurrence networks of HMAGs across the three categories, we calculated the frequency of each HMAG phylum in every sample. This frequency matrix served as the input for network construction. Microbial co-occurrence networks at the phylum level were constructed using the ggClusterNet 2 R package (v2.00) [59] with parameters “N = 0, r = 0.2, p = 0.05, method = spearman”. 

Web implementation

All HAMGs and metadata were stored in a MongoDB database (https://docs.mongodb.com/). The MAGdb web-interface (the frontend webpages) was implemented using JavaScript and HTML for frontend development. The used core JavaScript libraries include Vue.js (https://vuejs.org/) as the main frontend framework. The backend was mostly implemented in Node.js (https://nodejs.org/) as the framework for the application. MAGdb is available online without registration and is optimized for Chrome (recommended), Firefox, Windows Edge, and macOS Safari.