AnomalGRN: deciphering single-cell gene regulation network with graph anomaly detection

1392 Accesses
5 Citations
Explore all metrics

Abstract

Background

Single-cell RNA sequencing (scRNA-seq) is now essential for cellular-level gene expression studies and deciphering complex gene regulatory mechanisms. Deep learning methods, when combined with scRNA-seq technology, transform gene regulation research into graph link prediction tasks. However, these methods struggle to mitigate the impact of noisy data in gene regulatory networks (GRNs) and address the significant imbalance between positive and negative links.

Results

Consequently, we introduce the AnomalGRN model, focusing on heterogeneity and sparsification to elucidate complex regulatory mechanisms within GRNs. Initially, we consider gene pairs as nodes to construct new networks, thereby converting gene regulation prediction into a node prediction task. Considering the imbalance between positive and negative links in GRNs, we further adapt this issue into a graph anomaly detection (GAD) task, marking the first application of anomaly detection to GRN analysis. Introducing the cosine metric rule enables the AnomalGRN model to differentiate between homogeneity and heterogeneity among nodes in the reconstructed GRNs. The adoption of graph structure sparsification technology reduces noisy data impact and optimizes node representation.

Conclusions

GRLGRN: graph representation-based learning to infer gene regulatory networks from single-cell RNA-seq data

Article Open access 18 April 2025

Integrating Machine Learning and Biological Context for Single-Cell Gene Regulatory Network Inference

HGATLink: single-cell gene regulatory network inference via the fusion of heterogeneous graph attention networks and transformer

Article Open access 11 February 2025

Background

Gene regulation, crucial for controlling gene expression, dictates cellular and organismal responses to internal and external signals [1, 2]. Abnormal GRNs indicate dysregulation in gene–gene regulatory relationships, potentially triggering various pathogenic mechanisms [3]. Studying GRNs in depth can uncover disease-related key regulators and identify potential therapeutic targets [4]. This also offers researchers valuable insights for developing drug treatments. Focusing on specific regulatory pathways or gene targets enables the design of targeted drugs that minimize side effects and enhance therapeutic outcomes [5]. Consequently, a comprehensive understanding and accurate inference of GRNs are increasingly vital. Single-cell RNA sequencing, an advanced biotechnology, analyzes gene expression patterns at the single-cell level [6, 7]. Single-cell RNA sequencing precisely identifies and quantifies gene activities at the single-cell level, exposing functional and property differences among cells [8]. The rapid expansion of single-cell gene expression and regulatory data necessitates the development of efficient, accurate computational methods to explore the complex gene–gene interactions.

Machine learning and deep learning technologies have spurred the development of numerous computational models for GRN inference. Vân et al. introduced GENIE3, a model leveraging ensemble learning to break down GRN inference into several regression tasks, thereby inferring GRN through the integration of multiple regression models [9]. Shu et al. developed the GRN-Transformer model, employing a block cell approach to improve the predictive accuracy of single-cell GRNs [10]. They established a combined framework of pre-training and supervised learning, integrating single-cell RNA sequencing data with known GRN data from numerous cells to deduce GRNs for specific cell types.

The accuracy of GRN inference may be constrained by relying on a single data source. To address this limitation, Yuan et al. proposed the co-expression convolution neural network model (CNNC) to infer unknown GRNs [11]. CNNC converts expression data, which lacks local information, into an image-like format and applies a convolutional neural network (CNN) for feature extraction. Furthermore, CNNC is highly scalable, allowing the integration of diverse genomic data types to enhance its performance. Building on this, Chen et al. introduced DeepDRIM to integrate additional data sources [12]. DeepDRIM represents joint gene expression distributions of gene pairs as images, incorporating data on target genes, transcription factors, and potential neighbor correlations. Chen et al. employed multiple convolutional layers to integrate these images and compute gene–gene regulation scores.

GRNs depict the relationships between regulatory factors and target genes using graphs or networks. Kishan et al. introduced GNE, a scalable deep learning model based on network topology, known for its reliable predictive capabilities [13]. The GNE model integrates known gene–gene interaction and expression data, extracting node representations via message propagation and updates within the GRNs to predict unknown gene–gene interactions. Building on this, Chen et al. proposed GENELink, framing GRN inference as a link prediction task on a graph. GENELink employs a graph attention network (GAT) to infer potential interactions between transcription factors and target genes within GRNs [14]. GENELink first maps observed single-cell gene expression pairs to a low-dimensional space, then optimizes this space to learn gene node representations for downstream predictions. Subsequently, Mao et al. developed the GNNLink model, incorporating an interactive graph encoder based on GNNs [15]. GNNLink then extracts node features and reconstructs GRNs through adjacency matrix completion.

Current computational models are capable of inferring GRNs, yet they encounter three significant challenges. Firstly, GRNs inherently contain structural noise, manifesting as “fake” links. Secondly, existing GNN-based methods lack an effective mechanism to differentiate between homogeneous and heterogeneous links, particularly during message propagation. Thirdly, the quantity of positive links is significantly lower than that of negative links. To address these challenges, we introduce the AnomalGRN model, designed from the perspectives of network heterogeneity and sparsity, to accurately infer GRNs. By creating new nodes from gene pairs, treating those with positive links as abnormal and those with negative links as normal, we transform the task of GRN inference into predicting abnormal nodes within the reconstructed GRN. Our contributions are summarized as follows:

1.
This study innovates by transforming gene regulation prediction into node prediction tasks.
2.
This study pioneers the application of graph node anomaly detection to GRN inference, targeting the significant imbalance between positive and negative links.
3.
We employ a cosine similarity measure that accounts for message propagation among both homogenous and heterogeneous links.
4.
We implement a sparsification strategy to eliminate noise and reduce the influence of “fake” links on model accuracy.

Results

Data preparation

This study utilized single-cell RNA sequencing (scRNA-seq) datasets for seven cell types from BEELINE [16]: (1) human embryonic stem cells (hESC), (2) human mature stem cells (hHEP), (3) mouse dendritic cells (mDC), (4) mouse embryonic stem cells (mESC), (5) erythroid mouse hematopoietic stem cells (mHSC-E), (6) granulocyte-monocyte lineage mouse hematopoietic stem cells (mHSC-GM), (7) mouse lymphoid hematopoietic stem cells (mHSC-L). These datasets are linked to three ground-truth networks (GTNs) derived from the STRING database, non-specific ChIP-seq, and cell-type-specific ChIP-seq. Additionally, loss-of-function/gain-of-function (Lofgof) GTNs are available for mESC.

The GTN from the STRING database primarily reflects extensive protein interactions. It is not specific to any cell type and focuses on general and global GRNs. Non-specific ChIP-seq data mainly comprise non-specific protein-DNA binding data resulting from technical reasons or experimental conditions. This type of GTN may contain more background noise and non-target signals, reflecting extensive protein-DNA relationships without cell type or state selectivity. Cell-type-specific ChIP-seq data are collected from specific cell types and provide accurate protein-DNA interaction information within a specific cell environment. This type of GTN is more accurate and relevant, revealing details of gene regulation in specific cell types, including transcription factor binding sites. Comparatively, cell-type-specific ChIP-seq data are of higher quality.

The volume of raw scRNA-seq data is substantial. To process this data, we utilized established preprocessing methods. Specifically, we removed genes expressed in fewer than 10% of cells and those with low expression levels from further analysis. We then calculated the variance and p-value for each gene's expression, focusing particularly on genes with a Bonferroni-corrected p-value below 0.01. We normalized gene expression levels using logarithmic transformation. Following these steps, we constructed an ${\varvec{N}}\times {\varvec{M}}$ matrix, where N is the number of genes and M is the number of cells. Following Pratapa et al.’s method [17], we selected transcription factors (TFs) with high expression level and the top 500 and 1000 genes, thereby creating 14 distinct datasets, as shown in Table 1.

Table 1 Detailed statistics of the datasets

Full size table

Experimental setup

To assess our model's performance, we conducted comprehensive experiments on the aforementioned datasets. Furthermore, we benchmarked our model against several cutting-edge models, namely GNNLink [15], GENELink [14], GNE [13], CNNC [11], DeepDRIM [12], GRN-Transformer [10], and GENIE3 [9]. In our experimental setup, the division ratio for the training, validation, and test sets was uniformly set at 6:2:2, with the positive-to-negative sample ratio maintained at 1:1. The training was capped at 10,000 rounds, incorporating an early stopping mechanism whereby training ceases if no improvement is observed on the validation set for 500 consecutive rounds. The proposed model also involves the number of top-K neighbors (K), the hidden layer dimension, and the number of layers. The default values of these parameters are 10, 32, and 2, respectively. These parameters can be freely set. The primary evaluation metrics employed in the experiment were two well-known classification indicators: AUC and AUPR.

AUC represents the area under the ROC curve, plotted with the true positive rate (Recall) on the y-axis and the false positive rate on the x-axis. AUPR represents the area under the PR curve, plotted with Precision on the y-axis and Recall on the x-axis. AUC evaluates the ability to distinguish between positive and negative samples and is a comprehensive performance metric. AUPR focuses on distinguishing positive samples, particularly when the sample sizes are unbalanced, better reflecting the ability to recognize positive samples. The calculations for true positive rate (Recall), false positive rate, and Precision are as follows:

$$\begin{aligned}\textrm{True positive rate }(\mathrm{Recall})=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FN}}\\\textrm{False positive rate}=\frac{\mathrm{FP}}{\mathrm{TN}+\mathrm{FP}}\\\mathrm{Precision}=\frac{\mathrm{TP}}{\mathrm{TP}+\mathrm{FP}}\end{aligned}$$

(1)

Where TP (TN) represents the number of correctly predicted positive (negative) samples, and FP (FN) represents the number of incorrectly predicted positive (negative) samples.

Performance on benchmark datasets

We conducted experiments across all benchmark datasets to compare model performances, with results presented in Fig. 1. TF + 500 (1000) indicates that the top 500 (1000) genes based on gene expression levels are retained after screening scRNA-seq data. The screened genes are integrated with the GTN in the database to construct the datasets. The proposed GNN-based models, GNNLink and GENELink, significantly outperform other predictive models, demonstrating their superior capability in elucidating topological structures compared to conventional deep learning technologies. It is noteworthy that the proposed model surpasses the performance of all existing models. Specifically, on all TF + 500 datasets, the proposed model’s AUC and AUPR scores are respectively 14% and 57% higher than those of the suboptimal GNNLink model. For all TF + 1000 datasets, the AUC and AUPR values of the proposed model exceed those of the suboptimal GNNLink model by 10% and 58%, respectively.

AUC − denotes the difference between the model’s average AUC on the TF + 1000 datasets and its average AUC on the TF + 500 datasets, while AUPR- denotes the difference between the model’s average AUPR on the TF + 1000 datasets and its average AUPR on the TF + 500 datasets.

We infer that the high performance of all models on the Specific datasets may be due to its higher quality. All models perform poorly on the Non-specific datasets, likely due to more unpredictable noise. Additionally, STRING provides general data with unclear external factors, possibly resulting in lower data quality and, consequently, lower model performance. Notably, the proposed model demonstrates reliable performance across all datasets. Overall, the proposed model employs graph anomaly detection technology, demonstrating the potential for identifying noisy data. Specifically, we attribute this to several factors. First, the proposed model calculates node similarity using cosine, revealing correlation, irrelevance, and heterogeneity between nodes. Second, the model sets a fixed threshold to filter out potential noise links. Third, the model uses top-k variants to identify the most reliable links. Fourth, during message propagation, the model uses cosine instead of a fixed “1” in the numerator of Eq. 12 to better distinguish heterogeneity from homogeneity.

We calculated the mean and standard deviation of the results for each model on all TF + 1000 datasets, the TF + 500 datasets, and the difference between the results on the two datasets, detailed in Table 2. The AUPR performance of all models on the TF + 1000 and TF + 500 datasets shows slight differences. However, the AUC performance on the TF + 1000 dataset is slightly better than on the TF + 500 dataset, suggesting that increased training data correlates with enhanced model performance.

Table 2 Results of all models on all TF + 1000 and all TF + 500 datasets, respectively

Full size table

Recently, Yuan et al. proposed the scMAGTGRN model [18], which utilizes multi-view technology to capture local features and high-order neighbor information, while employing an attention mechanism to adjust feature weights. Wang et al. introduced the Transformer-based scGREAT model [19] for GRN inference and demonstrated its effectiveness in identifying unknown gene–gene relationships. Table 3 presents the AUC and AUPR metrics for AnomalGRN alongside the comparison models, scMGATGRN and scGREAT. The proposed AnomalGRN model outperforms scGREAT and scMAGTGRN significantly. These results further validate the effectiveness of the proposed AnomalGRN model in GRN inference.

Table 3 AUC and AUPR scores of models on cell-type-specific networks (TF + 500)

Full size table

Performance evaluation across positive–negative ratios

In GRNs, known links are significantly outnumbered by unknown links. The substantial imbalance between positive and negative samples presents a significant challenge for accurate GRN inference by the model. We conducted experiments to assess each model's predictive performance across various positive-to-negative ratios on the mESC-500 dataset in the LOF/GOF network. Initially, we established positive-to-negative ratios of 1:5, 1:6, and 1:7, and assessed the performance of the proposed model alongside GNNLink, GENELink, GNE, CNNC, DeepDRIM, GRN-Transformer, and GENIE3. For more pronounced imbalances, we adjusted the positive-to-negative ratios to 1:10, 1:20, 1:30, 1:40, and 1:50, and benchmarked against GNN models like GCN, GAT, and GraphSAGE.

Figure 2 illustrates that with widening positive-to-negative ratios, all models exhibit a trend of declining performance. This demonstrates that the imbalance between positive and negative samples is a critical factor limiting model performance. Nevertheless, the proposed model sustains high and stable performance at negative to positive ratios of 1:10, 1:20, 1:30, 1:40, and 1:50. This indicates that the model is capable of efficiently identifying potential positive links within highly imbalanced samples. Additionally, this validates the efficacy of using anomaly detection for GRN inference. This stability may stem from the model’s ability to filter “fake” links and retain potential links via network sparsification, and its capacity to differentiate between homogeneous and heterogeneous links using cosine similarity during message propagation. In summary, the model is less susceptible to interference from “fake” links, effectively retains potential links, and thus accurately identifies positive links in imbalanced datasets.

Ablation experiment

The proposed AnomalGRN model primarily includes two core components: “Weight Reconstruction” and “Network Sparsification.” We conducted several experiments on the Lofgof_mESC and Specific_mESC datasets, using AUC and AUPR as evaluation metrics to verify the importance of these core components. In Table 4, “w/o WR” refers to the model without the “Weight Reconstruction” component, where the weights between two nodes are not computed using cosine similarity and take values of either 1 or 0 directly. In the reconstructed GRN, edges are defined according to the steps outlined in the “Task Transformation” section. “w/o GNS” refers to the model excluding the “Network Sparsification” component, where the adjacency matrix remains unchanged after the default number of rounds. “w/o Cosmic” denotes that in GCN message propagation, S_ij in Eq. 12 is set to 1. The “w/o All” experiment involves removing the “Weight Reconstruction,” “Network Sparsification,” and “Cosmic” message propagation components from the proposed model, relying solely on traditional GCN to perform the GRN inference task. However, this experiment was conducted on the reconstructed GRN, indicating that it underwent a task transformation process, as detailed in the “Task transformation” section. The results indicate that removing the “Weight Reconstruction,” “Network Sparsification,” and “Cosmic” message propagation components leads to the poorest model performance. Furthermore, the removal of any single component or mechanism results in a significant decline in model performance. These findings demonstrate that the components and mechanisms positively contribute to the model’s performance, validating their rationality and effectiveness. The aforementioned process occurs during the execution of the GCN.

Table 4 Results of ablation experiment

Full size table

Additionally, when all components or mechanisms are removed, the primary difference between the proposed model and other models lies in the “task transformation” process, as observed in the “w/o All” condition. During the “task transformation” process, the original GRN was reconstructed as the initial input graph. Specifically, gene–gene pairs are formed into new nodes, and the initial input graph is generated based on the “Weight Reconstruction” and “Network Sparsification” components. However, during the construction of the initial graph, the newly created nodes have no neighbor. This indicates that, as described in Eq. 7, neighbor screening is performed solely based on a threshold. This process generates only the initial graph and does not directly influence the GCN execution process. Therefore, we hypothesize that the “task transformation” process is a critical factor in enhancing model performance. Reconstructing the GRN (as an initial input graph) using the “Weight Reconstruction” and “Network Sparsification” components likely eliminates the impact of “fake” links and identifies potential neighbors, contributing to improved model performance.

Parameter experiment

The proposed model includes several adjustable hyperparameters: the training data ratio, the number of neighbors (K) in the KNN variant, the number of GNN layers, and the output dimension size. After edge weights are constructed, the KNN variant primarily identifies potential neighbors, which are critical for GNN message propagation within the reconstructed GRN. The number of layers and output features in the GNN model significantly influence feature extraction efficiency, consequently impacting future prediction tasks. The cell-type-specific ChIP-seq data are of comparatively higher quality. To minimize external interference, we conducted a series of experiments using the cell-type-specific ChIP-seq ground truth dataset to analyze how parameter variations influence performance.

Specifically, the K value was set to 5, 10, 15, and 20; the GNN output dimension was configured as 8, 16, 32, and 64; and the number of GNN layers was set to 1, 2, 3, and 4. The results are presented in Fig. 3B, C, and D. Meanwhile, other parameters were kept constant to adhere to the univariate principle. Experimental results indicate that the proposed model’s performance remains stable with minor fluctuations as the K value varies. Similarly, the model’s performance is largely unaffected by variations in the number of GNN layers or output feature size. These findings suggest that the model exhibits stable performance and strong adaptability to parameter variations. The proposed model can be rapidly deployed and applied to new GRN data with minimal parameter adjustments, thereby enhancing research efficiency.

Given the sparsity of known GRNs, the available training data is inherently limited. A series of experiments were conducted to evaluate the model’s predictive performance using 10% to 50% of the available training data. For comparison, the state-of-the-art GNNLink model was evaluated on the TF + 500 dataset, as illustrated in Fig. 3A. The experimental results demonstrate that the performance of both models improves progressively as the training data increases. The experimental results demonstrate that the performance of both models improves progressively as the training data increases. Additionally, the proposed model exhibits stable performance with minimal fluctuations.

Noise-induced attack

To investigate the model’s resilience against noise attacks, we conducted experiments on the reconstructed GRN utilizing Random attack [20] and DICE attack [21] strategies. In a Random attack, artificial noise, including “fake” links, is introduced into the reconstructed GRN. Random edges ranging from 0 to 100% are incorporated into the test set. During a DICE attack, existing links in the reconstructed GRN are eliminated, followed by the addition of an equivalent number of “fake” links. The attack intensity is similarly adjusted from 0 to 100%. These experiments primarily utilized mESC datasets. The comparative analysis involved models such as GCN [22], GAT [23], and GraphSAGE [24], with AUC and AUPR serving as the evaluation metrics.

Figures 4 A and B showcase the performance impact of Random and DICE attacks on each model, respectively. Figure 4 A reveals a notable decrease in performance for standard GNN models like GCN, GAT, and GraphSAGE as the proportion of “fake” links increases. This decline may stem from the injected noise disrupting the aggregation process within GNNs. In contrast, the proposed model remains largely unaffected. This resilience could be attributed to the proposed model's effectiveness in filtering “fake” links.

Figure 4 B illustrates a marked decline in the performance of GCN, GAT, and GraphSAGE models as the attack intensity increases. This phenomenon is intuitively understandable. Removing real links and introducing “fake” links effectively doubles the impact of a Random Attack. Nevertheless, the proposed model maintains stable performance with minimal fluctuation. This indicates that the proposed model is capable of effectively filtering out “fake” links while also revealing potential real links. Consequently, the proposed model can adeptly handle Random and DICE attacks, showing a higher tolerance for the inevitable noises encountered during the data collection process.

Case study

This study involved the selection of six genes across seven cell types for analysis. The TCF4 gene, essential for stem cell pluripotency and self-renewal, was selected from human embryonic stem cells. Dysfunctional or abnormally expressed TCF4 gene can lead to diseases such as schizophrenia, skin conditions, and certain cancers [25]. Additionally, the ARID2 gene was chosen from human mature stem cells. The ARID2 gene, along with other SWI/SNF complex members, significantly influences mature stem cell renewal and differentiation [26]. Selected from mouse embryonic stem cells, the EHMT2 gene is part of the histone methyltransferase family. These enzymes modify chromatin structure and gene programming by adding methylation marks to specific lysine residues on histone proteins [27]. From mouse dendritic cells, the MAFF gene was selected; its protein modulates specific gene expressions [28]. The RPA2 gene, selected from granulocyte-monocyte lineage mouse hematopoietic stem cells, potentially ensures cell maturation and function by maintaining DNA replication and repair accuracy [29]. The MEF2C gene, selected from mouse hematopoietic stem cells of both erythroid and lymphoid lineages. The MEF2C gene regulates the generation, maturation, and proliferation of erythroid lineage mouse hematopoietic stem cells, stabilizing and optimizing hematopoietic system function [30]. In lymphoid lineage mouse hematopoietic stem cells, the MEF2C gene influences immune system development and function [31].

In the training phase, we remove all associations involving the selected genes from the datasets. During the inference phase, our attention is on the selected genes, using the trained model to rebuild the GRN around them. Using the TCF4 gene in human embryonic stem cells as an example, the model accurately predicted both true positive and true negative rates at 100%. Notably, as demonstrated in Fig. 5A, all genes the model associates with TCF4 have been confirmed in the hESC dataset. Furthermore, Figs. 5B ~ G demonstrate that the model performs similarly in predicting ARID2, EHMT2, MEF2C, RPA2, and MAFF genes. Thus, we can conclude that the AnomalGRN model is indeed capable of inferring unknown GRNs. It is anticipated to aid in uncovering complex gene regulatory mechanisms and disease causation in the future.

Additionally, we used the LIANA toolkit [32] to load tissue sample data from patients with myocardial infarction [33]. This sample is concentrated in the ischemic area of cardiac tissue and provides only gene expression data, without a corresponding GRN. Then, we obtained known GTNs from previous work [34], which collected many such GTNs. We integrated gene expression data and known GTNs, using gene expression data as node features and gene–gene pairs in known GTNs as links to reconstruct the GRN of the ischemic area in myocardial infarction patients. The integrated dataset was then divided into training, validation, and test sets in a 6:2:2 ratio. After training, the proposed model achieved an AUC of 72.69% and an AUPR of 75.05%. This indicates that the proposed model can potentially predict unknown gene–gene regulatory relationships based on feature data such as gene expression and known GTNs.

Additionally, we discussed applying this method to predict GRN in mouse kidney slices. Initially, we downloaded gene expression data of mouse kidney slices, generated using Visium technology, comprising 1438 cells and 32,285 genes. Based on previous work [35, 36], we acquired the known transcription factor (TF) and gene-target association information, referred to as the GRN. We integrated the gene expression data of mouse kidney slices with the known GRN, using genes as nodes and their expression data as node features. Links were constructed based on the association information of known transcription factors (TFs) and target genes, while irrelevant genes in the GRN were discarded. This process allowed us to reconstruct the GRN of mouse kidney slices. The integrated dataset was then split into training, validation, and test sets in a 6:2:2 ratio. Following training, the proposed model achieved an AUC of 75.13% and an AUPR of 73.37%. These results also demonstrate the proposed model’s scalability across diverse biological contexts and larger datasets.

Discussion

The integration of single-cell RNA sequencing (scRNA-seq) with deep learning has revolutionized GRN inference, enabling the exploration of gene regulation at an unprecedented resolution. This study proposes an innovative approach that frames GRN inference as a graph anomaly detection problem, effectively addressing several inherent challenges in the field. One of the significant challenges is the inevitable noise in scRNA-seq data, which can result in incorrect predictions of gene interactions. By sparsifying the GRN and filtering out false links, the impact of noise is reduced, thereby enhancing the reliability of predictions. Another notable contribution of this study is its approach to addressing the imbalance between positive and negative links in GRNs. Typically, the number of positive links in a network is significantly lower than that of negative links, posing a challenge to model training. By classifying positive links as anomaly nodes and negative links as normal nodes, this study redefines GRN inference as a graph anomaly detection problem. This approach offers a novel solution to the imbalance problem and enhances the accuracy of GRN predictions.

A series of experimental results validate the rationality of the model and its key components. However, several important considerations remain. The reconstruction of new nodes and edges in GRNs necessitates further exploration. Moreover, the quality of data utilized for GRN inference is critical to the accuracy of the results. Integrating multimodal data, including gene epigenetic data, transcription factor binding sites, and protein–protein interactions, can substantially improve the model’s robustness. Future work should focus on integrating advanced multimodal fusion techniques with large language models to enhance the accuracy and comprehensiveness of GRN inference. By incorporating diverse biological data sources, the proposed model can uncover intricate and subtle gene regulatory mechanisms, with potential applications in elucidating cellular processes, disease mechanisms, and therapeutic development.

Conclusions

Single-cell RNA sequencing (scRNA-seq) technology has significantly advanced our in-depth understanding and exploration of gene regulatory mechanisms at the single-cell level. Recent studies integrating scRNA-seq with deep learning technology have succeeded in predicting GRNs. However, these studies face two principal limitations. Firstly, noise is inevitable during the data collection process. Secondly, it is recognized that within GRNs, positive links are significantly outnumbered by negative links. To overcome the first limitation, we sparsify the GRN and filter out a portion of the “fake” links to mitigate the impact of noisy data. To tackle the second limitation, we categorize positive links as abnormal nodes and negative links as normal nodes. Subsequently, we reconstruct the GRN and convert the task of GRN inference into one of anomaly detection. Simultaneously, we employ cosine similarity to assess the similarity between nodes, differentiate between homogeneous and heterogeneous links, and mitigate the effects of the imbalance between positive and negative samples. Adopting this novel perspective, we conducted an in-depth study on the inference problem of GRN, substantially enhancing prediction accuracy. This marks the first application of anomaly detection to GRN inference, achieving considerable success. This further validates the feasibility of employing anomaly detection in unbalanced datasets like single-cell GRN, thereby encouraging broader applications.

The goal of GRN inference is to identify potential gene–gene relationships from numerous gene–gene pairs, thereby uncovering gene regulatory mechanisms. In graph anomaly detection, abnormal nodes are typically much fewer than normal nodes. This study reframes GRN inference as a graph anomaly detection task, which appears to be both reasonable and innovative. Nevertheless, several critical aspects warrant careful consideration. First, developing effective methods for reconstructing new nodes and edges in the reconstructed GRN. Second, integrating higher-quality data. In future work, we aim to employ advanced multimodal fusion and large language model technologies to integrate multimodal data, such as gene epigenetic data, transcription factor binding site information, protein–protein interactions, and known GRNs. This is expected to address the identified challenges, enhance the proposed model, and more effectively uncover underlying gene regulatory mechanisms.

Methods

Model architecture

Previous research typically treats GRN inference as a graph link prediction task. In a novel approach, we reconceptualize GRN inference as a task of classifying nodes within the reconstructed GRNs. Specifically, interacting gene pairs are defined as abnormal nodes, while non-interacting pairs are considered normal nodes. Given the significantly lower number of positive links compared to negative links in GRNs, GRN inference can effectively be transformed into an anomaly detection task. We then introduce the AnomalGRN model, which is developed from perspectives of heterogeneity and sparsity to uncover complex regulatory mechanisms within GRNs.

Figure 6 illustrates the AnomalGRN model, which comprises multiple modules. Module A constructs the initial input graph (reconstructed GRN) by converting gene–gene pairs into new nodes and splicing the scRNA-seq data of gene–gene pairs to create feature vectors for these nodes. The initial input graph is then constructed through the following two steps. First, Module B uses a deep neural network (DNN) to linearly map the node feature vectors and calculates the cosine similarity between the new nodes, as detailed in the “Weight reconstruction” section. Second, based on the cosine similarity between nodes, some neighbors are identified according to a predefined threshold. Additionally, an improved top-K algorithm is employed to determine additional neighbors, as explained in the “Network sparsification” section. It is noteworthy that a sparse matrix is also constructed between GNNs following these steps. Module C details the model’s entire execution process. The reconstructed GRN generated by Module A serves as input to the GNN model, then Eqs. 7 and 9 are applied to produce a sparse matrix. The outputs of multi-GNN layers, after multiple iterations, are integrated to derive node features, which are then used for label prediction through a neural network. Module D provides an abstract representation of Module C, where t(i) denotes the output of the GCN encoder at the ith layer, t(i-1) is the input to the ith layer, and “ ⊕ ” signifies the concatenation operation. Subsequently, we provide a detailed explanation of these key modules and technologies.

Task transformation

GRNs typically feature far fewer positive links than negative ones. This scenario aligns with anomaly detection tasks. Generally, normal nodes significantly outnumber abnormal nodes. Initially, we innovatively aggregate node pairs into singular nodes. Gene–gene pairs with links are designated as abnormal nodes, while those without links are considered normal nodes. Subsequently, we reconstruct the GRN using the B module, transforming the GRN inference problem into a graph anomaly detection task. Naturally, the reconstructed GRN often exhibits “fake” links, where normal nodes connect to abnormal ones. Links between two new nodes of the same type are defined as homogeneous links, whereas links between nodes of different types are defined as heterogeneous links.

Clearly, to accurately and effectively complete the anomaly detection task, two prerequisites are essential. First, the GNN model must simultaneously consider message propagation modes for both homogeneous and heterogeneous links. Second, efforts should be made to minimize the model's susceptibility to “fake” links. To address the first condition, we plan to employ cosine similarity for calculating edge weights, distinguishing between heterogeneous and homogeneous link message propagation. Regarding the second condition, we propose utilizing graph sparsification technology to filter out “unimportant” links, thereby mitigating the effect of “fake” links on model performance.

The original GRN is represented as G = (V, E, X), where V denotes the set of genes, E the set of links between genes, and X the scRNA sequence and other gene attribute data. The reconstructed GRN is denoted as G_r = (V₁, E₁, X₁, S), where V₁ is the set of nodes reconstructed from all gene–gene pairs, X₁ the spliced feature vectors of these pairs, and E₁ the set of reconstructed links. Edge weights S ∈ [−1,1], calculated using cosine similarity, represent the weights of the reconstructed links. From a statistical viewpoint, the GRN inference (graph anomaly detection) problem can be formalized as:

$$P\left(Y\vert G_r\right)=P\left(Y\vert G_r\right)\approx\sum\nolimits_{g\in GS}\;P\left(Y\vert g\right)P\left(g\vert G_r\right)$$

(2)

where Y denotes labels, where normal nodes (unconnected gene–gene pairs) are assigned 0, and abnormal nodes (connected gene–gene pairs) are assigned 1. G_S denotes a series of sparse graph sets derived from graph G_r. Traversing G_S is challenging and time-intensive; thus, based on previous work [37, 38], Eq. 2 can be approximated as follows:

$$\sum\nolimits_{g \in GS} {P\left( {Y|g} \right)} P\left( {g|Gr} \right) \approx \sum\nolimits_{g \in GS} {K\varphi \left( {Y|g,S} \right)} K\chi \left( {g|Gr} \right)$$

(3)

where ${{\varvec{K}}}_{\boldsymbol{\varphi }}$ and ${{\varvec{K}}}_{{\varvec{\chi}}}$ are the estimators for ${\varvec{P}}\left({\varvec{Y}}|{\varvec{g}}\right)$ and ${\varvec{P}}\left({\varvec{g}}|{{\varvec{G}}}_{{\varvec{r}}}\right)$, parameterized by $\boldsymbol{\varphi }$ and ${\varvec{\chi}}$, respectively. ${{\varvec{K}}}_{{\varvec{\chi}}}$ takes ${{\varvec{G}}}_{{\varvec{r}}}$ as input, generates a series of subgraphs (G_S) centered around each node as described in Eqs. 7 and 9, and ultimately forms a sparse adjacency matrix A. The ${{\varvec{K}}}_{\boldsymbol{\varphi }}$ function employs the GCN model to output node labels based on the input adjacency matrix A, which comprises all subgraphs (G_S). To ensure the differentiability of Eqs. 2 and 3, we generate a differentiable sample $\boldsymbol{\Delta }{\varvec{g}}$ using the parameterization technique [39]:

$$\sum\nolimits_{g \in GS} {K\varphi \left( {Y|g} \right)} K\chi \left( {g|Gr} \right) \propto \sum\nolimits_{{{\Delta }g \prec K\chi \left( {g|Gr} \right)}} {K\varphi \left( {Y|\Delta g} \right)}$$

(4)

where ${\varvec{\Delta}}{\varvec{g}}\prec {{\varvec{K}}}_{{\varvec{\chi}}}({\varvec{g}}|{{\varvec{G}}}_{{\varvec{r}}})$ represents ${\varvec{\Delta}}{\varvec{g}}=({\varvec{V}},\boldsymbol{ }{\varvec{\Delta}}{\varvec{A}},\boldsymbol{ }{\varvec{X}})$, where ${\varvec{\Delta}}{\varvec{g}}$ is a sparse graph sampled from ${{\varvec{K}}}_{{\varvec{\chi}}}({\varvec{g}}|{{\varvec{G}}}_{{\varvec{r}}})$.

Weight reconstruction

In the reconstructed GRNs, many abnormal links are heterogeneous. Thus, adequately addressing heterogeneous links can enhance anomaly detection task performance. Similar to previous work [38], we employ cosine similarity to reconstruct the adjacency weight matrix S, enabling consideration of both homogeneous and heterogeneous links in GNN message propagation. In the reconstructed GRN, for nodes i and j with initial features z_i and z_j, respectively, we input these features into a two-layer multilayer perceptron (MLP) and calculate the cosine similarity between the node outputs as follows:

$$Sij = \frac{{z_{i}^{0} \cdot z_{j}^{0} }}{{\left| {z_{i}^{0} } \right| \cdot \left| {z_{j}^{0} } \right|}} \in \left[ { - 1, 1} \right]$$

(5)

where ${{\varvec{z}}}_{{\varvec{i}}}^{0}={\varvec{M}}{\varvec{L}}{\varvec{P}}\left({{\varvec{z}}}_{{\varvec{i}}}\right)$ denotes node i's representation as output by the MLP, with z_i originating from X₂. S_ij quantifies the cosine similarity between nodes i and j. The relationship between nodes i and j is categorized based on three criteria: homogeneity, heterogeneity, and neutrality:

$$\left\{ {\begin{array}{*{20}c} {Sij \to 1,} & {{\text{if }}i{\text{ to }}j{\text{ tend to be homogeneous}}{.}} \\ {Sij \to 0,} & {{\text{if }}i{\text{ to }}j{\text{ tend to be worthless}}{.}} \\ {Sij \to - 1,} & {{\text{if }}i{\text{ to }}j{\text{ tend to be heterogeneous}}{.}} \\ \end{array} } \right.$$

(6)

The determination of weight S is independent of the edges in the reconstructed GRN. Unlike the graph attention mechanism, this approach uses cosine similarity to calculate node weights, ranging from − 1 to 1, diverging from the non-negative weights used in graph attention. Consequently, GNN models can account for both homogeneous and heterogeneous links in the information dissemination process. Furthermore, as per Eq. 7, the calculation is symmetrical.

In the proposed model, this component is primarily used for initial graph construction, the “Network Sparsification” component, and the message propagation mechanism. First, this component calculates the similarity between nodes and uses the top-K algorithm to find the K nearest neighbors of each node, constructing the initial graph. During training, the adjacency matrix is updated every fixed number of rounds (default is 250 rounds). Second, in the “Network Sparsification” component, between the layers of the GNN encoder, this component calculates node similarity, selects neighbors by setting a threshold, and uses the top-K algorithm to find the K nearest neighbors, finally merging the final neighbors. Third, in the message propagation mechanism, as shown in Eq. 12, the GCN message propagation rule changes the numerator from “1” to the cosine similarity between nodes. This weakens and distinguishes some irrelevant links.

Network sparsification

Pervasive “fake” links or neighbors in the reconstructed GRNs significantly impair model performance. Furthermore, the incompleteness of data arises from limitations in collection technology and conditions. Consequently, we apply sparsification to the reconstructed GRN. Our goals are twofold: firstly, to diminish the impact of “fake” links, and secondly, to discern potentially significant node neighbors.

As outlined in Sect. 2.2, the absolute value of S_ij indicates the influence of node i on node j. Therefore, setting a threshold α enables the filtering of non-essential neighbor nodes:

$$N1\left( i \right) = \left\{ {j \in V|j \in N\left( i \right),\left| {Sij} \right|> \alpha } \right\}$$

(7)

where ${\varvec{N}}({\varvec{i}})$ denotes the set of neighbors associated with node i. Consequently, certain nodes with high similarity but not classified as neighbors at this stage are ignored. Therefore, we identify potential neighbors utilizing the weight matrix S and an enhanced KNN algorithm. We employ the Gumbel-top-K method, an advanced variant of the KNN algorithm, to calculate the probability value of node pairs and select the K most promising neighbors [39, 40]. Adhering to the principles of underlying data distribution [39], the probability value for any node j not in N(i) is calculated as follows:

$$Pij = \frac{{\left| {Sij} \right|}}{{\sum\nolimits_{c \notin N\left( i \right)} {\left| {Sic} \right|} }}$$

(8)

Subsequently, logarithmic transformation of the probability value serves as the criterion for selecting the top-K neighbors:

$$\begin{gathered} \delta ij = \log \left( {pij} \right) - \log ( - \log (\varepsilon )) \hfill \\ N2\left( i \right) = \left\{ {j \in V|j \notin N\left( i \right),\delta ij \in top - K\left( {\delta * i} \right)} \right\} \hfill \\ \end{gathered}$$

(9)

The formal parameter ${\varvec{\varepsilon}}$ is sampled independently from a uniform distribution (0,1) for each node. The function top-K() identifies the set of the top K largest ${\varvec{\delta}}$ values. Our study iteratively applied this operation across the entire reconstructed GRNs.

Neighbor sets N₁ and N₂, derived from Eqs. 7 and 9, are merged to create the final neighbor set N’ = N₁ ∪ N₂. For node i, if j is in N’(i), the corresponding element A_ij’ in the sparse adjacency matrix A’ is assigned a value of 1. Based on this, the edge set E₂ is derived. At this point, G’ = (V₁,E₂,X₁,S) signifies the completion of the network's sparsification. This is an iterative and reusable process.

Numerically, S_ij spans a broad range of values. This implies a connection between any two nodes, necessitating processing of the reconstructed GRN to preserve sparsity. Methods like minimizing the 1 norm [41] or the nuclear norm [42] are used to optimize the weight matrix S, though they may have limitations [20, 43]. Drawing inspiration from Gumbel-Softmax, we achieve sparse regularization by constraining the absolute value of S to approach 1 or 0:

$$Lr = - \sum\limits_{{A{\text{ij}} = {1}}} {\left( {\omega ij\log \omega ij + (1 - \omega ij)\log \left( {1 - \omega ij} \right)/\tau } \right)}$$

(10)

where ${{\varvec{\omega}}}_{{\varvec{i}}{\varvec{j}}}={\varvec{m}}{\varvec{a}}{\varvec{x}}(\boldsymbol{\alpha },{\varvec{m}}{\varvec{i}}{\varvec{n}}(\left|{{\varvec{S}}}_{{\varvec{i}}{\varvec{j}}}\right|,1-\boldsymbol{\alpha }))$ and it requires a smaller α, rendering a larger value meaningless. Furthermore, in ${\varvec{m}}{\varvec{a}}{\varvec{x}}(\boldsymbol{\alpha },{\varvec{m}}{\varvec{i}}{\varvec{n}}(\left|{{\varvec{S}}}_{{\varvec{i}}{\varvec{j}}}\right|,1-\boldsymbol{\alpha }))$, if S_ij is significantly small and below α, it is ignored and replaced by α. This also mirrors the treatment of fake links below α in Eq. 7. The parameter ${\varvec{\tau}}>0$ regulates the anticipated sparsity level of S, where smaller ${\varvec{\tau}}$ results in more S values being optimized to 0, thus yielding a sparser graph $\boldsymbol{\Delta }{\varvec{g}}$.

Message propagation

Following the aforementioned procedures, we obtain the sparse graph G’ = (V₂,E₂,X₂,S). Subsequently, we derive node representations using the GNN model and perform node classification to fulfill the anomaly detection task, thereby aiming to infer the GRN. Any GNN model may be selected to facilitate this process. In our study, we utilized the GCN [22] model as an exemplar to demonstrate information propagation in the reconstructed GRN. The node representation of node i at the hth layer can be denoted as follows:

$$z_i^h=\sigma\left(\left(z_i^{h-1}+\sum_{j\in N'\left(i\right)}\gamma_{ij}^hz_i^{h-1}\right)W^h\right)$$

(11)

where j is a member of N’(i), denoting the adjacency of node i as derived from matrix A’. The trainable weight matrix is represented by W^h, and the activation function by ${\varvec{\sigma}}$. The coefficient ${{\varvec{\gamma}}}_{{\varvec{i}}{\varvec{j}}}^{{\varvec{h}}}$ dictates the contribution value of neighboring nodes to i.

The value of ${{\varvec{\gamma}}}_{{\varvec{i}}{\varvec{j}}}^{{\varvec{h}}}$ warrants particular attention. As detailed in Sects. 2.2 and 2.3, reconstructed GRNs inherently contain heterogeneous links. Homogeneous and heterogeneous nodes frequently exhibit distinct attributes. Concurrently, S_ij reveals the relationship between nodes i and j, categorizing it as homogeneous, heterogeneous, or unrelated. In GCN's message propagation, both homogeneity and heterogeneity are incorporated:

$$\gamma_{ij}^{h} = \frac{Sij}{{\sqrt {\left( {d_{i} + 1} \right)\left( {d_{j} + 1} \right)} }}$$

(12)

where d_i and d_j denote the degrees of nodes i and j within the sparse network G’, respectively. Diverging from standard GCN models, S_ij replaces “1,” and both positive and negative values are introduced to account for the influence of homogeneous and heterogeneous neighbors concurrently.

Model optimization

The model's optimization objectives primarily encompass the refinement of the weight matrix S and minimizing classification loss. For any given node i, upon obtaining its final representation ${\widehat{{\varvec{z}}}}_{{\varvec{i}}}$, the data undergoes transformation and prediction via a linear layer:

$$\hat{y}i = \sigma \left( {\hat{z}iW + b} \right)$$

(13)

where W denotes a trainable weight matrix, while b signifies the bias. ${\widehat{{\varvec{y}}}}_{{\varvec{i}}}$ indicates the predicted category of the node, being either 1 for a normal node or 0 for an abnormal node. Subsequently, the binary cross-entropy function is employed to compute the classification loss:

$$Lc = \sum\nolimits_{i \in V1} {yi\log } \hat{y}i + (1 - yi)\log \left( {1 - \hat{y}i} \right)$$

(14)

where y_i denotes the actual label of node i. Thus, by setting the weight coefficient ${\varvec{\lambda}}$ and incorporating both optimization and classification loss, including that of the weight matrix S, the ultimate optimization objective is established as follows:

$$L = Lc + \lambda Lr$$

(15)

Data availability

All data generated or analysed during this study are included in this published article, its supplementary information files and publicly available repositories. Our code is publicly available in the GitHub repository: https://github.com/ZZCrazy00/AnomalGRN, and is also publicly available in the Zenodo repository: https://zenodo.org/records/14917279. Additionally, the data is publicly available on Zenodo: https://zenodo.org/records/12176604.

References

De Nadal E, Ammerer G, Posas F. Controlling gene expression in response to stress. Nat Rev Genetics. 2011;12(12):833-845-833–45.
Article PubMed Google Scholar
López-Maury L, Marguerat S, Bähler J. Tuning gene expression to changing environments: from rapid responses to evolutionary adaptation. Nature Reviews Genetics. 2008;9(8):583-593-583–93.
Article PubMed Google Scholar
Parikshak NN, Gandal MJ, Geschwind DH. Systems biology and gene networks in neurodevelopmental and neurodegenerative disorders. Nature Reviews Genetics. 2015;16(8):441-458-441–58.
Article PubMed PubMed Central Google Scholar
Potkin SG, Macciardi F, Guffanti G, Fallon JH, Wang Q, Turner JA, Lakatos A, Miles MF, Lander A, Vawter MP, et al. Identifying gene regulatory networks in schizophrenia. Neuroimage. 2010;53(3):839-847-839–47.
Article PubMed Google Scholar
Moding EJ, Kastan MB, Kirsch DG. Strategies for optimizing the response of cancer and normal tissues to radiation. Nature reviews Drug discovery. 2013;12(7):526-542-526–42.
Article PubMed PubMed Central Google Scholar
Buettner F, Natarajan KN, Casale FP, Proserpio V, Scialdone A, Theis FJ, Teichmann SA, Marioni JC, Stegle O. Computational analysis of cell-to-cell heterogeneity in single-cell RNA-sequencing data reveals hidden subpopulations of cells. Nature biotechnology. 2015;33(2):155-160-155–60.
Article PubMed Google Scholar
Ding J, Adiconis X, Simmons SK, Kowalczyk MS, Hession CC, Marjanovic ND, Hughes TK, Wadsworth MH, Burks T, Nguyen LT, et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nature biotechnology. 2020;38(6):737-746-737–46.
PubMed PubMed Central Google Scholar
Chiu IM, Barrett LB, Williams EK, Strochlic DE, Lee S, Weyer AD, Lou S, Bryman GS, Roberson DP, Ghasemlou N, et al. Transcriptional profiling at whole population and single cell levels reveals somatosensory neuron molecular diversity. Elife. 2014;3:e04660–e04660.
Article PubMed PubMed Central Google Scholar
Huynh-Thu VA, Irrthum A, Wehenkel L, Geurts P. Inferring regulatory networks from expression data using tree-based methods. PLoS One. 2010;5(9):e12776–e12776.
Article PubMed PubMed Central Google Scholar
Shu H, Ding F, Zhou J, Xue Y, Zhao D, Zeng J, Ma J. Boosting single-cell gene regulatory network reconstruction via bulk-cell transcriptomic data. Briefings in Bioinformatics. 2022;23(5):bbac389–bbac389.
Article Google Scholar
Yuan Y, Bar-Joseph Z. Deep learning for inferring gene relationships from single-cell expression data. Proceedings of the National Academy of Sciences. 2019;116(52):27151-27158-27151–8.
Article Google Scholar
Chen J, Cheong C, Lan L, Zhou X, Liu J, Lyu A, Cheung WK, Zhang L. DeepDRIM: a deep neural network to reconstruct cell-type-specific gene regulatory network using single-cell RNA-seq data. Briefings in Bioinformatics. 2021;22(6):bbab325–bbab325.
Article Google Scholar
Kc K, Li R, Cui F, Yu Q, Haake AR. GNE: a deep learning framework for gene network inference by aggregating biological information. BMC systems biology. 2019;13(2):1-14-11–4.
Google Scholar
Chen G, Liu ZP. Graph attention network for link prediction of gene regulations from single-cell RNA-sequencing data. Bioinformatics. 2022;38(19):4522-4529-4522–9.
Article PubMed Google Scholar
Mao G, Pang Z, Zuo K, Wang Q, Pei X, Chen X, Liu J. Predicting gene regulatory links from single-cell RNA-seq data using graph neural networks. Briefings in Bioinformatics. 2023;24(6):bbad414–bbad414.
Article Google Scholar
Pratapa A, Jalihal AP, Law JN, Bharadwaj A, Murali T. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nat Methods. 2020;17(2):147–54.
Article CAS PubMed PubMed Central Google Scholar
Pratapa A, Jalihal AP, Law JN, Bharadwaj A, Murali TM. Benchmarking algorithms for gene regulatory network inference from single-cell transcriptomic data. Nature methods. 2020;17(2):147-154-147–54.
Article PubMed PubMed Central Google Scholar
Yuan L, Zhao L, Jiang Y, Shen Z, Zhang Q, Zhang M, Zheng C-H, Huang D-S. scMGATGRN: a multiview graph attention network–based method for inferring gene regulatory networks from single-cell transcriptomic data. Briefings in Bioinformatics. 2024;25(6):bbae526.
Article CAS PubMed PubMed Central Google Scholar
Wang Y, Chen X, Zheng Z, Huang L, Xie W, Wang F, Zhang Z, Wong K-C. scGREAT: Transformer-based deep-language model for gene regulatory network inference from single-cell transcriptomics. Iscience. 2024;27:27(4).
PubMed Central Google Scholar
Jin W, Ma Y, Liu X, Tang X, Wang S, Tang J. Graph structure learning for robust graph neural networks. In: Proceedings of the 26th ACM SIGKDD international conference on knowledge discovery & data mining. 2020.
Zügner D, Borchert O, Akbarnejad A, Günnemann S. Adversarial attacks on graph neural networks: perturbations and their patterns. ACM Transactions on Knowledge Discovery from Data (TKDD). 2020;14(5):1-31-31–31.
Article Google Scholar
Kipf TN, Welling M. Semi-supervised classification with graph convolutional networks. 5th International Conference on Learning Representations, {ICLR} 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings. 2017. https://openreview.net/forum?id=SJU4ayYgl.
Velickovic P, Cucurull G, Casanova A, Romero A, Liò P, Bengio Y. Graph attention networks. 2018.
Hamilton WL, Ying Z, Leskovec J. Inductive representation learning on large graphs. 2017.
Forrest MP, Hill MJ, Quantock AJ, Martin-Rendon E, Blake DJ. The emerging roles of TCF4 in disease and development. Trends in molecular medicine. 2014;20(6):322-331-322–31.
Article PubMed Google Scholar
Li M, Zhao H, Zhang X, Wood LD, Anders RA, Choti MA, Pawlik TM, Daniel HD, Kannangai R, Offerhaus GJA, et al. Inactivating mutations of the chromatin remodeling gene ARID2 in hepatocellular carcinoma. Nature genetics. 2011;43(9):828-829-828–9.
Article PubMed PubMed Central Google Scholar
Auclair G, Borgel J, Sanz LA, Vallet J, Guibert S, Dumas M, Cavelier P, Girardot M, Forné T, Feil R, et al. EHMT2 directs DNA methylation for efficient gene silencing in mouse embryos. Genome research. 2016;26(2):192-202-192–202.
Article PubMed PubMed Central Google Scholar
Moon EJ, Mello SS, Li CG, Chi JT, Thakkar K, Kirkland JG, Lagory EL, Lee IJ, Diep AN, Miao Y, et al. The HIF target MAFF promotes tumor invasion and metastasis through IL11 and STAT3 signaling. Nature communications. 2021;12(1):4308–4308.
Article CAS PubMed PubMed Central Google Scholar
Lee DH, Pan Y, Kanner S, Sung P, Borowiec JA, Chowdhury D. A PP4 phosphatase complex dephosphorylates RPA2 to facilitate DNA repair via homologous recombination. Nature structural and molecular biology. 2010;17(3):365-372-365–72.
Article PubMed Google Scholar
Arnold MA, Kim Y, Czubryt MP, Phan D, McAnally J, Qi X, Shelton JM, Richardson JA, Bassel-Duby R, Olson EN. MEF2C transcription factor controls chondrocyte hypertrophy and bone development. Developmental cell. 2007;12(3):377-389-377–89.
Article PubMed Google Scholar
Lin Q, Schwarz J, Bucana C, Olson EN. Control of mouse cardiac morphogenesis and myogenesis by transcription factor MEF2C. Science. 1997;276(5317):1404-1407-1404–7.
Article PubMed PubMed Central Google Scholar
Dimitrov D, Schäfer PSL, Farr E, Rodriguez Mier P, Lobentanzer S, Dugourd A, Tanevski J, Ramirez Flores RO, Saez-Rodriguez J. LIANA+: an all-in-one cell-cell communication framework. BioRxiv, 2023. 2023.2008. 2019.553863. https://doi.org/10.1101/2023.08.19.553863.
Kuppe C, Ramirez Flores RO, Li Z, Hayat S, Levinson RT, Liao X, Hannani MT, Tanevski J, Wünnemann F, Nagai JS. Spatial multi-omic map of human myocardial infarction. Nature. 2022;608(7924):766–77.
Article CAS PubMed PubMed Central Google Scholar
Huang X, Song C, Zhang G, Li Y, Zhao Y, Zhang Q, Zhang Y, Fan S, Zhao J, Xie L. scGRN: a comprehensive single-cell gene regulatory network platform of human and mouse. Nucleic Acids Res. 2024;52(D1):D293–303.
Article CAS PubMed Google Scholar
Ding Q, Yang W, Xue G, Liu H, Cai Y, Que J, Jin X, Luo M, Pang F, Yang Y. Dimension reduction, cell clustering, and cell–cell communication inference for single-cell transcriptomics with DcjComm. Genome Biol. 2024;25(1):241.
Article PubMed PubMed Central Google Scholar
McCalla SG, Fotuhi Siahpirani A, Li J, Pyne S, Stone M, Periyasamy V, Shin J, Roy S. Identifying strengths and weaknesses of methods for computational network inference from single-cell RNA-seq data. G3: Genes, Genomes, Genetics. 2023;13(3):jkad004.
Article CAS PubMed Google Scholar
Zheng C, Zong B, Cheng W, Song D, Ni J, Yu W, Chen H, Wang W. Robust graph representation learning via neural sparsification. In: International conference on machine learning. 2020.
Gong Z, Wang G, Sun Y, Liu Q, Ning Y, Xiong H, Peng J. Beyond homophily: robust graph anomaly detection via neural sparsification. In: Proceedings of the thirty-second international joint conference on artificial intelligence. 2023.
Kool W, van Hoof H, Welling M. Estimating gradients for discrete random variables by sampling without replacement. 8th International Conference on Learning Representations, {ICLR} 2020, Addis Ababa, Ethiopia, April 26-30, 2020. https://openreview.net/forum?id=rklEj2EFvB.
Kazi A, Cosmo L, Ahmadi S-A, Navab N, Bronstein MM. Differentiable graph module (dgm) for graph convolutional networks. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2022;45(2):1606-1617-1606–17.
Google Scholar
Chen Y, Wu L, Zaki M. Iterative deep graph learning for graph neural networks: better and robust node embeddings. Advances in neural information processing systems. 2020;33:19314-19326-19314–26.
Google Scholar
Koltchinskii V, Lounici K, Tsybakov AB. Nuclear-norm penalization and optimal rates for noisy low-rank matrix completion. 2011.
Zhu Y, Xu W, Zhang J, Du Y, Zhang J, Liu Q, Yang C, Wu S. A survey on graph structure learning: progress and opportunities. arXiv preprint arXiv:2103.03036, 2021. https://doi.org/10.48550/arXiv.2103.03036.

Download references

Acknowledgements

We would like to thank the National Natural Science Foundation of China for their support of this work (No. 62450002, 62302339, 62372158). We also appreciate the contributions of all authors involved in this study, whose collaboration made this research possible.

Funding

This work was supported by the National Natural Science Foundation of China (No. 62131004 to QZ, No. 62302339 to LZ, No. 62372158 to XF).

Author information

Authors and Affiliations

School of Data Science and Artificial Intelligence, Wenzhou University of Technology, Wenzhou, 325027, China
Zhecheng Zhou, Jinhang Wei, Mingzhe Liu & Linlin Zhuo
College of Computer Science and Electronic Engineering, Hunan University, Changsha, 410012, China
Xiangzheng Fu
Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 611730, China
Quan Zou

Authors

Zhecheng Zhou
View author publications
Search author on:PubMed Google Scholar
Jinhang Wei
View author publications
Search author on:PubMed Google Scholar
Mingzhe Liu
View author publications
Search author on:PubMed Google Scholar
Linlin Zhuo
View author publications
Search author on:PubMed Google Scholar
Xiangzheng Fu
View author publications
Search author on:PubMed Google Scholar
Quan Zou
View author publications
Search author on:PubMed Google Scholar

Contributions

All authors read and approved the final manuscript. Zhecheng Zhou and Jinhang Wei contributed to the experimental design and initial draft preparation. Mingzhe Liu and Linlin Zhuo provided experimental guidance, while Linlin Zhuo additionally conducted data analysis. Xiangzheng Fu and Quan Zou oversaw manuscript revision and review.

Corresponding authors

Correspondence to Linlin Zhuo, Xiangzheng Fu or Quan Zou.

Ethics declarations

Ethics approval and consent to participate

This study involves computational experiments that are non-invasive and do not directly intervene with any human or animal subjects. Therefore, ethical approval from an institutional review board is not required. The consent of participants and the protection of personal information are not applicable, as the experimental data are sourced from publicly available datasets.

Consent for publication

All authors have provided their consent for publication of this study. There are no identifiable individuals or personal data included in this manuscript, ensuring compliance with publication ethics.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Zhou, Z., Wei, J., Liu, M. et al. AnomalGRN: deciphering single-cell gene regulation network with graph anomaly detection. BMC Biol 23, 73 (2025). https://doi.org/10.1186/s12915-025-02177-z

Download citation

Received: 03 November 2024
Accepted: 25 February 2025
Published: 11 March 2025
DOI: https://doi.org/10.1186/s12915-025-02177-z

AnomalGRN: deciphering single-cell gene regulation network with graph anomaly detection

Abstract

Background

Results

Conclusions

Similar content being viewed by others

GRLGRN: graph representation-based learning to infer gene regulatory networks from single-cell RNA-seq data

Integrating Machine Learning and Biological Context for Single-Cell Gene Regulatory Network Inference

HGATLink: single-cell gene regulatory network inference via the fusion of heterogeneous graph attention networks and transformer

Explore related subjects

Background

Results

Data preparation

Experimental setup

Performance on benchmark datasets

Performance evaluation across positive–negative ratios

Ablation experiment

Parameter experiment

Noise-induced attack

Case study

Discussion

Conclusions

Methods

Model architecture

Task transformation

Weight reconstruction

Network sparsification

Message propagation

Model optimization

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords