Introduction

Network theory has emerged as a powerful framework to model different kinds of real-world systems, by representing their units as nodes and the interactions between them as links. Out of the many types of edges that have been considered so far, the signed one, offering the possibility of modelling positive as well as negative interactions, has recently seen its popularity revived1,2,3,4.

Most works on signed configurations have focused on monopartite graphs, i.e., configurations with a unique set of nodes each of which can interact with any other one. The interest towards one-mode signed networks is ascribable to the so-called balance theory (BT), introduced by Heider in5 and later formalised by Cartwright and Harary employing signed graphs6.

BT states that a signed graph is strongly balanced (SB) if all cycles contain an even number of negative edges: from a mesoscopic perspective, this implies that the network can be split in two groups with positive intra-modular and negative inter-modular links; in7, Davis spoke of weakly balanced (WB) graphs, allowing for more than two groups: taken together, the SB and WB concepts define what is called Traditional Balance Theory (TBT)8—a framework that has found applications in contexts as diverse as the biological, ecological, economic and social ones8,9,10,11,12,13,14,15,16,17.

Since, however, real-world networks often deviate from it, Doreian and Mrvar have proposed a generalisation named Relaxed Balance Theory (RBT)8,18 which allows for positive inter-modular and negative intra-modular links as well.

The concept of balance in the bipartite context

The interest towards two-mode signed networks, i.e., configurations with two sets of vertices where (signed) connections can be established only between pairs of nodes belonging to different sets, is, instead, much more recent. One of the earliest attempts at adapting the notion of structural balance to the bipartite framework has been carried out in ref. 19, where the difference between the roles played by ‘subjects’ and ‘objects’ in Heider’s formulation of the BT was highlighted. In refs. 20,21, the authors focused on the balance of bipartite cycles by considering the statistical significance of the shortest ones, known as ‘butterfly motifs’, ‘X-motifs’ or ‘2 × 2 bicliques’22. In23, the authors carried out three different analyses: first, they tested the degree of balance of real-world bipartite signed networks by employing the definition of ‘butterfly motifs’ provided in ref. 21. Second, they inferred the sign of the missing intra-layer edges by connecting any two nodes belonging to the same set with a ‘plus one’ (‘minus one’) if they established either two positive or two negative links (one positive and one negative link) with the same set of vertices on the opposite layer. Finally, they tested the degree of balance of each layer by comparing the empirical percentages of triangles and butterflies with the ones expected under a null model, shuffling the links of the original bipartite network— i.e., implementing the microcanonical version of the Free-topology Signed Random Graph Model introduced in ref. 13: as a result, the patterns turned out to be more balanced than expected. In24, bipartite signed networks were employed to model a group of individuals’ opinions about certain topics: the authors addressed the problem of partitioning both types of entities into two groups while maximising the number of positive intra-modular links and minimising the number of negative inter-modular links.

Projecting bipartite unsigned networks

One of the issues of major interest encountered when modelling bipartite networks is that of inferring the presence of a relationship between nodes belonging to the same layer in case a direct measurement of such a relationship is unfeasible (as for data gathering about friendships on social platforms25). The simplest way of solving the problem is linking any two nodes belonging to the same layer as long as they share at least one neighbour: however, this often results in a very dense network whose topological structure is almost trivial. A different recipe prescribes retaining the information on the number of common neighbours, i.e., to project a bipartite network into a monopartite weighted network25: this prescription, however, causes the nodes with larger degree to have larger strengths, thus masking the genuine, statistical relevance of the induced connections. Moreover, such a prescription lets spurious clusters of nodes emerge (e.g., cliques induced by the presence of—even—a single node connected to all the other vertices on the opposite layer).

Algorithms for retaining only the significant weights have been proposed to face this problem25. Many of them are based on a thresholding procedure, a major drawback of which lies in the arbitrariness of the chosen threshold26,27,28. A different approach prescribes calculating the statistical significance of the projected weights according to a properly defined null model29: the latter, however, encodes relatively little information about the original structure, thus being more suited to analyse natively monopartite networks. A similar-in-spirit approach identifies the backbone of a monopartite weighted projection with its Minimum Spanning Tree and its communities with the trees constituting the Minimum Spanning Forest30,31: the lack of a comparison with a benchmark, however, makes assessing the statistical relevance of the outcome difficult.

All the aforementioned approaches validate a projection a posteriori. A different class of methods focuses on recipes to obtain statistically validated projections by estimating the tendency of any two nodes belonging to the same layer to share a certain number of neighbours: all such approaches define a similarity measure that either ranges between 0 and 132,33 or follows a probability distribution allowing for a p value to be computed34,35,36,37,38. While, in the first case, the application of an arbitrary threshold is still unavoidable, in the second case, prescriptions rooted in traditional statistics can be applied.

The approaches discussed so far lead to unsigned projections of bipartite unsigned networks. In refs. 25,39, instead, the author carried out a two-sided test of hypothesis to decide whether any two members of the 108th U.S. Senate co-sponsored either enough bills for a political alliance (positive link) to be inferred or not enough bills for political antagonism (negative link) to be inferred. The employed null model was named Stochastic Degree Sequence Model, a principled derivation of which was provided in refs. 22,36 where the benchmark was re-named Bipartite Configuration Model.

For a comprehensive review of methods to carry out pattern detection in bipartite networks, see ref. 40.

Projecting bipartite signed networks

The issue of projecting bipartite signed networks has been addressed to a much less extent. A first example is provided by41, where the authors projected a user-item network onto the layer of users: the similarity of any two of them was quantified by calculating the scalar product of the corresponding rows of the biadjacency matrix42 and its statistical significance evaluated by comparing the empirical value with the one expected under a null model defined by randomly swapping two edges having the same sign. Under such a benchmark—which is the microcanonical version of the Free-topology Signed Configuration Model introduced ref. 13—the authors found that both the MovieLens and the Netflix projections were highly balanced—the degree of balance being proxied by the number of triangles having an even number of negative edges. A second example is provided by ref. 43, where the authors obtained signed projections by carrying out a statistical validation of the number of neighbours shared by any two nodes belonging to the same layer via a hypergeometric-binomial mixture distribution.

Hereby, we build upon previous contributions by extending the Exponential Random Graphs framework to include null models suitable for analysing binary undirected bipartite signed networks and employ them to obtain statistically validated signed projections. To address such a problem, we extend the algorithm proposed in36, based upon the idea that any two nodes sharing a significantly large number of neighbours should be linked in the corresponding monopartite projection. More precisely, we propose two variants of it, according to the way ambivalent patterns (i.e. patterns constituted by two nodes and an item, liked by one node and disliked by the other) and missing ties (characterising patterns constituted by two nodes and an item, with just one node expressing an opinion about it) are treated.

The rest of the paper is organised as follows. First, we introduce a quantity to measure the similarity of any two nodes belonging to the same layer. Second, we derive its probability distribution according to each considered benchmark. Third, we consider each pair of nodes and quantify the statistical significance of their similarity. Fourth, we link only the ones surviving a multiple-hypothesis testing procedure. Fifth, we employ our methods to obtain signed projections of several different datasets. Finally, we comment on our results.

Results

With the present contribution, we propose an unsupervised algorithm to obtain statistically validated projections of binary undirected bipartite signed networks, according to which any two nodes sharing a statistically significant number of concordant (discordant) relationships are connected by a positive (negative) edge. Before applying it to a number of real-world configurations, let us, first, illustrate how it works on a toy model.

Projection of synthetic configurations

In order to illustrate how our algorithm for projecting bipartite signed networks works, let us construct a generative model as follows. Let us consider the Bipartite Signed Stochastic Block Model (BiSSBM), induced by the finite scheme

$${b}_{i\alpha } \sim \left(\begin{array}{lll}-1&0&+1\\ {p}_{{g}_{i}{g}_{\alpha }}^{-}&{p}_{{g}_{i}{g}_{\alpha }}^{0}&{p}_{{g}_{i}{g}_{\alpha }}^{+}\end{array}\right),\quad i\in {g}_{i},\,\alpha \in {g}_{\alpha }$$
(1)

with i = 1…N and α = 1…M. Figure 1 provides a graphical representation of three configurations generated via the BiSSBM by considering parameters reading (left panels) N = 80, M = 100,

$${p}^{+}=\left(\begin{array}{ll}0.5&0.08\\ 0.08&0\end{array}\right)\quad \,{\text{and}}\,\quad {p}^{-}=\left(\begin{array}{ll}0&0.08\\ 0.08&0.5\end{array}\right);$$
(2)

(middle panels) N = 100, M = 120,

$${p}^{+}=\left(\begin{array}{ll}0.5&0.4\\ 0.4&0\end{array}\right)\quad \,{\text{and}}\,\quad {p}^{-}=\left(\begin{array}{ll}0&0.4\\ 0.4&0.5\end{array}\right);$$
(3)

(right panels) N = 100, M = 120,

$${p}^{+}=\left(\begin{array}{lll}0.6&0.2&0.1\\ 0&0.6&0.1\\ 0.15&0.1&0\end{array}\right)\,\,{\text{and}}\,\,{p}^{-}=\left(\begin{array}{lll}0&0&0.1\\ 0.2&0&0.1\\ 0.1&0.2&0.6\end{array}\right).$$
(4)
Fig. 1: Graphical representation of three synthetic configurations, generated by considering the values of the parameters reported in the main text.
figure 1

What we obtain confirms that (i) any two nodes within the same bipartite group share a significantly large number of concordant motifs that, in turn, induces a positive connection in the projection; (ii) any two nodes belonging to different bipartite groups may share either a significantly large number of discordant motifs, inducing a negative connection in the projection, or a significantly large number of concordant motifs, inducing a positive connection in the projection.

In the first case (left panels), the nodes belonging to gi = 1 (gi = 2) like the items belonging to gα = 1 (gα = 2) but establish few connections with the items belonging to gα = 2 (gα = 1); in the second case (middle panels), the nodes belonging to gi = 1 (gi = 2) like the items belonging to gα = 1 (gα = 2) but the density of inter-group connections is much larger; in the third case (right panels), the nodes belonging to the first (second) group like the items belonging to the first (second) group, the nodes belonging to the third group dislike the items belonging to the third group and the density of inter-group connections resembles the one characterising the first case. Irrespectively of the exact values of the parameters defining them, however, we expect the first two synthetic configurations to induce projections obeying the statistical variant of the traditional balance theory and the third configuration to induce a projection obeying the statistical variant of the RBT8,18.

Let us, now, project our synthetic configurations by employing the BiSRGM-FT, i.e., the global statistical benchmark induced by the zero-deflated scheme. Following8, one can adopt an ‘agnostic’ attitude and explore the mesoscale organisation of a projection without aligning with any specific conceptual framework: a principled approach to achieve such a goal is that of minimising the Bayesian Information Criterion (BIC) reading

$$\,{\text{BIC}}\,=\kappa \ln V-2\ln {\mathcal{L}};$$
(5)

the first addendum proxies the complexity of a model with the number of its parameters, κ, the second addendum proxies the accuracy of a model with its log-likelihood, \(\ln {\mathcal{L}}\), and V = N(N−1)/2 accounts for the system dimensions. Since we aim at describing a projection mesoscale organisation, a natural choice is that of adopting the Signed Stochastic Block Model (SSBM), defined by the likelihood function

$$\begin{array}{l}{{\mathcal{L}}}_{{\rm{SSBM}}}=\mathop{\prod }\limits_{r=1}^{k}{({p}_{rr}^{+})}^{{L}_{rr}^{+}}{({p}_{rr}^{-})}^{{L}_{rr}^{-}}{(1-{p}_{rr}^{+}-{p}_{rr}^{-})}^{\left(\begin{array}{c}{N}_{r}\\ 2\end{array}\right)-{L}_{rr}}\\ \mathop{\prod }\limits_{r=1}^{k}\mathop{\prod }\limits_{\begin{array}{c}s=1\\ s > r\end{array}}^{k}{({p}_{rs}^{+})}^{{L}_{rs}^{+}}{({p}_{rs}^{-})}^{{L}_{rs}^{-}}{(1-{p}_{rs}^{+}-{p}_{rs}^{-})}^{{N}_{r}{N}_{s}-{L}_{rs}}\end{array}$$
(6)

and a number of parameters κSSBM = k(k + 1), amounting at twice the number of modules, k, into which the projection can be partitioned. Naturally, Nr is the number of nodes constituting block r, \({L}_{rr}^{+}={p}_{rr}^{+}{N}_{r}({N}_{r}-1)/2\) is the number of positive links within block r, \({L}_{rr}^{-}={p}_{rr}^{-}{N}_{r}({N}_{r}-1)/2\) is the number of negative links within block r, \({L}_{rs}^{+}={p}_{rs}^{+}{N}_{r}{N}_{s}\) is the number of positive links between blocks r and s and \({L}_{rs}^{-}={p}_{rs}^{-}{N}_{r}{N}_{s}\) is the number of negative links between blocks r and s.

What we obtain confirms what we expect, i.e., that any two nodes within the same bipartite group share a significantly large number of concordant motifs that, in turn, induces a positive connection in the projection. On the other hand, any two nodes belonging to different bipartite groups may share either a significantly large number of discordant motifs that, in turn, induces a negative connection in the projection or a significantly large number of concordant motifs that, in turn, induces a positive connection in the projection.

Projection of real-world configurations

Let us, now, consider some real-world networks.

U.S. senate and U.S. house of representatives

The first two datasets that we consider, described in ref. 21 and further analysed in refs. 20,23, are the output of the GovTrack.us project44 and collect vote records from the 1st to the 10th Congress of the United States. The nodes on the first layer are either senators or representatives while the nodes on the second layer are bills: a positive (negative) link between a senator/representative and a bill indicates that the senator/representative has voted ‘Yes’ (‘Nay’) for that bill.

FilmTrust

The third dataset that we consider is the output of the FilmTrust project45,46 and collects rating data from an online community whose users assign a score, i.e., 1, 2, 3, 4, to a number of movies. The nodes on the first layer are users while the nodes on the second layer are movies. We obtain a binary signed version of this dataset by assigning a −1 to all the edges whose weight is either 1 or 2 and a + 1 to all the edges whose weight is either 3 or 4.

From a purely empirical perspective, all the aforementioned datasets are characterised by a small link density c = L/(NM): the connectance of U.S. Senate and U.S. House of Representatives amounts to 0.1 while the connectance of FilmTrust amounts to 0.01. The percentages of positive and negative links are, instead, quite different: U.S. Senate and U.S. House of Representatives have 55% of positive links while FilmTrust has 80% of positive links (see also Table I in Appendix E).

naive projection of real-world configurations

The aforementioned configurations have been, first, naively projected according to the recipes , valid within the zero-deflated scheme, and , valid within the zero-inflated scheme.

When considering the unsigned case, the link density of the projection returned by the naive approach is typically large36. As shown in Fig. 2, this is especially true within the zero-inflated scheme–although the naive projections of FilmTrust are both very dense. Let us also notice that the small link density of FilmTrust causes the naive projection obtained within the zero-inflated scheme to be solely populated by positive links. For what concerns the naive projection obtained within the zero-deflated scheme, instead, its large link density signals that the vast majority of pairs of nodes shares at least one V-motif (i.e. the absolute value of the difference between the number of ‘full’ concordant and discordant motifs is at least 1). For what concerns U.S. Senate and U.S. House of Representatives, the naive projections obtained within the zero-deflated scheme are already quite sparse, letting non-trivial mesoscopic patterns emerge—appreciable even by just looking at the adjacency matrices illustrated in the first column of Fig. 3 (see also Table 1).

Fig. 2
figure 2

Pictorial representation of the adjacency matrices of the projections of FilmTrust ((ac), U.S. Senate (df) and U.S. House of Representatives (gi), obtained within the zero-inflated scheme, i.e., the naive ones (a, d, g), the ones induced by the BiSRGM-FT b, e, h) and the ones induced by the BiSCM-FT (c, f, i). Entries equal to −1 are coloured in red, entries equal to 0 are coloured in white, and entries equal to +1 are coloured in blue. The rows and columns of these adjacency matrices are re-ordered on the basis of the mesoscopic structures spotted by minimising BIC (see also Fig. 4).

Fig. 3
figure 3

Pictorial representation of the adjacency matrices of the projections of FilmTrust (ac), U.S. Senate (df) and U.S. House of Representatives (gi), obtained within the zero-deflated scheme, i.e., the naive ones (a, d, g), the ones induced by the BiSRGM-FT (b, e, h) and the ones induced by the BiSCM-FT (c, f, i). Entries equal to −1 are coloured in red, entries equal to 0 are coloured in white, entries equal to +1 are coloured in blue. The rows and columns of these adjacency matrices are re-ordered on the basis of the mesoscopic structures spotted by minimising BIC.

Table 1 From a purely empirical perspective, the datasets considered in the present contribution are characterised by a small link density and different percentages of positive and negative links: the connectance of U.S. Senate and U.S. House of Representatives amounts to 0.1 while 55% of links is positive; the connectance of FilmTrust amounts to 0.01 while 80% of links is positive

BIC minimisation confirms the presence of modules seemingly obeying the statistical variant of the RBT8,18, as a non-negligible number of negative links is found not only between clusters but within clusters as well—the vast majority of links within the second block from the right-bottom angle of the (adjacency matrix of the) zero-deflated naive projection of U.S. Senate is, in fact, negative—and a non-negligible number of positive links is found not only within clusters but between clusters as well—the vast majority of links between the first and the second block from the right-bottom angle of the (adjacency matrix of the) zero-deflated naive projection of U.S. Senate is, in fact, positive.

As a last observation, let us stress that the large link density of the zero-inflated naive projections of U.S. Senate and U.S. House of Representatives does not prevent BIC minimisation from detecting statistically significant mesoscopic structures: its sensitivity to the density of signed links8, in fact, makes it capable of revealing the 6 different blocks constituting the zero-inflated naive projection of U.S. Senate and the 2 different blocks constituting the zero-inflated naive projection of U.S. House of Representatives, both depicted in Fig. 4.

Fig. 4: Pictorial representation of the projections of U.S. Senate (top panels) and U.S. House of Representatives (bottom panels), obtained within the zero-inflated scheme, i.e., the naive ones (left panels), the ones induced by the BiSRGM-FT (middle panels), and the ones induced by the BiSCM-FT (right panels).
figure 4

Entries equal to −1 are coloured in red, entries equal to 0 are coloured in white, and entries equal to +1 are coloured in blue. Minimising BIC lets the naive projection of U.S. Senate (U.S. House of Representatives) to be partitioned into 6 (2) modules, the one induced by the BiSRGM-FT to be partitioned into 6 (3) modules, and the one induced by the BiSCM-FT to be partitioned into 3 (11) modules.

Validated projection of real-world configurations within the zero-deflated scheme

Although summing motifs while retaining their own sign is, in a sense, enough to obtain sparse projections, the validation procedure proposed in this paper (also) aims at enhancing the identification of patterns encoding non-trivial information about the original structure.

For what concerns U.S. Senate, minimising BIC on the projection filtered via the global statistical benchmark named BiSRGM-FT refines the picture provided by minimising BIC on the naive projection, confirming the presence of a larger number of modules—more precisely, 10 as shown in Fig. 4. Interestingly enough, all such modules are characterised by (a vast majority of) positive links, although positive links are found between modules as well. Finally, minimising BIC on the projection filtered via the local statistical benchmark named BiSCM-FT returns a picture lying, somehow, halfway between the naive one and the one filtered via the BiSRGM-FT, as 8 modules are, now, detected; contrarily to what has been revealed with the aid of the global filter, however, applying the local filter returns a picture where negative links are found within modules as well.

The entire process is even more evident when considering U.S. House of Representatives, whose mesoscopic structure gets progressively resolved into an increasing number of increasingly negative modules (i.e., 2, 5, 6).

Overall, thus, our results confirm that the revealed modular structure seems to align better with the statistical variant of the RBT than with the statistical variant of its traditional counterpart (TBT), consistently across the different projections.

Validated projection of real-world configurations within the zero-inflated scheme

The considerations above are confirmed by the zero-inflated projections depicted in Fig. 2. For example, progressively filtering FilmTrust leads from a fully connected network to configurations whose connectance practically halves at each step. More in detail, applying the BiSRGM allows two different blocks to emerge—a result similar in spirit to the one that has been reported in ref. 36, where the global statistical benchmark induces a rough partition of the system under consideration. The filtering induced by the BiSCM is, instead, more severe and qualitatively different: the projection obtained is, in fact, much sparser and populated by a comparable number of positive and negative links.

Interestingly, filtering U.S. Senate with the BiSRGM lets BIC identify the same number of modules characterising the naive projection—they are 6 in both cases—but with a different arrangements of links; this is even more evident when considering the projection induced by the BiSCM, characterised by just three modules. On the contrary, filtering U.S. House of Representatives means rising (even substantially) the number of clusters while cutting half of the edges populating its naive projection.

Discussion

A straightforward approach to project a bipartite network in the unsigned setting is that of comparing the number of neighbours shared by any two nodes i and j, i.e.

$${V}_{ij}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }{b}_{j\alpha }={[{\bf{B}}\cdot {{\bf{B}}}^{T}]}_{ij}={{\bf{r}}}_{i}\cdot {({{\bf{r}}}_{j})}^{T},$$
(7)

with the one predicted under a chosen benchmark36. In the signed setting, instead, two complementary approaches can be devised, treating the missing ties in a different way: while the zero-deflated projection scheme ignores them, the zero-inflated projection scheme accounts for them as well. Loosely speaking, the zero-deflated projection scheme returns configurations that are sparser than the configurations returned by the zero-inflated projection scheme although the latter are populated by a larger number of negative links; both schemes, however, enhance the detection of mesoscopic structures, the filtered projections being characterised by a larger number of modules than the naive ones.

Additionally, we have compared the patterns revealed by minimising BIC with the ones revealed by minimising the frustration F, defined as

$$F({\boldsymbol{\sigma }})={L}_{\bullet }^{-}+{L}_{\circ }^{+}$$
(8)

according to the traditional balance theory8, i.e., counting the number of negative links within modules (indicated with a filled dot—see the first addendum) plus the number of positive links between modules (indicated with an empty dot—see the second addendum). The differences are highlighted in Fig. 5, showing how the optimisation of F basically returns an oversimplified picture leading, for instance, to partition the projection of U.S. House of Representatives induced by the BiSRGM within the zero-inflated scheme into 32 modules. The explanation lies in the sensitivity of F solely towards the signed membership: as the signed density is completely ignored, the groups of nodes dominated by negative links are split into singletons—and, by converse, the nodes connected by positive links are grouped together. Another example is provided by the projection of U.S. House of Representatives induced by the BiSRGM within the zero-deflated scheme: such a configuration is, now, partitioned into 2 modules, the signed membership leading the algorithm to disregard the (internal) hierarchical structure of these clusters. Similar results are found when considering the other kinds of projections.

Fig. 5: Pictorial representation of the projections of the U.S. House of Representatives, obtained within the zero-deflated scheme (first and second column) and the zero-inflated scheme (third and fourth column), induced by the BiSRGM-FT.
figure 5

Entries equal to −1 are coloured in red, entries equal to 0 are coloured in white, and entries equal to +1 are coloured in blue. The rows and columns of these adjacency matrices are re-ordered on the basis of the mesoscopic structures spotted by either minimising BIC (top panels) or the frustration F (bottom panels). As the optimisation of F is solely driven by signed membership, it returns a partition defined by i) a smaller number of modules than those individuated by BIC when the majority of validated links is positive and ii) a larger number of modules than those individuated by BIC when the majority of validated links is negative.

A more quantitative comparison can be carried out upon calculating the Wallace, Rand and Jaccard Index, that sum up the coefficients populating the confusion matrix to return three compact measures of similarity between partitions. As illustrated in Appendix F, the three aforementioned indices confirm that the mesoscale structures spotted by the BIC-based and the F-based recipes are more similar whenever a larger number of positive links populates a given projection. If, on the contrary, the latter is defined by a large number of negative links, the F-based recipe outputs many ‘false negatives’, i.e., pairs of nodes that are separated solely because they are found to be connected by a negative link. Such a result sheds further light on the observation made in refs. 13 and 8 about the (potential) ambiguity concerning the variant of the balance theory best supported by the data: if the intuitive definition of modules as ‘densely connected groups of nodes’ is extended to the signed case, then the recipe prescribing to minimise BIC should be preferred. If, on the other hand, one seeks the configuration best aligning with the TBT, then the recipe prescribing to minimise F should be preferred—although not specifically designed to spot (statistically significant) mesoscale structures.

A very last observation concerns the ‘sign prediction’ problem. In47, such an issue is addressed within the framework of the TBT, i.e., under the assumption that signed networks evolve towards balance: within such a context, sign prediction is carried out by combining the edge-based definition of Katz centrality with the minimisation of frustration. In48, instead, an approach based upon the ensemble of random graphs induced by the hypergeometric distribution is employed. Other methods are based upon machine learning techniques23,49,50,51,52,53. Our algorithm can, in a sense, be understood as implementing an unsupervised ‘white box’ method for predicting the sign of a link: interestingly, the prescription based upon the signature of its endpoints—summarisable with the motto ‘a significantly large number of concordant motifs induces a + 1 and a significantly large number of discordant motifs induces a − 1’—formalises the tendency towards balance, advocated by other approaches, within the bipartite context.

Methods

Formalism and basic quantities

A binary undirected bipartite signed network is completely defined by its biadjacency matrix, i.e., a rectangular table B whose dimensions will be indicated with M and N, M being the number of nodes in the top layer (i.e. the number of columns of B) and N being the number of nodes in the bottom layer (i.e. the number of rows of B).

Each edge can be positive, negative or missing: since we will focus on binary networks, each edge will be ‘+1’, ‘−1’ or ‘0’. More formally, for any two nodes i and α, the corresponding entry of the biadjacency matrix will be assumed to read biα = −1, 0, +1.

To ease mathematical manipulations, let us define the three quantities reading

$${b}_{i\alpha }^{-}=[{b}_{i\alpha }=-1],\quad {b}_{i\alpha }^{0}=[{b}_{i\alpha }=0],\quad {b}_{i\alpha }^{+}=[{b}_{i\alpha }=+1],$$
(9)

where we have employed Iverson’s brackets notation13. These new variables are mutually exclusive, satisfy the relationship \({b}_{i\alpha }^{-}+{b}_{i\alpha }^{0}+{b}_{i\alpha }^{+}=1\), i, α and induce the three non-negative matrices B+, B0 and B obeying the relationships B = B+ −B and B = B+ + B.

In the following, it will turn out to be useful to identify a biadjacency matrix with the set of its rows, i.e., \({\bf{B}}\equiv {\{{{\bf{r}}}_{i}\}}_{i = 1}^{N}\), and analogously for B+ and B. Lastly, BT indicates the transpose of the biadjacency matrix B.

Global connectivity and node degrees

The number of positive and negative links, respectively, read

$${L}^{+}=\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{+},\quad {L}^{-}=\mathop{\sum }\limits_{i=1}^{N}\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{-};$$
(10)

analogously, the positive and negative degree of node i are defined as

$${k}_{i}^{+}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{+},\quad {k}_{i}^{-}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{-}$$
(11)

while the positive and negative degree of node α are defined as

$${h}_{\alpha }^{+}=\mathop{\sum }\limits_{i=1}^{N}{b}_{i\alpha }^{+},\quad {h}_{\alpha }^{-}=\mathop{\sum }\limits_{i=1}^{N}{b}_{i\alpha }^{-}.$$
(12)

Naturally, \({L}^{+}=\mathop{\sum }\nolimits_{i=1}^{N}{k}_{i}^{+}=\mathop{\sum }\nolimits_{\alpha =1}^{M}{h}_{\alpha }^{+}\) and \({L}^{-}=\mathop{\sum }\nolimits_{i=1}^{N}{k}_{i}^{-}=\mathop{\sum }\nolimits_{\alpha =1}^{M}{h}_{\alpha }^{-}\). The advantage of adopting Iverson’s brackets is that of ensuring that each quantity is, now, computed on a matrix with positive entries, i.e., is positive as well.

The role of ambivalent patterns

The approaches to obtain a projection of a bipartite, signed network considered so far rest upon the calculation of any two nodes similarity. Evaluating such a quantity leads to two related problems, i.e., how to treat ambivalent patterns and missing ties.

Ambivalence is introduced in54 and defined as a ‘conjunction of positive and negative relations that are psychologically secondary or derived’; in ref. 55, Cartwright and Harary suggest that ‘attitudes of ambivalence should be unstable, changing to positive or negative attitudes so as to satisfy the criteria of balance’. In the bipartite setting we are considering here, ambivalence emerges in two, different cases: i) whenever nodes i and j establish motifs whose signature is either (+/−) or (−/+); ii) whenever nodes i and j establish the same number of (+/+) and (−/−) motifs. In both cases, devising a recipe to determine the sign of the link in the corresponding monopartite projection is not immediate.

In ref. 56, two different recipes are proposed. The first one is based on matrix multiplication and prescribes to consider the projections induced by positive and negative links separately. The second one rely on a ‘vertex duplication’ mechanism, according to which each node belonging to the layer of interest originates a positive copy, gathering the original positive links, and a negative copy, gathering the original negative links; these connections are treated as unsigned and the network is, then, projected. Finally, the sign of the edges populating the monopartite projection is restored according to a so-called ‘vertex contraction’ rule.

In order to address the problem of the ambivalent patterns, let us define the ‘full’ dyadic motifs reading

$${V}_{ij}^{++}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{+}{b}_{j\alpha }^{+}={[({{\bf{B}}}^{+})\cdot {({{\bf{B}}}^{+})}^{T}]}_{ij}={{\bf{r}}}_{i}^{+}\cdot {({{\bf{r}}}_{j}^{+})}^{T},$$
(13)

that counts the number of nodes (belonging to the second layer) to which the nodes i and j (belonging to the first layer) are both connected via two positive links, and

$${V}_{ij}^{--}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{-}{b}_{j\alpha }^{-}={[({{\bf{B}}}^{-})\cdot {({{\bf{B}}}^{-})}^{T}]}_{ij}={{\bf{r}}}_{i}^{-}\cdot {({{\bf{r}}}_{j}^{-})}^{T},$$
(14)

that counts the number of nodes (belonging to the second layer) to which the nodes i and j (belonging to the first layer) are both connected via two negative links. It is quite intuitive to ascribe them to the class of the so-called concordant motifs, i.e., patterns capturing the ‘agreement’ between any two users establishing them: our proposal is, thus, that of connecting any two users establishing a significantly large number of concordant motifs with a + 1.

Analogously, it is quite intuitive to ascribe the expressions

$${V}_{ij}^{+-}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{+}{b}_{j\alpha }^{-}={[({{\bf{B}}}^{+})\cdot {({{\bf{B}}}^{-})}^{T}]}_{ij}={{\bf{r}}}_{i}^{+}\cdot {({{\bf{r}}}_{j}^{-})}^{T}$$
(15)

and

$${V}_{ij}^{-+}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{-}{b}_{j\alpha }^{+}={[({{\bf{B}}}^{-})\cdot {({{\bf{B}}}^{+})}^{T}]}_{ij}={{\bf{r}}}_{i}^{-}\cdot {({{\bf{r}}}_{j}^{+})}^{T},$$
(16)

that count the number of nodes (belonging to the second layer) to which the nodes i and j (belonging to the first layer) are connected with a positive and a negative link, to the class of the so-called discordant motifs, i.e., patterns capturing the ‘disagreement’ between any two users establishing them: our proposal is, thus, that of connecting any two users establishing a significantly large number of discordant motifs with a − 1.

The role of missing ties

Let us, now, address the problem of the missing ties. To this aim, let us define the ‘partial’ dyadic motifs reading

$${V}_{ij}^{00}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{0}{b}_{j\alpha }^{0}={[({{\bf{B}}}^{0})\cdot {({{\bf{B}}}^{0})}^{T}]}_{ij}={{\bf{r}}}_{i}^{0}\cdot {({{\bf{r}}}_{j}^{0})}^{T},$$
(17)

that counts the number of nodes (belonging to the second layer) to which the nodes i and j (belonging to the first layer) are both not connected,

$${V}_{ij}^{0+}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{0}{b}_{j\alpha }^{+}={[({{\bf{B}}}^{0})\cdot {({{\bf{B}}}^{+})}^{T}]}_{ij}={{\bf{r}}}_{i}^{0}\cdot {({{\bf{r}}}_{j}^{+})}^{T}$$
(18)

and

$${V}_{ij}^{+0}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{+}{b}_{j\alpha }^{0}={[({{\bf{B}}}^{+})\cdot {({{\bf{B}}}^{0})}^{T}]}_{ij}={{\bf{r}}}_{i}^{+}\cdot {({{\bf{r}}}_{j}^{0})}^{T},$$
(19)

that counts the number of nodes (belonging to the second layer) to which node i (node j) is not connected and node j (node i) is connected with a positive link,

$${V}_{ij}^{0-}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{0}{b}_{j\alpha }^{-}={[({{\bf{B}}}^{0})\cdot {({{\bf{B}}}^{-})}^{T}]}_{ij}={{\bf{r}}}_{i}^{0}\cdot {({{\bf{r}}}_{j}^{-})}^{T}$$
(20)

and

$${V}_{ij}^{-0}=\mathop{\sum }\limits_{\alpha =1}^{M}{b}_{i\alpha }^{-}{b}_{j\alpha }^{0}={[({{\bf{B}}}^{-})\cdot {({{\bf{B}}}^{0})}^{T}]}_{ij}={{\bf{r}}}_{i}^{-}\cdot {({{\bf{r}}}_{j}^{0})}^{T},$$
(21)

that count the number of nodes (belonging to the second layer) to which node i (node j) is not connected and node j (node i) is connected with a negative link. While we are led to consider \({V}_{ij}^{00}\) as a signature of concordance (both nodes have no reason to connect to any of the nodes belonging to the opposite layer), we are also led to consider \({V}_{ij}^{0+}\), \({V}_{ij}^{+0}\), \({V}_{ij}^{0-}\), \({V}_{ij}^{-0}\) as a signature of discordance (one of the nodes has no reason to connect to any of the nodes belonging to the opposite layer while the other has (This is also logically coherent with the following situation: imagine i likes item α, j has not a connection with it and k dislikes it. Should j be considered as agreeing with both i and k, they also should for transitivity. But this is definitely not the case.)). The rightmost members indicate that the same numbers can be obtained by taking the scalar products of the rows, indexed by i and j, of the biadjacency matrices B+, B0, and B.

In summary, while accounting for \({V}_{ij}^{++}\), \({V}_{ij}^{--}\), \({V}_{ij}^{+-}\) and \({V}_{ij}^{-+}\) will lead to a zero-deflated projection scheme, where signs are solely determined by the ‘full’ dyadic motifs, also considering \({V}_{ij}^{00}\), \({V}_{ij}^{0+}\), \({V}_{ij}^{+0}\), \({V}_{ij}^{0-}\) and \({V}_{ij}^{-0}\) will lead to a zero-inflated projection scheme, where signs are determined by the ‘partial’ dyadic motifs too. For a pictorial illustration of the aforementioned ‘full’ and ‘partial’ dyadic motifs, we refer the reader to Figs. 6 and 7.

Fig. 6: Pictorial representation of a bipartite signed network with blue (positive) and red (negative) edges.
figure 6

The 'full' dyadic motifs, forming the 'bricks' of the zero-deflated projection scheme, are individuated by ij and in (+/−), il and im (−/+), jn (−/−), and im(+/+). Edges iβ, jα, and kδ do not contribute to any motif.

Fig. 7: Pictorial representation of a bipartite signed network with blue (positive), red (negative), and grey (missing) edges.
figure 7

The `full' dyadic motifs include ij and in (+/−), il and im (−/+), jn (−/−), and im (+/+). Additionally, node i participates in 'partial' motifs such as (0/+), (0/0), (+/0), and (0/−).

A scheme for the statistical validation of bipartite signed networks

Schematically, our validation algorithm works as follows:

  1. A.

    Focus on a specific pair of nodes belonging to the layer of interest, say i and j, and measure their similarity (see Section IV F);

  2. B.

    Quantify the statistical significance of the measured similarity, with respect to a properly defined benchmark, by computing the corresponding p value, say pij (see Section IV G);

  3. C.

    Repeat the step above for each pair of nodes;

  4. D.

    Apply a multiple-hypothesis testing procedure and connect the nodes i and j if and only if significantly similar (see Section IV H).

Let us stress that such a scheme is valid for both variants of our projection algorithm.

Step #1. Quantifying the similarity of any two nodes

Let us divide the next three sections in two subsections each: the first one will be devoted to devise a recipe for projecting a bipartite network that ignores the dyadic motifs constituted by, at least, one missing tie (the bipartite topology is considered ‘fixed’ and only the signs of the ‘full’ dyadic motifs are accounted for); the second one will be devoted to devise a recipe for projecting a bipartite network that accounts for the dyadic motifs constituted by, at least, one missing tie as well (the bipartite topology is considered ‘free’ and the signs of both the ‘full’ and the ‘partial’ dyadic motifs are accounted for). To avoid confusion, the quantities defined within the first framework will be underlined.

Zero-deflated projection scheme

The first step of our method prescribes measuring the degree of similarity of nodes i and j. To this aim, let us consider the quantity named signature and defined as:

$$\begin{array}{ll}{{S}_{ij}}=\mathop{\sum }\limits_{{\alpha =1}}^{{V}_{ij}}{b}_{i\alpha }{b}_{j\alpha }\\ \qquad=\mathop{\sum }\limits_{{\alpha =1}}^{{V}_{ij}}[({b}_{i\alpha }^{+}{b}_{j\alpha }^{+}+{b}_{i\alpha }^{-}{b}_{j\alpha }^{-})-({b}_{i\alpha }^{+}{b}_{j\alpha }^{-}+{b}_{i\alpha }^{-}{b}_{j\alpha }^{+})]\\ \qquad=\mathop{\sum }\limits_{{\alpha =1}}^{{V}_{ij}}({C}_{ij\alpha }-{D}_{ij\alpha })\\ \qquad={{C}_{ij}}-{{D}_{ij}}\end{array}$$
(22)

where the sum runs over the ‘full’ V-motifs (The symbol \(\mathop{\sum }\nolimits_{\underline{\alpha =1}}^{{V}_{ij}}(\ldots \,)\) indicates that the sum runs over the connected pairs of nodes and is equivalent to \(\mathop{\sum }\nolimits_{\alpha =1}^{M}| {b}_{i\alpha }{b}_{j\alpha }| (\ldots \,)\).), i.e., \({V}_{ij}\equiv \mathop{\sum }\nolimits_{\alpha =1}^{M}| {b}_{i\alpha }{b}_{j\alpha }|\). In words, the signature is the difference between two quantities, i.e., the concordance of nodes i and j, reading

$$\underline{{C}_{ij}}=\mathop{\sum }\limits_{\underline{\alpha =1}}^{{V}_{ij}}{C}_{ij\alpha }=\mathop{\sum }\limits_{\underline{\alpha =1}}^{{V}_{ij}}({b}_{i\alpha }^{+}{b}_{j\alpha }^{+}+{b}_{i\alpha }^{-}{b}_{j\alpha }^{-})={\underline{{V}_{ij}}}^{++}+{\underline{{V}_{ij}}}^{--}$$
(23)

and counting the number of ‘full’ concordant motifs, and the discordance of nodes i and j, reading

$$\underline{{D}_{ij}}=\mathop{\sum }\limits_{\underline{\alpha =1}}^{{V}_{ij}}{D}_{ij\alpha }=\mathop{\sum }\limits_{\underline{\alpha =1}}^{{V}_{ij}}({b}_{i\alpha }^{+}{b}_{j\alpha }^{-}+{b}_{i\alpha }^{-}{b}_{j\alpha }^{+})={\underline{{V}_{ij}}}^{+-}+{\underline{{V}_{ij}}}^{-+}$$
(24)

and counting the number of ‘full’ discordant motifs. As Fig. 6 shows, the pairs of nodes establishing motifs that are accounted for in the zero-deflated scheme are ij and in, establishing a (+/−) motif; il and im, establishing a (−/+) motif; jn, establishing a (−/−) motif; lm, establishing a (+/+) motif.

A naive way of projecting a bipartite signed network would prescribe to stop here and apply the sign function to the signature, hence connecting nodes i and j with a positive link if \({{S}_{ij}}\, > \,0\), i.e., \({{C}_{ij}} \,> \,{{D}_{ij}}\), and with a negative link if \({{S}_{ij}}\, < \,0\), i.e., \({{C}_{ij}}\, < \,{{D}_{ij}}\). More compactly,

(25)

Zero-inflated projection scheme

Let us, now, measure the degree of similarity of nodes i and j within the zero-inflated projection scheme. To this aim, let us consider the novel definition of signature reading

$${S}_{ij}=\mathop{\sum }\limits_{\alpha =1}^{M}({C}_{ij\alpha }-{D}_{ij\alpha })={C}_{ij}-{D}_{ij}$$
(26)

where the concordance between nodes i and j, now, reads

$${C}_{ij}=\mathop{\sum }\limits_{\alpha =1}^{M}{C}_{ij\alpha }={V}_{ij}^{++}+{V}_{ij}^{--}+{V}_{ij}^{00}$$
(27)

and the discordance between nodes i and j, now, reads

$$\begin{array}{ll}{D}_{ij}=\mathop{\sum }\limits_{\alpha =1}^{M}{D}_{ij\alpha }\\\qquad={V}_{ij}^{+-}+{V}_{ij}^{-+}+{V}_{ij}^{0+}+{V}_{ij}^{+0}+{V}_{ij}^{0-}+{V}_{ij}^{-0}.\end{array}$$
(28)

As Fig. 7 shows, the pairs of nodes establishing ‘full’ motifs are ij and in, establishing a (+/−) motif; il and im, establishing a (−/+) motif; jn, establishing a (−/−) motif; lm, establishing a (+/+) motif. Moreover, node i establishes a (0/+) motif with node j via node α, a (0/0) motif with nodes k, l, m and n via node α, a (+/0) motif with any other node via node β, a (0/−) motif with node k via node δ, etc.

A naive way of projecting a bipartite signed network would prescribe to stop here and apply the sign function to the signature, connecting nodes i and j with a positive link if Sij > 0, i.e., Cij > Dij, and with a negative link if Sij < 0, i.e., Cij < Dij. More compactly,

(29)

Notice that such a naive projection would be denser than the naive one induced by the zero-deflated definition of signature, the only possibility of observing being that of having Cij = Dij.

Step #2. Quantifying the statistical significance of similarity

Zero-deflated projection scheme

The second step of our method prescribes to evaluate the statistical significance of our nodes similarity. To this aim, let us find the probability distribution obeyed by \(\underline{{S}_{ij}}\), after noticing that

$$-{V}_{ij}\le \underline{{S}_{ij}}\le {V}_{ij}$$
(30)

where \(\underline{{S}_{ij}}=-{V}_{ij}\) if \(\underline{{C}_{ij}}=0\) (i.e., if each ‘full’ V-motif is composed by a −1 and a +1) and \({{S}_{ij}}={V}_{ij}\) if \({{D}_{ij}}=0\) (i.e., if each ‘full’ V-motif is composed by either two −1s or two +1s).

Let us, now, keep the formalism as general as possible and treat links as independent non-identically distributed (i.n.i.d.) random variables. This amounts at considering the finite scheme (Such a notation, introduced by Khintchine in Mathematical Foundations of Information Theory57, compactly represents a discrete probability distribution, by listing its support on the first row and the related probability coefficients on the second row.)

$${b}_{i\alpha } \sim \left(\begin{array}{ll}-1&+1\\ 1-{p}_{i\alpha }^{+}&{p}_{i\alpha }^{+}\end{array}\right)$$
(31)

i, α such that biα = 1, further inducing

$$\begin{array}{l}{b}_{i\alpha }{b}_{j\alpha } \sim \left(\begin{array}{ll}-1&+1\\ {p}_{i\alpha }^{+}+{p}_{j\alpha }^{+}-2{p}_{i\alpha }^{+}{p}_{j\alpha }^{+}&1-{p}_{i\alpha }^{+}-{p}_{j\alpha }^{+}+2{p}_{i\alpha }^{+}{p}_{j\alpha }^{+}\end{array}\right)\\ \qquad\quad=\left(\begin{array}{ll}-1&+1\\ 1-{q}_{ij\alpha }^{+}&{q}_{ij\alpha }^{+}\end{array}\right)\end{array}$$
(32)

i, j, α such that biαbjα = 1.

Since \({{S}_{ij}}\) is the sum of i.n.i.d. Bernoulli random variables, the probability distribution obeyed by it is the Poisson-binomial

$$P({{S}_{ij}}=s)=\sum _{{C}_{k}\in {{\mathcal{C}}}_{k}}\left[\prod _{\nu \in {C}_{k}}{q}_{ij\nu }^{+}\prod _{\tau \notin {C}_{k}}(1-{q}_{ij\tau }^{+})\right]$$
(33)

where \({{\mathcal{C}}}_{k}\) is the set of all possible k-tuples of which ν and τ are instances. As we will see in Section IV I, this is precisely the case of a benchmark enforcing linear local constraints.

The formula above simplifies in the case of a benchmark enforcing linear global constraints, the links becoming independent identically distributed (i.i.d.) random variables and the Poisson-binomial above reducing to a binomial.

For more details about the benchmarks, see Sections IV I 1, IV I 1 and Appendix A.

Zero-inflated projection scheme

Let us, now, find the probability distribution obeyed by Sij, after noticing that

$$-M\le {S}_{ij}\le M$$
(34)

where Sij = −M if Cij = 0 (i.e. if each V-motif is composed by a −1 and a +1) and Sij = M if Dij = 0 (i.e. if each V-motif is composed by either two − 1s or two + 1s).

As in the previous subsection, let us treat links link as i.n.i.d. random variables. Within the zero-inflated projection scheme, this amounts at replacing Eq. (31) with

$${b}_{i\alpha } \sim \left(\begin{array}{lll}-1&0&+1\\ {p}_{i\alpha }^{-}&{p}_{i\alpha }^{0}&{p}_{i\alpha }^{+}\end{array}\right)$$
(35)

i, α, further inducing

$${C}_{ij\alpha }-{D}_{ij\alpha } \sim \left(\begin{array}{ll}-1&+1\\ 1-{q}_{ij\alpha }^{+}&{q}_{ij\alpha }^{+}\end{array}\right)$$
(36)

with

$${q}_{ij\alpha }^{+}={q}_{ij\alpha }^{++}+{q}_{ij\alpha }^{--}+{q}_{ij\alpha }^{00},$$
(37)
$$1-{q}_{ij\alpha }^{+}={q}_{ij\alpha }^{+-}+{q}_{ij\alpha }^{-+}+{q}_{ij\alpha }^{0+}+{q}_{ij\alpha }^{+0}+{q}_{ij\alpha }^{0-}+{q}_{ij\alpha }^{-0}$$
(38)

where, for example, the coefficient \({q}_{ij\alpha }^{++}\) induces the finite scheme

$${b}_{i\alpha }^{+}{b}_{j\alpha }^{+} \sim \left(\begin{array}{ll}0&+1\\ 1-{p}_{i\alpha }^{+}{p}_{j\alpha }^{+}&{p}_{i\alpha }^{+}{p}_{j\alpha }^{+}\end{array}\right)=\left(\begin{array}{ll}0&+1\\ 1-{q}_{ij\alpha }^{++}&{q}_{ij\alpha }^{++}\end{array}\right)$$
(39)

and analogously for the others.

As a consequence, Sij becomes a sum of i.n.i.d. Bernoulli random variables obeying the Poisson-binomial reading

$$P({S}_{ij}=s)=\sum _{{C}_{k}\in {{\mathcal{C}}}_{k}}\left[\prod _{\nu \in {C}_{k}}{q}_{ij\nu }^{+}\prod _{\tau \notin {C}_{k}}(1-{q}_{ij\tau }^{+})\right]$$
(40)

where \({{\mathcal{C}}}_{k}\) is the set of all possible k-tuples of which ν and τ are instances. As in the previous subsection, if the benchmark induced by linear local constraints is replaced by the benchmark induced by linear global constraints, the links become i.i.d. random variables and the Poisson-binomial reduces to a binomial.

For more details about the benchmarks, see Sections IV I 2, IV I 2 and Appendix B.

Quantifying the statistical significance of similarity

Once we have calculated the distribution for each pair of nodes belonging to the layer of interest, we have to calculate the statistical significance of the empirical signature: as we have a signed quantity, we need both tails of such a distribution to carry out what is known as two-sided test of hypothesis. In what follows we will focus on \(\underline{{S}_{ij}}\) but the same considerations hold true for Sij.

Basically, we need to answer the two related questions i) is the empirical value of the signature significantly different from the one expected under the chosen benchmark? and ii) if so, is the deviation negative (hence, the signature is significantly smaller than expected) or positive (hence, the signature is significantly larger than expected)?

The first question can be answered upon calculating the two-sided p value reading

$${p}_{ij}=2\cdot \min \left\{F(\underline{{S}_{ij}^{* }}),1-F(\underline{{S}_{ij}^{*}})\right\}$$
(41)

with \({S}_{ij}^{* }\) being the empirical value of the signature and F being the cumulative distribution function, defined as \(F(\underline{{S}_{ij}^{* }})=\sum _{x\le \underline{{S}_{ij}^{* }}}P(\underline{{S}_{ij}}=x)\): such a number evaluates the probability of observing a deviation from the expected value in either directions.

The second question can be answered upon calculating the sign of such a deviation, by determining if either

$$F({{S}_{ij}^{* }})\, < \,1-F({{S}_{ij}^{* }})$$
(42)

or

$$F({{S}_{ij}^{*}})\, > \,1-F({{S}_{ij}^{*}})$$
(43)

holds true. In the first case, \(F({{S}_{ij}^{* }}) \,< \,1/2\), the empirical value of the signature is smaller than the median of the distribution and the deviation is negative; in the second case, \(F({{S}_{ij}^{* }})\, > \,1/2\), the empirical value of the signature is larger than the median of the distribution and the deviation is positive. Upon indicating the threshold individuated by the multiple-hypothesis testing procedure with pth (It represents the largest p value that satisfies the False Discovery Rate rejection criterion and is defined as \({p}_{th}=\hat{i}t/| H|\), with \(\hat{i}\) being the largest integer satisfying the condition \({\text{}p-\text{value}}_{\hat{i}}\le {p}_{th}\), t is the single-test significance level and H is the total number of tested hypotheses (see also Appendix C).), the following cases can be met:

  • The two conditions \(F({{S}_{ij}^{* }})\, < \,1/2\) and pij ≤ pth indicate that nodes i and j have established a number of ‘full’ discordant motifs that is so large to induce a significantly negative signature—and, potentially, a negative link in the projection (aij = −1);

  • The two conditions \(F({{S}_{ij}^{* }})\, > \,1/2\) and pij ≤ pth indicate that nodes i and j have established a number of ‘full’ concordant motifs that is so large to induce a significantly positive signature—and, potentially, a positive link in the projection (aij = +1);

  • The condition pij > pth individuates a value of the signature (i.e. of ‘full’ concordant/discordant motifs) that is compatible with the one predicted by the chosen benchmark: stated otherwise, the empirical number of motifs could have been observed in configurations generated by the benchmark itself—hence, inducing a null link in the projection (aij = 0).

For a graphical representation of our validation procedure, see Fig. 8.

Fig. 8: Probability distribution of the signature () and its Gaussian approximation ().
figure 8

The left panel provides a graphical answer to the question: Is the empirical value of the signature significantly different from the one expected under the chosen benchmark? while the right panel provides a graphical answer to the question is the deviation negative (hence, the signature is significantly smaller than expected) or positive (hence, the signature is significantly larger than expected)?, the red (blue) area corresponding to the region of validation of the negative (positive) links.

Step #3. Validating the projection

The second step of our method returns a symmetric matrix of p values. Individuating the ones associated with the hypotheses to be actually rejected requires a procedure to deal with the comparison of multiple hypotheses at the same time. In very general terms, a threshold has to be set: if the specific p value is smaller than such a threshold, the associated event is interpreted as statistically significant and the corresponding nodes are connected in the projection.

In the present paper, we apply the so-called False Discovery Rate (FDR) procedure58, allowing one to control for the expected number of ‘false discoveries’ (i.e. incorrectly rejected null hypotheses or incorrectly validated links), irrespectively of the independence of the hypotheses tested.

For more details about the FDR procedure, see Appendix C.

Maximum entropy benchmarks for bipartite signed networks

The second step of our method prescribes to quantify the statistical significance of the similarity of any two nodes i and j. To this aim, a statistical benchmark is needed. A natural choice leads to the adoption of models belonging to the class of the so-called Exponential Random Graphs, induced by the constrained maximisation of the Shannon entropy22,59,60,61,62,63

$$S=-\sum _{{\bf{B}}\in {\mathbb{B}}}P({\bf{B}})\ln P({\bf{B}}),$$
(44)

the sum running over a properly defined ensemble of configurations. Within such a framework, the generic bipartite network B is assigned the probability

$$P({\bf{B}})=\frac{{e}^{-H({\boldsymbol{\theta }},{\boldsymbol{C}}({\bf{B}}))}}{Z({\boldsymbol{\theta }})}$$
(45)

whose value is determined by the vector C(B) = {Ci(B)} of topological constraints via the Hamiltonian reading \(H({\boldsymbol{\theta }},{\boldsymbol{C}}({\bf{B}}))=\sum _{i}{\theta }_{i}{C}_{i}({\bf{B}})\). In what follows, we will employ linear constraints: beside allowing our optimisation problem to be analytically solved in these cases—P(B) can be written in a factorised form, i.e., as a product of pair-specific probability coefficients—they individuate the ‘right’ amount of information to be accounted for to obtain a projection: since we are interested in evaluating the statistical significance of co-occurrences, one must discount the tendency of nodes to establish connections with many/few neighbours.

In order to determine the unknown parameters θ, the likelihood-maximisation recipe can be adopted: given an observed biadjacency matrix B*, it translates into solving the system of equations

$$\langle {C}_{i}\rangle ({\boldsymbol{\theta }})=\sum _{{\bf{B}}\in {\mathbb{B}}}P({\bf{B}}){C}_{i}({\bf{B}})={C}_{i}({{\bf{B}}}^{* }),\quad \forall \,i$$
(46)

which prescribes to equate each ensemble average, e.g., 〈Ci〉(θ), to its observed counterpart, i.e., Ci(B*) 22,61,62,63,64.

Zero-deflated projection scheme

The first two benchmarks determine the coefficients of the two variants of the finite scheme defined in Eq. (31). As already stressed, dyadic motifs constituted by, at least, one missing tie are ignored, here: the bipartite topology is, thus, considered ‘fixed’. Upon indicating the total number of connected pairs of nodes with \(L=\mathop{\sum }\nolimits_{i=1}^{N}\mathop{\sum }\nolimits_{\alpha =1}^{M}| {b}_{i\alpha }| ={L}^{+}+{L}^{-}\) and considering that any node pair can be either positively or negatively connected, the ensemble is constituted by \(| {\mathbb{B}}| ={2}^{L}\) possible configurations in both cases.

Fixed-topology Bipartite signed random graph model

The Fixed-topology Bipartite Signed Random Graph Model (BiSRGM-FT) is defined by two global constraints, i.e., the total number of positive and negative links: the corresponding Hamiltonian, thus, reads

$$H({\boldsymbol{\theta }},{\bf{B}})={\beta }^{{\prime} }{L}^{+}({\bf{B}})+{\gamma }^{{\prime} }{L}^{-}({\bf{B}}).$$
(47)

Keeping the topology of the network under analysis fixed while (solely) randomising the edge signs implies that the role of random variables is played by the entries of the biadjacency matrix corresponding to the connected pairs of nodes, i.e., the ones for which biα = 1. Let us, however, notice that the Hamiltonian in Eq. (47) can be re-written as

$$H({\boldsymbol{\theta }},{\bf{B}})={\beta }^{{\prime} }{L}^{+}({\bf{B}})+{\gamma }^{{\prime} }(L-{L}^{+})({\bf{B}})$$
(48)
$$=({\beta }^{{\prime} }-{\gamma }^{{\prime} }){L}^{+}({\bf{B}})+{\gamma }^{{\prime} }L({\bf{B}})$$
(49)

and further simplified into

$$H({\boldsymbol{\theta }},{\bf{B}})=\beta {L}^{+}({\bf{B}})$$
(50)

where the constant term has been dropped, as it does not affect the computation of the probability coefficients, and the only relevant Lagrange multiplier has been re-named65. The generic entry, thus, obeys

$$\begin{array}{rcl}P({b}_{i\alpha }=+1)&=&\displaystyle\frac{{e}^{-\beta }}{1+{e}^{-\beta }}={p}^{+},\\ P({b}_{i\alpha }=-1)&=&\displaystyle\frac{1}{1+{e}^{-\beta }}=1-{p}^{+};\end{array}$$

in words, each entry satisfying biα = 1 obeys a Bernoulli distribution whose probability coefficients are determined by the imposed constraints: each existing link is assigned a ‘plus one’ with probability p+ and a ‘minus one’ with probability p.

To employ the BiSRGM-FT for studying real-world networks, the parameters that define it need to be properly tuned. More specifically, one needs to ensure

$${\left\langle {L}^{+}\right\rangle }_{{\rm{BiSRGM}}\text{-}{\rm{FT}}}={L}^{+}({{\bf{B}}}^{* })$$
(51)

with the symbol B* indicating the empirical network under analysis. The maximisation of the likelihood function \({{\mathcal{L}}}_{{\rm{BiSRGM}}\text{-}{\rm{FT}}}\equiv \ln {P}_{{\rm{BiSRGM}}\text{-}{\rm{FT}}}({{\bf{B}}}^{* })\) with respect to the unknown parameters that define it leads us to find

$${p}^{+}={L}^{+}({{\bf{B}}}^{* })/L({{\bf{B}}}^{* }).$$
(52)

For an alternative, yet equivalent, resolution of the problem, see Appendix D.

Fixed-topology bipartite signed configuration model

The Fixed-topology Bipartite Signed Configuration Model (BiSCM-FT), instead, is defined by local constraints, i.e., the positive and negative degree sequences: the corresponding Hamiltonian, thus, reads

$$\begin{array}{ll}H({\boldsymbol{\theta }},{\bf{B}})=\mathop{\sum }\limits_{i=1}^{N}[{\beta }_{i}^{{\prime} }{k}_{i}^{+}({\bf{B}})+{\gamma }_{i}^{{\prime} }{k}_{i}^{-}({\bf{B}})]\\ \qquad\qquad\quad+\mathop{\sum }\limits_{\alpha =1}^{M}[{\delta }_{\alpha }^{{\prime} }{h}_{\alpha }^{+}({\bf{B}})+{\eta }_{\alpha }^{{\prime} }{h}_{\alpha }^{-}({\bf{B}})];\end{array}$$
(53)

again, the role of random variables is played by the entries of the biadjacency matrix corresponding to the connected pairs of nodes and the Hamiltonian in Eq. (53) can be further simplified into

$$H({\boldsymbol{\theta }},{\bf{B}})=\mathop{\sum }\limits_{i=1}^{N}[{\beta }_{i}^{{\prime} }{k}_{i}^{+}({\bf{B}})+{\gamma }_{i}^{{\prime} }({k}_{i}-{k}_{i}^{+}({\bf{B}}))]$$
(54)
$$+\mathop{\sum }\limits_{\alpha =1}^{M}[{\delta }_{\alpha }^{{\prime} }{h}_{\alpha }^{+}({\bf{B}})+{\eta }_{\alpha }^{{\prime} }({h}_{\alpha }-{h}_{\alpha }^{+}({\bf{B}}))]$$
(55)
$$=\mathop{\sum }\limits_{i=1}^{N}{\beta }_{i}{k}_{i}^{+}({\bf{B}})+\mathop{\sum }\limits_{\alpha =1}^{M}{\delta }_{\alpha }{h}_{\alpha }^{+}({\bf{B}})$$
(56)

with obvious meaning of the symbols65. The generic entry, now, obeys

$$P({b}_{i\alpha }=+1)=\frac{{e}^{-({\beta }_{i}+{\delta }_{\alpha })}}{1+{e}^{-({\beta }_{i}+{\delta }_{\alpha })}}={p}_{i\alpha }^{+},$$
(57)
$$P({b}_{i\alpha }=-1)=\frac{1}{1+{e}^{-({\beta }_{i}+{\delta }_{\alpha })}}=1-{p}_{i\alpha }^{+};$$
(58)

in words, given any two connected nodes i and α, their link is assigned a + 1 with probability \({p}_{i\alpha }^{+}\) and a − 1 with probability \({p}_{i\alpha }^{-}\).

To tune the parameters defining the BiSCM-FT, we can maximise the likelihood function \({{\mathcal{L}}}_{{\rm{BiSCM}}\text{-}{\rm{FT}}}\equiv \ln {P}_{{\rm{BiSCM}}\text{-}{\rm{FT}}}({{\bf{B}}}^{* })\) with respect to the unknown parameters that define it: such a recipe leads us to find

$${\langle {k}_{i}^{+}\rangle }_{{\rm{BiSCM}}\text{-}{\rm{FT}}}={k}_{i}^{+}({{\bf{B}}}^{* }),\quad \forall \,i,$$
(59)
$${\langle {h}_{\alpha }^{+}\rangle }_{{\rm{BiSCM}}\text{-}{\rm{FT}}}={h}_{\alpha }^{+}({{\bf{B}}}^{* }),\quad \forall \,\alpha .$$
(60)

The system above can be solved only numerically, along the guidelines provided in ref. 66.

For an alternative, yet equivalent, resolution of the problem, see Appendices D and E. Here, we have solved it by employing the SIMONA Matlab-coded package, available at this URL.

Zero-inflated projection scheme

The other two benchmarks determine the coefficients of the two variants of the finite scheme defined in Eq. (35). As already stressed, dyadic motifs constituted by, at least, one missing tie are accounted for, here: the bipartite topology is, thus, considered ‘free’. Since the total number of node pairs is N M and any node pair can be positively connected, negatively connected or disconnected, the ensemble is constituted by \(| {\mathbb{B}}| ={3}^{N\cdot M}\) possible configurations in both cases.

Free-topology bipartite signed random graph model

The Free-topology Bipartite Signed Random Graph Model (BiSRGM)13 is induced by the same Hamiltonian inducing its fixed-topology counterpart, i.e., H(θ, B) = βL+(B) + γL(B) but treating the topology as ‘free’ does not allow for any simplification. The generic entry obeys

$$P({b}_{i\alpha }=+1)=\frac{{e}^{-\beta }}{1+{e}^{-\beta }+{e}^{-\gamma }}={p}^{+},$$
(61)
$$P({b}_{i\alpha }=-1)=\frac{{e}^{-\gamma }}{1+{e}^{-\beta }+{e}^{-\gamma }}={p}^{-}$$
(62)

and p0 ≡ 1 − p − p+. In words, biα obeys a generalised Bernoulli distribution whose probability coefficients are determined by the imposed constraints. Each positive link appears with probability p+, each negative link appears with probability p and each missing link has a probability p0.

The maximisation of the likelihood function \({{\mathcal{L}}}_{{\rm{BiSRGM}}}=\ln {P}_{{\rm{BiSRGM}}}({{\bf{B}}}^{* })\) with respect to the unknown parameters that define it leads us to find

$${p}^{+}={L}^{+}({{\bf{B}}}^{* })/(N\cdot M),$$
(63)
$${p}^{-}={L}^{-}({{\bf{B}}}^{* })/(N\cdot M)$$
(64)

and p0 ≡ 1 − p − p+.

For more details, see Appendix D.

Free-topology bipartite signed configuration model

The Free-topology Bipartite Signed Configuration Model (BiSCM)13 is induced by the same Hamiltonian inducing its fixed-topology counterpart, i.e., \(H({\boldsymbol{\theta }},{\bf{B}})=\mathop{\sum }\nolimits_{i=1}^{N}[{\beta }_{i}{k}_{i}^{+}({\bf{B}})+{\gamma }_{i}{k}_{i}^{-}({\bf{B}})]+\mathop{\sum }\nolimits_{\alpha =1}^{M}[{\delta }_{\alpha }{h}_{\alpha }^{+}({\bf{B}})+{\eta }_{\alpha }{h}_{\alpha }^{-}({\bf{B}})]\) but, as for the BiSRGM, it cannot be further simplified. The generic entry, thus, reads

$$P({b}_{i\alpha }=+1)=\frac{{e}^{-({\beta }_{i}+{\delta }_{\alpha })}}{1+{e}^{-({\beta }_{i}+{\delta }_{\alpha })}+{e}^{-({\gamma }_{i}+{\eta }_{\alpha })}}={p}_{i\alpha }^{+},$$
(65)
$$P({b}_{i\alpha }=-1)=\frac{{e}^{-({\gamma }_{i}+{\eta }_{\alpha })}}{1+{e}^{-({\beta }_{i}+{\delta }_{\alpha })}+{e}^{-({\gamma }_{i}+{\eta }_{\alpha })}}={p}_{i\alpha }^{-}$$
(66)

and \({p}_{i\alpha }^{0}\equiv 1-{p}_{i\alpha }^{-}-{p}_{i\alpha }^{+}\). In words, biα obeys a generalised Bernoulli distribution whose probability coefficients are determined by the imposed constraints. Given any two nodes i and α, they are connected by a positive link with probability \({p}_{i\alpha }^{+}\), by a negative link with probability \({p}_{i\alpha }^{-}\) and are disconnected with probability \({p}_{i\alpha }^{0}\).

The maximisation of the likelihood function \({{\mathcal{L}}}_{{\rm{BiSCM}}}=\ln {P}_{{\rm{BiSCM}}}({{\bf{B}}}^{* })\) with respect to the unknown parameters that define it leads us to find

$${k}_{i}^{+}({{\bf{B}}}^{* })=\langle {k}_{i}^{+}\rangle ,\quad \forall \,i,$$
(67)
$${k}_{i}^{-}({{\bf{B}}}^{* })=\langle {k}_{i}^{-}\rangle ,\quad \forall \,i,$$
(68)
$${h}_{\alpha }^{+}({{\bf{B}}}^{* })=\langle {h}_{\alpha }^{+}\rangle ,\quad \forall \,\alpha ,$$
(69)
$${h}_{\alpha }^{-}({{\bf{B}}}^{* })=\langle {h}_{\alpha }^{-}\rangle ,\quad \forall \,\alpha .$$
(70)

The system above can be solved only numerically, along the guidelines provided in ref. 66.

For more details, see Appendices D and E. Here, we have solved it by employing the SIMONA Matlab-coded package, available at this URL.

Let us conclude this section by mentioning that BiSCM can be obtained as a special case of the Bipartite Score Configuration Model presented in ref. 67.