Introduction

Many human diseases are closely tied to evolutionary processes. In particular, cancers, often referred to as a “disease of evolution”1,2, are driven by the continuous diversification of intra-tumor subclonal lineages, which evolve to compete for resources, metastasize, and evade the immune system and therapeutic interventions3,4,5,6,7,8,9. Similarly, intra-host viral evolution, shaped by selective pressures from the host immune system and therapeutic treatment, is a key factor underlying adaptive immunity evasion10,11,12,13, disease severity and progression14,15,16,17, tissue tropism18,19, the establishment and persistence of chronic or prolonged infections20,21, transmissibility22, and latency23,24.

Although the biological mechanisms of viral infections and cancers are different, both are fueled by highly mutable populations of disease-causing agents, whose extreme genomic diversity originates from error-prone replication processes, whether due to the lack of a proofreading mechanism in RNA-dependent RNA polymerase or retroviral reverse transcriptase in RNA viruses25,26,27, or from the genetic instability of tumor cells manifesting itself in somatic mutations, chromosomal gain, loss, translocation, and aneuploidy9,28,29,30. Consequently, the general phylogenetic methodologies applied to these populations exhibit many similarities. From a methodological standpoint, they form a unique segment of phylogenetics and phylodynamics, fostering a mutual exchange of concepts that enhances all areas of application31,32,33.

In this study, we specifically focus on tissue-to-tissue metastatic migration and host-to-host viral transmission - migration behaviors characteristic of cancer and viral populations, respectively, understanding of which is essential for public health and medical research34,35,36,37. These research areas especially benefited from groundbreaking advancements in sequencing technologies. State-of-the-art high-throughput targeted sequencing and single-cell DNA sequencing enable the capture of detailed population snapshots at exceptionally high resolutions, facilitating fine-grained analysis down to the level of individual genotypes37,38,39,40. In particular, this allows us to examine transmissions and migrations on the level of individual events38,41,42,43.

A wide array of methods has been developed specifically for reconstructing viral and bacterial transmissions, reflecting the substantial interest in tracking the spread of infectious diseases. The arsenal of tools available for reconstructing transmission histories is extensive, including but not limited to Outbreaker, Outbreaker 244,45, SeqTrack46, SCOTTI47, SOPHIE48, Phybreak49, Bitrugs50, BadTrIP51, Phyloscanner52, StrainHub53, TransPhylo54,55 (along with its extension TransPhyloMulti56), STraTUS57, TreeFix-TP58, QUENTIN59, VOICE60, HIVTrace61, GHOST62, MicrobeTrace63, SharpTNI64, TiTUS33, TNeT65, AutoNet66, BREATH67 and others55,68,69,70,71,72,73,74,75. These tools have been instrumental in the investigation of outbreaks and monitoring the transmission dynamics of pathogens like HIV, hepatitis C (HCV), SARS, MERS and SARS-CoV-241,76,77,78,79,80. Some methods are specifically designed for outbreak investigations with complete or near-complete sampling (including the method presented in this paper), while others account for unsampled hosts and are better suited for studying transmission dynamics in large-scale epidemiological settings.

Similarly, the development of methods for deducing metastatic spread histories is burgeoning, driven by advancements in single-cell DNA sequencing and CRISPR-based lineage tracing technologies. Currently, the repertoire of tools in this domain includes MACHINA43, FitchCount (as part of the Cassiopeia suite)42,81, PathFinder82, TCC, PCC, and PCCH83. Notwithstanding the relatively shorter list, the impact of these tools is growing, with several studies published recently leveraging these methods to gain insights into the mechanisms of metastatic spread42,84,85,86.

Despite significant progress in the field, the wide variety of existing methods underscores that the challenge of accurately inferring both types of spread remains unresolved. This diversity of approaches indicates both the complexity of the problem and the ongoing efforts to refine and improve upon existing methods.

Phylogenetics and phylodynamics provide the most widely used methodological frameworks for viral transmission and metastatic migration history reconstruction43,56. However, their application in this context is not straightforward. A phylogenetic tree does not directly equate to a migration or transmission tree56, as the nodes in a phylogenetic tree represent divergences of lineages associated with different types of events56,87. While some of these divergences may result from the seeding of new sites, others occur within previously invaded sites. Thus, constructing a migration tree from a phylogenetic tree requires reconstructing the ancestral states of internal nodes-an approach known as ancestral trait inference. In this context, the ancestral trait corresponds to the location of each lineage, and the goal is to determine whether each divergence event occurred within the same site or was caused by migration, leading to the seeding of a new site. Furthermore, it is crucial to effectively leverage the full spectrum of intra-site (within-host or within-tumor) population diversity uncovered by high-throughput sequencing, that often provides a strong signal for migration inference. For example, the presence of paraphyletic relationships between populations-where genomic variants from one population form a clade in the phylogenetic tree, but some members are more closely related to individuals from another population than to their own, suggests recent seeding of one site from another60,88,89,90.

The challenges mentioned above highlight several crucial questions that remain unresolved and warrant further exploration. These questions include:

1) To what extent and under which conditions does the topology of a phylogenetic tree reflect the structure of the underlying migration tree? This question becomes particularly important when the level of paraphyly in the labeled phylogeny is low, a situation not uncommon, as the paraphyletic signal tends to diminish over time and with smaller sample sizes88. A number of studies have looked at this subject, but their conclusions were mixed. Some studies have found that certain migration patterns, such as super-spreading or, more generally, migrations over networks formed by different models like Erdös-Rényi91, preferential attachment92, or Watts-Strogatz models93, lead to quantitatively distinct phylogenetic tree topologies94,95,96. In addition, the spatial structure and dynamics of heterogeneous populations, which are directly related to migration pathways, have been shown to affect the phylogeny structure of both viral and tumor populations97,98,99,100. On the other hand, other studies report that the direct impact of migration patterns on phylogenetic trees ranges from minimal to moderate101,102,103,104. Finally, certain studies have drawn mixed conclusions, indicating that while some migration characteristics are reflected in the phylogeny, others are not105.

2) How can we limit the solution space to balance computational feasibility, accuracy of inference, generalizability, and biological realism? The space of migration trees consistent with given phylogenetic trees is often vast, and its properties are not well understood33,57. A sampling-consensus approach is one method to address solution ambiguity, where feasible migration trees are sampled and summarized in a weighted consensus graph, with weights reflecting posterior probabilities of edges33,42,47,64,65. However, the size of the solution space may restrict the depth of sampling. As a response, it is common practice to narrow down the solution space to a set of plausible migration trees optimizing a specific objective function under evolutionary-based constraints. Employing constrained models also aids in preventing overfitting in the presence of missing data and errors.

Various objectives and constraints have been implemented by existing methods. Limiting the number of migration events, sizes of bottlenecks or numbers of back-migrations33,42,43,46,52,57,65,81,83 is more computationally efficient and scalable due to the utilization of dynamic programming106,107, making such approaches practical in both molecular epidemiology and computational oncology. These can also be formulated as Integer Linear Programming (ILP) problems108 and solved with reasonable efficiency using existing ILP solvers. Models with more complex Bayesian objectives with constraints regularized as priors44,47,49,51,56 offer a richer, biologically nuanced perspective but suffer from scalability issues and usually rely on generic methods like Markov Chain Monte Carlo (MCMC) sampling, which may not yield optimal solutions, in part due to a lack of problem-specific mathematical strategies. Balancing computational efficiency with biological comprehensiveness presents a notable challenge, compounded by the uncertainty of how much constraints and objectives truly limit the solution space33,57.

3) How to incorporate a variety of models for phylogenetic inference of transmissions and migrations into a unified computational framework?

Migration inference models draw on varied biological or epidemiological assumptions. For instance, viral transmission inference often incorporates case-specific temporal data like infectious periods, exposure intervals, symptom onset, diagnosis or sample collection dates to establish the order of infections and eliminate unlikely transmission links33,44,47,49,50,54,69. In some rare cases, contact networks are known and can be used for the same purpose80,109. While effective, such data is often unavailable, non-informative, or sensitive, particularly for endemic and pandemic diseases caused by HIV, Hepatitis C, SARS-CoV-2, or Influenza48,59,75. In situations where case-specific data cannot be used, genomic epidemiology tools resort to broader assumptions, like the expected degree distribution of transmission networks implied by the structure of a susceptible population48,59. Similarly, methods for inferring metastatic spread use constraints defined by so-called migration patterns that reflect realistic cancer migration scenarios, such as monoclonal, polyclonal or multi-source seeding43,83. A commonality across all these methods is the use of structural constraints on feasible transmission trees that are considered as subgraphs of a larger “pattern” graph. It suggests that these diverse approaches can be unified within a single Bayesian inference framework based on such topological constraints and/or priors.

Addressing these challenges requires a comprehensive investigation into the mathematical properties of migration trees consistent with a given phylogeny, a topic that is not yet fully understood57. Our study aims to advance this area by developing a methodology based on powerful techniques from structural graph theory and combinatorics. In addition, we will use this methodology to introduce algorithmic approaches for inferring migration trees.

A number of earlier studies has achieved important progress in this area. Several studies noticed that migration trees consistent with a given phylogeny correspond to partitions of the phylogeny’s node set or to coloring of its branches54,55,57,73. These observations have informed the development of methods to enumerate and sample these trees, assuming a complete bottleneck57 and a known sequence of migrations110. In fact, as we argue in this paper, the relationship between a phylogenetic tree and a migration tree is described by a graph homomorphism - the mathematical concept that generalizes both partitions and colorings.

Graph homomorphism is essentially a mapping between the vertices of two graphs that preserves their structure111. The theory of graph homomorphisms is a well-established area of structural and algebraic graph theory, with deep results and rich methodology. All types of migration trees discussed so far can be described by a graph homomorphism with specific constraints. For example, migration trees consistent with a given phylogeny under the assumptions of complete sampling and complete bottleneck, that has been studied in refs. 57,73 are minors112,113 of the phylogeny or, equivalently, its homomorphic images such that inverse images of all vertices are connected subtrees.

We use mathematical and algorithmic machinery of graph homomorphism theory to delve into the details of viral transmission and metastatic migration inference. Specifically, we formulate several variants of the migration inference problem in mathematical terms and discuss their applicability in different real-world settings, provide necessary and sufficient conditions describing trees consistent with a given phylogeny, and propose a series of algorithms that evaluate the consistency of phylogenetic and migration trees under various evolutionary scenarios through the construction of corresponding homomorphisms.

Based on aforementioned insights, we propose a general framework for migration inference under complete or nearly-complete sampling - i.e., when the major sources of spread are included in the sample. Our method draws candidate migration trees from a chosen prior random tree distribution and identifies a subsample of trees consistent with a given phylogeny. By varying prior tree distributions, this approach expands upon and generalizes several existing models, offering a versatile and computationally efficient strategy applicable to a variety of settings arising in outbreak investigations and metastatic spread studies. Crucially, the proposed framework is computationally fast, enabling biomedical and public health researchers to quickly test different tree priors that represent common migration models. Beyond that, proposed methods can be used for investigating how phylogenies constrain the space of possible migration trees, for inferring an order of known or suspected migration events, and for determining potential events that are definitively ruled out by a phylogeny57. For studying viral transmissions in epidemiological settings with sparse sampling, alternative approaches such as BREATH67 and TransPhyloMulti56 may be more appropriate. It is also important to distinguish the problem addressed in this paper from phylogeography inference that focuses of pathogen spread between geographical locations114, which requires specific modeling approaches that lie outside of the scope of this study.

The paper is organized as follows. In Methods, we first introduce essential graph-theoretical concepts (Subsection 4), then present four problem formulations for inferring migration histories under various structural constraints (Subsection 4). Subsections 4-4 address each problem in turn, detailing their computational complexity and/or solution methods via ILP or dynamic programming.

We validated our methodology using both simulated and real experimental data from studies of viral outbreaks and cancer metastasis, demonstrating its effectiveness across diverse scenarios. Subsections 4-4 present results on synthetic data, where we explored the impact of structural constraints on the relationship between a phylogeny and a migration tree, and benchmarked our method against state-of-the-art tools. In Subsection 4, we applied our approach to Hepatitis C (HCV) outbreak data, accurately inferring transmission links in complex phylogenies with highly intermixed intra-host populations. Finally, Subsection 4 shows its application to cancer data, where it effectively quantified and reduced uncertainty in metastatic migration inference.

Results

Simulated data generation

To generate synthetic data, we used FAVITES115, a tool capable of simulating genomes, phylogenies, and migration networks under various evolutionary scenarios. Although originally designed to simulate viral outbreaks, FAVITES supports general phylogenetic and population genetics models, making it suitable for simulating migrations of heterogeneous populations besides viruses. It should also be noted that, to the best of our knowledge, specialized simulation tools for metastatic spread with capabilities comparable to FAVITES are currently not available.

We simulated the migration of a heterogeneous population over a network of sites formed according to the Barabasi-Albert model92. This assumption can be valid for both viral116,117 and cancer118,119 spread. In the basic model, migrations occur at a constant rate along each network edge (in a viral context, this corresponds to the network-based Susceptible-Infected (SI) transmission model). Within each site, phylogenies evolved under the coalescent model with the exponential growth, that has been previously used to simulate intra-host88 and intra-tumor120 evolution. Genotypes were assumed to evolve under the GTR+Γ substitution model, and were sampled simultaneously at the end of the simulation, with 100 sequences per deme.

Beyond the basic model, we examined the impact of additional simulation settings, including variable migration rates, logistic intra-site population growth (i.e., assuming that the coalescence rate depends on a time-varying population size following a logistic growth model121), as well as smaller and variable sample sizes per deme (see Supplementary Note 5). In all cases, including the basic model, simulations with a low number of invaded sites were discarded, resulting in datasets averaging 257 simulated migration histories per combination of models and constraints, spanning 5 to 30 demes. In what follows, if not specifically stated otherwise, we are discussing the results for the basic model.

In the first series of experiments, we sampled candidate migration trees from 3 distinct prior distributions of tree topologies and evaluated their consistency with simulated exponential growth phylogenies under three different types of constraints. The prior distributions included:

  • T1) Degenerate distribution consisting of the true topology. Even though this scenario is unrealistic for actual migration inference, it serves as a test to determine if migration links can be accurately reconstructed when the topology is known but sites need to be correctly mapped to migration tree vertices. To avoid bias linked to correlations between vertex IDs and migration times, that can be potentially introduced by the simulation methods, we produced multiple samples with randomly permuted vertex IDs.

  • T2) Random scale-free trees produced by the preferential attachment procedure.

  • T3) Uniformly distributed trees of a given size. Sampling was performed by generating random Prufer codes122, integer sequences of length n − 2 that uniquely define n-vertex trees.

The following types of constraints were used:

  • H1) unconstrained homomorphism;

  • H2) convex homomorphism minimizing the number of compactness constraint violations;

  • H3) convex and compact homomorphism.

For each simulated dataset, we produced 9 samples of candidate migration trees corresponding to all combinations of conditions T1–T3 and H1–H3. Each combination of a simulated dataset and model parameters or tools is further referred to as a test instance. The sample sizes ranged from 1000 for the degenerate distribution without constraints, up to 1,000,000 for the uniform distribution with convexity and compactness constraints. This variation in sample size was necessary because stricter constraints require larger samples to ensure that a sufficient number of feasible trees are produced. We assessed whether these sampled trees are consistent with the given phylogenies (i.e., the former are homomorphic images of the latter) under the respective constraints, and extracted subsamples of consistent trees.

Sampled trees were compared with true migration trees produced by FAVITES. Individual trees were compared by measuring recall, defined as the fraction of inferred transmission edges among true transmission edges; precision, the fraction of true transmission edges among inferred transmission edges; and the f-score, i.e., the harmonic mean of precision and recall.

In addition, we summarized each subsample of consistent migration trees using a consensus graph, where each edge is weighted by the proportion of candidate trees that support that edge47,55,123. A solution can be extracted from the consensus graph by discarding edges with support below a predefined threshold. For each graph, we estimated the area under the precision-recall curve, which was calculated by varying the support threshold. We opted for the precision-recall curve instead of the more common ROC curve due to the imbalance between the classes of true and false migration edges.

Simulated data results

We found that the relationship between phylogenetic trees and migration trees is heavily influenced by structural constraints. In the absence of constraints, there is considerable ambiguity in the possible migration histories that align with a given phylogenetic tree topology, even when prior knowledge about the true migration tree topology is available. Notably, almost every sampled scale-free tree proved consistent with the given phylogenies, with a median fraction of consistent trees at μ = 1 (Fig. 1b). It is not entirely unexpected in light of Theorem 1, which suggests that trees with a low diameter – a common feature of scale-free networks – are more likely to be consistent with a given phylogenetic tree. However, even among uniformly sampled trees, a high consistency rate was observed when no constraints are applied (median μ = 0.84, Fig. 1c).

Fig. 1: Relationship between phylogenetic and migration trees.
figure 1

ac Percent of sampled trees (y-axis) that are consistent with given phylogenies across n = 816 test instances. df Area under the precision-recall curve (AUC, y-axis) calculated by varying the support threshold for n = 570 (degenerate prior), n = 815 (scale-free prior) and n = 800 (uniform prior) test instances with non-empty samples of consistent trees. Each boxplot displays the interquartile range (IQR) as the box, the median as the center line, whiskers extending to the furthest points within 1.5 × IQR, and outliers plotted beyond this range. Source data are provided as a Source Data file.

Furthermore, without constraints, individual consistent trees display only marginal agreement with true migration trees. This holds not only for scale-free and uniformly sampled trees, but even for the degenerate distribution, with median f-score within the range 0.33–0.36 for all three tree priors (Supplementary Fig. 3). In other words, even if the topology of a true migration tree is known, numerous labelings of that topology are consistent with the original phylogeny, most of which substantially diverge from the true labeling. Consequently, the true labeling is not immediately distinguishable among the alternatives without additional information. Combining a consistent tree subsample into a consensus graph, however, brings it closer to the true migration tree, with the median AUC values ranging from 0.56 to 0.59 for all three tree priors (see Fig. 1).

The introduction of convexity and compactness constraints significantly reduces the percentage of trees deemed consistent, as can be expected (Fig. 1a–c). This reduction primarily eliminates incidental solutions, enhancing the alignment of the trees that meet these constraints with the true migration trees (Fig. 1d–f). In particular, for the degenerate distribution, introducing constraints effectively filters out ambiguous vertex mappings, nearly always recovering the true mapping, provided that the solutions satisfying the constraints exist.

Prior knowledge of the migration tree structure is also beneficial for inference, with AUCs improving as the tree prior becomes tighter. In particular, under constraints, AUCs for the scale-free prior are significantly higher than those for the uniform prior (p = 4.4 10−6 and p = 1.76 10−16 for convex and both convex and compact cases, respectively, Kruskal–Wallis test). Given that true migration trees are generated by preferential attachment, this suggests that under constraints, the phylogeny, to a certain degree, reflects the properties of the underlying migration tree.

As expected, the number of sequences sampled per site affects the accuracy of inference. Notably, this effect is non-linear: while accuracy declines gradually for larger sample sizes, it drops sharply when the sample size is reduced to one (Supplementary Fig. 3). However, this is explainable, given that the major phylogenetic signals of migration, such as paraphyly, are driven by asymmetries in population structure60,88. These asymmetries persist even with relatively small intra-site samples but disappear completely when only a single genome per site is available. Variable sampling rates per deme and variable migration rates had no significant impact on the inference accuracy (Supplementary Fig. 4).

When entire sites are sampled at random rates of ρ = 0.5 and ρ = 0.75, the algorithm’s accuracy in inferring transmission links between sampled sites also decreases, but to a lesser degree than may be initially anticipated, still exceeding the performance of most other tools under perfect sampling (Supplementary Fig. 5). Notably, accuracy improved with the migration tree size; for instance, at ρ = 0.5, the f-score for trees with more than 15 vertices increased from 0.50 to 0.58. This feature reflects the so-called robust but fragile property of scale-free networks124.

Taken together, these observations indicate that phylogeny topologies do indeed reflect underlying migration tree structures, but the extent of this reflection is influenced by evolutionary constraints. Moreover, the correspondence between phylogenies and migration trees is primarily discernible when analyzed statistically across a large sample of feasible migration trees that are consistent with the phylogeny. A single consistent tree may be arbitrary, and thus relying on a single solution may lead to misleading conclusions.

Furthermore, introducing constraints can enhance inference accuracy even when the ground truth does not satisfy them. This is because constraints reduce the size of the solution space, allowing for more efficient and focused exploration. While the ground truth may not strictly adhere to the constraints, it may be close enough that the sampled constrained solutions remain not too far from it. As a result, averaging over these solutions can yield an accurate tree estimate.

Based on those observations, we have developed a method named SMiTH (Sampling Migration Trees using Homomorphisms) for the constrained inference of migration trees with expected general properties. This method involves sampling candidate migration trees from a designated random tree distribution, identifying convex homomorphisms from the given phylogeny to these sampled trees while minimizing an objective function defined by the number of compactness constraint violations, constructing a consensus graph from trees with top objective values, and ultimately inferring the final migration tree as the minimal spanning tree of this consensus graph.

We benchmarked SMiTH against several existing tools designed to infer migration histories from phylogenetic tree topologies. For the sake of fairness, tools that use dated phylogenies and/or case-specific epidemiological information were not considered. The tools selected for this comparison include Cassiopeia42,81, MACHINA43, Phyloscanner52, STraTUS57, and TNet65. For MACHINA, we ran all four migration models provided by the tool and report the best result, which was achieved using the single-source seeding model. STraTUS generates a sample of migration trees rather than a single tree; thus, similarly to SMiTH, we used the minimum spanning tree of the consensus graph for benchmarking purposes.

The results of the algorithm comparison for the exponential population growth model are shown in Fig. 2. It was found that both variants of SMiTH – with uniform and scale-free tree priors – allow for a statistically significant improvement over other tools (p < 10−9, multiple comparison of f-score distributions by Kruskal–Wallis test). SMiTH is followed by Cassiopeia and STraTUS – two other sampling-based methods, whose accuracies were statistically indiscernible (p = 0.58, Kruskal–Wallis test). Both tools produce convex solutions, albeit using different algorithms. While STraTUS achieves this through sampling, Cassiopeia’s FitchCount module determines the number of unique most parsimonious solutions, which, in our examples, were almost always convex using dynamic programming. These tools were followed by MACHINA that, similarly to our approach, imposes structural constraints on plausible migration trees by considering them as subgraphs of so-called transition patterns. However, MACHINA produces a single solution optimizing particular objectives rather than summarizes a sample of such solutions; this, in light of the observations described above, likely hampered its performance vis-à-vis sampling-based methods. Similar reasoning can be applied to Phyloscanner, that also produces a single most parsimonious solution. In addition, Phyloscanner is specifically designed to make use of paraphyly, that usually provides a strong signal for migration88 when present; consequently, its accuracy can be affected when, as in analyzed test cases, the number of paraphyletic clades is limited. Furthermore, Phyloscanner is specifically designed for scenarios with sparse sampling, which is different from the settings used in this study. In particular, it usually assumes that when the populations sampled from different cities are monophyletic, then they all have a common unobserved source88 - the assumption that is opposite to compactness and that seems not to be always valid. The relation between algorithms’ performances is mostly stable with regard to the number of migration sites (Fig. 2b).

Fig. 2: Methods benchmarking.
figure 2

a Summary statistics of methods performance on all simulated datasets (n = 1900 test instances in total). The boxplot displays the interquartile range (IQR) as the box, the median as the center line, whiskers extending to the furthest points within 1.5 × IQR, and outliers plotted beyond this range. Source data are provided as a Source Data file. b Median f-scores of different methods on simulated datasets with varying numbers of populations. c Median running times of SMiTH for the construction of a constrained homomorphism for a given phylogenetic tree and candidate migration tree as a function of the number of migration sites.

The accuracy of all algorithms decreases for data simulated under the logistic coalescent model (Supplementary Fig. 6), consistent with observations in ref. 48. However, the ranking of algorithms remains largely stable, with both variants of SMiTH maintaining the top positions. Notably, SMiTH experiences a smaller decline in accuracy compared to other methods.

The median running time of our method for construction of a constrained homomorphism for a given phylogenetic tree and candidate migration tree was within 0.7 seconds for all tests and tree priors (Fig. 2c), even though theoretically our algorithms can be exponential in the worst case. It allows us, using straightforward parallelization, to produce and process samples consisting in hundreds of thousands of candidate trees in a reasonable time.

Experimental viral data

Analysis of simulated data highlights the role of structural constraints in migration tree inference, particularly when paraphyly is limited. Interestingly, these constraints prove just as essential in scenarios with a high degree of paraphyly, albeit for different reasons. This is evidenced by the analysis of data of Hepatitis C (HCV) outbreaks, which have been considered in previous studies48,59,60,65,75. The data comprises intra-host HCV populations from several outbreaks investigated by the Centers for Disease Control and Prevention, each population consisting of sequences covering Hypervariable Region 1 (HVR1) of the HCV genome. In each outbreak, a single primary host infected all other hosts, rendering the migration tree in graph-theoretical terms a star.

We analyzed the two largest outbreaks involving 15 and 19 infected hosts. For each outbreak, we used phylogenetic trees that were constructed in a prior study48. All transmissions occurred within a short time frame and, as a result, intra-host populations are highly intermixed (Fig. 3a). This makes paraphylitic signal strong, but oversaturated, thus impeding its use to reconstruct true transmission history.

Fig. 3: The analysis of experimental HCV data.
figure 3

a Phylogenetic trees of HCV variants from two outbreaks. Variants sampled from different hosts are highlighted in different colors. b Graphs formed by edges corresponding to adjacent tree nodes with unique Fitch labels. True edges are highlighted in red. c Consensus networks of the top 1% of sampled trees with respect to the objective (3). Edge thicknesses are proportional to their frequencies, true edges are highlighted in red.

This effect can be demonstrated by examining internal node labels generated by the Fitch algorithm, that serves as a basis for several methods considered in the previous section. In the trees analyzed, 32–34% of internal nodes were assigned a single provisional label during the post-order traversal step of the dynamic programming algorithm, indicating that these labels appear in all most parsimonious solutions (or solutions with the minimal migration number in terms of ref. 43). Many of these nodes are adjacent, suggesting that the transmission links they represent will be identified by any parsimony-based sampling approach similar to those employed by existing tools33,42,65. For one outbreak, these links form a connected graph, whereas in the other, only one host does not integrate into this single connected component (see Fig. 3b). These resulting graphs are relatively dense and include not only true edges but also a significant number of false positives (see Fig. 3b). Consequently, even if all true positive edges are correctly identified using nodes with multiple Fitch labels, the f-scores would not exceed 0.54 and 0.51, respectively. Enhancing the accuracy of this approach requires the filtering out of false positive edges, achievable only through the integration of additional prior information or constraints.

In contrast, sampling unconstrained candidate transmission trees from the scale-free tree distribution produces the results that are significantly closer to true transmission histories (Fig. 3c). For consensus networks derived from these samples, areas under the precision-recall curve are estimated at 0.76 and 0.70. Furthermore, f-scores of solutions obtained as minimal spanning trees of consensus networks produced from the top 1% of sampled trees according to the objective (3) are 0.86 and 0.94. Comparable results – f = 0.79 and f = 0.89 – are obtained if we use the top 1% of sampled trees based on the parsimony score.

Experimental cancer data

We employed SMiTH to analyze the migration history of metastatic ovarian cancer using the data published in ref. 125. The dataset comprised whole-genome and targeted sequencing data from samples collected at various anatomical sites, including the left ovary (LOv), the right ovary (ROv), and several metastases. In the original study, migration histories were inferred using hierarchical clustering trees and a Dollo parsimony model. Subsequent re-analysis using MACHINA43 revealed several additional, more parsimonious migration histories.

We focused on the data from Patients 1, 3, and 7, that included the highest number of anatomical sites (7-8 sites) and that were thoroughly analyzed in ref. 43. We used clone trees shared by the authors of ref. 43. For each patient, we sampled candidate migration trees from a uniform distribution using an unconstrained model. Following the methodology described in ref. 43, we resolved polytomies in clone trees to match the solutions reported there. In instances where the resolution of polytomies was ambiguous, we applied a random resolution, generating a new random resolution for each sampled candidate migration tree.

For Patient 1,125 identified a complex migration history, designating ROv as the primary tumor site. MACHINA was able to find several more parsimonious histories that suggested either LOv or ROv as the primary tumor location, but was not able to distinguish between them. These histories shared parsimony scores (referred to as “migration numbers”43) and/or the same number of migration events (named “co-migration numbers”43), which left the primary tumor’s location ambiguous. In contrast, SMiTH enabled the comparison of migration number distributions for different potential primary tumor sources (Fig. 4) rather than making the decision based on a single most parsimonious solution. Migration numbers associated with LOv and ROv were significantly lower than those of other potential sources (p < 10−65, Mann–Whitney U test). Among these two, the lowest numbers were observed for ROv (p < 10−38, Mann–Whitney U test), indicating a stronger statistical support for ROv as the primary tumor source.

Fig. 4: Distributions of migration numbers for trees with different primary tumor sites across n = 50000 sampled trees.
figure 4

Adnx: adnexa, ApC: appendix, Brn: brain, Bwl: bowel implant, CDSB: cul de sac, ClnE: sigmoid colon deposit, LOv: left ovary, LFTB,LFTC: left fallopian tube, Om: omentum, RFTA: left fallopian tube, ROv: right ovary, RPv: right pelvic mass, RUt: right uterosacral ligamen, SBwl: small bowel. Each boxplot displays the interquartile range (IQR) as the box, the median as the center line, whiskers extending to the furthest points within 1.5 × IQR, and outliers plotted beyond this range. Source data are provided as a Source Data file.

A similar situation was observed for Patient 7. Here,125 suggested the right uterosacral ligament (RUt) as the primary tumor location, while MACHINA identified several alternative migration histories with either the left ovary (LOv) or the right ovary (ROv) as the source, each sharing identical migration and co-migration numbers. Based on these results,43 argued that available data provide no evidence for the assertion that the primary tumor is located in the RUt as opposed to the ovaries. However, SMiTH provided such statistical evidence (Fig. 4), showing that migration trees with RUt as the source generally exhibited lower migration numbers (p < 10−154, Mann–Whitney U test).

Patient 3 presents a different scenario. Here, MACHINA identified several migration histories with LOv or ROv as primary tumor sources. There is also an alternative history with the omentum (Om) as the source, which has a lower migration number. The latter hypothesis appears preferable if judged solely by the single most parsimonious solution. Yet, this conclusion is not reliable due to the highly symmetric distribution of clones from different sites in the clone tree. Many clones form polytomies, offering insufficient data to clearly differentiate between the corresponding sites (e.g., all clones from LOv and LFTC are siblings, rendering these sites indistinguishable; see Supplementary Fig. 7). The apparently lower migration number for Om, compared to other sites, is simply due to its representation by five clones, versus four for several other sites – a difference that could stem from sampling bias given the small number of clones involved.

SMiTH was able to capture and quantify this uncertainty. Specifically, it did not find significant differences in the distribution of migration numbers among potential primary sources, with p-values ranging from 0.09 to 0.96 in pairwise Mann–Whitney U tests and a p = 0.65 in a joint Kruskal–Wallis test (Fig. 4). This provides a statistical support for suggestion that the existing data is insufficient to draw reliable conclusions about the migration history.

In total, these examples illustrate how SMiTH can be used to provide statistical support for hypotheses regarding metastatic spread pathways.

Discussion

This study is dedicated to an in-depth mathematical exploration of the relationships between phylogenies and migration trees of heterogeneous genomic populations, with the focus on host-to-host viral transmissions and cancer metastatic spread. Although it is established that phylogenetic trees impose some restrictions on migration pathways57, the exact nature and extent of these constraints are still not well understood, despite a considerable amount of research dedicated to this problem57,94,95,96,97,98,99,100,101,102,103,104,105. Our approach adds both depth and rigor to this area by utilizing the powerful theoretical and algorithmic framework of the theory of graph homomorphisms. This framework allowed us to derive necessary and sufficient conditions for the consistency of phylogenetic and migration trees, and to develop efficient algorithms for analyzing this consistency through numerical experiments.

Based on our findings, we propose a general and flexible computational framework that can be used to infer viral transmission and cancer migration trees under various assumptions, quantitatively assess competing hypotheses about migration dynamics, investigate the influence of phylogenies on the migration tree space, and to determine whether a potential migration history is definitively contradicted by a phylogeny or set of phylogenies.

Methodologically, our approach balances the advantages of probabilistic and parsimony methods. It incorporates scalability and the use of advanced combinatorial optimization techniques from the latter, along with the biological plausibility of the former that comes from employing appropriate prior random tree distributions.

From a modeling perspective, our approach eliminates the common assumption that transmissions are independent-a limitation of many existing methods that does not always hold in real-life settings, as evidenced by the frequent presence of “superspreading” in viral outbreaks. Furthermore, it is polytomy-agnostic, i.e., unlike other methods, it does not require polytomies to be resolved prior to inference. This is particularly important for cancer trees, where polytomies are prevalent. In addition, it is one of the first methods to propose and implement a Bayesian approach to metastatic migration inference, whereas most existing tools rely on the maximum parsimony principle.

This study aligns well with the context of previous research. Several earlier studies have conceptualized migration inference as a coloring problem55,56,57, employing this framework both to develop efficient inference algorithms and to explore the structure of migration tree space. The methodology introduced in this paper advances these prior approaches by incorporating a more comprehensive mathematical model and applying more sophisticated mathematical techniques. Similarly, the concept of constraining the tree space to random trees from a specified distribution was first introduced in our earlier studies on viral transmissions48,59, while a related concept of restricting the tree space to subgraphs of specific migration patterns has been utilized in computational cancer genomics43,83. This paper not only refines and extends these methodologies but also integrates them into a cohesive modeling and computational framework.

Although this paper focuses on specific applications of the homomorphism-based methods in cancer genomics and molecular epidemiology, these methods are in principle generalizable and applicable for other types of problems. Indeed, the proposed framework can be understood as a variant of the inverse problem126, which involves inferring unknown state transition rules of a dynamical system from observed outputs. This class of problems is fundamental across disciplines, including physics, control theory, and machine learning.

In our setting, the objective is to estimate the underlying state transition model - both its structure and transition probabilities - given a collection of observed system outputs. These observations are assumed to result from trajectories governed by an unknown but structured transition process, represented as a weighted graph or distribution over plausible transition graphs. Within a homomorphism-based framework, this problem can be approached as follows: (a) construct unlabeled system trajectories, for example, using a hierarchical clustering tree or a directed acyclic graph that encodes similarity or evolutionary relationships between observed outputs; (b) sample candidate transition graphs from a prior distribution (e.g., over graph topologies or transition matrices); (c) for each sample, compute graph homomorphisms from the observed trajectory structure into the candidate transition graph; (d) use the induced mappings to estimate the frequencies of transitions between states, thereby approximating the transition probabilities of the underlying model. This approach provides a flexible and general framework for modeling a wide range of diffusion-like processes, including Markov models, control systems, or the spread of information or influence in social networks.

The proposed approach certainly has both advantages and disadvantages. The proposed method does not explicitly account for unsampled sites, unlike recent viral transmission inference tools such as TransPhylo56 and BREATH67. However, inferring unsampled sites usually requires timed or branch-length phylogenies, which are common in viral genomics but less frequently used in cancer studies. Cancer research typically relies on character-based phylogenies127, for which topology-based tools like SMiTH are more appropriate. These methods assume that the metastatic migration network formed by sampled metastatic sites is connected42,43,82. Similar settings are also used in molecular epidemiology, particularly in local outbreak investigations where transmission clusters are well-covered through epidemiological methods80,128 or prior statistical analyses77, as well as in studies of viral transmissibility in clinical129 and laboratory74 settings.

In principle, however, the proposed framework can be extended to accommodate incomplete sampling. This can be done by allowing transition patterns to include more vertices than there are observed states at the leaves of the phylogeny, with the additional vertices representing unobserved states. Homomorphisms would then be allowed to map some nodes to these unobserved states. Such an extension would expand the solution space, likely necessitating the use of additional informative priors.

Furthermore, if the viral transmission or cancer migration tree is scale-free - as has been frequently observed116,117,118,119 - then migration tree inference remains relatively robust to unsampled sites, provided that the sampling rate is uniform. This robustness arises from the scale-free networks property to maintain connectivity despite the random removal of a substantial portion of nodes124. In other words, with high probability, the inference of transmission links between sampled hosts will not be critically affected under uniformly random and sufficiently dense sampling.

Next, it is important to note that this paper focuses on migration trees, whereas in some real-world scenarios, migration graphs may contain cycles-particularly in cases involving multiple infections or multiple metastatic seeding events. In fact, the proposed methodology is capable of producing outputs that are not strictly trees. The tree samples produced by SMiTH can be aggregated to approximate a likely transmission graph, either by sampling edges from the consensus graph based on their probabilities or by extracting the most probable subgraph with desired properties. This aligns with a broader class of randomized algorithms, where multiple simple solutions (here, trees) are combined to approximate a more complex one (here, a graph). A good example is the graph sparsification problem, where a dense graph G is approximated by a sparser graph H on the same vertex set. This can be done by sampling random spanning trees of G, estimating the probability of each edge appearing in the samples, and resampling edges accordingly to construct H130. A similar strategy could be effective in our settings.

An alternative way to handle multiple infections or seeding events is to partition each population into genetically distinct subpopulations, treating each as a separate vertex in the transmission or migration pattern. After tree construction, vertices corresponding to the same deme can be merged to form a graph. This approach has been used in viral studies48; subpopulations can be identified via reference genomes, phylogenetic analysis, or clustering79,131,132,133,134.

We also recognize that uniform sampling from the space of candidate migration trees may not be the most optimal tool for migration dynamics inference, and employing more precise optimization or sampling techniques to navigate the tree space could substantially enhance the method’s accuracy and efficiency. This work lays a foundation for further theoretical and algorithmic development in this direction, equipping researchers with the tools needed to expand the use of graph homomorphism methodologies. On the other hand, uniform sampling may be more suitable for hypothesis testing and comparison. It also should be noted that a key aspect of our methodology is that it involves sampling from the space of subtrees of a transition pattern, rather than from the space of phylogeny node labelings as was done in other studies33,57,65. The former space is considerably smaller in practical scenarios, which often makes the uniform sampling computationally feasible.

Furthermore, it is likely that models tailored to the specific characteristics of underlying populations could potentially yield more accurate insights into the biological processes involved. In particular, the biological mechanisms driving viral and cancer migrations are certainly very different. Nonetheless, employing a unified phylogenetic approach to study highly mutable populations offers several advantages. First, it decouples the initial phylogenetic reconstruction from its biological interpretation, thereby minimizing the risk of overfitting and ensuring that the results are less biased by underlying models56. In addition, such methods are significantly more computationally efficient and scalable compared to parameter-rich models48,56. Finally, more general phylogenetic models offer greater flexibility and versatility, and usually can be readily extended to more specific settings through the integration of suitable priors56. These features make them an excellent foundation for more detailed analyses.

The considerations above suggest several directions for future research. These include extending the proposed framework to accommodate more complex migration histories, scenarios involving sparse sampling, and broader classes of inverse problems in dynamical systems. On the theoretical side, the computational complexity of convex and compact label-distinctive homomorphism problems remains unexplored. In addition, potential connections between our model and graph sparsification problems warrant further investigation.

Methods

As the research was limited to computational methods, ethical approval from institutional boards was not applicable.

Basic definitions

Throughout this paper, we consider a pair of trees: a phylogenetic tree and a migration tree. For clarity, we refer to elements of a phylogeny as nodes, and to elements of a migration tree as vertices, and denote them by Greek and Latin letters, respectively.

The problem of migration tree inference is set up as follows. The input is a phylogenetic tree Ψ = (V(Ψ), E(Ψ)), with the leaf set L(Ψ) representing genomic variants belonging to different subpopulations (or demes), denoted by \({{\mathcal{L}}}\). The tree Ψ can be a standard binary phylogeny or a non-binary mutation tree used in most cancer studies. Each leaf λ L(Ψ) has an assigned site label (or color) \({l}_{\lambda }\in {{\mathcal{L}}}\). The aim is to expand this labeling from the leaves to all nodes in the tree, creating a full labeling \(f:V(\Psi )\to {{\mathcal{L}}}\). In this model, any multi-colored tree edge αβ represents a migration of genomic variants between demes f(α) and f(β). The migration tree T = T(Ψl) and with vertices \(V(T)={{\mathcal{L}}}\), is then formed by contracting the nodes with the same color73.

As mentioned in the introduction, researchers often seek migration trees satisfying particular constraints restricting types of migration or tree topologies. These constraints can be encoded using a transition pattern graph G that describes permissible patterns of migration. Transition pattern can be either a deterministic graph or a random graph characterized by some probability distribution; specific examples are provided in the following subsection. In this model, a migration tree should be isomorphic (i.e., identical up to relabeling of vertices) to a subgraph of the transition pattern. Any corresponding labeling will be called feasible.

The relations between the phylogeny Ψ, the migration tree T and the transition pattern G can be captured using the concept of a graph homomorphism. A homomorphism fΨ → G111 is an adjacency-preserving mapping between vertex sets of these graphs, i.e., f(u)f(v) E(G) if uv E(Ψ). For the sake of mathematical rigor, here and throughout this paper, we assume that a transition pattern is reflexive, i.e., every vertex is adjacent to itself. With this condition in place, any feasible labeling f is a homomorphism from Ψ to G, making the migration inference problem essentially a problem of finding such a homomorphism.

Throughout this paper, we use standard graph theory notations. We denote by Ψα a subtree of Ψ rooted at the node α. For any graph G, G[X] represents the subgraph induced by the subset of vertices X. We use u ~ v to indicate that vertices u and v are adjacent. The set of neighbors of a vertex u in G is denoted as NG(u), degG(u) = NG(u) is the degree of u, NG[u] = NG(u) {u} and NG[X] is the union of the neighborhoods of all vertices in a set X. In addition, the distance between any two vertices u and v in G is denoted by dG(uv), and PG(uv) refers to the corresponding shortest path between them. When the graph is clear from the context, we may omit the subscripts in these notations. In the case of a phylogeny Ψ, all distances and paths are undirected.

In addition, to simplify the notation, we will apply set-theoretical operations (e.g., intersection and union) directly to subgraph of G, with the understanding that the resulting subgraph is induced by the set obtained from applying these operations to the vertex sets of the original subgraphs.

Migration inference under structural constraints

The simplest form of the migration inference challenge appears when the vertices of the transition pattern graph G directly correspond to the labels of the phylogenetic tree leaves, that is, \(V(G)={{\mathcal{L}}}\). This variant is referred to as labeled inference. The transition pattern G can be an unweighted graph or a random graph with specified edge probabilities. Several scenarios exemplify this problem:

  • For viral transmissions, G could reflect a contact network of potential hosts80,109, where an infection spreads through direct interactions within this network.

  • Another viral transmission model assumes that transmission is only possible between hosts with overlapping exposure intervals33,51. In this case, G is an interval graph135, i.e., a graph with vertices representing time intervals and edges connecting vertices whose intervals intersect.

  • For metastatic spread inference, G may represent a circulatory network119,136,137, with vertices representing organs and edges reflecting the prior probabilities of cancer spreading between organs.

Problem 1 (labeled migration inference)

Input:

  • (a1) a phylogenetic tree Ψ with leaf labels \({({l}_{\lambda })}_{\lambda \in L(\Psi )}\) forming the label set \({{\mathcal{L}}}\);

  • (b1) a graph G with \(V(G)={{\mathcal{L}}}\);

Output:

  • (c1) a homomorphism fΨ → G such that f(λ) = lλ for every λ L(Ψ).

In practical settings, however, Problem 1 might be too restrictive or not fully reflective of reality. The main issue is the absence of a known mapping between the leaf labels and the vertices of the transition pattern. That is, the pattern G defines the permissible structure of migration, such as which configurations are allowed-rather than specifying exact migration routes. For example, in the context of viral transmission, we might expect the presence of a superspreader, which would result in a star-like topology. This indicates the permissible structure. However, the identity of the host acting as the superspreader-and consequently, the precise transmission links-are unknown. The following models to this variant of the problem:

  • Viral transmission: The transition pattern G could represent a random graph that describes expected characteristics of transmission trees, like being scale-free or having a particular expected degree distribution. Such models draw on known expected properties of contact networks essential for infection spread116,138,139,140 without requiring actual host contact information, and have been explored in several studies48,59.

  • Metastatic spread inference: In this case, a transition pattern G may describe plausible evolutionary scenarios of cancer migration, such as monoclonal or polyclonal seeding from single or multiple sources43,83. A related idea has been applied in studies analyzing CRISPR-based lineage tracing phylogenies, where G specifies a so-called star homoplasy model141.

  • In general phylogenetics, several widely used models-such as Dollo and Camin-Sokal parsimony-can be reformulated in terms of transition patterns141 and therefore fall naturally within the graph homomorphism framework.

In this version of migration inference problem, is no explicitly given mapping between the set of leaf labels and the vertices of the transition pattern. Instead, a feasible homomorphism should map leaves with distinct labels to distinct vertices of G and vice versa, i.e., for any λ1λ2 L(Ψ) we have f(λ1) ≠ f(λ2) whenever \({l}_{{\lambda }_{1}}\ne {l}_{{\lambda }_{2}}\) (Fig. 5 (a)). We will call a homomorphisms satisfying this requirement a label-distinctive homomorphism.

Fig. 5: Solutions for the Migration History Inference Problem under various constraints.
figure 5

For each scenario, a phylogenetic tree is illustrated in two different layouts on the left and in the middle, with the corresponding migration tree displayed on the right. In the phylogenies, nodes are color-coded by their homomorphic images in the migration tree. The layout in the middle showcases how homomorphism transforms a phylogeny into a migration tree. a Unconstrained solution. Subtrees formed by blue and red nodes are not connected, indicating a violation of the convexity constraint. b Convex solution. Each color-coded subtree is connected, but compactness is violated in the subtree rooted at node 7. c Convex and compact solution.

In this study, we focus specifically on migration trees because: (a) this formulation is biologically meaningful and widely used in previous research; (b) the associated computational problems are algorithmically tractable, as demonstrated below; and (c) the methods developed for migration tree inference can be naturally extended to more general spread patterns (see Discussion). In such a context, the first question to be asked is whether a given subtree T of the transition pattern G is consistent with the phylogeny Ψ, i.e., whether there exist a label-distinctive homomorphism Ψ → T.

Problem 2 (unlabeled migration inference)

Input:

  • (a2) a phylogenetic tree Ψ with leaf labels \({({l}_{\lambda })}_{\lambda \in L(\Psi )}\);

  • (b2) a tree T;

Output:

  • (c2) a label-distinctive homomorphism fΨ → T.

A more restricted version of Problem 2 includes an additional convexity constraint142,143, where tree nodes mapping to the same vertex of G form a connected subtree of Ψ (Fig. 5 (b)):

Problem 3 (convex unlabeled migration inference)

Input: (a2) and (b2)

Output:

  • (c3) a label-distinctive homomorphism fΨ → T such that induced subgraphs Ψ[f−1(v)] are connected for all v V(T).

We will refer to a homomorphism satisfying (c3) as a convex homomorphism. Convex homomorphisms to trees describe migrations with a complete bottleneck; such migration trees were the focus of extensive research in previous studies57,73,110.

To define another type of constraint, consider a labeling l of nodes of the phylogeny Ψ. The labeling l is compact if the label of the most recent common ancestor (MRCA) for any group of leaves matches one of the labels within that group.

The rationale for this is based on the understanding that migrations within any given subtree could follow one of the following scenarios: either (i) migrations involve only the demes represented by the leaves of the subtree, or (ii) migrations include external demes, but lineages from these demes were not sampled due to extinction or incomplete collection. Although both scenarios are feasible, the first is more parsimonious, especially when sampling is dense and migration events span a short period of time. Hence, homomorphisms that align with this compact labeling are referred to as compact.

Problem 4 (compact unlabeled migration inference)

Input: (a2) and (b2)

Output:

  • (c4) a label-distinctive homomorphism fΨ → T such that for any node α V(Ψ) we have f(α) f(L(Ψα)).

Note that convexity and compactness are independent properties; neither implies the other. An example of a convex but non-compact labeling is shown in Fig. 5b, while an example of a compact but non-convex labeling is depicted by Supplementary Fig. 1.

It is often desirable to produce a sample of potential solutions rather than a single solution. Such approach offers statistical backing for potential migration links and logically shifts the migration inference problem to a Bayesian paradigm. Potential solutions can be sampled from a random graph distribution or, if a transition pattern is a deterministic graph, from the uniform distribution of its subgraphs with specified properties (e.g., subtrees). The sampling-based version of the migration inference problem can be formulated as follows:

Problem 5 (unlabeled migration sampling)

Input:

  • (a5) a phylogenetic tree Ψ with leaf labels \({({l}_{\lambda })}_{\lambda \in L(\Psi )}\);

  • (b5) a random graph G (further referred to as a transition pattern) with edge probabilities pE(G) → [0, 1] that define a distribution \({{\mathcal{D}}}\) of subtrees of G.

Output:

  • (c5) a sample S = {T1, …, Tm} from the distribution \({{\mathcal{D}}}\), where each subtree Ti is a homomorphic image of Ψ, and the corresponding label-distinctive homomorphisms fiΨ → Ti are possibly convex and/or compact.

Finally, it is useful to relate the concepts introduced here to standard notions in structural graph theory. In this context, a graph homomorphism is sometimes interpreted as a G-coloring of the tree Ψ111, where each node of Ψ is assigned a “color”-that is, a vertex of T. In this sense, the homomorphism model generalizes coloring frameworks explored in prior transmission inference studies.

A key distinction from most existing homomorphism studies is that we assume the target tree T is reflexive-a less used, but not uncommon, variant111. Another central feature of our model is the label-distinction property, which significantly differentiates it from standard formulations. To the best of our knowledge, this specific variant of the graph homomorphism problem has not been previously studied. It can be seen as a relaxed version of the list homomorphism problem: instead of assigning explicit allowed images to nodes, we only impose constraints on which nodes must or must not share an image.

In the case of a convex homomorphism, the tree T can be derived from Ψ through a sequence of edge contractions, making T a minor of Ψ. In general, a graph G1 is a minor of a graph G2 if it can be obtained by a combination of edge contractions, edge deletions, and vertex deletions144. However, the minors relevant to our problem are subject to stricter constraints: only edge contractions are permitted, and contractions between labeled nodes with different labels are explicitly disallowed. In addition, convex homomorphisms generalize convex colorings145 in the same way as standard homomorphisms generalize proper colorings111. Finally, we are not aware of any prior work that explores concepts similar to compact homomorphisms.

Labeled migration inference

Problem 1 can be efficiently solved in polynomial time using dynamic programming. Briefly, the algorithm first performs a post-order traversal to compute potential image sets for each node, ensuring that each internal node can be mapped while preserving adjacency constraints. If the root has a valid image, a pre-order traversal assigns specific values to construct the homomorphism; otherwise, it confirms that no valid mapping exists (see Supplementary Note 1 for formal description).

Unconstrained unlabeled migration inference

A possible strategy to solve Problem 2 involves identifying a bijection \(g:{{\mathcal{L}}}\to V(T)\) that can be extended to a homomorphism Ψ → T using the algorithm for the labeled migration inference. We will describe such bijections as feasible.

Let Lg(u) = {λ L(Ψ): g(lλ) = u} be the set of leaves in Ψ whose labels map to u. The following theorem establishes a necessary and sufficient condition for the bijection feasibility.

Theorem 1

The bijection \(g:{{\mathcal{L}}}\to V(T)\) is feasible if and only if

$${d}_{T}({u}_{1},{u}_{2})\le {d}_{\Psi }({\lambda }_{1},{\lambda }_{2})$$
(1)

for all u1u2 V(T), λ1 Lg(u1), λ2 Lg(u2).

In other words, Theorem 1 claims that g is feasible if and only if it is a contraction. The necessity of this condition is a known property of graph homomorphisms111, but it is generally not sufficient111. However, for our specific type of the homomorphism problem it indeed suffice. The proof of this and the following theorems can be found in the Supplementary Note 2.

Theorem 1 establishes that the Unlabeled Migration Inference problem is algorithmically equivalent to the problem of finding a corresponding contraction. First of all, this allows us to demonstrate that Problem 2 is \({{\mathcal{NP}}}\)-hard. It can be proved through a reduction from the following problem:

Graph Bandwidth problem.

Given: A graph G.

Find: The minimal integer K (denoted as bw(G)) for which there exists is a bijection fV(G) → {1, …, V(G)} such that

$$| \, f(u)-f(v)| \le K\,{{\mbox{for}}} \, {{\rm{all}}}\,u \sim v.$$
(2)

The Graph Bandwidth problem is \({{\mathcal{NP}}}\)-hard146, and, moreover, it cannot be approximated within any constant factor unless \({{\mathcal{P}}}={{\mathcal{NP}}}\), even when the input graph G is a tree147.

Theorem 2

The Unlabeled Migration Inference problem is \({{\mathcal{NP}}}\)-hard, even when all leaf labels in the phylogeny Ψ are unique.

Although the Unlabeled Migration Inference problem is \({{\mathcal{NP}}}\)-hard, we can use Theorem 1 to approach it using Integer Linear Programming (ILP). We define a feasible bijection \(g:{{\mathcal{L}}}\to V(T)\) using binary variables xiu, where xi,u = 1 if g(i) = u, for each \(i\in {{\mathcal{L}}}\) and u V(T). The ILP formulation to find such a bijection is as follows:

$${\sum}_{i\in {{\mathcal{L}}}}{\sum}_{u\in V(T)}\delta (i){{\rm{deg}}}(u){x}_{iu}\to \max$$
(3)

s.t.

$${\sum}_{u\in V(T)}{x}_{iu}=1,\,\,\,i\in {{\mathcal{L}}};$$
(4)
$${\sum}_{i\in {{\mathcal{L}}}}{x}_{iu}=1,\,\,\,u\in V(T);$$
(5)
$${x}_{iu}+{x}_{jv}\le 1,\,\,\,i,j\in {{\mathcal{L}}},u,v\in V(T),{\Delta }_{ij} < {d}_{T}(u,v);$$
(6)
$${x}_{iu}+{x}_{iv}+{x}_{ju}+{x}_{jv}\le {y}_{ij}+1\,\,\,i,j\in {{\mathcal{L}}},uv\in E(T);$$
(7)
$${\sum}_{i,j\in {{\mathcal{L}}}}{y}_{ij}=n-1.$$
(8)

where \({\Delta }_{ij}={\min }_{{\lambda }_{i}\in {l}^{-1}(i),{\lambda }_{j}\in {l}^{-1}(j)}{d}_{\Psi }({\lambda }_{i},{\lambda }_{j})\).

In this formulation, constraints (4) and (5) ensure that x encodes a bijection, while constraints (6) guarantee that the bijection adheres to the conditions of Theorem 1. The auxiliary variables yij in constraints (7) indicate whether a pair of leaf labels map to adjacent vertices in T, with constraint (8) ensuring the inferred migration network forms a tree.

The coefficient δ(i) of the objective function (3) represents the genetic diversity of the ith subpopulation, measured as allelic entropy averaged over all allelic positions. The objective function defined in this way facilitates the search for a solution by leveraging the relationship between population diversity and population age148,149,150, that suggests that more diverse populations, which are likely older, are also more probable origins of migration52,59,88. Consequently, it is more likely that such populations correspond to high-degree vertices in the tree T, and the objective function (3) prioritize such solutions.

Migration inference under convexity constraints

In this section, a convex label-distinctive homomorhism Ψ → T will be called C-feasible. As noted earlier, in this case T is a minor of Ψ. It is known that the problem of detecting whether a given graph is a minor of another graph is \({{\mathcal{NP}}}\)-hard, even when input graphs are trees151,152. However, since we consider a restricted class of minors, we suggest that in practical settings C-feasible homomorphisms can be efficiently found and enumerated using dynamic programming.

The proposed algorithmic approach is described as follows. Initially, when possible, we simplify the phylogenetic tree Ψ by collapsing paths between leaves sharing identical labels into a single node, a step made feasible by the convexity constraint. As a result, Ψ may become non-binary and contains uniquely labeled leaves and possibly some labeled internal nodes.

The algorithm performs a post-order traversal of the phylogeny Ψ and, for each node α V(Ψ), calculates a set Hα describing possible homomorphisms from the subtree Ψα to subtrees of T. At the root node ρ, the set Hρ thus describes all homomorphisms from Ψ to T. Upon completing the traversal, the algorithm either concludes that no C-feasible homomorphism exists (when \({H}_{\rho }={{\emptyset}}\)), or initiates a pre-order traversal of Ψ. During this second traversal, it reconstructs C-feasible homomorphisms using the information from the sets Hα.

Formally, let Λα be the set of labeled nodes in the subtree Ψα. A subtree T[vX] of T is termed an induced v-subtree if it includes the vertex v, a subset X of v’s neighbors, and all vertices that are connected to v via paths that intersect with X (Fig. 6). Equivalently, T[vX] can be expressed recursively as

$$T[v,X]=\left(\mathop{\bigcup}_{x\in X}T[x,{N}_{T}(x)\setminus \{v\}]\right)\bigcup \{v\}$$
(9)
Fig. 6: Overview of the Dynamic Programming Algorithm for Detecting Convex Label-Distinctive Homomorphisms.
figure 6

A A phylogenetic subtree, Ψα, rooted at node α with two children, β and γ. B Input candidate migration tree T. The goal is to produce convex homomorphisms from Ψα to induced subtrees of T from such homomorphisms for Ψβ and Ψγ. C Convex homomorphisms from Ψβ (top row) and Ψγ (bottom row) to induced subtrees of T. For instance, the top figure depicts a homomorphism to an induced 1-tree T[1, {4, 9, 10, 11}] that consists of the vertex 1, its neighbors 4, 9, 10, 11 and all vertices connected to 1 via paths that intersect these neighbors. Nodes of subtrees are colored by their homomorphic images. Homomorphisms are organized into a bipartite graph \({{\mathcal{G}}}\), where edges connect homomorphism pairs f1f2 and f1f4 that are consistent, and there is no edge between homomorphisms f1 and f3, that are not consistent. D Homomorphisms f5 and f6 obtained by combining consistent homomorphisms f1f2 and f1f4.

For a vertex α V(Ψ), the set Hα consists of triples (vXC) called partial homomorphism tokens or simply tokens. In each token, (i) v V(T), (ii) X NT(v), (iii) C is a subset of vertices of an induced v-subtree T[vX] such that there exists a C-feasible surjective homomorphism fΨα → T[vX] with f(α) = v and f(Λα) = C.

The algorithm is initialized by setting \({H}_{\lambda }=\{(v,{{\emptyset}},\{v\}):v\in V(T)\}\) for all leafs λ. For an internal node α, the set Hα is constructed based on the sets from its children nodes β1, …, βk. The construction utilizes the following lemma (the proof can be found in Supplementary Note 3):

Lemma 1

Let T[vX] be an induced v-subtree. Then there exist a C-feasible surjective homomorphism fΨα → T[vX] with f(α) = v and f(Λα) = C if and only if there exist tokens \(({v}_{1},{X}_{1},{C}_{1})\in {H}_{{\beta }_{1}}\),…,\(({v}_{k},{X}_{k},{C}_{k})\in {H}_{{\beta }_{k}}\) satisfying the following conditions:

  • (a1) v ~ vi or v = vi for all i {1, …, k};

  • (b1) \((V(T[{v}_{i},{X}_{i}])\setminus \{v\})\cap (V(T[{v}_{j},{X}_{j}])\setminus \{v\})={{\emptyset}}\) for all ij {1, …, k}, i ≠ j;

  • (c1) v V(T[viXi]) if and only if vi = v.

  • (d1) v Ci, if α is labeled.

  • (e1) Xi = NT(vi){v}, if vi ≠ v.

  • (f1) X = {v1, …, vk}{v};

For convenience, we also state the following condition, which is logically implied by the others but aids in the forthcoming algorithm’s description:

(g1) C = C1 Ck, if α is unlabeled, and C = C1 Ck {v}, if α is labeled.

To convert Lemma 1 into an algorithm constructing the set of tokens for a node α using the tokens of its children, we first need to identify tokens of children that satisfy conditions (a1)-(g1). This can be achieved through the following steps. For each vertex v V(T), we construct a multipartite graph \({{\mathcal{G}}}={{\mathcal{G}}}(\alpha,v)\) (i.e., a graph partitioned into k independent sets or parts), as described below:

  1. (i)

    The parts A1, …, Ak correspond to children of α.

  2. (ii)

    The vertices of the set Ai are tokens from the set \({\Psi }_{{\beta }_{i}}\) satisfying the conditions (a1),(c1), (d1) and (e1).

  3. (iii)

    Two tokens from the sets Ai and Aj are adjacent whenever they satisfy the condition (b1).

In the constructed graph, sets of partial homomorphism tokens that satisfy Lemma 1 can be identified as k-vertex cliques. We employ the Bron-Kerbosch algorithm153 to generate these cliques. For each identified clique, we use conditions (f1) and (g1) to construct a new token (vXC) for the node α.

In addition, for each token (vXC) Hα, we maintain pointers p(vXC) that link to the children tokens used in its construction. These pointers are used in the subsequent phase of the algorithm, which aims to reconstruct full C-feasible homomorphisms f.

During this phase, the algorithm executes a pre-order traversal of Ψ. As it progresses, it recursively assigns a specific token to each node. When a token t = (vXC) is assigned to node α, the algorithm sets f(α) = v. It then retrieves tokens for t via the pointers p(t) and assigns them to children, β1, …, βk. This ensures that by the end of the traversal, each node in Ψ has been assigned a homomorphic image, completing the construction of the homomorphism f.

The outlined method is formalized as Supplementary Algorithm 2 and illustrated by Fig. 6. Its efficiency can be improved in several ways, with specific details discussed in Supplementary Note 3. It should be noted that, strictly speaking, the described algorithm is not polynomial, since the number of tokens for a node of Ψ theoretically can be exponential. In practical settings, however, the algorithm is extremely fast, and require split seconds to finish.

Migration inference with convexity and compactness constraints

A similar approach to the one outlined in Subsection 4 can be employed to identify homomorphisms that are both convex and compact. However, the dynamic programming algorithm can be further optimized by taking advantage of the specific nature of these constraints.

In what follows, homomorphisms that are both convex and compact will be referred to as CC-feasible. Following a similar methodology to that used in Supplementary Algorithm 2, we begin by contracting the paths in the phylogeny Ψ. We then construct partial homomorphism tokens similar to those used previously, with the exception that subsets C of images of labeled nodes are not required. Thus, the tokens are simplified to pairs (vX), where v V(T) and X NT(v). A pair (vX) is included in Hα if there exists a CC-feasible surjective homomorphism fΨα → T[vX] such that f(α) = v, and it also satisfies the following condition:

$$| {\Lambda }_{\alpha }|=| T[v,X]| .$$
(10)

This condition is necessary for a partial homomorphism to be extendable to a full compact homomomorphism Ψ → T.

The algorithm is initialized by setting \({H}_{\lambda }=\{(v,{{\emptyset}}):v\in V(T)\}\) for leafs λ. For an internal node α V(Ψ), its token set Hα is constructed from the tokens of its children β1, …βk using Lemma 2 (see Supplementary Note 4 for the proof).

Lemma 2

(vX) Hα if and only if one of the following conditions hold:

  1. 1.

    α is not labeled and there exist w X such that \((v,X\setminus \{w\})\in {H}_{{\beta }_{1}}\) and \((w,{N}_{T}(w)\setminus \{v\})\in {H}_{{\beta }_{2}}\).

  2. 2.

    α is labeled, X = k, and there exist a permutation (v1, …, vk) of elements of X such that v ~ v1, …, vk, and \(({v}_{1},{N}_{T}({v}_{1})\setminus \{v\})\in {H}_{{\beta }_{1}},\ldots,({v}_{k},{N}_{T}({v}_{k})\setminus \{v\})\in {H}_{{\beta }_{k}}\).

Lemma 2 can be straightforwardly used to construct tokens of α from tokens of its children, if α is unlabeled. When α is labeled, then finding a permutation (v1, …, vk) required by Lemma 2 is more complicated and can be achieved using the approach described next.

Suppose that \({{\mathcal{S}}}=({S}_{1},\ldots,{S}_{k})\) is a collection of sets. A vector (x1, …, xk) is termed a transversal of \({{\mathcal{S}}}\)154 if xi Si and all xi are distinct. Let now \({S}_{i}=\{u:\left.(u,Y)\right)\in {H}_{{\beta }_{i}}\,{\mbox{and}}\,v \sim u\}\). Then the vector (v1, …, vk) satisfies the condition 2) of Lemma 2 if and only if it is a transversal of \({{\mathcal{S}}}\).

Given this, the set Hα can be obtained by generating all transversals of \({{\mathcal{S}}}\). This can be done by reducing the problem to finding all maximal matchings in a bipartite graph (a graph consisting of two disjoint vertex sets with edges only between them; a matching is a set of its non-overlapping edges, see Supplementary Note 4 for specific details. The full method is described in Supplementary Algorithm 3.

SMiTH: Sampling MIgrations and Transmissions with Homomorphisms

The algorithms discussed can be effectively integrated into an Unlabeled Migration Sampling framework (Problem 5). It allows for the identification of homomorphic images of a given phylogeny Ψ within a collection of candidate migration trees sampled from a given migration pattern represented by a specified tree distribution.

The obtained sample of migration trees can be directly analyzed to estimate the probabilities of specific migration routes or to obtain summary statistics and confidence intervals for derivative evolutionary parameters. It can be also synthesized into a single weighted consensus graph, where each edge is weighted by the number of candidate trees that support it. When the homomorphism reconstruction includes an objective function, the consensus graph is constructed from a subsample comprising the top κ% of trees ranked by their objective values. For applications requiring a specific output tree – such as for benchmarking and comparison with other methods – the tree is determined by calculating the maximum-weight spanning tree of the consensus graph. The entire algorithmic pipeline, named SMiTH (SMiTH: Sampling MIgrations and Transmissions with Homomorphisms), is illustrated in Fig. 7.

Fig. 7: SMiTH: Sampling MIgrations and Transmissions with Homomorphisms.
figure 7

A Input phylogenetic tree. B Distribution of possible migration trees. Parallel rectangles depict a probability density function, with each rectangle’s width proportional to the corresponding probability. In practice, the distribution is represented either by a random graph model or by a stochastic graph generation procedure. C Candidate migration trees sampled from the distribution (B) using the specified random graph models (for instance, by selecting tree edges at random with pre-defined probabilities or by constructing the trees using a randomized graph growth model). D Homomorphisms from the phylogeny to three sampled trees. In the phylogenies, nodes are color-coded by their homomorphic images in a migration tree. The phylogeny layouts in the middle of each subfigure showcases how homomorphism transforms them into sampled trees. E: consensus solution derived from homomorphisms in D. The solution is shown as potential color distributions for the phylogeny’s nodes (left) or as a graph where possible migration edges are weighted according to the number of supporting solutions (right), with the edge thickness indicating weight.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.