Article
Open access
Published: 28 August 2025

A graph homomorphism approach for unraveling histories of metastatic cancers and viral outbreaks under evolutionary constraints

Kiril Kuzmin¹,
Henri Schmidt²,
Maryam Kafi Kang³,
Sagi Snir⁴,
Benjamin J. Raphael² &
…
Pavel Skums ORCID: orcid.org/0000-0003-4007-5624³

Nature Communications volume 16, Article number: 8027 (2025) Cite this article

4398 Accesses
13 Altmetric
Metrics details

Subjects

Abstract

Viral infections and cancers are driven by evolution of populations of highly mutable genomic variants. A key evolutionary process in these populations is their migration or spread via transmission or metastasis. Understanding this process is crucial for research, clinical practice, and public health, yet tracing spread pathways is challenging. Phylogenetics offers the main methodological framework for this problem, with challenges including determining the conditions when a phylogenetic tree reflects the underlying migration tree structure, and balancing computational efficiency, flexibility, and biological realism. We tackle these challenges using the powerful machinery of graph homomorphisms, a mathematical concept describing how one graph can be mapped onto another while preserving its structure. We focus on metastatic migrations and viral host-to-host transmissions in outbreak settings. We investigate how structural constraints on migration patterns influence the relationship between phylogenetic and migration trees and propose algorithms to evaluate trees consistency under varying conditions. Leveraging our findings, we introduce a framework for inferring transmission/migration trees by sampling potential solutions from a prior random tree distribution and identifying a subsample consistent with a given phylogeny. By varying the prior distribution, this approach generalizes several existing models, offering a versatile strategy applicable in diverse settings.

Migrations of cancer cells through the lens of phylogenetic biogeography

Article Open access 25 August 2021

Equitable machine learning counteracts ancestral bias in precision medicine

Article Open access 10 March 2025

Inference of phylogenetic trees directly from raw sequencing reads using Read2Tree

Article Open access 20 April 2023

Introduction

Many human diseases are closely tied to evolutionary processes. In particular, cancers, often referred to as a “disease of evolution”^1,2, are driven by the continuous diversification of intra-tumor subclonal lineages, which evolve to compete for resources, metastasize, and evade the immune system and therapeutic interventions^{3,4,5,6,7,8,9}. Similarly, intra-host viral evolution, shaped by selective pressures from the host immune system and therapeutic treatment, is a key factor underlying adaptive immunity evasion^10,11,12,13, disease severity and progression^14,15,16,17, tissue tropism^18,19, the establishment and persistence of chronic or prolonged infections^20,21, transmissibility²², and latency^23,24.

Although the biological mechanisms of viral infections and cancers are different, both are fueled by highly mutable populations of disease-causing agents, whose extreme genomic diversity originates from error-prone replication processes, whether due to the lack of a proofreading mechanism in RNA-dependent RNA polymerase or retroviral reverse transcriptase in RNA viruses^25,26,27, or from the genetic instability of tumor cells manifesting itself in somatic mutations, chromosomal gain, loss, translocation, and aneuploidy^9,28,29,30. Consequently, the general phylogenetic methodologies applied to these populations exhibit many similarities. From a methodological standpoint, they form a unique segment of phylogenetics and phylodynamics, fostering a mutual exchange of concepts that enhances all areas of application^31,32,33.

In this study, we specifically focus on tissue-to-tissue metastatic migration and host-to-host viral transmission - migration behaviors characteristic of cancer and viral populations, respectively, understanding of which is essential for public health and medical research^34,35,36,37. These research areas especially benefited from groundbreaking advancements in sequencing technologies. State-of-the-art high-throughput targeted sequencing and single-cell DNA sequencing enable the capture of detailed population snapshots at exceptionally high resolutions, facilitating fine-grained analysis down to the level of individual genotypes^37,38,39,40. In particular, this allows us to examine transmissions and migrations on the level of individual events^38,41,42,43.

A wide array of methods has been developed specifically for reconstructing viral and bacterial transmissions, reflecting the substantial interest in tracking the spread of infectious diseases. The arsenal of tools available for reconstructing transmission histories is extensive, including but not limited to Outbreaker, Outbreaker 2^44,45, SeqTrack⁴⁶, SCOTTI⁴⁷, SOPHIE⁴⁸, Phybreak⁴⁹, Bitrugs⁵⁰, BadTrIP⁵¹, Phyloscanner⁵², StrainHub⁵³, TransPhylo^54,55 (along with its extension TransPhyloMulti⁵⁶), STraTUS⁵⁷, TreeFix-TP⁵⁸, QUENTIN⁵⁹, VOICE⁶⁰, HIVTrace⁶¹, GHOST⁶², MicrobeTrace⁶³, SharpTNI⁶⁴, TiTUS³³, TNeT⁶⁵, AutoNet⁶⁶, BREATH⁶⁷ and others^{55,68,69,70,71,72,73,74,75}. These tools have been instrumental in the investigation of outbreaks and monitoring the transmission dynamics of pathogens like HIV, hepatitis C (HCV), SARS, MERS and SARS-CoV-2^{41,76,77,78,79,80}. Some methods are specifically designed for outbreak investigations with complete or near-complete sampling (including the method presented in this paper), while others account for unsampled hosts and are better suited for studying transmission dynamics in large-scale epidemiological settings.

Similarly, the development of methods for deducing metastatic spread histories is burgeoning, driven by advancements in single-cell DNA sequencing and CRISPR-based lineage tracing technologies. Currently, the repertoire of tools in this domain includes MACHINA⁴³, FitchCount (as part of the Cassiopeia suite)^42,81, PathFinder⁸², TCC, PCC, and PCCH⁸³. Notwithstanding the relatively shorter list, the impact of these tools is growing, with several studies published recently leveraging these methods to gain insights into the mechanisms of metastatic spread^42,84,85,86.

Despite significant progress in the field, the wide variety of existing methods underscores that the challenge of accurately inferring both types of spread remains unresolved. This diversity of approaches indicates both the complexity of the problem and the ongoing efforts to refine and improve upon existing methods.

Phylogenetics and phylodynamics provide the most widely used methodological frameworks for viral transmission and metastatic migration history reconstruction^43,56. However, their application in this context is not straightforward. A phylogenetic tree does not directly equate to a migration or transmission tree⁵⁶, as the nodes in a phylogenetic tree represent divergences of lineages associated with different types of events^56,87. While some of these divergences may result from the seeding of new sites, others occur within previously invaded sites. Thus, constructing a migration tree from a phylogenetic tree requires reconstructing the ancestral states of internal nodes-an approach known as ancestral trait inference. In this context, the ancestral trait corresponds to the location of each lineage, and the goal is to determine whether each divergence event occurred within the same site or was caused by migration, leading to the seeding of a new site. Furthermore, it is crucial to effectively leverage the full spectrum of intra-site (within-host or within-tumor) population diversity uncovered by high-throughput sequencing, that often provides a strong signal for migration inference. For example, the presence of paraphyletic relationships between populations-where genomic variants from one population form a clade in the phylogenetic tree, but some members are more closely related to individuals from another population than to their own, suggests recent seeding of one site from another^60,88,89,90.

The challenges mentioned above highlight several crucial questions that remain unresolved and warrant further exploration. These questions include:

1) To what extent and under which conditions does the topology of a phylogenetic tree reflect the structure of the underlying migration tree? This question becomes particularly important when the level of paraphyly in the labeled phylogeny is low, a situation not uncommon, as the paraphyletic signal tends to diminish over time and with smaller sample sizes⁸⁸. A number of studies have looked at this subject, but their conclusions were mixed. Some studies have found that certain migration patterns, such as super-spreading or, more generally, migrations over networks formed by different models like Erdös-Rényi⁹¹, preferential attachment⁹², or Watts-Strogatz models⁹³, lead to quantitatively distinct phylogenetic tree topologies^94,95,96. In addition, the spatial structure and dynamics of heterogeneous populations, which are directly related to migration pathways, have been shown to affect the phylogeny structure of both viral and tumor populations^97,98,99,100. On the other hand, other studies report that the direct impact of migration patterns on phylogenetic trees ranges from minimal to moderate^{101,102,103,104}. Finally, certain studies have drawn mixed conclusions, indicating that while some migration characteristics are reflected in the phylogeny, others are not¹⁰⁵.

2) How can we limit the solution space to balance computational feasibility, accuracy of inference, generalizability, and biological realism? The space of migration trees consistent with given phylogenetic trees is often vast, and its properties are not well understood^33,57. A sampling-consensus approach is one method to address solution ambiguity, where feasible migration trees are sampled and summarized in a weighted consensus graph, with weights reflecting posterior probabilities of edges^{33,42,47,64,65}. However, the size of the solution space may restrict the depth of sampling. As a response, it is common practice to narrow down the solution space to a set of plausible migration trees optimizing a specific objective function under evolutionary-based constraints. Employing constrained models also aids in preventing overfitting in the presence of missing data and errors.

Various objectives and constraints have been implemented by existing methods. Limiting the number of migration events, sizes of bottlenecks or numbers of back-migrations^{33,42,43,46,52,57,65,81,83} is more computationally efficient and scalable due to the utilization of dynamic programming^106,107, making such approaches practical in both molecular epidemiology and computational oncology. These can also be formulated as Integer Linear Programming (ILP) problems¹⁰⁸ and solved with reasonable efficiency using existing ILP solvers. Models with more complex Bayesian objectives with constraints regularized as priors^{44,47,49,51,56} offer a richer, biologically nuanced perspective but suffer from scalability issues and usually rely on generic methods like Markov Chain Monte Carlo (MCMC) sampling, which may not yield optimal solutions, in part due to a lack of problem-specific mathematical strategies. Balancing computational efficiency with biological comprehensiveness presents a notable challenge, compounded by the uncertainty of how much constraints and objectives truly limit the solution space^33,57.

3) How to incorporate a variety of models for phylogenetic inference of transmissions and migrations into a unified computational framework?

Migration inference models draw on varied biological or epidemiological assumptions. For instance, viral transmission inference often incorporates case-specific temporal data like infectious periods, exposure intervals, symptom onset, diagnosis or sample collection dates to establish the order of infections and eliminate unlikely transmission links^{33,44,47,49,50,54,69}. In some rare cases, contact networks are known and can be used for the same purpose^80,109. While effective, such data is often unavailable, non-informative, or sensitive, particularly for endemic and pandemic diseases caused by HIV, Hepatitis C, SARS-CoV-2, or Influenza^48,59,75. In situations where case-specific data cannot be used, genomic epidemiology tools resort to broader assumptions, like the expected degree distribution of transmission networks implied by the structure of a susceptible population^48,59. Similarly, methods for inferring metastatic spread use constraints defined by so-called migration patterns that reflect realistic cancer migration scenarios, such as monoclonal, polyclonal or multi-source seeding^43,83. A commonality across all these methods is the use of structural constraints on feasible transmission trees that are considered as subgraphs of a larger “pattern” graph. It suggests that these diverse approaches can be unified within a single Bayesian inference framework based on such topological constraints and/or priors.

Addressing these challenges requires a comprehensive investigation into the mathematical properties of migration trees consistent with a given phylogeny, a topic that is not yet fully understood⁵⁷. Our study aims to advance this area by developing a methodology based on powerful techniques from structural graph theory and combinatorics. In addition, we will use this methodology to introduce algorithmic approaches for inferring migration trees.

A number of earlier studies has achieved important progress in this area. Several studies noticed that migration trees consistent with a given phylogeny correspond to partitions of the phylogeny’s node set or to coloring of its branches^54,55,57,73. These observations have informed the development of methods to enumerate and sample these trees, assuming a complete bottleneck⁵⁷ and a known sequence of migrations¹¹⁰. In fact, as we argue in this paper, the relationship between a phylogenetic tree and a migration tree is described by a graph homomorphism - the mathematical concept that generalizes both partitions and colorings.

Graph homomorphism is essentially a mapping between the vertices of two graphs that preserves their structure¹¹¹. The theory of graph homomorphisms is a well-established area of structural and algebraic graph theory, with deep results and rich methodology. All types of migration trees discussed so far can be described by a graph homomorphism with specific constraints. For example, migration trees consistent with a given phylogeny under the assumptions of complete sampling and complete bottleneck, that has been studied in refs. ^57,73 are minors^112,113 of the phylogeny or, equivalently, its homomorphic images such that inverse images of all vertices are connected subtrees.

We use mathematical and algorithmic machinery of graph homomorphism theory to delve into the details of viral transmission and metastatic migration inference. Specifically, we formulate several variants of the migration inference problem in mathematical terms and discuss their applicability in different real-world settings, provide necessary and sufficient conditions describing trees consistent with a given phylogeny, and propose a series of algorithms that evaluate the consistency of phylogenetic and migration trees under various evolutionary scenarios through the construction of corresponding homomorphisms.

Based on aforementioned insights, we propose a general framework for migration inference under complete or nearly-complete sampling - i.e., when the major sources of spread are included in the sample. Our method draws candidate migration trees from a chosen prior random tree distribution and identifies a subsample of trees consistent with a given phylogeny. By varying prior tree distributions, this approach expands upon and generalizes several existing models, offering a versatile and computationally efficient strategy applicable to a variety of settings arising in outbreak investigations and metastatic spread studies. Crucially, the proposed framework is computationally fast, enabling biomedical and public health researchers to quickly test different tree priors that represent common migration models. Beyond that, proposed methods can be used for investigating how phylogenies constrain the space of possible migration trees, for inferring an order of known or suspected migration events, and for determining potential events that are definitively ruled out by a phylogeny⁵⁷. For studying viral transmissions in epidemiological settings with sparse sampling, alternative approaches such as BREATH⁶⁷ and TransPhyloMulti⁵⁶ may be more appropriate. It is also important to distinguish the problem addressed in this paper from phylogeography inference that focuses of pathogen spread between geographical locations¹¹⁴, which requires specific modeling approaches that lie outside of the scope of this study.

The paper is organized as follows. In Methods, we first introduce essential graph-theoretical concepts (Subsection 4), then present four problem formulations for inferring migration histories under various structural constraints (Subsection 4). Subsections 4-4 address each problem in turn, detailing their computational complexity and/or solution methods via ILP or dynamic programming.

We validated our methodology using both simulated and real experimental data from studies of viral outbreaks and cancer metastasis, demonstrating its effectiveness across diverse scenarios. Subsections 4-4 present results on synthetic data, where we explored the impact of structural constraints on the relationship between a phylogeny and a migration tree, and benchmarked our method against state-of-the-art tools. In Subsection 4, we applied our approach to Hepatitis C (HCV) outbreak data, accurately inferring transmission links in complex phylogenies with highly intermixed intra-host populations. Finally, Subsection 4 shows its application to cancer data, where it effectively quantified and reduced uncertainty in metastatic migration inference.

Results

Simulated data generation

To generate synthetic data, we used FAVITES¹¹⁵, a tool capable of simulating genomes, phylogenies, and migration networks under various evolutionary scenarios. Although originally designed to simulate viral outbreaks, FAVITES supports general phylogenetic and population genetics models, making it suitable for simulating migrations of heterogeneous populations besides viruses. It should also be noted that, to the best of our knowledge, specialized simulation tools for metastatic spread with capabilities comparable to FAVITES are currently not available.

We simulated the migration of a heterogeneous population over a network of sites formed according to the Barabasi-Albert model⁹². This assumption can be valid for both viral^116,117 and cancer^118,119 spread. In the basic model, migrations occur at a constant rate along each network edge (in a viral context, this corresponds to the network-based Susceptible-Infected (SI) transmission model). Within each site, phylogenies evolved under the coalescent model with the exponential growth, that has been previously used to simulate intra-host⁸⁸ and intra-tumor¹²⁰ evolution. Genotypes were assumed to evolve under the GTR+Γ substitution model, and were sampled simultaneously at the end of the simulation, with 100 sequences per deme.

Beyond the basic model, we examined the impact of additional simulation settings, including variable migration rates, logistic intra-site population growth (i.e., assuming that the coalescence rate depends on a time-varying population size following a logistic growth model¹²¹), as well as smaller and variable sample sizes per deme (see Supplementary Note 5). In all cases, including the basic model, simulations with a low number of invaded sites were discarded, resulting in datasets averaging 257 simulated migration histories per combination of models and constraints, spanning 5 to 30 demes. In what follows, if not specifically stated otherwise, we are discussing the results for the basic model.

In the first series of experiments, we sampled candidate migration trees from 3 distinct prior distributions of tree topologies and evaluated their consistency with simulated exponential growth phylogenies under three different types of constraints. The prior distributions included:

T1) Degenerate distribution consisting of the true topology. Even though this scenario is unrealistic for actual migration inference, it serves as a test to determine if migration links can be accurately reconstructed when the topology is known but sites need to be correctly mapped to migration tree vertices. To avoid bias linked to correlations between vertex IDs and migration times, that can be potentially introduced by the simulation methods, we produced multiple samples with randomly permuted vertex IDs.
T2) Random scale-free trees produced by the preferential attachment procedure.
T3) Uniformly distributed trees of a given size. Sampling was performed by generating random Prufer codes¹²², integer sequences of length n − 2 that uniquely define n-vertex trees.

The following types of constraints were used:

H1) unconstrained homomorphism;
H2) convex homomorphism minimizing the number of compactness constraint violations;
H3) convex and compact homomorphism.

For each simulated dataset, we produced 9 samples of candidate migration trees corresponding to all combinations of conditions T1–T3 and H1–H3. Each combination of a simulated dataset and model parameters or tools is further referred to as a test instance. The sample sizes ranged from 1000 for the degenerate distribution without constraints, up to 1,000,000 for the uniform distribution with convexity and compactness constraints. This variation in sample size was necessary because stricter constraints require larger samples to ensure that a sufficient number of feasible trees are produced. We assessed whether these sampled trees are consistent with the given phylogenies (i.e., the former are homomorphic images of the latter) under the respective constraints, and extracted subsamples of consistent trees.

Sampled trees were compared with true migration trees produced by FAVITES. Individual trees were compared by measuring recall, defined as the fraction of inferred transmission edges among true transmission edges; precision, the fraction of true transmission edges among inferred transmission edges; and the f-score, i.e., the harmonic mean of precision and recall.

In addition, we summarized each subsample of consistent migration trees using a consensus graph, where each edge is weighted by the proportion of candidate trees that support that edge^47,55,123. A solution can be extracted from the consensus graph by discarding edges with support below a predefined threshold. For each graph, we estimated the area under the precision-recall curve, which was calculated by varying the support threshold. We opted for the precision-recall curve instead of the more common ROC curve due to the imbalance between the classes of true and false migration edges.

Simulated data results

We found that the relationship between phylogenetic trees and migration trees is heavily influenced by structural constraints. In the absence of constraints, there is considerable ambiguity in the possible migration histories that align with a given phylogenetic tree topology, even when prior knowledge about the true migration tree topology is available. Notably, almost every sampled scale-free tree proved consistent with the given phylogenies, with a median fraction of consistent trees at μ = 1 (Fig. 1b). It is not entirely unexpected in light of Theorem 1, which suggests that trees with a low diameter – a common feature of scale-free networks – are more likely to be consistent with a given phylogenetic tree. However, even among uniformly sampled trees, a high consistency rate was observed when no constraints are applied (median μ = 0.84, Fig. 1c).

**Fig. 1: Relationship between phylogenetic and migration trees.**

Furthermore, without constraints, individual consistent trees display only marginal agreement with true migration trees. This holds not only for scale-free and uniformly sampled trees, but even for the degenerate distribution, with median f-score within the range 0.33–0.36 for all three tree priors (Supplementary Fig. 3). In other words, even if the topology of a true migration tree is known, numerous labelings of that topology are consistent with the original phylogeny, most of which substantially diverge from the true labeling. Consequently, the true labeling is not immediately distinguishable among the alternatives without additional information. Combining a consistent tree subsample into a consensus graph, however, brings it closer to the true migration tree, with the median AUC values ranging from 0.56 to 0.59 for all three tree priors (see Fig. 1).

The introduction of convexity and compactness constraints significantly reduces the percentage of trees deemed consistent, as can be expected (Fig. 1a–c). This reduction primarily eliminates incidental solutions, enhancing the alignment of the trees that meet these constraints with the true migration trees (Fig. 1d–f). In particular, for the degenerate distribution, introducing constraints effectively filters out ambiguous vertex mappings, nearly always recovering the true mapping, provided that the solutions satisfying the constraints exist.

Prior knowledge of the migration tree structure is also beneficial for inference, with AUCs improving as the tree prior becomes tighter. In particular, under constraints, AUCs for the scale-free prior are significantly higher than those for the uniform prior (p = 4.4 ⋅ 10⁻⁶ and p = 1.76 ⋅ 10⁻¹⁶ for convex and both convex and compact cases, respectively, Kruskal–Wallis test). Given that true migration trees are generated by preferential attachment, this suggests that under constraints, the phylogeny, to a certain degree, reflects the properties of the underlying migration tree.

As expected, the number of sequences sampled per site affects the accuracy of inference. Notably, this effect is non-linear: while accuracy declines gradually for larger sample sizes, it drops sharply when the sample size is reduced to one (Supplementary Fig. 3). However, this is explainable, given that the major phylogenetic signals of migration, such as paraphyly, are driven by asymmetries in population structure^60,88. These asymmetries persist even with relatively small intra-site samples but disappear completely when only a single genome per site is available. Variable sampling rates per deme and variable migration rates had no significant impact on the inference accuracy (Supplementary Fig. 4).

When entire sites are sampled at random rates of ρ = 0.5 and ρ = 0.75, the algorithm’s accuracy in inferring transmission links between sampled sites also decreases, but to a lesser degree than may be initially anticipated, still exceeding the performance of most other tools under perfect sampling (Supplementary Fig. 5). Notably, accuracy improved with the migration tree size; for instance, at ρ = 0.5, the f-score for trees with more than 15 vertices increased from 0.50 to 0.58. This feature reflects the so-called robust but fragile property of scale-free networks¹²⁴.

Taken together, these observations indicate that phylogeny topologies do indeed reflect underlying migration tree structures, but the extent of this reflection is influenced by evolutionary constraints. Moreover, the correspondence between phylogenies and migration trees is primarily discernible when analyzed statistically across a large sample of feasible migration trees that are consistent with the phylogeny. A single consistent tree may be arbitrary, and thus relying on a single solution may lead to misleading conclusions.

Furthermore, introducing constraints can enhance inference accuracy even when the ground truth does not satisfy them. This is because constraints reduce the size of the solution space, allowing for more efficient and focused exploration. While the ground truth may not strictly adhere to the constraints, it may be close enough that the sampled constrained solutions remain not too far from it. As a result, averaging over these solutions can yield an accurate tree estimate.

Based on those observations, we have developed a method named SMiTH (Sampling Migration Trees using Homomorphisms) for the constrained inference of migration trees with expected general properties. This method involves sampling candidate migration trees from a designated random tree distribution, identifying convex homomorphisms from the given phylogeny to these sampled trees while minimizing an objective function defined by the number of compactness constraint violations, constructing a consensus graph from trees with top objective values, and ultimately inferring the final migration tree as the minimal spanning tree of this consensus graph.

We benchmarked SMiTH against several existing tools designed to infer migration histories from phylogenetic tree topologies. For the sake of fairness, tools that use dated phylogenies and/or case-specific epidemiological information were not considered. The tools selected for this comparison include Cassiopeia^42,81, MACHINA⁴³, Phyloscanner⁵², STraTUS⁵⁷, and TNet⁶⁵. For MACHINA, we ran all four migration models provided by the tool and report the best result, which was achieved using the single-source seeding model. STraTUS generates a sample of migration trees rather than a single tree; thus, similarly to SMiTH, we used the minimum spanning tree of the consensus graph for benchmarking purposes.

The results of the algorithm comparison for the exponential population growth model are shown in Fig. 2. It was found that both variants of SMiTH – with uniform and scale-free tree priors – allow for a statistically significant improvement over other tools (p < 10⁻⁹, multiple comparison of f-score distributions by Kruskal–Wallis test). SMiTH is followed by Cassiopeia and STraTUS – two other sampling-based methods, whose accuracies were statistically indiscernible (p = 0.58, Kruskal–Wallis test). Both tools produce convex solutions, albeit using different algorithms. While STraTUS achieves this through sampling, Cassiopeia’s FitchCount module determines the number of unique most parsimonious solutions, which, in our examples, were almost always convex using dynamic programming. These tools were followed by MACHINA that, similarly to our approach, imposes structural constraints on plausible migration trees by considering them as subgraphs of so-called transition patterns. However, MACHINA produces a single solution optimizing particular objectives rather than summarizes a sample of such solutions; this, in light of the observations described above, likely hampered its performance vis-à-vis sampling-based methods. Similar reasoning can be applied to Phyloscanner, that also produces a single most parsimonious solution. In addition, Phyloscanner is specifically designed to make use of paraphyly, that usually provides a strong signal for migration⁸⁸ when present; consequently, its accuracy can be affected when, as in analyzed test cases, the number of paraphyletic clades is limited. Furthermore, Phyloscanner is specifically designed for scenarios with sparse sampling, which is different from the settings used in this study. In particular, it usually assumes that when the populations sampled from different cities are monophyletic, then they all have a common unobserved source⁸⁸ - the assumption that is opposite to compactness and that seems not to be always valid. The relation between algorithms’ performances is mostly stable with regard to the number of migration sites (Fig. 2b).

The accuracy of all algorithms decreases for data simulated under the logistic coalescent model (Supplementary Fig. 6), consistent with observations in ref. ⁴⁸. However, the ranking of algorithms remains largely stable, with both variants of SMiTH maintaining the top positions. Notably, SMiTH experiences a smaller decline in accuracy compared to other methods.

The median running time of our method for construction of a constrained homomorphism for a given phylogenetic tree and candidate migration tree was within 0.7 seconds for all tests and tree priors (Fig. 2c), even though theoretically our algorithms can be exponential in the worst case. It allows us, using straightforward parallelization, to produce and process samples consisting in hundreds of thousands of candidate trees in a reasonable time.

Experimental viral data

Analysis of simulated data highlights the role of structural constraints in migration tree inference, particularly when paraphyly is limited. Interestingly, these constraints prove just as essential in scenarios with a high degree of paraphyly, albeit for different reasons. This is evidenced by the analysis of data of Hepatitis C (HCV) outbreaks, which have been considered in previous studies^{48,59,60,65,75}. The data comprises intra-host HCV populations from several outbreaks investigated by the Centers for Disease Control and Prevention, each population consisting of sequences covering Hypervariable Region 1 (HVR1) of the HCV genome. In each outbreak, a single primary host infected all other hosts, rendering the migration tree in graph-theoretical terms a star.

We analyzed the two largest outbreaks involving 15 and 19 infected hosts. For each outbreak, we used phylogenetic trees that were constructed in a prior study⁴⁸. All transmissions occurred within a short time frame and, as a result, intra-host populations are highly intermixed (Fig. 3a). This makes paraphylitic signal strong, but oversaturated, thus impeding its use to reconstruct true transmission history.

**Fig. 3: The analysis of experimental HCV data.**

This effect can be demonstrated by examining internal node labels generated by the Fitch algorithm, that serves as a basis for several methods considered in the previous section. In the trees analyzed, 32–34% of internal nodes were assigned a single provisional label during the post-order traversal step of the dynamic programming algorithm, indicating that these labels appear in all most parsimonious solutions (or solutions with the minimal migration number in terms of ref. ⁴³). Many of these nodes are adjacent, suggesting that the transmission links they represent will be identified by any parsimony-based sampling approach similar to those employed by existing tools^33,42,65. For one outbreak, these links form a connected graph, whereas in the other, only one host does not integrate into this single connected component (see Fig. 3b). These resulting graphs are relatively dense and include not only true edges but also a significant number of false positives (see Fig. 3b). Consequently, even if all true positive edges are correctly identified using nodes with multiple Fitch labels, the f-scores would not exceed 0.54 and 0.51, respectively. Enhancing the accuracy of this approach requires the filtering out of false positive edges, achievable only through the integration of additional prior information or constraints.

In contrast, sampling unconstrained candidate transmission trees from the scale-free tree distribution produces the results that are significantly closer to true transmission histories (Fig. 3c). For consensus networks derived from these samples, areas under the precision-recall curve are estimated at 0.76 and 0.70. Furthermore, f-scores of solutions obtained as minimal spanning trees of consensus networks produced from the top 1% of sampled trees according to the objective (3) are 0.86 and 0.94. Comparable results – f = 0.79 and f = 0.89 – are obtained if we use the top 1% of sampled trees based on the parsimony score.

Experimental cancer data

We employed SMiTH to analyze the migration history of metastatic ovarian cancer using the data published in ref. ¹²⁵. The dataset comprised whole-genome and targeted sequencing data from samples collected at various anatomical sites, including the left ovary (LOv), the right ovary (ROv), and several metastases. In the original study, migration histories were inferred using hierarchical clustering trees and a Dollo parsimony model. Subsequent re-analysis using MACHINA⁴³ revealed several additional, more parsimonious migration histories.

We focused on the data from Patients 1, 3, and 7, that included the highest number of anatomical sites (7-8 sites) and that were thoroughly analyzed in ref. ⁴³. We used clone trees shared by the authors of ref. ⁴³. For each patient, we sampled candidate migration trees from a uniform distribution using an unconstrained model. Following the methodology described in ref. ⁴³, we resolved polytomies in clone trees to match the solutions reported there. In instances where the resolution of polytomies was ambiguous, we applied a random resolution, generating a new random resolution for each sampled candidate migration tree.

For Patient 1,¹²⁵ identified a complex migration history, designating ROv as the primary tumor site. MACHINA was able to find several more parsimonious histories that suggested either LOv or ROv as the primary tumor location, but was not able to distinguish between them. These histories shared parsimony scores (referred to as “migration numbers”⁴³) and/or the same number of migration events (named “co-migration numbers”⁴³), which left the primary tumor’s location ambiguous. In contrast, SMiTH enabled the comparison of migration number distributions for different potential primary tumor sources (Fig. 4) rather than making the decision based on a single most parsimonious solution. Migration numbers associated with LOv and ROv were significantly lower than those of other potential sources (p < 10⁻⁶⁵, Mann–Whitney U test). Among these two, the lowest numbers were observed for ROv (p < 10⁻³⁸, Mann–Whitney U test), indicating a stronger statistical support for ROv as the primary tumor source.

**Fig. 4: Distributions of migration numbers for trees with different primary tumor sites across n = 50000 sampled trees.**

A similar situation was observed for Patient 7. Here,¹²⁵ suggested the right uterosacral ligament (RUt) as the primary tumor location, while MACHINA identified several alternative migration histories with either the left ovary (LOv) or the right ovary (ROv) as the source, each sharing identical migration and co-migration numbers. Based on these results,⁴³ argued that available data provide no evidence for the assertion that the primary tumor is located in the RUt as opposed to the ovaries. However, SMiTH provided such statistical evidence (Fig. 4), showing that migration trees with RUt as the source generally exhibited lower migration numbers (p < 10⁻¹⁵⁴, Mann–Whitney U test).

Patient 3 presents a different scenario. Here, MACHINA identified several migration histories with LOv or ROv as primary tumor sources. There is also an alternative history with the omentum (Om) as the source, which has a lower migration number. The latter hypothesis appears preferable if judged solely by the single most parsimonious solution. Yet, this conclusion is not reliable due to the highly symmetric distribution of clones from different sites in the clone tree. Many clones form polytomies, offering insufficient data to clearly differentiate between the corresponding sites (e.g., all clones from LOv and LFTC are siblings, rendering these sites indistinguishable; see Supplementary Fig. 7). The apparently lower migration number for Om, compared to other sites, is simply due to its representation by five clones, versus four for several other sites – a difference that could stem from sampling bias given the small number of clones involved.

SMiTH was able to capture and quantify this uncertainty. Specifically, it did not find significant differences in the distribution of migration numbers among potential primary sources, with p-values ranging from 0.09 to 0.96 in pairwise Mann–Whitney U tests and a p = 0.65 in a joint Kruskal–Wallis test (Fig. 4). This provides a statistical support for suggestion that the existing data is insufficient to draw reliable conclusions about the migration history.

In total, these examples illustrate how SMiTH can be used to provide statistical support for hypotheses regarding metastatic spread pathways.

Discussion

This study is dedicated to an in-depth mathematical exploration of the relationships between phylogenies and migration trees of heterogeneous genomic populations, with the focus on host-to-host viral transmissions and cancer metastatic spread. Although it is established that phylogenetic trees impose some restrictions on migration pathways⁵⁷, the exact nature and extent of these constraints are still not well understood, despite a considerable amount of research dedicated to this problem^{57,94,95,96,97,98,99,100,101,102,103,104,105}. Our approach adds both depth and rigor to this area by utilizing the powerful theoretical and algorithmic framework of the theory of graph homomorphisms. This framework allowed us to derive necessary and sufficient conditions for the consistency of phylogenetic and migration trees, and to develop efficient algorithms for analyzing this consistency through numerical experiments.

Based on our findings, we propose a general and flexible computational framework that can be used to infer viral transmission and cancer migration trees under various assumptions, quantitatively assess competing hypotheses about migration dynamics, investigate the influence of phylogenies on the migration tree space, and to determine whether a potential migration history is definitively contradicted by a phylogeny or set of phylogenies.

Methodologically, our approach balances the advantages of probabilistic and parsimony methods. It incorporates scalability and the use of advanced combinatorial optimization techniques from the latter, along with the biological plausibility of the former that comes from employing appropriate prior random tree distributions.

From a modeling perspective, our approach eliminates the common assumption that transmissions are independent-a limitation of many existing methods that does not always hold in real-life settings, as evidenced by the frequent presence of “superspreading” in viral outbreaks. Furthermore, it is polytomy-agnostic, i.e., unlike other methods, it does not require polytomies to be resolved prior to inference. This is particularly important for cancer trees, where polytomies are prevalent. In addition, it is one of the first methods to propose and implement a Bayesian approach to metastatic migration inference, whereas most existing tools rely on the maximum parsimony principle.

This study aligns well with the context of previous research. Several earlier studies have conceptualized migration inference as a coloring problem^55,56,57, employing this framework both to develop efficient inference algorithms and to explore the structure of migration tree space. The methodology introduced in this paper advances these prior approaches by incorporating a more comprehensive mathematical model and applying more sophisticated mathematical techniques. Similarly, the concept of constraining the tree space to random trees from a specified distribution was first introduced in our earlier studies on viral transmissions^48,59, while a related concept of restricting the tree space to subgraphs of specific migration patterns has been utilized in computational cancer genomics^43,83. This paper not only refines and extends these methodologies but also integrates them into a cohesive modeling and computational framework.

Although this paper focuses on specific applications of the homomorphism-based methods in cancer genomics and molecular epidemiology, these methods are in principle generalizable and applicable for other types of problems. Indeed, the proposed framework can be understood as a variant of the inverse problem¹²⁶, which involves inferring unknown state transition rules of a dynamical system from observed outputs. This class of problems is fundamental across disciplines, including physics, control theory, and machine learning.

In our setting, the objective is to estimate the underlying state transition model - both its structure and transition probabilities - given a collection of observed system outputs. These observations are assumed to result from trajectories governed by an unknown but structured transition process, represented as a weighted graph or distribution over plausible transition graphs. Within a homomorphism-based framework, this problem can be approached as follows: (a) construct unlabeled system trajectories, for example, using a hierarchical clustering tree or a directed acyclic graph that encodes similarity or evolutionary relationships between observed outputs; (b) sample candidate transition graphs from a prior distribution (e.g., over graph topologies or transition matrices); (c) for each sample, compute graph homomorphisms from the observed trajectory structure into the candidate transition graph; (d) use the induced mappings to estimate the frequencies of transitions between states, thereby approximating the transition probabilities of the underlying model. This approach provides a flexible and general framework for modeling a wide range of diffusion-like processes, including Markov models, control systems, or the spread of information or influence in social networks.

The proposed approach certainly has both advantages and disadvantages. The proposed method does not explicitly account for unsampled sites, unlike recent viral transmission inference tools such as TransPhylo⁵⁶ and BREATH⁶⁷. However, inferring unsampled sites usually requires timed or branch-length phylogenies, which are common in viral genomics but less frequently used in cancer studies. Cancer research typically relies on character-based phylogenies¹²⁷, for which topology-based tools like SMiTH are more appropriate. These methods assume that the metastatic migration network formed by sampled metastatic sites is connected^42,43,82. Similar settings are also used in molecular epidemiology, particularly in local outbreak investigations where transmission clusters are well-covered through epidemiological methods^80,128 or prior statistical analyses⁷⁷, as well as in studies of viral transmissibility in clinical¹²⁹ and laboratory⁷⁴ settings.

In principle, however, the proposed framework can be extended to accommodate incomplete sampling. This can be done by allowing transition patterns to include more vertices than there are observed states at the leaves of the phylogeny, with the additional vertices representing unobserved states. Homomorphisms would then be allowed to map some nodes to these unobserved states. Such an extension would expand the solution space, likely necessitating the use of additional informative priors.

Furthermore, if the viral transmission or cancer migration tree is scale-free - as has been frequently observed^{116,117,118,119} - then migration tree inference remains relatively robust to unsampled sites, provided that the sampling rate is uniform. This robustness arises from the scale-free networks property to maintain connectivity despite the random removal of a substantial portion of nodes¹²⁴. In other words, with high probability, the inference of transmission links between sampled hosts will not be critically affected under uniformly random and sufficiently dense sampling.

Next, it is important to note that this paper focuses on migration trees, whereas in some real-world scenarios, migration graphs may contain cycles-particularly in cases involving multiple infections or multiple metastatic seeding events. In fact, the proposed methodology is capable of producing outputs that are not strictly trees. The tree samples produced by SMiTH can be aggregated to approximate a likely transmission graph, either by sampling edges from the consensus graph based on their probabilities or by extracting the most probable subgraph with desired properties. This aligns with a broader class of randomized algorithms, where multiple simple solutions (here, trees) are combined to approximate a more complex one (here, a graph). A good example is the graph sparsification problem, where a dense graph G is approximated by a sparser graph H on the same vertex set. This can be done by sampling random spanning trees of G, estimating the probability of each edge appearing in the samples, and resampling edges accordingly to construct H¹³⁰. A similar strategy could be effective in our settings.

An alternative way to handle multiple infections or seeding events is to partition each population into genetically distinct subpopulations, treating each as a separate vertex in the transmission or migration pattern. After tree construction, vertices corresponding to the same deme can be merged to form a graph. This approach has been used in viral studies⁴⁸; subpopulations can be identified via reference genomes, phylogenetic analysis, or clustering^{79,131,132,133,134}.

We also recognize that uniform sampling from the space of candidate migration trees may not be the most optimal tool for migration dynamics inference, and employing more precise optimization or sampling techniques to navigate the tree space could substantially enhance the method’s accuracy and efficiency. This work lays a foundation for further theoretical and algorithmic development in this direction, equipping researchers with the tools needed to expand the use of graph homomorphism methodologies. On the other hand, uniform sampling may be more suitable for hypothesis testing and comparison. It also should be noted that a key aspect of our methodology is that it involves sampling from the space of subtrees of a transition pattern, rather than from the space of phylogeny node labelings as was done in other studies^33,57,65. The former space is considerably smaller in practical scenarios, which often makes the uniform sampling computationally feasible.

Furthermore, it is likely that models tailored to the specific characteristics of underlying populations could potentially yield more accurate insights into the biological processes involved. In particular, the biological mechanisms driving viral and cancer migrations are certainly very different. Nonetheless, employing a unified phylogenetic approach to study highly mutable populations offers several advantages. First, it decouples the initial phylogenetic reconstruction from its biological interpretation, thereby minimizing the risk of overfitting and ensuring that the results are less biased by underlying models⁵⁶. In addition, such methods are significantly more computationally efficient and scalable compared to parameter-rich models^48,56. Finally, more general phylogenetic models offer greater flexibility and versatility, and usually can be readily extended to more specific settings through the integration of suitable priors⁵⁶. These features make them an excellent foundation for more detailed analyses.

The considerations above suggest several directions for future research. These include extending the proposed framework to accommodate more complex migration histories, scenarios involving sparse sampling, and broader classes of inverse problems in dynamical systems. On the theoretical side, the computational complexity of convex and compact label-distinctive homomorphism problems remains unexplored. In addition, potential connections between our model and graph sparsification problems warrant further investigation.

Methods

As the research was limited to computational methods, ethical approval from institutional boards was not applicable.

Basic definitions

Throughout this paper, we consider a pair of trees: a phylogenetic tree and a migration tree. For clarity, we refer to elements of a phylogeny as nodes, and to elements of a migration tree as vertices, and denote them by Greek and Latin letters, respectively.

The problem of migration tree inference is set up as follows. The input is a phylogenetic tree Ψ = (V(Ψ), E(Ψ)), with the leaf set L(Ψ) representing genomic variants belonging to different subpopulations (or demes), denoted by ${{\mathcal{L}}}$. The tree Ψ can be a standard binary phylogeny or a non-binary mutation tree used in most cancer studies. Each leaf λ ∈ L(Ψ) has an assigned site label (or color) ${l}_{\lambda }\in {{\mathcal{L}}}$. The aim is to expand this labeling from the leaves to all nodes in the tree, creating a full labeling $f:V(\Psi )\to {{\mathcal{L}}}$. In this model, any multi-colored tree edge αβ represents a migration of genomic variants between demes f(α) and f(β). The migration tree T = T(Ψ, l) and with vertices $V(T)={{\mathcal{L}}}$, is then formed by contracting the nodes with the same color⁷³.

As mentioned in the introduction, researchers often seek migration trees satisfying particular constraints restricting types of migration or tree topologies. These constraints can be encoded using a transition pattern graph G that describes permissible patterns of migration. Transition pattern can be either a deterministic graph or a random graph characterized by some probability distribution; specific examples are provided in the following subsection. In this model, a migration tree should be isomorphic (i.e., identical up to relabeling of vertices) to a subgraph of the transition pattern. Any corresponding labeling will be called feasible.

The relations between the phylogeny Ψ, the migration tree T and the transition pattern G can be captured using the concept of a graph homomorphism. A homomorphism f: Ψ → G¹¹¹ is an adjacency-preserving mapping between vertex sets of these graphs, i.e., f(u)f(v) ∈ E(G) if uv ∈ E(Ψ). For the sake of mathematical rigor, here and throughout this paper, we assume that a transition pattern is reflexive, i.e., every vertex is adjacent to itself. With this condition in place, any feasible labeling f is a homomorphism from Ψ to G, making the migration inference problem essentially a problem of finding such a homomorphism.

Throughout this paper, we use standard graph theory notations. We denote by Ψ_α a subtree of Ψ rooted at the node α. For any graph G, G[X] represents the subgraph induced by the subset of vertices X. We use u ~ v to indicate that vertices u and v are adjacent. The set of neighbors of a vertex u in G is denoted as N_G(u), deg_G(u) = ∣N_G(u)∣ is the degree of u, N_G[u] = N_G(u) ∪ {u} and N_G[X] is the union of the neighborhoods of all vertices in a set X. In addition, the distance between any two vertices u and v in G is denoted by d_G(u, v), and P_G(u, v) refers to the corresponding shortest path between them. When the graph is clear from the context, we may omit the subscripts in these notations. In the case of a phylogeny Ψ, all distances and paths are undirected.

In addition, to simplify the notation, we will apply set-theoretical operations (e.g., intersection and union) directly to subgraph of G, with the understanding that the resulting subgraph is induced by the set obtained from applying these operations to the vertex sets of the original subgraphs.

Migration inference under structural constraints

The simplest form of the migration inference challenge appears when the vertices of the transition pattern graph G directly correspond to the labels of the phylogenetic tree leaves, that is, $V(G)={{\mathcal{L}}}$. This variant is referred to as labeled inference. The transition pattern G can be an unweighted graph or a random graph with specified edge probabilities. Several scenarios exemplify this problem:

For viral transmissions, G could reflect a contact network of potential hosts^80,109, where an infection spreads through direct interactions within this network.
Another viral transmission model assumes that transmission is only possible between hosts with overlapping exposure intervals^33,51. In this case, G is an interval graph¹³⁵, i.e., a graph with vertices representing time intervals and edges connecting vertices whose intervals intersect.
For metastatic spread inference, G may represent a circulatory network^119,136,137, with vertices representing organs and edges reflecting the prior probabilities of cancer spreading between organs.

Problem 1 (labeled migration inference)

Input:

(a1) a phylogenetic tree Ψ with leaf labels ${({l}_{\lambda })}_{\lambda \in L(\Psi )}$ forming the label set ${{\mathcal{L}}}$;
(b1) a graph G with $V(G)={{\mathcal{L}}}$;

Output:

(c1) a homomorphism f: Ψ → G such that f(λ) = l_λ for every λ ∈ L(Ψ).

In practical settings, however, Problem 1 might be too restrictive or not fully reflective of reality. The main issue is the absence of a known mapping between the leaf labels and the vertices of the transition pattern. That is, the pattern G defines the permissible structure of migration, such as which configurations are allowed-rather than specifying exact migration routes. For example, in the context of viral transmission, we might expect the presence of a superspreader, which would result in a star-like topology. This indicates the permissible structure. However, the identity of the host acting as the superspreader-and consequently, the precise transmission links-are unknown. The following models to this variant of the problem:

Viral transmission: The transition pattern G could represent a random graph that describes expected characteristics of transmission trees, like being scale-free or having a particular expected degree distribution. Such models draw on known expected properties of contact networks essential for infection spread^{116,138,139,140} without requiring actual host contact information, and have been explored in several studies^48,59.
Metastatic spread inference: In this case, a transition pattern G may describe plausible evolutionary scenarios of cancer migration, such as monoclonal or polyclonal seeding from single or multiple sources^43,83. A related idea has been applied in studies analyzing CRISPR-based lineage tracing phylogenies, where G specifies a so-called star homoplasy model¹⁴¹.
In general phylogenetics, several widely used models-such as Dollo and Camin-Sokal parsimony-can be reformulated in terms of transition patterns¹⁴¹ and therefore fall naturally within the graph homomorphism framework.

In this version of migration inference problem, is no explicitly given mapping between the set of leaf labels and the vertices of the transition pattern. Instead, a feasible homomorphism should map leaves with distinct labels to distinct vertices of G and vice versa, i.e., for any λ₁, λ₂ ∈ L(Ψ) we have f(λ₁) ≠ f(λ₂) whenever ${l}_{{\lambda }_{1}}\ne {l}_{{\lambda }_{2}}$ (Fig. 5 (a)). We will call a homomorphisms satisfying this requirement a label-distinctive homomorphism.

**Fig. 5: Solutions for the Migration History Inference Problem under various constraints.**

In this study, we focus specifically on migration trees because: (a) this formulation is biologically meaningful and widely used in previous research; (b) the associated computational problems are algorithmically tractable, as demonstrated below; and (c) the methods developed for migration tree inference can be naturally extended to more general spread patterns (see Discussion). In such a context, the first question to be asked is whether a given subtree T of the transition pattern G is consistent with the phylogeny Ψ, i.e., whether there exist a label-distinctive homomorphism Ψ → T.

Problem 2 (unlabeled migration inference)

Input:

(a2) a phylogenetic tree Ψ with leaf labels ${({l}_{\lambda })}_{\lambda \in L(\Psi )}$;
(b2) a tree T;

Output:

(c2) a label-distinctive homomorphism f: Ψ → T.

A more restricted version of Problem 2 includes an additional convexity constraint^142,143, where tree nodes mapping to the same vertex of G form a connected subtree of Ψ (Fig. 5 (b)):

Problem 3 (convex unlabeled migration inference)

Input: (a2) and (b2)

Output:

(c3) a label-distinctive homomorphism f: Ψ → T such that induced subgraphs Ψ[f ⁻¹(v)] are connected for all v ∈ V(T).

We will refer to a homomorphism satisfying (c3) as a convex homomorphism. Convex homomorphisms to trees describe migrations with a complete bottleneck; such migration trees were the focus of extensive research in previous studies^57,73,110.

To define another type of constraint, consider a labeling l of nodes of the phylogeny Ψ. The labeling l is compact if the label of the most recent common ancestor (MRCA) for any group of leaves matches one of the labels within that group.

The rationale for this is based on the understanding that migrations within any given subtree could follow one of the following scenarios: either (i) migrations involve only the demes represented by the leaves of the subtree, or (ii) migrations include external demes, but lineages from these demes were not sampled due to extinction or incomplete collection. Although both scenarios are feasible, the first is more parsimonious, especially when sampling is dense and migration events span a short period of time. Hence, homomorphisms that align with this compact labeling are referred to as compact.

Problem 4 (compact unlabeled migration inference)

Input: (a2) and (b2)

Output:

(c4) a label-distinctive homomorphism f: Ψ → T such that for any node α ∈ V(Ψ) we have f(α) ∈ f(L(Ψ_α)).

Note that convexity and compactness are independent properties; neither implies the other. An example of a convex but non-compact labeling is shown in Fig. 5b, while an example of a compact but non-convex labeling is depicted by Supplementary Fig. 1.

It is often desirable to produce a sample of potential solutions rather than a single solution. Such approach offers statistical backing for potential migration links and logically shifts the migration inference problem to a Bayesian paradigm. Potential solutions can be sampled from a random graph distribution or, if a transition pattern is a deterministic graph, from the uniform distribution of its subgraphs with specified properties (e.g., subtrees). The sampling-based version of the migration inference problem can be formulated as follows:

Problem 5 (unlabeled migration sampling)

Input:

(a5) a phylogenetic tree Ψ with leaf labels ${({l}_{\lambda })}_{\lambda \in L(\Psi )}$;
(b5) a random graph G (further referred to as a transition pattern) with edge probabilities p: E(G) → [0, 1] that define a distribution ${{\mathcal{D}}}$ of subtrees of G.

Output:

(c5) a sample S = {T₁, …, T_m} from the distribution ${{\mathcal{D}}}$, where each subtree T_i is a homomorphic image of Ψ, and the corresponding label-distinctive homomorphisms f_i: Ψ → T_i are possibly convex and/or compact.

Finally, it is useful to relate the concepts introduced here to standard notions in structural graph theory. In this context, a graph homomorphism is sometimes interpreted as a G-coloring of the tree Ψ¹¹¹, where each node of Ψ is assigned a “color”-that is, a vertex of T. In this sense, the homomorphism model generalizes coloring frameworks explored in prior transmission inference studies.

A key distinction from most existing homomorphism studies is that we assume the target tree T is reflexive-a less used, but not uncommon, variant¹¹¹. Another central feature of our model is the label-distinction property, which significantly differentiates it from standard formulations. To the best of our knowledge, this specific variant of the graph homomorphism problem has not been previously studied. It can be seen as a relaxed version of the list homomorphism problem: instead of assigning explicit allowed images to nodes, we only impose constraints on which nodes must or must not share an image.

In the case of a convex homomorphism, the tree T can be derived from Ψ through a sequence of edge contractions, making T a minor of Ψ. In general, a graph G₁ is a minor of a graph G₂ if it can be obtained by a combination of edge contractions, edge deletions, and vertex deletions¹⁴⁴. However, the minors relevant to our problem are subject to stricter constraints: only edge contractions are permitted, and contractions between labeled nodes with different labels are explicitly disallowed. In addition, convex homomorphisms generalize convex colorings¹⁴⁵ in the same way as standard homomorphisms generalize proper colorings¹¹¹. Finally, we are not aware of any prior work that explores concepts similar to compact homomorphisms.

Labeled migration inference

Problem 1 can be efficiently solved in polynomial time using dynamic programming. Briefly, the algorithm first performs a post-order traversal to compute potential image sets for each node, ensuring that each internal node can be mapped while preserving adjacency constraints. If the root has a valid image, a pre-order traversal assigns specific values to construct the homomorphism; otherwise, it confirms that no valid mapping exists (see Supplementary Note 1 for formal description).

Unconstrained unlabeled migration inference

A possible strategy to solve Problem 2 involves identifying a bijection $g:{{\mathcal{L}}}\to V(T)$ that can be extended to a homomorphism Ψ → T using the algorithm for the labeled migration inference. We will describe such bijections as feasible.

Let L_g(u) = {λ ∈ L(Ψ): g(l_λ) = u} be the set of leaves in Ψ whose labels map to u. The following theorem establishes a necessary and sufficient condition for the bijection feasibility.

Theorem 1

The bijection $g:{{\mathcal{L}}}\to V(T)$ is feasible if and only if

$${d}_{T}({u}_{1},{u}_{2})\le {d}_{\Psi }({\lambda }_{1},{\lambda }_{2})$$

(1)

for all u₁, u₂ ∈ V(T), λ₁ ∈ L_g(u₁), λ₂ ∈ L_g(u₂).

In other words, Theorem 1 claims that g is feasible if and only if it is a contraction. The necessity of this condition is a known property of graph homomorphisms¹¹¹, but it is generally not sufficient¹¹¹. However, for our specific type of the homomorphism problem it indeed suffice. The proof of this and the following theorems can be found in the Supplementary Note 2.

Theorem 1 establishes that the Unlabeled Migration Inference problem is algorithmically equivalent to the problem of finding a corresponding contraction. First of all, this allows us to demonstrate that Problem 2 is ${{\mathcal{NP}}}$-hard. It can be proved through a reduction from the following problem:

Graph Bandwidth problem.

Given: A graph G.

Find: The minimal integer K (denoted as bw(G)) for which there exists is a bijection f: V(G) → {1, …, ∣V(G)∣} such that

$$| \, f(u)-f(v)| \le K\,{{\mbox{for}}} \, {{\rm{all}}}\,u \sim v.$$

(2)

The Graph Bandwidth problem is ${{\mathcal{NP}}}$-hard¹⁴⁶, and, moreover, it cannot be approximated within any constant factor unless ${{\mathcal{P}}}={{\mathcal{NP}}}$, even when the input graph G is a tree¹⁴⁷.

Theorem 2

The Unlabeled Migration Inference problem is ${{\mathcal{NP}}}$-hard, even when all leaf labels in the phylogeny Ψ are unique.

Although the Unlabeled Migration Inference problem is ${{\mathcal{NP}}}$-hard, we can use Theorem 1 to approach it using Integer Linear Programming (ILP). We define a feasible bijection $g:{{\mathcal{L}}}\to V(T)$ using binary variables x_iu, where x_i_,_u = 1 if g(i) = u, for each $i\in {{\mathcal{L}}}$ and u ∈ V(T). The ILP formulation to find such a bijection is as follows:

$${\sum}_{i\in {{\mathcal{L}}}}{\sum}_{u\in V(T)}\delta (i){{\rm{deg}}}(u){x}_{iu}\to \max$$

(3)

s.t.

$${\sum}_{u\in V(T)}{x}_{iu}=1,\,\,\,i\in {{\mathcal{L}}};$$

(4)

$${\sum}_{i\in {{\mathcal{L}}}}{x}_{iu}=1,\,\,\,u\in V(T);$$

(5)

$${x}_{iu}+{x}_{jv}\le 1,\,\,\,i,j\in {{\mathcal{L}}},u,v\in V(T),{\Delta }_{ij} < {d}_{T}(u,v);$$

(6)

$${x}_{iu}+{x}_{iv}+{x}_{ju}+{x}_{jv}\le {y}_{ij}+1\,\,\,i,j\in {{\mathcal{L}}},uv\in E(T);$$

(7)

$${\sum}_{i,j\in {{\mathcal{L}}}}{y}_{ij}=n-1.$$

(8)

where ${\Delta }_{ij}={\min }_{{\lambda }_{i}\in {l}^{-1}(i),{\lambda }_{j}\in {l}^{-1}(j)}{d}_{\Psi }({\lambda }_{i},{\lambda }_{j})$.

In this formulation, constraints (4) and (5) ensure that x encodes a bijection, while constraints (6) guarantee that the bijection adheres to the conditions of Theorem 1. The auxiliary variables y_ij in constraints (7) indicate whether a pair of leaf labels map to adjacent vertices in T, with constraint (8) ensuring the inferred migration network forms a tree.

The coefficient δ(i) of the objective function (3) represents the genetic diversity of the ith subpopulation, measured as allelic entropy averaged over all allelic positions. The objective function defined in this way facilitates the search for a solution by leveraging the relationship between population diversity and population age^148,149,150, that suggests that more diverse populations, which are likely older, are also more probable origins of migration^52,59,88. Consequently, it is more likely that such populations correspond to high-degree vertices in the tree T, and the objective function (3) prioritize such solutions.

Migration inference under convexity constraints

In this section, a convex label-distinctive homomorhism Ψ → T will be called C-feasible. As noted earlier, in this case T is a minor of Ψ. It is known that the problem of detecting whether a given graph is a minor of another graph is ${{\mathcal{NP}}}$-hard, even when input graphs are trees^151,152. However, since we consider a restricted class of minors, we suggest that in practical settings C-feasible homomorphisms can be efficiently found and enumerated using dynamic programming.

The proposed algorithmic approach is described as follows. Initially, when possible, we simplify the phylogenetic tree Ψ by collapsing paths between leaves sharing identical labels into a single node, a step made feasible by the convexity constraint. As a result, Ψ may become non-binary and contains uniquely labeled leaves and possibly some labeled internal nodes.

The algorithm performs a post-order traversal of the phylogeny Ψ and, for each node α ∈ V(Ψ), calculates a set H_α describing possible homomorphisms from the subtree Ψ_α to subtrees of T. At the root node ρ, the set H_ρ thus describes all homomorphisms from Ψ to T. Upon completing the traversal, the algorithm either concludes that no C-feasible homomorphism exists (when ${H}_{\rho }={{\emptyset}}$), or initiates a pre-order traversal of Ψ. During this second traversal, it reconstructs C-feasible homomorphisms using the information from the sets H_α.

Formally, let Λ_α be the set of labeled nodes in the subtree Ψ_α. A subtree T[v, X] of T is termed an induced v-subtree if it includes the vertex v, a subset X of v’s neighbors, and all vertices that are connected to v via paths that intersect with X (Fig. 6). Equivalently, T[v, X] can be expressed recursively as

$$T[v,X]=\left(\mathop{\bigcup}_{x\in X}T[x,{N}_{T}(x)\setminus \{v\}]\right)\bigcup \{v\}$$

(9)

**Fig. 6: Overview of the Dynamic Programming Algorithm for Detecting Convex Label-Distinctive Homomorphisms.**

For a vertex α ∈ V(Ψ), the set H_α consists of triples (v, X, C) called partial homomorphism tokens or simply tokens. In each token, (i) v ∈ V(T), (ii) X ⊆ N_T(v), (iii) C is a subset of vertices of an induced v-subtree T[v, X] such that there exists a C-feasible surjective homomorphism f: Ψ_α → T[v, X] with f(α) = v and f(Λ_α) = C.

The algorithm is initialized by setting ${H}_{\lambda }=\{(v,{{\emptyset}},\{v\}):v\in V(T)\}$ for all leafs λ. For an internal node α, the set H_α is constructed based on the sets from its children nodes β₁, …, β_k. The construction utilizes the following lemma (the proof can be found in Supplementary Note 3):

Lemma 1

Let T[v, X] be an induced v-subtree. Then there exist a C-feasible surjective homomorphism f: Ψ_α → T[v, X] with f(α) = v and f(Λ_α) = C if and only if there exist tokens $({v}_{1},{X}_{1},{C}_{1})\in {H}_{{\beta }_{1}}$,…,$({v}_{k},{X}_{k},{C}_{k})\in {H}_{{\beta }_{k}}$ satisfying the following conditions:

(a1) v ~ v_i or v = v_i for all i ∈ {1, …, k};
(b1) $(V(T[{v}_{i},{X}_{i}])\setminus \{v\})\cap (V(T[{v}_{j},{X}_{j}])\setminus \{v\})={{\emptyset}}$ for all i, j ∈ {1, …, k}, i ≠ j;
(c1) v ∈ V(T[v_i, X_i]) if and only if v_i= v.
(d1) v ∉ C_i, if α is labeled.
(e1) X_i = N_T(v_i)⧹{v}, if v_i ≠ v.
(f1) X = {v₁, …, v_k}⧹{v};

For convenience, we also state the following condition, which is logically implied by the others but aids in the forthcoming algorithm’s description:

(g1) C = C₁ ∪ ⋯ ∪ C_k, if α is unlabeled, and C = C₁ ∪ ⋯ ∪ C_k ∪ {v}, if α is labeled.

To convert Lemma 1 into an algorithm constructing the set of tokens for a node α using the tokens of its children, we first need to identify tokens of children that satisfy conditions (a1)-(g1). This can be achieved through the following steps. For each vertex v ∈ V(T), we construct a multipartite graph ${{\mathcal{G}}}={{\mathcal{G}}}(\alpha,v)$ (i.e., a graph partitioned into k independent sets or parts), as described below:

(i)
The parts A₁, …, A_k correspond to children of α.
(ii)
The vertices of the set A_i are tokens from the set ${\Psi }_{{\beta }_{i}}$ satisfying the conditions (a1),(c1), (d1) and (e1).
(iii)
Two tokens from the sets A_i and A_j are adjacent whenever they satisfy the condition (b1).

In the constructed graph, sets of partial homomorphism tokens that satisfy Lemma 1 can be identified as k-vertex cliques. We employ the Bron-Kerbosch algorithm¹⁵³ to generate these cliques. For each identified clique, we use conditions (f1) and (g1) to construct a new token (v, X, C) for the node α.

In addition, for each token (v, X, C) ∈ H_α, we maintain pointers p(v, X, C) that link to the children tokens used in its construction. These pointers are used in the subsequent phase of the algorithm, which aims to reconstruct full C-feasible homomorphisms f.

During this phase, the algorithm executes a pre-order traversal of Ψ. As it progresses, it recursively assigns a specific token to each node. When a token t = (v, X, C) is assigned to node α, the algorithm sets f(α) = v. It then retrieves tokens for t via the pointers p(t) and assigns them to children, β₁, …, β_k. This ensures that by the end of the traversal, each node in Ψ has been assigned a homomorphic image, completing the construction of the homomorphism f.

The outlined method is formalized as Supplementary Algorithm 2 and illustrated by Fig. 6. Its efficiency can be improved in several ways, with specific details discussed in Supplementary Note 3. It should be noted that, strictly speaking, the described algorithm is not polynomial, since the number of tokens for a node of Ψ theoretically can be exponential. In practical settings, however, the algorithm is extremely fast, and require split seconds to finish.

Migration inference with convexity and compactness constraints

A similar approach to the one outlined in Subsection 4 can be employed to identify homomorphisms that are both convex and compact. However, the dynamic programming algorithm can be further optimized by taking advantage of the specific nature of these constraints.

In what follows, homomorphisms that are both convex and compact will be referred to as CC-feasible. Following a similar methodology to that used in Supplementary Algorithm 2, we begin by contracting the paths in the phylogeny Ψ. We then construct partial homomorphism tokens similar to those used previously, with the exception that subsets C of images of labeled nodes are not required. Thus, the tokens are simplified to pairs (v, X), where v ∈ V(T) and X ⊆ N_T(v). A pair (v, X) is included in H_α if there exists a CC-feasible surjective homomorphism f: Ψ_α → T[v, X] such that f(α) = v, and it also satisfies the following condition:

$$| {\Lambda }_{\alpha }|=| T[v,X]| .$$

(10)

This condition is necessary for a partial homomorphism to be extendable to a full compact homomomorphism Ψ → T.

The algorithm is initialized by setting ${H}_{\lambda }=\{(v,{{\emptyset}}):v\in V(T)\}$ for leafs λ. For an internal node α ∈ V(Ψ), its token set H_α is constructed from the tokens of its children β₁, …β_k using Lemma 2 (see Supplementary Note 4 for the proof).

Lemma 2

(v, X) ∈ H_α if and only if one of the following conditions hold:

1.
α is not labeled and there exist w ∈ X such that $(v,X\setminus \{w\})\in {H}_{{\beta }_{1}}$ and $(w,{N}_{T}(w)\setminus \{v\})\in {H}_{{\beta }_{2}}$.
2.
α is labeled, ∣X∣ = k, and there exist a permutation (v₁, …, v_k) of elements of X such that v ~ v₁, …, v_k, and $({v}_{1},{N}_{T}({v}_{1})\setminus \{v\})\in {H}_{{\beta }_{1}},\ldots,({v}_{k},{N}_{T}({v}_{k})\setminus \{v\})\in {H}_{{\beta }_{k}}$.

Lemma 2 can be straightforwardly used to construct tokens of α from tokens of its children, if α is unlabeled. When α is labeled, then finding a permutation (v₁, …, v_k) required by Lemma 2 is more complicated and can be achieved using the approach described next.

Suppose that ${{\mathcal{S}}}=({S}_{1},\ldots,{S}_{k})$ is a collection of sets. A vector (x₁, …, x_k) is termed a transversal of ${{\mathcal{S}}}$¹⁵⁴ if x_i ∈ S_i and all x_i are distinct. Let now ${S}_{i}=\{u:\left.(u,Y)\right)\in {H}_{{\beta }_{i}}\,{\mbox{and}}\,v \sim u\}$. Then the vector (v₁, …, v_k) satisfies the condition 2) of Lemma 2 if and only if it is a transversal of ${{\mathcal{S}}}$.

Given this, the set H_α can be obtained by generating all transversals of ${{\mathcal{S}}}$. This can be done by reducing the problem to finding all maximal matchings in a bipartite graph (a graph consisting of two disjoint vertex sets with edges only between them; a matching is a set of its non-overlapping edges, see Supplementary Note 4 for specific details. The full method is described in Supplementary Algorithm 3.

SMiTH: Sampling MIgrations and Transmissions with Homomorphisms

The algorithms discussed can be effectively integrated into an Unlabeled Migration Sampling framework (Problem 5). It allows for the identification of homomorphic images of a given phylogeny Ψ within a collection of candidate migration trees sampled from a given migration pattern represented by a specified tree distribution.

The obtained sample of migration trees can be directly analyzed to estimate the probabilities of specific migration routes or to obtain summary statistics and confidence intervals for derivative evolutionary parameters. It can be also synthesized into a single weighted consensus graph, where each edge is weighted by the number of candidate trees that support it. When the homomorphism reconstruction includes an objective function, the consensus graph is constructed from a subsample comprising the top κ% of trees ranked by their objective values. For applications requiring a specific output tree – such as for benchmarking and comparison with other methods – the tree is determined by calculating the maximum-weight spanning tree of the consensus graph. The entire algorithmic pipeline, named SMiTH (SMiTH: Sampling MIgrations and Transmissions with Homomorphisms), is illustrated in Fig. 7.

**Fig. 7: SMiTH: Sampling MIgrations and Transmissions with Homomorphisms.**

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.

Data availability

Simulated data generated in this study are available at https://github.com/compbel/SMiTH. Experimental HCV datasets are available at https://doi.org/10.5281/zenodo.6792964. The cancer clone trees used in this study were originally published in ref. ⁴³, and the raw datasets analyzed in that study are available from the database of Genotypes and Phenotypes (dbGaP) database under accession numbers phs000676 and phs000941.v1.p1 and from the European Genome-phenome Archive (EGA) under accessions EGAS00001000547, EGAS00001000262, EGAS00001000730 and EGAS00001000756. Source data are provided in this paper.

Code availability

The code and software developed in this study are available at https://github.com/compbel/SMiTH and https://doi.org/10.5281/zenodo.15036855.

References

DeGregori, J. Adaptive Oncogenesis: A New Understanding of How Cancer Evolves Inside Us. (Harvard University Press, Cambridge, MA, 2018).
Book Google Scholar
Nowell, P. C. The clonal evolution of tumor cell populations: Acquired genetic lability permits stepwise selection of variant sublines and underlies tumor progression. Science 194, 23–28 (1976).
Article ADS CAS PubMed Google Scholar
Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306 (2012).
Article ADS CAS PubMed PubMed Central Google Scholar
Yates, L. R. & Campbell, P. J. Evolution of the cancer genome. Nat. Rev. Genet. 13, 795 (2012).
Article CAS PubMed PubMed Central Google Scholar
Bonavia, R., Cavenee, W. K. & Furnari, F. B. et al. Heterogeneity maintenance in glioblastoma: a social network. Cancer Res. 71, 4055–4060 (2011).
Article CAS PubMed PubMed Central Google Scholar
Burrell, R. A., McGranahan, N., Bartek, J. & Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 501, 338–345 (2013).
Article ADS CAS PubMed Google Scholar
Doyle, M. A., Li, J., Doig, K., Fellowes, A. & Wong, S. Q. Studying cancer genomics through next-generation dna sequencing and bioinformatics. Methods Mol. Biol. 11168, 83–98 (2014).
Kuipers, J., Jahn, K. & Beerenwinkel, N. Advances in understanding tumour evolution through single-cell sequencing. Biochim. Biophys. Acta Rev. Cancer 1867, 127–138 (2017).
Article CAS PubMed PubMed Central Google Scholar
Greaves, M. Nothing in cancer makes sense except…. BMC Biol. 16, 1–8 (2018).
Article Google Scholar
Phillips, R. E. et al. Human immunodeficiency virus genetic variation that can escape cytotoxic t cell recognition. Nature 354, 453–459 (1991).
Article ADS CAS PubMed Google Scholar
Boutwell, C. L., Rolland, M. M., Herbeck, J. T., Mullins, J. I. & Allen, T. M. Viral evolution and escape during acute hiv-1 infection. J. Infect. Dis. 202, S309 (2010).
Article PubMed Google Scholar
Ramachandran, S. et al. Temporal variations in the hepatitis c virus intrahost population during chronic infection. J. Virol. 85, 6369–6380 (2011).
Article CAS PubMed PubMed Central Google Scholar
Frost, S. D., Dumaurier, M.-J., Wain-Hobson, S. & Brown, A. J. L. Genetic drift and within-host metapopulation dynamics of hiv-1 infection. Proc. Natl. Acad. Sci. USA 98, 6975–6980 (2001).
Article ADS CAS PubMed PubMed Central Google Scholar
Lara, J., López-Labrador, F. X., González-Candelas, F., Berenguer, M. & Khudyakov, Y. E. Computational models of liver fibrosis progression for hepatitis c virus chronic infection. BMC Bioinform. 15, 1–9 (2014).
Article Google Scholar
Zhang, Y. et al. Sars-cov-2 intra-host single-nucleotide variants associated with disease severity. Virus Evol. 8, veac106 (2022).
Article PubMed PubMed Central Google Scholar
Lee, H. Y., Perelson, A. S., Park, S.-C. & Leitner, T. Dynamic correlation between intrahost hiv-1 quasispecies evolution and disease progression. PLoS Comput. Biol. 4, e1000240 (2008).
Article MathSciNet PubMed PubMed Central Google Scholar
Shankarappa, R. et al. Consistent viral evolutionary changes associated with the progression of human immunodeficiency virus type 1 infection. J. Virol. 73, 10489–10502 (1999).
Article CAS PubMed PubMed Central Google Scholar
Xu, M. et al. Genome sequencing analysis identifies epstein–barr virus subtypes associated with high risk of nasopharyngeal carcinoma. Nat. Genet. 51, 1131–1136 (2019).
Article CAS PubMed PubMed Central Google Scholar
Van Loy, T. et al. Jc virus quasispecies analysis reveals a complex viral population underlying progressive multifocal leukoencephalopathy and supports viral dissemination via the hematogenous route. J. Virol. 89, 1340–1347 (2015).
Article PubMed Google Scholar
Voloch, C. M. et al. Intra-host evolution during sars-cov-2 prolonged infection. Virus Evol. 7, veab078 (2021).
Article PubMed PubMed Central Google Scholar
Skums, P., Bunimovich, L. & Khudyakov, Y. Antigenic cooperation among intrahost hcv variants organized into a complex network of cross-immunoreactivity. Proc. Natl. Acad. Sci. USA 112, 6653–6658 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Campo, D. S., Zhang, J., Ramachandran, S. & Khudyakov, Y. Transmissibility of intra-host hepatitis c virus variants. BMC Genom. 18, 11–19 (2017).
Article Google Scholar
Salemi, M. The intra-host evolutionary and population dynamics of human immunodeficiency virus type 1: a phylogenetic perspective. Infect. Dis. Rep. 5, e3 (2013).
Article PubMed PubMed Central Google Scholar
Hill, A. L. Mathematical Models of HIV Latency (pp. 131–156. Springer International Publishing, Cham, 2018).
Google Scholar
Smith, E. C. The not-so-infinite malleability of rna viruses: Viral and cellular determinants of rna virus mutation rates. PLoS Pathogens 13, e1006254 (2017).
Article PubMed PubMed Central Google Scholar
Duffy, S. Why are rna virus mutation rates so damn high? PLoS Biol. 16, e3000003 (2018).
Article PubMed PubMed Central Google Scholar
Sanjuán, R. & Domingo-Calap, P. Mechanisms of viral mutation. Cell. Mol. Life Sci. 73, 4433–4448 (2016).
Article PubMed PubMed Central Google Scholar
Tomlinson, I. P., Novelli, M. & Bodmer, W. The mutation rate and cancer. Proc. Natl. Acad. Sci. USA 93, 14800–14803 (1996).
Article ADS CAS PubMed PubMed Central Google Scholar
Grady, W. M. Genomic instability and colon cancer. Cancer Metastasis Rev. 23, 11–27 (2004).
Article CAS PubMed Google Scholar
Charames, G. S. & Bapat, B. Genomic instability and cancer. Curr. Mol. Med. 3, 589–596 (2003).
Article CAS PubMed Google Scholar
Utro, F., Levovitz, C., Rhrissorrakrai, K. & Parida, L. A common methodological phylogenomics framework for intra-patient heteroplasmies to infer sars-cov-2 sublineages and tumor clones. BMC Genom. 22, 1–13 (2021).
Article Google Scholar
Stadler, T., Pybus, O. G. & Stumpf, M. P. Phylodynamics for cell biologists. Science 371, eaah6266 (2021).
Article CAS PubMed Google Scholar
Sashittal, P. & El-Kebir, M. Sampling and summarizing transmission trees with multi-strain infections. Bioinformatics 36, i362–i370 (2020).
Article CAS PubMed PubMed Central Google Scholar
Sanborn, J. Z. et al. Phylogenetic analyses of melanoma reveal complex patterns of metastatic dissemination. Proc. Natl. Acad. Sci. USA 112, 10995–11000 (2015).
Article ADS CAS PubMed PubMed Central Google Scholar
Nguyen, D. X. & Massagué, J. Genetic determinants of cancer metastasis. Nat. Rev. Genet. 8, 341–352 (2007).
Article CAS PubMed Google Scholar
Grubaugh, N. D. et al. Tracking virus outbreaks in the twenty-first century. Nat. Microbiol. 4, 10–19 (2019).
Article CAS PubMed Google Scholar
Armstrong, G. L. et al. Pathogen genomics in public health. N. Engl. J. Med. 381, 2569–2580 (2019).
Article PubMed PubMed Central Google Scholar
Black, A., MacCannell, D. R., Sibley, T. R. & Bedford, T. Ten recommendations for supporting open pathogen genomic analysis in public health. Nat. Med. 26, 832–841 (2020).
Article CAS PubMed PubMed Central Google Scholar
Lähnemann, D. et al. Eleven grand challenges in single-cell data science. Genome Biol. 21, 1–35 (2020).
Article Google Scholar
Navin, N. et al. Tumour evolution inferred by single-cell sequencing. Nature 472, 90–94 (2011).
Article ADS CAS PubMed PubMed Central Google Scholar
Knyazev, S., Hughes, L., Skums, P. & Zelikovsky, A. Epidemiological data analysis of viral quasispecies in the next-generation sequencing era. Brief. Bioinform. 22, 96–108 (2021).
Article CAS PubMed Google Scholar
Quinn, J. J. et al. Single-cell lineages reveal the rates, routes, and drivers of metastasis in cancer xenografts. Science 371, eabc1944 (2021).
Article CAS PubMed PubMed Central Google Scholar
El-Kebir, M., Satas, G. & Raphael, B. J. Inferring parsimonious migration histories for metastatic cancers. Nat. Genet. 2, 5 (2018).
Google Scholar
Jombart, T. et al. Bayesian reconstruction of disease outbreaks by combining epidemiologic and genomic data. PLoS Comput. Biol. 10, e1003457 (2014).
Article PubMed PubMed Central Google Scholar
Campbell, F. et al. outbreaker2: a modular platform for outbreak reconstruction. BMC Bioinform. 19, 1–8 (2018).
Article Google Scholar
Jombart, T., Eggo, R., Dodd, P. & Balloux, F. Reconstructing disease outbreaks from genetic data: a graph approach. Heredity 106, 383–390 (2011).
Article CAS PubMed Google Scholar
De Maio, N., Wu, C.-H. & Wilson, D. J. Scotti: efficient reconstruction of transmission within outbreaks with the structured coalescent. PLoS Comput. Biol. 12, e1005130 (2016).
Article PubMed PubMed Central Google Scholar
Skums, P. et al. Sophie: viral outbreak investigation and transmission history reconstruction in a joint phylogenetic and network theory framework. Cell Syst. 13, 844–856 (2022).
Article CAS PubMed PubMed Central Google Scholar
Klinkenberg, D., Backer, J. A., Didelot, X., Colijn, C. & Wallinga, J. Simultaneous inference of phylogenetic and transmission trees in infectious disease outbreaks. PLoS Comput. Biol. 13, e1005495 (2017).
Article ADS PubMed PubMed Central Google Scholar
Worby, C. J. et al. Reconstructing transmission trees for communicable diseases using densely sampled genetic data. Ann. Appl. Stat. 10, 395 (2016).
Article MathSciNet PubMed PubMed Central Google Scholar
De Maio, N., Worby, C. J., Wilson, D. J. & Stoesser, N. Bayesian reconstruction of transmission within outbreaks using genomic variants. PLoS Comput. Biol. 14, e1006117 (2018).
Article PubMed PubMed Central Google Scholar
Wymant, C. et al. Phyloscanner: inferring transmission from within-and between-host pathogen genetic diversity. Mol. Biol. Evol. 35, 719–733 (2017).
Article PubMed Central Google Scholar
de Bernardi Schneider, A. et al. Strainhub: A phylogenetic tool to construct pathogen transmission networks. Bioinformatics 36, 945–947 (2020).
Article PubMed Google Scholar
Didelot, X., Fraser, C., Gardy, J. & Colijn, C. Genomic infectious disease epidemiology in partially sampled and ongoing outbreaks. Mol. Biol. Evol. 34, 997–1007 (2017).
CAS PubMed PubMed Central Google Scholar
Didelot, X., Gardy, J. & Colijn, C. Bayesian inference of infectious disease transmission from whole-genome sequence data. Mol. Biol. Evol. 31, 1869–1879 (2014).
Article CAS PubMed PubMed Central Google Scholar
Carson, J., Keeling, M., Wyllie, D., Ribeca, P. & Didelot, X. Inference of infectious disease transmission through a relaxed bottleneck using multiple genomes per host. Mol. Biol. Evol. 41, msad288 (2024).
Article CAS PubMed PubMed Central Google Scholar
Hall, M. D. & Colijn, C. Transmission trees on a known pathogen phylogeny: Enumeration and sampling. Mol. Biol. Evol. 36, 1333–1343 (2019).
Article CAS PubMed PubMed Central Google Scholar
Sledzieski, S., Zhang, C., Mandoiu, I. & Bansal, M. S. Treefix-tp: Phylogenetic error-correction for infectious disease transmission network inference. Pac. Symp. Biocomput. 26, 119–130 (2019).
Skums, P. et al. Quentin: reconstruction of disease transmissions from viral quasispecies genomic data. Bioinformatics 34, 163–170 (2017).
Article PubMed Central Google Scholar
Glebova, O. et al. Inference of genetic relatedness between viral quasispecies from sequencing data. BMC Genom. 18, 918 (2017).
Article Google Scholar
Kosakovsky Pond, S. L., Weaver, S., Leigh Brown, A. J. & Wertheim, J. O. Hiv-trace (transmission cluster engine): a tool for large scale molecular epidemiology of hiv-1 and other rapidly evolving pathogens. Mol. Biol. Evol. 35, 1812–1819 (2018).
Article PubMed PubMed Central Google Scholar
Longmire, A. G. et al. Ghost: global hepatitis outbreak and surveillance technology. BMC Genom. 18, 916 (2017).
Article Google Scholar
Campbell, E. M. et al. Microbetrace: retooling molecular epidemiology for rapid public health response. PLoS Comput. Biol. 17, e1009300 (2021).
Article CAS PubMed PubMed Central Google Scholar
Sashittal, P. & El-Kebir, M. SharpTNI: Counting and sampling parsimonious transmission networks under a weak bottleneck. Preprint at https://doi.org/10.1101/842237 (2019).
Dhar, S., Zhang, C., Mandoiu, I. & Bansal, M. S. Tnet: Transmission network inference using within-host strain diversity and its application to geographical tracking of covid-19 spread. IEEE/ACM Trans. Comput. Biol. Bioinform 19, 230–242 (2021).
Ke, Z. & Vikalo, H. Graph-based reconstruction and analysis of disease transmission networks using viral genomic data. J. Comput. Biol. 7, 796–813 (2023).
Colijn, C., Hall, M. D. & Bouckaert, R. Taking a breath (Bayesian reconstruction and evolutionary analysis of transmission histories) to simultaneously infer phylogenetic and transmission trees for partially sampled outbreaks. Preprint at https://doi.org/10.1101/2024.07.11.603095 (2024).
Ypma, R. J., van Ballegooijen, W. M. & Wallinga, J. Relating phylogenetic trees to transmission trees of infectious disease outbreaks. Genetics 195, 1055–1062 (2013).
Article PubMed PubMed Central Google Scholar
Mollentze, N. et al. A bayesian approach for inferring the dynamics of partially observed endemic infectious diseases from space-time-genetic data. Proc. R. Soc. Lond. B Biol. Sci. 281, 20133251 (2014).
Google Scholar
Morelli, M. J. et al. A bayesian inference framework to reconstruct transmission trees using epidemiological and genetic data. PLoS Comput. Biol. 8, e1002768 (2012).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Cottam, E. M. et al. Integrating genetic and epidemiological data to determine transmission pathways of foot-and-mouth disease virus. Proc. R. Soc. Lond. B Biol. Scie. 275, 887–895 (2008).
Google Scholar
Campbell, F., Cori, A., Ferguson, N. & Jombart, T. Bayesian inference of transmission chains using timing of symptoms, pathogen genomes and contact data. PLoS Comput. Biol. 15, e1006930 (2019).
Article ADS PubMed PubMed Central Google Scholar
Hall, M., Woolhouse, M. & Rambaut, A. Epidemic reconstruction in a phylogenetics framework: transmission trees as partitions of the node set. PLoS Comput. Biol. 11, e1004613 (2015).
Article ADS PubMed PubMed Central Google Scholar
Senghore, M. et al. Inferring bacterial transmission dynamics using deep sequencing genomic surveillance data. Nat. Commun. 14, 6397 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Campo, D. S. et al. Accurate genetic detection of hepatitis c virus transmissions in outbreak settings. J. Infect. Dis. 213, 957–965 (2016).
Article PubMed Google Scholar
Wertheim, J. O. et al. Social and genetic networks of hiv-1 transmission in new york city. PLoS Pathog. 13, e1006000 (2017).
Article PubMed PubMed Central Google Scholar
Ratmann, O. et al. Inferring hiv-1 transmission networks and sources of epidemic spread in africa with deep-sequence phylogenetic analysis. Nat. Commun. 10, 1–13 (2019).
Article CAS Google Scholar
Zhang, Y. et al. Evaluation of phylogenetic methods for inferring the direction of human immunodeficiency virus (hiv) transmission: Hiv prevention trials network (hptn) 052. Clin. Infect. Dis. 72, https://doi.org/10.1093/cid/ciz1247 (2020).
Ramachandran, S. et al. A large hcv transmission network enabled a fast-growing hiv outbreak in rural indiana, 2015. EBioMedicine 37, 374–381 (2018).
Article PubMed PubMed Central Google Scholar
Campbell, E. M. et al. Detailed transmission network analysis of a large opiate-driven outbreak of hiv infection in the united states. J. Infect. Dis. 216, 1053–1062 (2017).
Article PubMed PubMed Central Google Scholar
Jones, M. G. et al. Inference of single-cell phylogenies from lineage tracing data using cassiopeia. Genome Biol. 21, 1–27 (2020).
Article CAS Google Scholar
Kumar, S. et al. Pathfinder: Bayesian inference of clone migration histories in cancer. Bioinformatics 36, i675–i683 (2020).
Article CAS PubMed PubMed Central Google Scholar
Roddur, M. S., Snir, S. & El-Kebir, M. Inferring temporally consistent migration histories. In 23rd International Workshop on Algorithms in Bioinformatics (WABI 2023), 9.1–9.22 (Schloss Dagstuhl-Leibniz-Zentrum für Informatik, 2023).
Hessey, S., Fessas, P., Zaccaria, S., Jamal-Hanjani, M. & Swanton, C. Insights into the metastatic cascade through research autopsies. Trends Cancer 6, 490–502 (2023).
Al Bakir, M. et al. The evolution of non-small cell lung cancer metastases in tracerx. Nature 616, 534–542 (2023).
Article ADS CAS PubMed PubMed Central Google Scholar
Chen, H.-N. et al. Genomic evolution and diverse models of systemic metastases in colorectal cancer. Gut 71, 322–332 (2022).
Article CAS PubMed Google Scholar
Schwartz, R. & Schäffer, A. A. The evolution of tumour phylogenetics: principles and practice. Nat. Rev. Genet. 18, 213–229 (2017).
Article CAS PubMed PubMed Central Google Scholar
Romero-Severson, E. O., Bulla, I. & Leitner, T. Phylogenetically resolving epidemiologic linkage. Proc. Natl. Acad. Sci. USA 113, 2690–2695 (2016).
Article ADS CAS PubMed PubMed Central Google Scholar
Leitner, T. & Fitch, W. The Evolution of HIV. (Johns Hopkins University Press, Baltimore, 1999).
Fischer, G. E. et al. Hepatitis c virus infections from unsafe injection practices at an endoscopy clinic in las vegas, nevada, 2007–2008. Clin. Infect. Dis. 51, 267–273 (2010).
Article PubMed Google Scholar
Erdős, P. & Rényi, A. et al. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci 5, 17–60 (1960).
MathSciNet Google Scholar
Barabási, A.-L. & Albert, R. Emergence of scaling in random networks. Science 286, 509–512 (1999).
Article ADS MathSciNet PubMed Google Scholar
Watts, D. J. & Strogatz, S. H. Collective dynamics of ‘small-world’networks. Nature 393, 440 (1998).
Article ADS CAS PubMed Google Scholar
Colijn, C. & Gardy, J. Phylogenetic tree shapes resolve disease transmission patterns. Evol. Med. Public Health 2014, 96–108 (2014).
Article PubMed PubMed Central Google Scholar
Leventhal, G. E. et al. Inferring epidemic contact structure from phylogenetic trees. PLoS Comput. Biol. 8, e1002413 (2012).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Carnegie, N. B. Effects of contact network structure on epidemic transmission trees: implications for data required to estimate network structure. Stat. Med. 37, 236–248 (2018).
Article MathSciNet PubMed Google Scholar
Grenfell, B. T. et al. Unifying the epidemiological and evolutionary dynamics of pathogens. Science 303, 327–332 (2004).
Article ADS CAS PubMed Google Scholar
Volz, E. M., Koelle, K. & Bedford, T. Viral phylodynamics. PLoS Comput. Biol. 9, e1002947 (2013).
Article ADS MathSciNet CAS PubMed PubMed Central Google Scholar
Noble, R. et al. Spatial structure governs the mode of tumour evolution. Nat. Ecol. Evol. 6, 207–217 (2022).
Article PubMed Google Scholar
Lewinsohn, M. A., Bedford, T., Müller, N. F. & Feder, A. F. State-dependent evolutionary models reveal modes of solid tumour growth. Nat. Ecol.Evol. 7, 581–596 (2023).
Article PubMed PubMed Central Google Scholar
Robinson, K., Fyson, N., Cohen, T., Fraser, C. & Colijn, C. How the dynamics and structure of sexual contact networks shape pathogen phylogenies. PLoS Comput. Biol. 9, e1003105 (2013).
Article ADS CAS PubMed PubMed Central Google Scholar
Villandre, L. et al. Assessment of overlap of phylogenetic transmission clusters and communities in simple sexual contact networks: applications to hiv-1. PloS ONE 11, e0148459 (2016).
Article PubMed PubMed Central Google Scholar
Welch, D. Is network clustering detectable in transmission trees? Viruses 3, 659–676 (2011).
Article PubMed PubMed Central Google Scholar
Giardina, F., Romero-Severson, E. O., Albert, J., Britton, T. & Leitner, T. Inference of transmission network structure from hiv phylogenetic trees. PLoS Comput. Biol. 13, e1005316 (2017).
Article ADS PubMed PubMed Central Google Scholar
McCloskey, R. M., Liang, R. H. & Poon, A. F. Reconstructing contact network parameters from viral phylogenies. Virus Evol. 2, vew029 (2016).
Article PubMed PubMed Central Google Scholar
Sankoff, D. Minimal mutation trees of sequences. SIAM J. Appl. Math. 28, 35–42 (1975).
Article MathSciNet Google Scholar
Fitch, W. Towards defining the course of evolution: minimum change for a specified tree topology. Syst. Biol. 20, 406–416 (1971).
Article Google Scholar
Gusfield, D. Integer Linear Programming in Computational and Systems Biology: An Entry-Level Text and Course (Cambridge University Press, 2019).
Switzer, W. M. et al. High hiv diversity, recombination, and superinfection revealed in a large outbreak among persons who inject drugs in kentucky and ohio, usa. Virus Evol. 10, veae015 (2024).
Article PubMed PubMed Central Google Scholar
Kenah, E., Britton, T., Halloran, M. E. & Longini Jr, I. M. Molecular infectious disease epidemiology: survival analysis and algorithms linking phylogenies to transmission trees. PLoS Comput. Biol. 12, e1004869 (2016).
Article ADS PubMed PubMed Central Google Scholar
Hell, P. & Nešetril, J.Graphs and Homomorphisms (Oxford University Press, 2004).
Lovász, L. Graph minor theory. Bull. Am. Math. Soc. 43, 75–86 (2006).
Article MathSciNet Google Scholar
Robertson, N. & Seymour, P. D. Graph minors. xx. wagner’s conjecture. J. Comb.Theory Ser. B 92, 325–357 (2004).
Article MathSciNet Google Scholar
Lemey, P., Rambaut, A., Drummond, A. J. & Suchard, M. A. Bayesian phylogeography finds its roots. PLoS Comput. Biol. 5, e1000520 (2009).
Article ADS MathSciNet PubMed PubMed Central Google Scholar
Moshiri, N., Ragonnet-Cronin, M., Wertheim, J. O. & Mirarab, S. Favites: simultaneous simulation of transmission networks, phylogenetic trees and sequences. Bioinformatics 35, 1852–1861 (2019).
Article CAS PubMed Google Scholar
Wertheim, J. O. et al. The global transmission network of hiv-1. J. Infect. Dis. 209, 304–313 (2014).
Article PubMed Google Scholar
Brown, A. J. L. et al. Transmission network parameters estimated from hiv sequences for a nationwide epidemic. J. Infect. Dis. 204, 1463–1469 (2011).
Article Google Scholar
Castillo, S. P., Rebolledo, R. A., Arim, M., Hochberg, M. E. & Marquet, P. A. Metastatic cells exploit their stoichiometric niche in the network of cancer ecosystems. Sci. Adv. 9, eadi7902 (2023).
Article CAS PubMed PubMed Central Google Scholar
Newton, P. K. et al. Spreaders and sponges define metastasis in lung cancer: a markov chain monte carlo mathematical model. Cancer Res. 73, 2760–2769 (2013).
Article CAS PubMed PubMed Central Google Scholar
Posada, D. Cellcoal: coalescent simulation of single-cell sequencing samples. Mol. Biol. Evol. 37, 1535–1542 (2020).
Article MathSciNet CAS PubMed PubMed Central Google Scholar
Leventhal, G. E., Günthard, H. F., Bonhoeffer, S. & Stadler, T. Using an epidemiological model for phylogenetic inference reveals density dependence in hiv transmission. Mol. Biol. Evol. 31, 6–17 (2014).
Article CAS PubMed Google Scholar
Caminiti, S., Finocchi, I. & Petreschi, R. On coding labeled trees. Theor. Comput. Sci. 382, 97–108 (2007).
Article MathSciNet Google Scholar
Wymant, C. et al. Easy and accurate reconstruction of whole hiv genomes from short-read sequence data with shiver. Virus Evol. 4, vey007 (2018).
Article PubMed PubMed Central Google Scholar
Newman, M. Networks. (Oxford University Press, New York, NY, 2018).
Book Google Scholar
McPherson, A. et al. Divergent modes of clonal spread and intraperitoneal mixing in high-grade serous ovarian cancer. Nat. Genet. 48, 758–767 (2016).
Article CAS PubMed Google Scholar
Kaipio, J. & Somersalo, E. Statistical and Computational Inverse Problems 160 (Springer Science & Business Media, Berlin, Germany, 2006).
Google Scholar
Jahn, K., Kuipers, J. & Beerenwinkel, N. Tree inference for single-cell data. Genome Biol. 17, 86 (2016).
Article PubMed PubMed Central Google Scholar
Alroy-Preis, S. et al. Large outbreak of hepatitis c virus associated with drug diversion by a healthcare technician. Clin. Infect. Dis. 67, 845–853 (2018).
Article PubMed Google Scholar
Wertheim, J. O. et al. Natural selection favoring more transmissible hiv detected in united states molecular transmission network. Nat. Commun. 10, 5788 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Spielman, D. A. & Srivastava, N. Graph sparsification by effective resistances. SIAM J. Comput. 40, 1913–1926 (2011).
Article MathSciNet Google Scholar
Novitsky, V., Moyo, S., Wang, R., Gaseitsiwe, S. & Essex, M. Deciphering multiplicity of hiv-1c infection: transmission of closely related multiple viral lineages. PloS ONE 11, e0166746 (2016).
Article PubMed PubMed Central Google Scholar
Van der Kuyl, A. C. & Cornelissen, M. Identifying hiv-1 dual infections. Retrovirology 4, 1–12 (2007).
Google Scholar
Pacold, M. et al. Comparison of methods to detect hiv dual infection. AIDS Res. Hum. Retroviruses 26, 1291–1298 (2010).
Article CAS PubMed PubMed Central Google Scholar
Pipek, O. A. et al. Systematic detection of co-infection and intra-host recombination in more than 2 million global sars-cov-2 samples. Nat. Commun. 15, 517 (2024).
Article ADS CAS PubMed PubMed Central Google Scholar
Brandstädt, A., Le, V. B. & Spinrad, J. P. Graph Classes: A Survey. (SIAM, Philadelphia, PA, 1999).
Book Google Scholar
Scott, J., Kuhn, P. & Anderson, A. R. Unifying metastasis-integrating intravasation, circulation and end-organ colonization. Nat. Rev. Cancer 12, 445–446 (2012).
Article CAS PubMed PubMed Central Google Scholar
Gerlee, P. & Johansson, M. Inferring rates of metastatic dissemination using stochastic network models. PLoS Comput. Biol. 15, e1006868 (2019).
Article ADS CAS PubMed PubMed Central Google Scholar
Liljeros, F., Edling, C. R., Amaral, L. A. N., Stanley, H. E. & Åberg, Y. The web of human sexual contacts. Nature 411, 907–908 (2001).
Article ADS CAS PubMed Google Scholar
Hughes, G. J. et al. Molecular phylodynamics of the heterosexual hiv epidemic in the United Kingdom. PLoS Pathog. 5, e1000590 (2009).
Article PubMed PubMed Central Google Scholar
Romano, C. M. et al. Social networks shape the transmission dynamics of hepatitis c virus. PLoS ONE 5, e11170 (2010).
Article ADS PubMed PubMed Central Google Scholar
Sashittal, P., Schmidt, H., Chan, M. & Raphael, B. J. Startle: a star homoplasy approach for crispr-cas9 lineage tracing. Cell Syst. 14, 1113–1121 (2023).
Article CAS PubMed PubMed Central Google Scholar
Meacham, C. A. & Duncan, T. The necessity of convex groups in biological classification. Syst. Bot. 78–90 (1987).
Moran, S. & Snir, S. Convex recolorings of strings and trees: Definitions, hardness results and algorithms. J.Comput. Syst. Sci. 74, 850–869 (2008).
Article MathSciNet Google Scholar
Diestel, R. Graph Theory (2012).
Kammer, F. & Tholey, T. The complexity of minimum convex coloring. Discrete Appl. Math. 160, 810–833 (2012).
Article MathSciNet Google Scholar
Garey, M. R., Graham, R. L., Johnson, D. S. & Knuth, D. E. Complexity results for bandwidth minimization. SIAM J. Appl. Math. 34, 477–495 (1978).
Article MathSciNet Google Scholar
Dubey, C., Feige, U. & Unger, W. Hardness results for approximating the bandwidth. J. Comput. Syst. Sci. 77, 62–90 (2011).
Article MathSciNet Google Scholar
Puller, V., Neher, R. & Albert, J. Estimating time of hiv-1 infection from next-generation sequence diversity. PLOS Comput. Biol. 13, e1005775 (2017).
Article ADS PubMed PubMed Central Google Scholar
Russell, M. L. et al. Using viral sequence diversity to estimate time of hiv infection in infants. PLoS Pathog. 19, e1011861 (2023).
Article CAS PubMed PubMed Central Google Scholar
Baykal, P. B. I., Lara, J., Khudyakov, Y., Zelikovsky, A. & Skums, P. Quantitative differences between intra-host hcv populations from persons with recently established and persistent infections. Virus Evol. 6, veaa103 (2021).
Article Google Scholar
Kilpeläinen, P. & Mannila, H. Ordered and unordered tree inclusion. SIAM J. Comput. 24, 340–356 (1995).
Article MathSciNet Google Scholar
Matoušek, J. & Thomas, R. On the complexity of finding iso-and other morphisms for partial k-trees. Discrete Math. 108, 343–364 (1992).
Article MathSciNet Google Scholar
Bron, C. & Kerbosch, J. Algorithm 457: finding all cliques of an undirected graph. Commun. ACM 16, 575–577 (1973).
Article Google Scholar
Lovász, L. & Plummer, M. D. Matching Theory 367 (American Mathematical Soc., Providence, RI, 2009).
Google Scholar

Download references

Acknowledgements

KK has been supported by a Georgia State University Molecular Basis of Disease fellowship. P.S. has been supported by the NSF grants 2415564 and 2415562. The authors thank the organizers of the Computational Genomics Summer Institute (UCLA, July 12 – August 4, 2023 and July 10 - August 2, 2024), where several ideas of this paper were conceived.

Author information

Authors and Affiliations

Department of Computer Science, Georgia State University, Atlanta, GA, USA
Kiril Kuzmin
Department of Computer Science, Princeton University, Princeton, NJ, USA
Henri Schmidt & Benjamin J. Raphael
School of Computing, University of Connecticut, Storrs, CT, USA
Maryam Kafi Kang & Pavel Skums
Department of Evolutionary and Environmental Biology, University of Haifa, Haifa, Israel
Sagi Snir

Authors

Kiril Kuzmin
View author publications
Search author on:PubMed Google Scholar
Henri Schmidt
View author publications
Search author on:PubMed Google Scholar
Maryam Kafi Kang
View author publications
Search author on:PubMed Google Scholar
Sagi Snir
View author publications
Search author on:PubMed Google Scholar
Benjamin J. Raphael
View author publications
Search author on:PubMed Google Scholar
Pavel Skums
View author publications
Search author on:PubMed Google Scholar

Contributions

P.S., S.S., and B.R. conceived the study. P.S. developed the theoretical foundations of the methodology with input from H.S., S.S., and B.R. K.K. and P.S. designed and implemented the algorithms. K.K., M.K.K. and P.S. conducted simulation experiments, analyzed results, and benchmarked the algorithms. K.K. and P.S. analyzed the experimental data provided and pre-processed by H.S. and B.R. P.S. supervised the study and wrote the manuscript with contributions from all co-authors.

Corresponding author

Correspondence to Pavel Skums.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Communications thanks Yangjing Long, and the other, anonymous, reviewer(s) for their contribution to the peer review of this work. A peer review file is available.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Reporting Summary

Transparent Peer Review file

Source data

Source Data

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Cite this article

Kuzmin, K., Schmidt, H., Kafi Kang, M. et al. A graph homomorphism approach for unraveling histories of metastatic cancers and viral outbreaks under evolutionary constraints. Nat Commun 16, 8027 (2025). https://doi.org/10.1038/s41467-025-63411-4

Download citation

Received: 05 September 2024
Accepted: 19 August 2025
Published: 28 August 2025
DOI: https://doi.org/10.1038/s41467-025-63411-4

Subjects

Abstract

Similar content being viewed by others

Introduction

Results

Simulated data generation

Simulated data results

Experimental viral data

Experimental cancer data

Discussion

Methods

Basic definitions

Migration inference under structural constraints

Labeled migration inference

Unconstrained unlabeled migration inference

Theorem 1

Theorem 2

Migration inference under convexity constraints

Lemma 1

Migration inference with convexity and compactness constraints

Lemma 2

SMiTH: Sampling MIgrations and Transmissions with Homomorphisms

Reporting summary

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Peer review

Peer review information

Additional information

Supplementary information

Source data

Rights and permissions

About this article

Cite this article

Share this article

Search

Quick links