Main

Ever since its earliest descriptions in the 1940s1,2, autism has been thought of as a condition that emerges in early childhood. However, a greater proportion of autistic individuals are now receiving an autism diagnosis from mid-childhood onwards than in early childhood3,4,5. One factor that may explain these findings is a shift in the conceptualization of the condition over time, including the recognition that the behavioural signs of autism may not manifest clearly in the first three years of life6,7,8,9. Supporting this, some studies have demonstrated that a subset of children who do not initially meet the criteria for an autism diagnosis receive a diagnosis later7,10,11,12,13,14. Later autism diagnosis is associated with elevated co-occurring mental-health conditions15,16, highlighting the need to understand why some autistic people are not diagnosed until later in life.

Several social, demographic and clinical factors have been linked to age at autism diagnosis17. However, previous studies have shown that individual clinical and sociodemographic factors explain only a small proportion (typically less than 15%) of the variance in age at autism diagnosis (Extended Data Fig. 1 and Supplementary Table 1). This indicates that other factors contribute to age at autism diagnosis. One of these additional factors could be genetic differences between autistic individuals. Despite the relatively high heritability of autism18, the role of genetics in age at autism diagnosis has not to our knowledge been previously studied.

Two theoretical models can explain how genetics affect age at autism diagnosis. In the first model, autism has a single polygenic aetiology, with the same set of genetic variants underlying autism, regardless of age at diagnosis (the ‘unitary model’; Extended Data Fig. 2). In this model, later-diagnosed autism may have subtle clinical features that are harder to recognize early in life, so individuals do not cross a diagnostic threshold earlier in life, perhaps because they have a lower genetic predisposition to autism. As these people get older, environmental factors may alter their clinical features, eventually bringing individuals above the clinical threshold to receive an autism diagnosis later in life.

An alternative model is that earlier- and later-diagnosed autism have different underlying developmental trajectories and polygenic aetiologies (the ‘developmental model’; Extended Data Fig. 2). This model aligns with existing evidence that the genetic influences on traits related to autism vary across development19,20,21. This model does not preclude a role for environmental factors influencing when someone receives an autism diagnosis but implies that different sets of genetic variants are associated with earlier- and later-diagnosed autism.

Here we examined the evidence for these two models through four linked aims (Extended Data Fig. 3). First, we investigated whether the trajectories of socioemotional and behavioural development are associated with age at autism diagnosis in birth cohorts. Although variable developmental trajectories have been observed among autistic individuals and their younger siblings22, it is unclear whether these differences in trajectories are associated with age at diagnosis. Second, we estimated the proportion of variance in age at autism diagnosis that is explained by common single nucleotide polymorphisms (SNP-based heritability). We then tested whether this is attenuated by a range of clinical and demographic factors, as predicted by the unitary model. Third, we investigated whether different polygenic factors are associated with earlier and later autism diagnosis, as predicted by the developmental model. Finally, we estimated the genetic correlation between the autism polygenic factors related to age at diagnosis and other mental-health and developmental phenotypes.

We provide a summary of the study and address potential questions regarding the implications of the findings in the Supplementary Summary and FAQs.

Behavioural trajectories and diagnosis age

In Aim 1, we investigated whether autistic individuals have varying socioemotional and behavioural trajectories, and whether these are associated with age at autism diagnosis in three birth cohorts (n = 89 to 188 autistic individuals with recorded age at diagnosis between 5 years and 17 years). These are the Millennium Cohort Study (MCS, participants born in 2000) and the Longitudinal Study of Australian Children: Kindergarten cohort (LSAC-K, 1999) and Birth cohort (LSAC-B, 2003) (Extended Data Fig. 4, Supplementary Table 2 and Supplementary Note 1). All three cohorts collected longitudinal information on socioemotional and behavioural development using the carer-reported Strengths and Difficulties Questionnaire (SDQ)23. The SDQ has five subscales (emotional, conduct, hyperactivity/inattention, peer problems and prosocial behaviours), and the total score of difficulties (hereafter ‘total difficulties’) is the summed score of the first four subscales. The SDQ is widely used, has excellent psychometric properties24,25,26 and is largely invariant across age, sex and different populations27,28,29, indicating that it is measuring the same latent trait across these demographic variables. Furthermore, the SDQ is moderately correlated with autism-specific measures30,31,32,33, although it does not capture all of the core diagnostic features of autism. Because not all cohorts recorded the exact age when children received their autism diagnosis, we used the child’s age during the study data collection when carers first reported the diagnosis as an approximation of age at autism diagnosis.

To identify latent trajectories, we used growth mixture models of the SDQ total difficulties and subscale scores among autistic individuals in all three cohorts. Growth mixture models do not require grouping based on an a priori hypothesis but can identify latent subgroups based on longitudinal differences in SDQ scores.

Across all three birth cohorts, growth mixture modelling identified a two-trajectory model as being optimal for SDQ total difficulties and most subscale scores (Supplementary Table 3 and Supplementary Figs. 16). The first latent trajectory was characterized by difficulties in early childhood that remained stable or modestly attenuated in adolescence (termed ‘early childhood emergent latent trajectory’). The second latent trajectory was characterized by fewer difficulties in early childhood that increased in late childhood and adolescence (termed ‘late childhood emergent latent trajectory’) (Fig. 1a–c).

Fig. 1: Trajectory analyses in three of the four birth cohorts.
figure 1

ac, Longitudinal growth mixture models of SDQ total scores in autistic individuals, demonstrating the presence of two groups in the MCS (a), LSAC-B (b) and LSAC-K (c) cohorts. Shaded areas indicate 95% confidence intervals of the line of best fit. df, Stacked bar charts show the proportion of individuals who had been diagnosed as autistic at specific ages, categorized by membership in the latent trajectories identified from the growth mixture models in MCS (d), LSAC-B (e), and LSAC-K (f) cohorts. Darker colours indicate male individuals and lighter colours indicate female individuals. P-values are from χ2 tests (two-sided) comparing the distribution of age at autism diagnosis between the two latent trajectories (pooling the two sexes).

Autistic individuals in the early childhood emergent latent trajectory were more likely to be diagnosed as autistic in childhood than autistic individuals in the late childhood emergent latent trajectory in MCS (P = 1.42 × 10−4, χ2 test) and LSAC-B (P = 2.24 × 10−2, χ2 test) (Fig. 1d–f and Supplementary Table 4). This difference was not significant in LSAC-K, possibly because age 11 was the earliest time an autism diagnosis was recorded.

Sensitivity analyses in MCS confirmed the robustness of the two latent trajectories and their association with age at diagnosis among autistic children. We identified consistent results after expanding the sample to include individuals with co-occurring ADHD (n = 238; Supplementary Tables 3 and 4), and after imputing missing data to increase the statistical power and reduce bias (n = 623) (Supplementary Table 5 and Supplementary Notes 2 and 3). We also obtained consistent results when restricting the analyses to only male individuals (n = 136; Supplementary Tables 3 and 4), indicating that these results were not driven by sex differences in age of diagnosis. We were unable to run equivalent female-only analyses owing to the low sample sizes.

To assess the specificity of this result to autism, we tested whether similar latent trajectories were also observed in children with ADHD but not autism (n = 89, imputed n = 325) in MCS using growth mixture models. Two latent SDQ trajectory classes emerged, but these were not significantly linked to age of ADHD diagnosis, except for the SDQ hyperactivity/inattention and conduct problems subscales in the imputed sample (Supplementary Figs. 7 and 8 and Supplementary Table 6), indicating that the findings are relatively specific to autism, rather than to neurodevelopmental conditions more broadly.

Although female individuals receive an autism diagnosis later than male individuals on average34, in all three cohorts, the sex ratio was similar between the two latent trajectories (Supplementary Table 4), possibly because of their relatively small sample sizes. Individuals in the late childhood emergent latent trajectories were more likely to report mental-health conditions (Supplementary Table 7), consistent with previous epidemiological observations among later-diagnosed autistic individuals15,16.

Using multiple regression models, we then examined the extent to which these two latent trajectories contributed to differences in age at autism diagnosis over and above sociodemographic and cognitive characteristics (Supplementary Table 8). In these models, SDQ latent trajectories explained 11.7% (LSAC-B) to 30.3% (MCS) of the variance in age of autism diagnosis. By contrast, sociodemographic variables explained 4.8% (LSAC-B) to 5.5% (MCS) of the total variance across cohorts, consistent with previous reports (Extended Data Fig. 1). In the imputed MCS sample (n = 623; Supplementary Table 5 and Supplementary Note 2), the SDQ latent trajectories and sociodemographic variables explained 56.6% and 3.2% of the variance, respectively. The effects of the sociodemographic variables were not mediated by the SDQ latent trajectories (Supplementary Table 8 and Supplementary Note 4).

The associations between different SDQ trajectories and age at autism diagnosis were also supported by latent growth curve models fitted on earlier- and later-diagnosed autistic individuals (Supplementary Note 5, Supplementary Table 9 and Supplementary Figs. 16, 9 and 10). These results confirm that the association between SDQ trajectories and age at autism diagnosis is robust to methodological choices

Age at autism diagnosis is heritable

The above analyses demonstrate that variation in socioemotional and behavioural trajectories, measured using the SDQ, is associated with age at autism diagnosis. Previous research has demonstrated that developmental variation traits related to autism are partly explained by genetic factors19,20,35,36,37,38. A corollary of this is that genetic factors may also be associated with age at autism diagnosis.

Subsequently, in Aim 2, we tested whether age at autism diagnosis is heritable in two large cohorts of autistic individuals using genetic data and information on age at autism diagnosis. This includes: first, the Danish-based iPSYCH cohort (ntotal = 18,965), a population-based sample derived from the Danish national registries that includes autistic individuals; and second, the US-based cohort of autistic individuals (SPARK39; ntotal = 28,165; Extended Data Figs. 5 and 6), which recruits families with at least one autistic individual through online platforms and medical centres across the United States. In SPARK, we conducted initial analyses in a discovery subset of 18,809 autistic individuals (SPARK Discovery), and replicated key findings in a second sample of 9,356 autistic individuals that became available only after the initial analyses were completed (SPARK Replication). SPARK and iPSYCH differed in the diagnostic classification system (iPSYCH: International Classification of Diseases (ICD) and SPARK: Diagnostic and Statistical Manual of Mental Disorders (DSM)) and median age at diagnosis (iPSYCH, median = 10 years, median absolute deviation = 4 years; SPARK, median = 4 years, median absolute deviation = 2.7 years).

We conducted a genome-wide association study (GWAS) in iPSYCH and across both the Discovery and Replication samples of SPARK, using age at autism diagnosis as a quantitative trait. In all three samples, we identified significant and consistent SNP-based heritability of approximately 11% for age at autism diagnosis (Fig. 2a and Supplementary Table 10). This is larger than, or similar to, the variance explained by several other clinical and sociodemographic factors tested in SPARK (Extended Data Fig. 1 and Supplementary Table 1).

Fig. 2: Heritability of age at autism diagnosis.
figure 2

a, SNP-based heritability (h2) for age at autism diagnosis in the SPARK cohorts, calculated using single-component genome-wide complex trait analysis with a genomic-relatedness-based restricted maximum-likelihood approach (GCTA-GREML) for the SPARK Discovery cohort (orange dashed line, n = 16,786), SPARK Replication cohort (purple dashed line, n = 8,558) and a meta-analysis of the two (light blue solid line, n = 25,344), and iPSYCH, calculated using linkage disequilibrium score regression coefficient (LDSC) (solid green line, n = 18,965). b, SNP-based heritability (GCTA-GREML) in the SPARK cohorts after accounting for various clinical and sociodemographic factors. A ‘+’ indicates the baseline model in addition to the specified covariates. The x axis has been truncated at 0 and 0.25. In a and b, central points represent SNP-based heritability estimates and error bars indicate 95% confidence intervals. Sample sizes for b are provided in Supplementary Table 10. PC, genetic principal component; RBS-R, Repetitive Behavior Scale-Revised; SCQ, Social Communication Questionnaire; SES, socioeconomic status.

In contrast to the effect of common genetic variants, in a subsample of SPARK with available data for both parents and their autistic child (n = 6,206 trios), we observed no association between age at autism diagnosis and rare de novo variants or inherited protein truncating or missense variants in highly constrained genes (Supplementary Table 11). This may possibly be the result of low statistical power or reflect later autism diagnosis in some carriers of de novo mutations owing to diagnostic overshadowing by co-occurring intellectual disability or global developmental delay40,41.

We next tested whether this SNP-based heritability of age at autism diagnosis is consistent with either of the two theoretical models outlined earlier. The unitary model assumes that later diagnosis reflects subtle or less-severe clinical features, so the SNP-based heritability of age at autism diagnosis may simply reflect the severity of autism features. Alternatively, the SNP-based heritability may reflect additional genetic influences associated with co-occurring developmental delays, developmental regression or intellectual disability, which may lead to an earlier diagnosis. It may also reflect the heritable component of parental socioeconomic status and neighbourhood deprivation. which are proxies for parental awareness and healthcare access that affect diagnostic timing. Controlling for any of these measures should attenuate the SNP-based heritability.

By contrast, the developmental model assumes that SNP-based heritability of age at autism diagnosis reflects a mixture of different polygenic factors that are correlated with age at diagnosis but are independent of these covariates. Under this model, a significant SNP-based heritability should persist after controlling for clinical, developmental and sociodemographic measures.

In line with the developmental model, we found that the SNP-based heritability did not significantly attenuate after controlling for parental sociodemographic measures, clinical features or co-occurring developmental delays and conditions (Fig. 2b and Supplementary Table 10). This is inconsistent with the unitary model, although imperfect and incomplete measurement of clinical and developmental phenotypes may limit this conclusion.

A second prediction of the unitary model is that earlier diagnosis is associated with a greater polygenic propensity for autism compared with later diagnosis (Extended Data Fig. 2). In this model, all autistic individuals would have a higher polygenic propensity for autism compared with non-autistic controls. This would result in negative genetic correlations between GWAS of age at autism diagnosis and GWAS of autism, with the magnitude of this negative correlation decreasing as the median age at diagnosis in the autism GWAS samples increases.

We tested this prediction using 13 different but partly overlapping autism GWASs, including six GWASs stratified by age at diagnosis and two GWASs stratified by sex (Box 1 and Fig. 3). The genetic correlation between age at autism diagnosis and different autism GWASs varied systematically, becoming increasingly positive as the median age at diagnosis increased (Fig. 3 and Supplementary Table 12). However, contrary to the expectation under the unitary model, we observed positive genetic correlations between age at autism diagnosis and autism GWAS comprising later-diagnosed autistic individuals. These findings support the existence of different genetic architectures across diagnostic age groups, aligning with the developmental model.

Fig. 3: Median age at autism diagnosis and genetic correlations with age at autism diagnosis across different GWAS cohorts.
figure 3

Left, median age at diagnosis (years) with error bars representing median absolute deviation. Circle size indicates the number of autistic individuals (cases) in the GWAS, and exact sample sizes are provided in Box 1. Beige circles represent GWAS unstratified by age at diagnosis; red circles represent GWAS stratified by age at diagnosis. Lighter (more transparent) circles indicate studies with no information about age at autism diagnosis, and the median ages have been inferred from other available information (PGC-2017 (ref. 45) and Grove et al.46). Right, genetic correlations with age at autism diagnosis for both SPARK (blue, meta-analysis, n = 28,165) and iPSYCH (green, n = 18,965) datasets, with error bars representing 95% confidence intervals.

Furthermore, the male- and female-stratified autism GWAS from iPSYCH had a similar genetic correlation with age at autism diagnosis. This indicates that the pattern of genetic correlation between age at diagnosis and the autism GWASs does not reflect differences in the sex ratio of participants across the autism GWASs.

The genetic correlation with some of the autism GWASs differs significantly between the age at diagnosis GWASs in iPSYCH versus meta-analysed SPARK (Fig. 3). This reflects the fact that these two age at diagnosis GWASs are only moderately genetically correlated with each other (genetic correlation (rg) = 0.51, standard error (s.e.) = 0.19, P = 7.56 × 10−3), which may be due to the different recruitment strategies and resulting differences in the median age at autism diagnosis in the two cohorts (Supplementary Note 6).

Two autism polygenic factors

These findings indicate that the age at autism diagnosis reflects a mixture of different age-dependent polygenic traits (developmental model), rather than a single polygenic trait (unitary model). In the developmental model, one would expect that the genetic correlations between different autism GWASs will differ according to the difference in the median age of diagnosis between the GWASs.

In Aim 3, we tested this by estimating genetic correlations among the 13 autism GWASs (Box 1). We observed genetic correlations ranging from 0.02 (s.e. = 0.13) to 1.00 (s.e. = 0.01) (Fig. 4a and Supplementary Table 13). We observed a gradient in the genetic correlations related to the similarity in median age at diagnosis between cohorts. Cohorts with the most similar median ages at diagnosis (differing by a maximum of 2 years) showed the highest genetic correlations (rg = 1, s.e. = 0.04), and correlations progressively decreased as age differences increased. The lowest genetic correlation (rg = 0.02, s.e. = 0.13) was observed between SPARKbefore6 (median age of around 3) and SPARKafter10 (median age of around 16).

Fig. 4: Two genetic latent factors in autism.
figure 4

a, Left, genetic correlation heatmaps of all GWASs of autism as described in Box 1. Asterisks indicate significant genetic correlations after Benjamini–Yekutieli adjustment. Right, median age at autism diagnosis for the same GWAS (indicated by the number on top of the circle). Error bars indicate median absolute deviation; the size of the circles indicate the sample size. For both panels, GWASs have been ordered based on hierarchical clustering of the genetic correlations. b, Structural equation model illustrating the two-correlated genetic-factor models for autism, using six minimally overlapping autism GWAS datasets. F1, factor 1; F2, factor 2. One-headed arrows depict the regression relationship pointing from the independent variables to the dependent variables; the numbers on the arrows represent the regression coefficients of the factor loadings, with standard errors provided in parentheses. Covariance between variables is represented by two-headed arrows linking the variables. The numbers on the two-headed arrows can be interpreted as genetic-correlation estimates with the standard error provided in parentheses. Residual variances for each GWAS dataset are represented using a two-headed arrow connecting the residual variable (u) to itself. Standard errors are shown in parentheses.

Hierarchical clustering of the genetic correlations identified two broad, overlapping clusters that differed by age at autism diagnosis. One cluster comprised GWAS of autism in cohorts with predominantly childhood-diagnosed individuals, and the other comprised GWAS of autism in cohorts with a large fraction of individuals diagnosed in adolescence or later, consistent with predictions from the developmental model.

We formally tested this by modelling the genetic covariance using the structural equation models in GenomicSEM42, testing six theoretical models (Supplementary Table 14). We used six minimally overlapping GWASs for autism with wide variation in age at autism diagnosis among those listed in Box 1. We found that a correlated two-factor model was the most parsimonious and fit the data best (Akaike information criterion, 38.64; confirmatory fit index, 0.99; standardized root mean residual, 0.08; Fig. 4b). Factor 1 (earlier-diagnosed autism factor) was defined by the GWAS with predominantly early childhood-diagnosed individuals (PGC-2017, SPARKbefore6, with a median age at diagnosis of 3). Factor 2 (later-diagnosed autism factor) was defined primarily by GWASs with adolescent- or adult-diagnosed individuals (iPSYCHafter10, FinnGen and SPARKafter10). The cross-loading of iPSYCHbefore9 (median age at diagnosis of around 5.7) indicates that factor 2 may impact behaviours in mid-to-late childhood as well. The two factors had a small genetic correlation (rg = 0.38, s.e. = 0.06). Sensitivity analyses confirmed the robustness of the above results using partly different GWASs, in which we identified a two-correlated-factor model as the best-fitting model, with similar moderate genetic correlations between the two factors (rg = 0.37, s.e. = 0.06 to rg = 0.52, s.e. = 0.10; Supplementary Table 14).

The earlier-diagnosed autism factor was negatively genetically correlated with age at autism diagnosis (Fig. 5). The later-diagnosed autism factor was positively genetically correlated only with age at autism diagnosis from SPARK. Genetic correlation with the autism GWAS stratified by sex showed that both autism factors had stronger genetic correlations with autism in male than in female individuals, with a larger difference for the earlier-diagnosed autism factor (Fig. 5), consistent with established sex differences in age at autism diagnosis.

Fig. 5: Genetic correlation between the two autism polygenic factors and a range of mental-health, neurodevelopmental and cognitive traits.
figure 5

Central points indicate the estimate (genetic correlation), error bars indicate 95% confidence intervals and asterisks indicate significant P-values (two-sided) with Benjamini–Yekutieli adjustment. Sample sizes are shown in Supplementary Table 18.

Further analyses using polygenic scores confirmed that the association between the two polygenic autism factors and age at autism diagnosis is not due to several confounding factors using within-family approaches, nor to differences in clinical and demographic factors, and co-occurring developmental conditions (Supplementary Note 7 and Supplementary Tables 1517). Taken together, the above findings demonstrate that earlier- and later-diagnosed autism have different polygenic aetiologies, supporting the developmental model.

Genetic correlations with autism factors

The above analyses indicate that there are at least two polygenic factors associated with age at autism diagnosis. Because later-diagnosed autistic individuals have higher rates of mental-health conditions15,16, we proposed that this might partly be because earlier- and later-diagnosed autism factors have differing genetic correlations with mental health and cognitive phenotypes. In Aim 4, we investigated this hypothesis using genetic correlation analyses.

The earlier-diagnosed autism factor (factor 1) had a low (rg of around 0.1–0.2) but significant genetic correlation with educational attainment, cognitive aptitude, ADHD and various mental-health and related conditions (Fig 5 and Supplementary Table 18). The later-diagnosed autism factor (factor 2) showed a statistically similar genetic correlation with educational attainment but significantly higher genetic correlations (rg of around 0.5–0.7) with ADHD and a range of other mental-health and related conditions, including depression, post-traumatic stress disorder (PTSD), childhood maltreatment and self-harm. After accounting for genetic effects on ADHD, we saw an attenuated but significant moderate genetic correlation between later-diagnosed autism (factor 2) and mental-health conditions, indicating that shared genetics with ADHD do not fully explain the elevated correlation between later-diagnosed autism and mental-health phenotypes (Supplementary Table 18 and Supplementary Fig. 11). Sensitivity analyses using age at diagnosis-stratified GWASs from iPSYCH and SPARK yielded largely consistent genetic correlation results, indicating that these results are not due to cohort differences (Supplementary Table 19 and Extended Data Fig. 7).

The higher genetic correlation between later-diagnosed autism and other mental-health conditions may indicate diagnostic misclassification (in which individuals with other conditions incorrectly receive an autism diagnosis) or diagnostic overshadowing (where the presence of co-occurring mental-health conditions can delay an autism diagnosis). In the unitary model, the genetics of later-diagnosed autism would reflect the additive genetic effects of earlier-diagnosed autism and of other mental-health conditions. However, decomposition of the autism genetic signal using genomicSEM indicated that later-diagnosed autism cannot be entirely attributed to the polygenic effects of earlier-diagnosed autism and six other mental-health conditions tested: schizophrenia, ADHD, anorexia nervosa, depression, bipolar disorder and PTSD (Supplementary Note 8). Thus, the later-diagnosed autism genetic factor does not represent the additive genetic effects of earlier-diagnosed autism and mental-health conditions.

Finally, we investigated whether the two autism polygenic factors related to age at diagnosis differed in their associations with developmental traits. Cross-sectionally, these genetic factors showed few significant differences in genetic correlation or polygenic score association with developmental phenotypes measured at age 3 or earlier. The exceptions were age at onset of walking43 and expressive vocabulary at an age of 2–3 years44, with which the earlier-diagnosed autism factor was positively genetically correlated but the later-diagnosed factor was not (Supplementary Note 9 and Supplementary Tables 2022). However, longitudinal polygenic score analyses across two birth cohorts revealed differential genetic effects of earlier- versus later-diagnosed autism factors on SDQ total difficulties scores over time (Supplementary Note 9 and Supplementary Table 23).

Discussion

The results of this study indicate that earlier- and later-diagnosed autism are associated with different developmental trajectories, and are only moderately genetically correlated with each other. This possible framework provides one axis of heterogeneity to describe the widely acknowledged clinical and genetic heterogeneity within autism that thus far has been challenging to identify. This finding of two or more developmentally variable polygenic latent traits for autism is robust to various observed clinical and demographic factors (Supplementary Note 10), including sex and intellectual disability.

Consistent with the wider literature on developmental variation in autism22, our analyses of socioemotional and behavioural trajectories across multiple birth cohorts converge with the genetic findings: polygenic scores for earlier- and later-diagnosed autism have different associations with developmental changes in SDQ total difficulties scores (Supplementary Note 9). These results indicate that the timing of autism diagnosis may partly reflect aetiologically different developmental pathways, rather than purely environmental or diagnostic factors. Our findings are consistent with the wider literature that demonstrates that genetic influences on traits related to autism vary across development in the general population19,20.

This two-polygenic-trait genetic model provides one framework to understand genetic heterogeneity in autism and the varying patterns of genetic correlations between different GWASs of autism and other phenotypes. For example, previous GWASs of autism (including PGC-2017 (ref. 45)) found limited genetic correlation with ADHD, contrary to findings from more-recent autism GWAS (such as Grove et al.46). We show that this is explained by the different average age at diagnosis across these GWASs (Box 1 and Fig. 3), because the genetic correlation between autism and ADHD increases with later age at autism diagnosis (Fig. 5). These findings were confirmed using within-family analyses that demonstrated over-transmission of ADHD polygenic scores, mainly to individuals with a later autism diagnosis (Supplementary Table 24).

Both the later-diagnosed genetic autism factor (Fig. 4) and the late childhood emergent latent trajectory of SDQ total difficulties scores (Fig. 1) are associated with greater mental-health problems (Fig. 5 and Supplementary Table 7). This indicates that epidemiological findings of greater mental-health difficulties among later-diagnosed autistic individuals15,16 may be partly explained by the developmental model of autism. Given that autistic female individuals are, on average, diagnosed later in life, research that investigates sex and gender differences in both autism and co-occurring conditions16,47 needs to account for genetic confounding associated with age at autism diagnosis. Findings that may seem to reflect sex differences may also partly reflect differences associated with age at diagnosis. For example, the higher prevalence of mental-health problems in autistic female individuals16,47 compared with male individuals attenuates when restricting to autistic individuals diagnosed before age the age of 5 (ref. 16).

These findings must be interpreted considering several limitations. First, the SNP-based heritability for age at autism diagnosis is only about 11% (Fig. 2), and other observed developmental and demographic factors typically explain less than 15% of the variance (Extended Data Fig. 2). This indicates that there are several other factors that contribute to age at autism diagnosis. We find that the genetic effects on age at autism diagnosis are not mediated by several of these measured developmental and demographic factors, but we acknowledge that there may be several unmeasured factors that may mediate the genetic effects. Furthermore, the substantial variation across the datasets explored highlights that age at autism diagnosis is immensely complex and varies across geography and time. Local cultural factors, access to health care, gender bias, stigma, ethnicity and camouflaging probably have an effect on who receives a diagnosis and when. Second, our trajectory models (Fig. 1) were built using only the SDQ, which measures a wide range of parent-reported neurodevelopmental and mental-health traits. Although the SDQ is correlated with an autism diagnosis, it does not fully capture core autistic traits, and other measures of autistic traits were not available in the birth cohorts. Third, it is likely that other dimensions contribute to the genetic heterogeneity in autism. For example, a significant proportion of the variation in the FinnGen autism GWAS was not explained by either of the two factors (Fig. 4). Fourth, we use earlier- and later-diagnosed autism as relative terms, reflecting that developmental and polygenic differences represent a gradient (Fig. 4 and Supplementary Fig. 10), rather than being discrete categories, and because there is no consensus on age thresholds for early versus late diagnosis48. Fifth, autism diagnoses in the birth cohorts used in the current study rely on community-based carer or self-reporting, rather than standardized clinical assessments. As such, there may be varying delays between the emergence of autistic features and a formal autism diagnosis. However, our findings can guide future research using longitudinal cohorts, and particularly sibling studies, that systematically track the emergence of autism features over time. Finally, our genetic analyses focused on common genetic variants in genetically inferred European ancestries, because we had only limited GWAS data from other populations, highlighting the need for future research to examine the transferability of these findings across diverse genetic ancestries.

In conclusion, we find that the developmental trajectories and polygenic architecture of autism varies with age at diagnosis. These findings partly explain the varying genetic correlations among the different GWASs of autism and between autism and various mental-health conditions. These findings provide further support for the hypothesis that the umbrella term ‘autism’ describes multiple phenomena with differing aetiologies, developmental trajectories and correlations with mental-health conditions. These findings have implications for how we conceptualize neurodevelopment more broadly, and for understanding diagnosis, sex and gender differences, and co-occurring health profiles in autism.

Methods

Terminology

We use the term autistic and non-autistic to refer to people with and without an autism diagnosis51. For sex, male and female refer to sex assigned at birth.

Explaining variance in age at autism diagnosis

To contextualize the SNP-based heritability and the variance explained by the SDQ total difficulties and subscale scores, we reviewed the variance in the age at autism diagnosis explained by various sociodemographic and clinical factors, including sex and autism severity. Using Google Scholar and PubMed, we searched for studies published between 1998 and 10 December 2024 using combinations of the following terms in the title or abstract: ‘age at diagnosis’ AND ‘autism’; ‘autism’ AND ‘age’; and ‘diagnosis age’ AND ‘autism’. We also used these search terms with alternative terminology for autism, including ‘autism spectrum condition’, ‘autism spectrum disorder’ and ‘ASD’. This search resulted in more than 1,700 studies. A manual review identified 184 studies that investigated factors associated with age at autism diagnosis. Of these, 13 quantified the variance explained using measures of proportion of variance explained such as R2 (coefficient of determination) or η2 (effect size) and these were included in our final analyses (Supplementary Table 1).

We also calculated the variance explained in the US-based cohort of autistic individuals and their families, SPARK39, using the v.9 release of the phenotypic data. We focused on the variance explained by sociodemographic factors (sex, reported race, household income, mother’s education and father’s education), cognitive and developmental factors (reported IQ score, reported intellectual disability, age at walking independently, age at first words, language regression and other regression) and autism severity (scores on the Social Communication Questionnaire and Repetitive Behaviour Scale-Revised). After excluding individuals with missing data, we quantified the variance explained by these factors using relative importance analysis (see method in ref. 52) for 5,773 autistic individuals diagnosed before the age of 22. Thereafter, analyses were done using the relaimpo (v.2.2-7) package in R, which allowed us to examine the contributions of all variables simultaneously53.

Trajectory analyses of birth cohorts

Cohorts

We used four population-based birth cohorts that varied both in the ages at which the data were collected from participants and the calendar years of the data collection (Extended Data Fig. 4). In brief, the four cohorts included were the UK-based MCS54, the Australia-based LSAC-B and LSAC-K cohorts55,56 and the Ireland-based Growing Up in Ireland (GUI) Child cohort (aka Cohort 98)55,56. All children included in the cohorts were born in the twenty-first century. Further details about the cohorts are provided in Supplementary Note 1. We used data from MCS, LSAC-B and LSAC-K for the growth mixture models. We did not use data from GUI for growth mixture models because there were only three time points, which is not enough to identify two or more trajectories. All four cohorts were used for latent growth curve models.

As indicated in Supplementary Table 2, these cohorts were selected because they were longitudinal in nature, were nationally representative and included key data on behavioural profiles and neurodevelopmental diagnosis. These overlapping features across datasets allowed for cross-country comparisons and generalization57.

Measures

Autism and ADHD diagnosis and age at diagnosis

In all cohorts, across multiple sweeps, the main carer was asked whether the participant had a diagnosis of autism (Extended Data Fig. 4). For age at diagnosis, we used the age at the sweep when carers first reported their child’s autism diagnosis in every cohort, to maximize sample sizes and ensure consistency across cohorts for effective comparisons. For instance, if an autism diagnosis was reported for the participant for the first time at the age 11 sweep, we considered the age at diagnosis to be 11 years. Although the specific age at diagnosis was provided for LSAC-B and LSAC-K, we opted not to use this, because we identified errors in some reports in which months and years of diagnosis were swapped or not reported.

For MCS, we used reports of both autism and ADHD diagnoses for further sensitivity analyses. For our primary analyses, we included a narrowly defined sample of children with consistently reported autism diagnoses by primary and proxy carers (when both were available) and no other reported neurodevelopmental diagnosis (particularly ADHD). To assess the generalizability of our results and increase the sample size, we then expanded the sample to include all children with any reported diagnosis of autism (results are reported in Supplementary Note 3). This expanded sample included cases regardless of whether the diagnoses were consistent across sweeps or carers, and included those with co-occurring ADHD. Furthermore, we imputed the independent variables and covariates for autistic individuals with missing information, as detailed below (more details are in Supplementary Note 2 and Supplementary Note 3). Finally, to assess the specificity of the trajectories for autism, we conducted analyses among children who had a consistent ADHD diagnosis but no diagnosis of autism. Imputation was also performed within this ADHD-only sample, to increase the sample size.

SDQ

We used the SDQ to obtain social, emotional and behavioural profiles of participants, with repeated measures from 3 years to 18 years across cohorts. SDQ comprises 25 statements that carers were asked to rate on a three-point Likert scale (‘not true’, ‘somewhat true’ and ‘certainly true’) based on the child’s symptoms or behaviours over the past six months. There were five subscales, each containing five items, which assessed emotional symptoms, conduct problems, hyperactivity−inattention, peer relationship problems and prosocial behaviours23. The first four subscales assessed difficulties, and their combined total score ranged from 0 to 40, with higher scores indicating greater difficulties. The fifth subscale (prosocial behaviours) represented strengths and ranged from 0 to 10, with higher scores indicating more prosocial behaviours. We analysed the total score and each subscale separately. The SDQ demonstrates good test–retest reliability and criterion validity across countries24,25,26. Its five-factor structure (each subscale as a factor) has shown consistency and invariance across age, sex and ethnic background24,28. The SDQ captures several core features of mental health and neurodevelopmental conditions, including autism and ADHD58. Only children with complete data of SDQ across all sweeps were included in the analyses, except for the imputation analyses.

Sociodemographic measures

Sociodemographic measures were included as covariates to account for their impact on age at diagnosis in each cohort (Supplementary Table 25). Specific measures and available information vary across cohorts, but we generally included sex, ethnic background, maternal age at delivery, child’s cognitive aptitude, household SES and deprivation level of the living area, to account for factors that may affect the age when someone received an autism diagnosis59,60. Only subsets of children in the complete-SDQ samples, with complete data for these sociodemographic factors, were included in the analyses.

For MCS, although various census classifications for ethnic groups were available, we opted to use a binary indicator to identify non-white ethnic minorities. Ethnicity data were not collected in either LSAC cohort. Instead, visible ethnic minority status was determined mainly by parental country of birth and the language(s) spoken at home61. Maternal age at delivery was collected only for MCS. For other cohorts, we used maternal age (in years) at the first sweep of data collection to reflect the variation in maternal age at delivery.

For MCS, we identified multiple variables linked to cognitive ability, SES and area deprivation. Similarly, for LSAC-B, we identified multiple measures linked to SES, although there were no measures with sufficient sample size linked to cognitive ability or area deprivation. Subsequently, we conducted principal component analysis (PCA) for cognitive abilities, SES and area deprivation in MCS, and for SES in LSAC-B. PCA was done using a wide range of measures collected across sweeps (Supplementary Table 25), with one variable excluded from any pair with a correlation coefficient greater than 0.70 to address multicollinearity. The first principal component (PC1) explained more than 40% of the variance for each corresponding factor. By contrast, subsequent components contributed substantially less, supporting the use of the respective PC1 as the summary measure for cognitive ability, SES and area deprivation (Supplementary Table 25).

Intellectual disability was defined as scoring two standard deviations below the mean on the PC1 of the cognitive aptitude factor, consistent with previous studies. No autistic children in the MCS or LSAC cohorts who had measures of cognitive aptitude met the criteria for intellectual disability, probably because of participation bias. All PCA analyses were done in R using the prcomp() function62.

Statistical analyses

Growth mixture models and latent growth curve models

We used two methods to model the longitudinal trajectories of SDQ total and subscale scores. First, we used growth mixture models to identify whether there were latent groups of autistic individuals, based on their trajectories of SDQ total and subscale scores. Growth mixture models assume that the sample consists of multiple mixed effects models, each capturing a subgroup trajectory with shared intercept and slope63. We fitted models with one to four groups for each subscale and SDQ total scores in each cohort, using the lcmm (v.2.1.0) package in R64. The optimal number of latent trajectories were then determined by comparing fit indices, including Bayesian information criterion values, classification quality measure (entropy) and substantive interpretation. Models with lower Bayesian information criterion values and higher entropy were favoured65. Models identifying subgroups with less than 5% of the sample size were not considered, owing to poor statistical reliability and limited practical significance66. We compared the distribution of group memberships across diagnostic ages and across sexes using χ2 tests.

Second, we used linear latent growth curve models to identify the latent trajectories of SDQ total and subscale scores in the three groups (childhood diagnosed, adolescent diagnosed and the general population) for all cohorts. Each linear model included a latent intercept to represent the initial level of the outcome variable, and a linear latent slope to represent the mean rate of change over time. Childhood-diagnosed (diagnosed before ages 9–11, depending on the cohort) and adolescent-diagnosed (diagnosed after the ages of 9–11, depending on the cohort) were defined in advance. We chose this 9–11 age window as our cut-off because it aligns with the onset of puberty and the transition from primary to secondary school, and aligns with epidemiological evidence showing increased autism incidence in female individuals during this window34. An earlier cut-off was not feasible because only MCS (ages 5–7) and GUI (age 7) recorded autism diagnoses before this window. A later cut-off was not possible because there were no autism diagnoses in MCS after age 14. To further examine the relationship between age at diagnosis and socioemotional and behavioural outcomes, we conducted further latent growth curve models for autistic children using stepwise groupings by age at diagnosis in MCS and LSAC-B (see Supplementary Note 5 and Supplementary Fig. 10). All individuals who lacked an autism diagnosis (and also an ADHD diagnosis in MCS) were included in the general-population group.

Given the sex differences in age at autism diagnosis34, we also applied the same models stratified by sex, estimating latent intercept and slope for each sex, within the autistic samples. All latent growth curve models were fitted under the structural equation modelling framework using the lavaan (v.0.6-19) package in R67.

Association with sex and mental-health phenotypes

To examine the association between growth mixture models-derived SDQ latent trajectories and mental-health phenotypes in MCS, LSAC-B and LSAC-K, adjusting for sex, we used multiple regression in autistic individuals.

Variance explained in age at autism diagnosis and mediation analysis

Multiple regression analyses were conducted in MCS, LSAC-B and LSAC-K to investigate the association between age at autism diagnosis (the outcome variable), SDQ total difficulties and subscale latent trajectories memberships identified in optimal growth mixture models, and also accounting for other sociodemographic covariates. We did not detect any multicollinearity among the variables using variance inflation factors.

The relative importance of each predictor was assessed using dominance analysis68. We used the misty69 (v.0.6.8) package in R for this analysis, using a correlation matrix extracted from the fitted model via the lavInspect function from the lavaan (v.0.6-19) package67. This approach leverages the correlation matrix to consider not only individual predictors, but also the correlations between them, providing a more comprehensive assessment of their relative importance70.

To examine potential causal pathways, mediation analyses were done, allowing sociodemographic factors to indirectly influence the age at diagnosis through their effects on SDQ latent trajectory memberships identified in the optimal growth mixture model. Using structural equation modelling in the lavaan (v.0.6-19) package67, both direct and indirect effects were assessed, with their significance calculated using bootstrapping analysis. Further details are provided in Supplementary Note 4.

To investigate the specificity of our findings to autism, we used growth mixture models, latent growth curve models, regression and mediation analyses in individuals with ADHD, but without a co-occurring autism diagnosis in the MCS cohort (n = 89, Supplementary Table 6, with results presented in Supplementary Note 5). ADHD diagnoses were available in the same sweeps as autism diagnoses, reported at ages 5, 7, 11 and 14. Carers were asked the following question: “Has a doctor or other health professional ever told you that <child’s name> had attention deficit hyperactivity disorder (ADHD)?’.

Imputation

To assess the impact of missingness, we used softImpute (v.1.4-1) to impute missing data for all children with an autism or ADHD diagnosis reported by any carer in any sweep in the MCS cohort (autism: n = 623, Supplementary Table 5; ADHD: n = 325, Supplementary Table 6). We chose softImpute because of its computational efficiency in handling large-scale matrices through low-rank approximation, effectively preserving underlying structure of input data. Further information is provided in Supplementary Note 2.

SPARK: genotyping, quality control and imputation

We used data from the SPARK cohort39 iWES2 v.1 dataset (released in Feb 2022), which included data from 70,487 autistic individuals and their families as the SPARK Discovery cohort. Data from SPARK iWES v.3 (released in August 2024), which included an additional 71,267 autistic individuals and their families, was included in the SPARK Replication cohort. All participants in the Discovery cohort were genotyped using the Illumina Global Screening Array (GSA_24v2-0_A2) and in the Replication cohort using the Twist Bioscience genome-wide SNP capture panel for genotyping by sequencing. To avoid false positives caused by fine-scale population stratification, we restricted the analyses to individuals of genetically inferred European ancestries (Discovery, n = 51,869; Replication, n = 50,211 autistic and non-autistic participants), which was provided by the SPARK consortium. From this, we excluded individuals with genotyping rate of less than 98%, individuals with sex mismatches and those with excess heterozygosity (3 standard deviations from the mean heterozygosity). Where trio data were available, trios with greater than 5% Mendelian errors were excluded, resulting in 47,170 (Discovery) and 48,750 (Replication) autistic and non-autistic individuals. We included genetic variants with minor allele frequency (MAF) greater than 1%, genotyping rate more than 95% and that were in Hardy–Weinberg equilibrium (P > 1 × 10−6), resulting in 518,189 (Discovery) and 1,225,308 (Replication) SNPs.

We used these quality-controlled genotype data for imputation, calculating genetic principal components and inferring relatedness among individuals. We inferred genetic relatedness using KING71 (v.2.3.2). For genetic PCA, we pruned SNPs for linkage disequilibrium (maximum r2 = 0.1) and removed the human leukocyte antigen region. Using PC-AiR72 in GENESIS (v.2.22.2), we first calculated PCs in genetically unrelated individuals and then projected the PCs onto related individuals. We imputed genotypes using the TOPMED imputation panel73 on the Michigan imputation server (v.1.7.3)74 using Minimac4 (ref. 74) and after phasing using Eagle v.2.5 (ref. 75). After imputation, variants were converted from GRCh38/hg38 to GRCh37/hg19 using liftOver. We restricted downstream analyses to variants with MAF > 0.1% and with an imputation R2 > 0.6.

SPARK: association analyses

Polygenic score association analyses

Polygenic scores (PGSs) were calculated using PRScs76 (v.1.1.0), which has a Bayesian shrinkage prior. PGSs were calculated for autism diagnosed before age 11 (iPSYCHbefore11) and autism diagnosed after age 10 (iPSYCHafter10), generated using the iPSYCH2015 (ref. 77) cohort, details of which are provided below. For simplicity, we refer to this cohort as iPSYCH throughout. PGSs were calculated and analysed separately for the Discovery and Replication cohorts and meta-analysed using inverse-variance-weighted meta-analysis78.

We ran separate linear regression analyses between each of the two PGSs and age at autism diagnosis (converted to years in all analyses) in the quality-controlled dataset. We excluded individuals older than 22 to focus on those who had an autism diagnosis using either the DSM-IV79 or DSM-5, retaining a maximum of 18,809 (Discovery cohort) and 9,383 (Replication cohort) autistic individuals for PGS analyses. This criteria also allowed us to focus on individuals who received their diagnosis in childhood or adolescence, because older adults may have missed an earlier diagnosis of autism owing to secular changes in societal attitudes towards autism. The baseline model included intellectual disability (16.34% had carer-reported intellectual disability), sex and the first 10 genetic PCs as covariates. We ran eight different sensitivity analyses by including various covariates as well as the covariates included in the baseline model. First, we ran three models to account for developmental and clinical covariates: age at walking and age at first words (model 2), age at walking and first words, autism severity (SCQ and RBS-R total scores), carer-reported IQ scores, and language or other regression (model 3), and stratified analyses restricted to individuals without intellectual disability who can speak in longer sentences (model 4). Next, we ran two models accounting for sociodemographic factors: parental SES (model 5) and also area deprivation (model 6). We controlled for any attentional and behavioural diagnosis (model 7), for DSM edition (DSM IV versus DSM 5; model 8) and trio status (model 9). Finally, we also ran sensitivity analyses after stratifying by sex.

In the SPARK cohort, we obtained data for age at achieving nine developmental milestones (in months) for autistic individuals. For all milestones, we excluded individuals who were more than five median absolute deviations from the median. We ran multiple linear regression with PGS for iPSYCHbefore11 and iPSYCHafter10 GWASs with sex and the first ten genetic PCs as covariates. Yet again, we ran the analyses for both the Discovery and Replication cohorts and meta-analysed it using inverse-variance-weighted meta-analysis.

Rare high-impact de novo variants and inherited variants

We identified rare (MAF < 0.1%) de novo and inherited variants in complete trios from SPARK, as previously described40. We identified high-impact protein-truncating variants by restricting it to variants in loss-of-function observed/expected upper bound fraction (LOEUF)80; highly constrained decile (LOEUF < 0.37) that were annotated as either frameshift, stop gained or start lost; and that had a loss-of-function transcript effect estimator (LOFTEE) high-confidence annotation. To identify high-impact de novo missense variants, we restricted it to variants in LOEUF highly constrained genes (LOEUF < 0.37) that had an MPC (missense badness, PolyPhen-2, and constraint) score81 > 2.

We ran regression analyses for: high-impact de novo and inherited protein-truncating variants; missense variants; and by combining both protein-truncating and missense variants. We included sex and the first ten genetic PCs as covariates.

GWAS

GWAS of age at autism diagnosis

We generated a GWAS of age at autism diagnosis (in years) in the quality-controlled dataset from SPARK, restricting it to autistic individuals who were under 22 years of age (Discovery, n = 18,809; Replication, n = 9,356) and SNPs with MAF > 1%. GWAS was generated using FastGWA82 with sex, intellectual disability and the first ten genetic PCs as covariates. We meta-analysed the GWAS from the SPARK Discovery and Replication cohorts using inverse-variance-weighted meta-analysis in Plink83,84 (v.2.0). In iPSYCH, we generated an additional GWAS of age at autism diagnosis (in years) in a quality-controlled dataset of unrelated individuals with sex and intellectual disability included as covariates using FastGWA82, restricting it to SNPs with MAF > 1%. To keep it consistent with SPARK, we excluded individuals who were diagnosed after age 22, leaving a total sample of 18,965 individuals. In brief, pre-imputation quality control of the iPSYCH data was done using the Ricopili pipeline85, prephased using Eagle v.2.3.5 and imputed using Minimac3 (ref. 86), using the downloadable version of the Haplotype Reference Consortium87 (accession number EGAD00001002729). Further details of quality control and imputation are provided in ref. 88.

GWAS of autism stratified by age at diagnosis

We generated three age at autism diagnosis stratified GWASs in SPARK using (unscreened) non-autistic parents and siblings as controls (Discovery, ncontrol = 24,965; Replication, ncontrol = 33,302). The three GWAS were: first, SPARK, diagnosed before age 6 (SPARKbefore6; Discovery, nautistic = 14,578; Replication, nautistic = 6,857); second, SPARK, diagnosed before age 11 (SPARKbefore11, nautistic = 18,719; Replication, nautistic = 9,162); and third, SPARK, diagnosed after age 10 (SPARKafter10, nautistic = 3,358; Replication, nautistic = 2,885). For these analyses, we did not restrict it to individuals under 22, to increase the sample size. Of note, SPARKbefore11 overlaps with the SPARKbefore6 cohort. GWASs were generated using quality-controlled SNPs with MAF > 1% using FastGWA-GLMM89. We included age at recruitment in the study (to account for the use of parents as controls, who potentially lack an autism diagnosis owing to secular changes in attitudes and diagnosis), sex and the first ten genetic PCs as covariates. Fast-GWA GLMM can account for relatedness and fine-scale population stratification, even in family-based samples such as SPARK. Given the relatively low sample size of the Replication cohort, we did the meta-analysis of the Replication and Discovery cohort GWAS using inverse-variance-weighted meta-analyses in Plink83,84 (v.2.0).

Although inclusion of unscreened related individuals as controls can decrease heritability and statistical power to identify loci90, we used the GWAS to primarily conduct genetic correlation and related analyses. To ensure the robustness of these models we did the following: first, we confirmed that the attenuation ratio for all GWASs was not significantly greater than 1; second, we generated an additional GWAS of SPARK without stratifying by age at autism diagnosis using the same methods and confirmed a high genetic correlation (rg = 0.92, s.e. = 0.17) with a previous SPARK GWAS49, which used a case-pseudocontrol approach; and third, in the genomicSEM analyses, we ran sensitivity analyses using a trio-based SPARK GWAS49 in lieu of the age at diagnosis stratified GWAS from SPARK and confirmed our findings.

We generated three GWASs of autism stratified by age at autism diagnosis in the iPSYCH cohort77. The primary GWASs used in the analyses were GWASs of autism diagnosed before age 11 (iPSYCHbefore11, n = 9,500 autistic and n = 36,667 non-autistic individuals) and autism diagnosed after age 10 (iPSYCHafter10, n = 9,231 autistic and n = 36,667 non-autistic individuals). We chose to subdivide the iPSYCH cohort at age 10 because we observed an increase in SDQ scores in birth cohorts at this age, and because age 10 is associated with an increase in diagnosis of female individuals in epidemiological samples34. We also conducted a GWAS of autism diagnosed before age nine (iPSYCHbefore9, n = 5,451 autistic and n = 36,667 non-autistic individuals).

All individuals included in these GWASs from iPSYCH were born between May 1980 and December 2008 to mothers who were living in Denmark. The GWAS was done for unrelated individuals of European ancestry, with the first ten genetic PCs included as covariates using logistic regression as provided in PLINK.

Heritability, genetic correlation, and genomicSEM

Heritability analyses for age at autism diagnosis were conducted using a single-component genome-wide complex trait analysis with a genomic-relatedness-based restricted maximum likelihood approach (GCTA-GREML v1.94.1)91,92 in unrelated autistic individuals using the quality-controlled genetic data in SPARK. We estimated SNP-based heritability first after including sex and the first ten genetic PCs as covariates, and then also with intellectual disability as a covariate (baseline model). We ran several sensitivity analyses after including additional covariates: first, developmental milestones (age at first words and age at walking); second, developmental milestones and developmental regression (language regression and other regression); third, developmental milestones, developmental regression and IQ scores; fourth, SCQ and RBS-R scores; fifth, SCQ and RBS-R scores, developmental milestones and developmental regression; sixth, developmental milestones and SES; and finally, developmental milestones, SES and deprivation.

We conducted genetic correlation analyses using LDSC, with linkage disequilibrium scores from the northwest European populations.

We did genetic correlation analyses among different autism GWASs using LDSC (v.1.0.1). This included a European-only case-pseudocontrol GWAS in SPARK49 (4,535 case-pseudocontrol pairs); GWAS from FinnGen (Data Release r10)93 (646 cases and 301,879 controls), the PGC-2017 autism GWAS45 (7,387 cases and 8,567 controls), GWAS from iPSYCH, and age at diagnosis-stratified GWAS from SPARK. The iPSYCH GWAS included an unstratified (19,870 autistic individuals, comprising 15,025 male individuals and 4,845 female individuals, and 39,078 controls) and sex-stratified GWAS50, and three age at diagnosis-stratified GWASs, as mentioned earlier.

For genomicSEM42 (v.0.0.5) analyses, we restricted it to six GWASs with minimal sample overlap, without high genetic correlation (rg > 0.95), and with wide variation in age at diagnosis to conduct genomicSEM analyses using autosomes. Using the patterns of genetic correlations observed, we tested an age at diagnosis-related correlated two-factor model. We also tested: first, a single-factor model; second, a correlated two-factor ‘geography’ model, in which three US-based autism GWASs loaded onto one factor, and three Europe-based autism GWASs loaded onto a second factor; third, a bifactor model based on age at diagnosis; fourth, a bifactor model based on the geography of the cohorts; and finally, a hierarchical factor model based on age at diagnosis. The two-factor model was chosen because it had lower root mean square error of approximation and higher comparative fit index and was more parsimonious than the bifactor model. We ran sensitivity analyses using different GWASs of autism as input and confirmed that the two-correlated-factor model was the best-fitting model of those tested.

Analyses in ALSPAC and MCS

Genetic quality control for ALSPAC

We obtained quality-controlled and imputed genotype data from ALSPAC94,95,96. Further details about the cohort are provided in Supplementary Note 1. In brief, ALSPAC children were genotyped using the Illumina HumanHap550 quad chip genotyping platforms by 23andme. Some individuals were excluded owing to sex mismatches, excess heterozygosity, missingness greater than 3% and insufficient sample replication (identical by descent score of less than 0.8). After multidimensional scaling and comparison with Hapmap II (release 22), only individuals of genetically inferred European ancestries were retained. SNPs with low frequency (MAF < 1%), poor genotyping (call rate < 95%) and deviations from Hardy–Weinberg equilibrium (P < 5 × 10−7) were removed; 9,115 subjects and 500,527 SNPs passed quality control. Genotypes were phased using ShapeIT and imputation was done using the Haplotype Reference Consortium panel using the Michigan imputation server. After imputation, we further removed low-frequency SNPs (MAF < 1%). Further details of the quality control and imputation of ALSPAC are provided here: https://proposals.epi.bristol.ac.uk/alspac_omics_data_catalogue.html#org89bb79b. Genome-wide genotype data were generated by the Sample Logistics and Genotyping facilities at the Wellcome Sanger Institute and LabCorp (Laboratory Corporation of America) using support from 23andMe.

Genetic quality control for MCS

We also obtained quality-controlled and imputed data from MCS. In brief, MCS samples were genotyped using the Illumina Global Screening Array97. Some individuals were excluded owing to sex mismatches, excess heterozygosity and missingness greater than 2%. We identified European samples using the GenoPred pipeline98 (https://github.com/opain/GenoPred). SNPs with low frequency (MAF < 1%), poor genotyping (call rate < 97%) and deviations from Hardy–Weinberg equilibrium (P < 1 × 10−6) were removed. Imputation was conducted using Minimac4 (ref. 74) using the TOPMED reference panel73 in the Michigan imputation server74. After imputation, SNPs with an imputation R2 INFO score of less than 0.8, with more than 3% missing and with MAF < 1% were excluded. Further details are available here: https://cls-genetics.github.io/docs/MCS.html.

PGS association with SDQ

PGSs for both ALSPAC and MCS were calculated for individuals of genetically inferred European ancestries. Genetic PCs were calculated for both cohorts using PC-AiR, as described earlier. We calculated PGS for iPSYCHbefore11 and iPSYCHafter10 and used these in all analyses in the MCS to keep it consistent with analyses in SPARK.

We obtained scores on the SDQ total and subscales for six ages in the MCS and five ages in ALSPAC. We ran cross-sectional analysis at each age using multiple linear regression with PGS for iPSYCHbefore11 and iPSYCHafter10, with sex, age and the first ten genetic PCs as covariates. We also ran multiple linear mixed effects regression using the lme4 (v.1.1.27.1) package in R99, fitting a PGS by age interaction term to investigate whether the effects of PGS on SDQ change over time.

To investigate whether the differences in association between MCS and ALSPAC were due to differences in ascertainment between the two cohorts, we matched ALSPAC to MCS using entropy balancing100 and re-ran the PGS association analyses. Entropy balancing is a reweighting technique that ensures the covariate distributions are identical between groups. This method uses optimization algorithms to assign weights to individuals such that the weighted average of the covariates in ALSPAC (the larger genotyped cohort) matches that of MCS (the smaller genotyped cohort), minimizing confounding biases and increasing comparability. We used the child’s biological sex, maternal age at delivery and maternal highest educational qualification at first data collection in each cohort as matching factors. Entropy balancing was done using the ebal (v.0.1-8) package in R101.

PGS association with communication skills and autism diagnosis

For ALSPAC, we obtained understanding of simple phrases (such as “Do you want that?” or “Come here”) and gesture scores from the Macarthur-Bates Communicative Development Inventories102 at 15 months of age (Supplementary Note 9). We conducted multiple linear regression using PGS for iPSYCHbefore11 and iPSYCHafter10, with sex, age and the first ten genetic PCs as covariates.

Autism diagnosis for the MCS was obtained using parent/carer reports of autism/Asperger’s syndrome diagnosis by a doctor at ages 5, 7, 11 and 14. We identified individuals with an autism diagnosis at age 7 or earlier, age 11 or earlier, or between ages 11 and 14. We conducted Firth’s bias-reduced multiple logistic regression (logistf v.1.26.0 package in R) using PGS for iPSYCHbefore11 and iPSYCHafter10, with sex, age and the first ten genetic PCs as covariates.

Ethics

Ethical approval for individual cohorts was obtained independently of the current study. Ethical approval for ALSPAC was obtained from the ALSPAC Ethics and Law Committee and the local research ethics committees. Ethical approval for each sweep of MCS was obtained from NHS research ethics committees (MREC). Ethical approval for LSAC was obtained from the Australian Institute of Family Studies Human Research Ethics Committee. Ethical approval for GUI was obtained from a dedicated research ethics committee set up by the Department of Children, Equality, Disability, Integration and Youth. Ethical approval for SPARK was obtained from the Western Institutional Review Board Copernicus Group (IRB protocol 20151664). The Danish Scientific Ethics Committee, the Danish Health Data Authority, the Danish data protection agency and the Danish Neonatal Screening Biobank Steering Committee approved the iPSYCH study. Ethical approval for the analyses of de-identified data used in this study was obtained from the Cambridge Human Biology Research Ethics Committee (HBREC.2020.07).

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.