Introduction

Structural transformation is a central concept in economic development. It refers to the shift from agrarian-based economies to more diversified industrial and service-oriented systems1. While this process has been extensively studied in high-income countries, where data is comprehensive and reliable, comparative research in low- and middle-income countries (LMICs) remains limited due to data constraints2. The scarcity and unreliability of data in these contexts make in-depth analysis difficult3,4. Traditional models and methods often fail to address these limitations effectively3,5, which restricts their ability to capture structural changes accurately.

LMICs, and particularly Sub-Saharan African economies, present a critical and under-explored testing ground for structural transformation analysis. These countries face persistent challenges such as high informality6, weak statistical capacity7, and sectoral volatility8, all of which limit the reliability of conventional growth diagnostics. At the same time, policy imperatives in LMICs are closely tied to understanding sectoral transitions9, given their implications for employment, industrial policy, and public investment. Yet, most available analytical tools are either ill-suited to sparse datasets or built on assumptions that do not reflect the structural complexity of LMIC economies8,10.

Recent efforts, such as those by McMillan et al.2, have drawn attention to the shortcomings of conventional econometric models when applied to sparse and noisy datasets. While these studies motivated the need for better tools, they stopped short of offering a generalizable methodology capable of integrating imputation, uncertainty modeling, and dimensionality reduction into a single framework. The current paper addresses this methodological gap by proposing and validating an explicitly unified statistical architecture that merges Bayesian hierarchical modeling, machine learning-based imputation, and factor analysis11. These components are not treated as interchangeable tools but are operationally interlinked. That is, imputation enhances data quality12, factor analysis identifies latent structures that inform priors, and Bayesian modeling produces posterior estimates that reflect both local structure and global uncertainty.

This integrated structure represents a departure from earlier sparse-data toolkits, which typically apply individual techniques in isolation or limit their use to sector-specific settings. In contrast, the proposed framework is designed to operate across countries and sectors simultaneously, and it is validated through simulations, error diagnostics, and out-of-sample checks. Furthermore, we benchmark the framework against conventional econometric baselines to demonstrate its empirical and inferential advantages.

This study addresses a critical question; How can structural transformation be reliably measured when data is incomplete or unreliable? Our objective is to develop a statistical framework specifically adapted to the realities of LMICs. By combining modern statistical learning and Bayesian methods, the framework responds to the urgent need for tools that remain robust under missing data, noise, and high dimensionality. Many policy decisions in LMICs rely on projections of economic growth, sectoral productivity, and employment shifts. When based on inadequate data, these projections can mislead policymakers13. A key contribution of this paper is the demonstration that context-aware imputation, coupled with model-based inference, can significantly improve projection accuracy and reduce uncertainty.

To summarize, this paper proposes and empirically validates a coherent, scalable framework that brings together methods often used separately. The novelty lies in the operational interdependence of these components, tested under realistic data conditions, and their application to LMICs where structural transformation analysis is typically weakest11,14. This architecture provides a statistically grounded tool for sectoral forecasting and structural diagnostics in development contexts.

Structural transformation in Sub-Saharan Africa: existing literature and methodological challenges

Structural transformation in low- and middle-income countries, particularly in Sub-Saharan Africa (SSA), has attracted growing scholarly attention8 due to the region’s persistent development constraints and shifting growth patterns15. Much of the literature aims to understand how economies transition from agriculture toward industry and services, yet these studies have encountered major methodological challenges, chiefly, the pervasive issue of sparse, inconsistent, and unreliable data.

Early empirical efforts in this area, such as McMillan et al.2, demonstrated how sectoral reallocation contributed to productivity growth. However, they also exposed systematic data gaps, especially concerning the informal sector, which often led to overestimation of structural change16,17. Other econometric approaches have relied on national accounts or time-series models to examine sectoral productivity and labor dynamics18, but these tools depend heavily on data completeness and linear assumptions, conditions rarely met in SSA.

To overcome these constraints, recent studies have explored imputation and machine learning techniques. For instance, researchers have used supervised learning methods to fill in missing data for agriculture and industry1,19, achieving promising results. Nevertheless, these models are frequently narrow in application, designed for single-country or single-sector analysis, and do not generalize well across contexts. Separately, Bayesian hierarchical models have been applied to incorporate uncertainty and regional heterogeneity20, particularly in agricultural productivity studies21, but have not been systematically combined with other advanced techniques.

At the same time, scholars have noted that structural transformation in Africa often departs from classical growth pathways22,23. Rather than following a sequential shift from agriculture to industry and then services, many economies are skipping industrialization altogether and moving directly into services11,24. This non-linear trajectory presents additional modeling difficulties, particularly for traditional econometric models that assume stylized growth patterns.

Recent work has called for more comprehensive and adaptive methods. Kweka et al.25 emphasized the need for better handling of missing data to assess labor reallocation, while others have highlighted the limitations of satellite and survey-based data collection due to cost and accessibility26,27. Factor analysis has begun to be used as a dimensionality reduction technique to uncover latent structures in economic indicators28, but this too has largely remained exploratory and context-specific.

Despite these advances, no existing study has proposed an integrated framework that combines machine learning-based imputation, Bayesian inference, and latent structure extraction into a coherent, validated approach applicable across countries and sectors. This paper fills that methodological and empirical gap by offering a generalizable framework that not only imputes missing data, but also incorporates hierarchical dependencies and identifies underlying economic forces through factor analysis. The framework is tested under simulated and real data conditions and benchmarked against conventional econometric approaches.

Therefore, while the literature has made substantial strides in highlighting the limitations of data-scarce environments and proposing partial remedies, there remains a pressing need for unified, scalable methodologies capable of supporting robust structural transformation analysis in SSA. This study responds directly to that call.

Methods

To address the challenge of analyzing structural transformation in data-constrained low- and middle-income countries (LMICs), we propose a modular, data-efficient statistical framework. The architecture integrates three key components, i.e., Bayesian Hierarchical Modeling, Machine Learning-based Imputation, and Factor Analysis. Each method contributes a distinct advantage, handling uncertainty, recovering missing data, and reducing dimensionality, respectively. Together, they offer a robust and interpretable approach for examining sectoral shifts, productivity patterns, and macroeconomic transformation.

Bayesian hierarchical modeling

Bayesian Hierarchical Modeling (BHM) provides a probabilistic framework to estimate sectoral outcomes while accounting for regional, sectoral, and data-driven heterogeneity. This is particularly important in LMIC contexts where datasets are sparse and highly variable across space and time29.

Let the structural transformation outcome for sector \(i\) in region \(j\) be denoted as

$$Y_{ij} \sim \text {Normal}(\mu _{ij}, \sigma ^2),$$

with the mean structure defined by:

$$\mu _{ij} = \alpha _j + \beta _i X_{ij},$$

where \(X_{ij}\) includes observed covariates such as labor allocation, capital formation, and sectoral output shares. The parameters \(\alpha _j\) and \(\beta _i\) capture region-specific intercepts and sectoral slopes, respectively.

Priors are defined as

$$\alpha _j \sim \text {Normal}(\mu _\alpha , \tau _\alpha ^2), \quad \beta _i \sim \text {Normal}(\mu _\beta , \tau _\beta ^2),$$

under weak, moderate, and strong variance configurations. Posterior estimation is performed via Markov Chain Monte Carlo (MCMC).

While theoretically well-suited for hierarchically structured and noisy economic data, empirical application showed that posterior means for both \(\alpha _j\) and \(\beta _i\) collapsed to zero across all prior strengths. Moreover, the resulting RMSE exceeded \(1.48 \times 10^{11}\), suggesting non-identifiability, poor data scaling, or weak signal-to-noise ratio. These results indicate that the model failed to learn meaningful structure, not because it is ill-conceived, but because of practical limitations in data quality. We therefore interpret these outcomes as diagnostic. That is, Bayesian models require proper normalization and domain-specific priors to stabilize inference in sparse environments. Nevertheless, BHM remains conceptually integral to the framework. With better-scaled inputs and enriched priors, its ability to capture partial pooling and posterior uncertainty can significantly improve structural trend estimation.

Machine learning-based data imputation

Incomplete data is one of the major bottlenecks in studying structural change in LMICs. To address this, we implement machine learning-based imputation techniques capable of reconstructing missing values without assuming linearity or parametric distributional forms.

Given a dataset \(X = \{X_1, X_2, \ldots , X_n\}\) with missing entries, for each variable \(X_i\), we estimate its missing values as

$$\hat{X}_i = f(X_{-i}; \theta ),$$

where \(f\) is a machine learning model (e.g., Random Forest, Gradient Boosting Machine) parameterized by \(\theta\). The objective is to minimize the empirical loss:

$$\mathscr {L}(\theta ) = \frac{1}{N} \sum _{j=1}^{N} (X_{ij} - \hat{X}_{ij})^2.$$

This approach is nonparametric, data-adaptive, and robust to complex multivariate interactions. Cross-validation ensures generalizability and guards against overfitting. Imputed values are used as inputs for subsequent modeling steps.

GDP, being subject to high macro-volatility, exhibited greater imputation error than sectoral shares. Unlike agriculture or services, which evolve gradually, GDP reacts sharply to policy changes8,9, external shocks30, and currency fluctuations31. This makes it difficult for global-pattern models like SoftImpute to capture such short-term volatility.

To mitigate this, we combine SoftImpute with k-Nearest Neighbors (kNN) in a hybrid strategy. SoftImpute captures latent global structures, while kNN excels at preserving local anomalies. Together, they offer balanced performance across both stable and volatile indicators.

Alternative imputation techniques and justification

We evaluated several alternative imputation methods:

  • Mean/Median Imputation is computationally cheap but ignores inter-variable relationships and introduces bias.

  • kNN captures local structure but suffers under high sparsity or dimensionality.

  • Expectation-Maximization (EM) requires strong distributional assumptions, making it fragile under model misspecification.

  • Matrix Factorization (e.g., SoftImpute) leverages low-rank structures, but may overly smooth data and assume linearity.

Machine learning methods outperform these by adapting to nonlinear dependencies without strict assumptions. They are scalable, flexible, and more robust to real-world missingness patterns, making them well-suited for LMIC economic data.

Factor analysis

To reduce dimensionality and uncover latent drivers of structural transformation, we apply factor analysis. This statistical technique identifies unobserved common factors that explain co-movements among economic indicators.

Let \(Y = \{Y_1, Y_2, \ldots , Y_p\}\) be the observed variables. We model:

$$Y = \Lambda F + \varepsilon ,$$

where \(F = \{F_1, F_2, \ldots , F_k\}\) are latent factors, \(\Lambda\) is the factor loading matrix, and \(\varepsilon \sim \text {Normal}(0, \Psi )\) is the idiosyncratic noise. Latent factors follow:

$$F \sim \text {Normal}(0, I_k),$$

and parameters are estimated via maximum likelihood or Bayesian procedures.

Unlike Principal Component Analysis (PCA), which maximizes explained variance, factor analysis distinguishes between shared (systematic) and unique (random) variance. This makes it more suitable for economic interpretation, e.g., isolating sustained transformation from temporary fluctuations.

Factor interpretation in structural transformation

Factor analysis yields interpretable dimensions aligned with core development theories:

  • Factor 1: Labor Reallocation , Captures the shift from agriculture to industry/services (Kuznets’ hypothesis).

  • Factor 2: Capital Deepening , Reflects industrial productivity and investment intensity (Lewis-Ranis-Fei model).

  • Factor 3: Services Expansion , Aligns with Baumol’s structural change theory on rising service employment.

  • Factor 4: Technological Progress , Consistent with endogenous growth models emphasizing innovation and scale.

  • Factor 5: Urbanization Effects , Mirrors spatial transformation and infrastructure development theories.

  • Factor 6: Trade Openness , Tracks global integration effects (Heckscher-Ohlin and open economy frameworks).

These latent factors anchor the framework in empirical regularities observed in economic development literature, offering a more rigorous and theory-aligned interpretation of structural change.

Factor scores are then passed as inputs into the Bayesian model and LASSO regressions, helping to isolate meaningful signals while mitigating noise. This integration ensures dimensional efficiency without sacrificing explanatory relevance.

Building on the components introduced in the Methods section, this framework provides a unified solution for analyzing structural transformation in data-constrained environments. It addresses three core challenges, that is, data incompleteness, high dimensionality, and latent heterogeneity. By integrating low-rank matrix completion, machine learning-based imputation, Bayesian hierarchical modeling, and sparse econometric estimation, the framework enables reliable structural inference and robust economic forecasting, particularly in LMICs where standard approaches often fail.

Framework architecture and core components

The analytical pipeline comprises three key layers:

  1. (a)

    Data Structuring and Preprocessing: The raw panel data, consisting of country-sector-time combinations, is first organized into a matrix \(X \in \mathbb {R}^{N \times T}\), where \(N\) indexes country-sector units and \(T\) is the number of time periods:

    $$X = \begin{bmatrix} x_{11} & x_{12} & \dots & x_{1T} \\ x_{21} & x_{22} & \dots & x_{2T} \\ \vdots & \vdots & \ddots & \vdots \\ x_{N1} & x_{N2} & \dots & x_{NT} \end{bmatrix}$$

    To address missing entries, low-rank matrix factorization is applied:

    $$X = UV^\top + E,$$

    where \(U \in \mathbb {R}^{N \times r}\) and \(V \in \mathbb {R}^{T \times r}\) are latent factor matrices, and \(E\) captures noise. The objective is:

    $$\min _{U, V} \Vert UV^\top - X \Vert _F^2 + \lambda \left( \Vert U \Vert _* + \Vert V \Vert _* \right) ,$$

    where \(\Vert \cdot \Vert _F\) is the Frobenius norm, \(\Vert \cdot \Vert _*\) the nuclear norm, and \(\lambda\) controls model complexity.

    In parallel, machine learning-based imputation (e.g., SoftImpute, kNN) reconstructs missing values:

    $$\hat{X}_i = f(X_{-i}; \theta ),$$

    where \(f\) is a non-parametric learner (e.g., random forest or gradient boosting), trained to minimize:

    $$\mathscr {L}(\theta ) = \frac{1}{N} \sum _{j=1}^{N} (X_{ij} - \hat{X}_{ij})^2.$$

    The combined use of low-rank approximation and local learning ensures that both global structure and localized patterns are preserved.

  2. (b)

    Hierarchical and Factor-Based Modeling: Once imputed, the complete dataset is subjected to two inferential layers. First, factor analysis reduces dimensionality by identifying latent economic forces:

    $$Y = \Lambda F + \varepsilon , \quad \varepsilon \sim \mathscr {N}(0, \Psi ), \quad F \sim \mathscr {N}(0, I_k),$$

    where \(Y \in \mathbb {R}^{p}\) is the observed data vector, \(F \in \mathbb {R}^k\) are latent factors, and \(\Lambda \in \mathbb {R}^{p \times k}\) are loadings. These extracted factors guide downstream modeling by summarizing core transformation dynamics.

    Next, a Bayesian hierarchical model is specified to capture country-sector-time dependencies:

    $$y_{it} \sim \mathscr {N}(\mu _{it}, \sigma ^2), \quad \mu _{it} = \beta _0 + \beta _1 X_{it} + \gamma _i + \delta _t,$$

    where \(\gamma _i \sim \mathscr {N}(0, \tau ^2)\) are group-specific random effects and \(\delta _t \sim \mathscr {N}(0, \xi ^2)\) account for common time shocks. Inference is conducted via MCMC, generating posterior distributions for all model parameters.

  3. (c)

    Sparse Econometric Estimation (LASSO): Finally, a LASSO regression is used to select key covariates that drive structural transformation:

    $$\min _{\beta } \sum _{i=1}^{N} \sum _{t=1}^{T} \left( y_{it} - \hat{\mu }_{it} \right) ^2 + \lambda \sum _{k=1}^{p} |\beta _k|,$$

    where \(\lambda\) tunes the sparsity constraint. This step ensures interpretability and guards against overfitting in high-dimensional, partially imputed datasets.

Summary of framework components

Each layer in the framework addresses a distinct empirical challenge:

  • Imputation and Completion: Restore missing information while preserving sectoral dynamics and cross-country comparability.

  • Bayesian Modeling: Capture nested structure and uncertainty, allowing inference in sparse, grouped data with posterior quantification.

  • Factor Analysis: Extract latent economic drivers, reduce noise, and improve downstream model identifiability.

  • LASSO Selection: Identify key covariates without inflating model complexity, crucial in sparse-data environments.

Table 1 provides a comparative summary of each component’s strengths, limitations, and applications.

Table 1 Summary of Bayesian hierarchical modeling, machine learning-based imputation, and factor analysis.

Integrated flow diagram and validation

Figure 1 illustrates the pipeline, from sparse input data to actionable insights.

Fig. 1
figure 1

Integrated Framework Flowchart.

Validation strategy

To evaluate performance, the framework was subjected to a multi-pronged validation process, as detailed below.

  1. (a)

    Simulations: Synthetic datasets with varying sparsity (up to 60%) and noise were generated (c.f. Appendix C). Performance metrics included RMSE, MAE, and Bayesian Information Criterion (BIC).

  2. (b)

    Sensitivity Analyses: Prior configurations, imputation ranks, and tuning parameters were varied to assess robustness.

  3. (c)

    Empirical Application: Historical data from Kenya, Nigeria, Ghana, and comparator economies were analyzed to assess real-world applicability and consistency across structural contexts.

These validation steps confirm that the framework delivers statistically sound and policy-relevant insights even under data scarcity. Its modular architecture also allows adaptation to other applications, such as labor mobility or sectoral productivity forecasting, in low-data environments.

Empirical framework and model validation

Data and sampling design

We focus our empirical analysis on four structurally diverse low- and middle-income countries (LMICs), namely, Kenya, Nigeria, Ghana, and South Africa. These countries were selected for three reasons. First, they exhibit different stages of structural transformation, ranging from agrarian to service-oriented economies. Second, they represent varying levels of data completeness, enabling robust testing of imputation methods. Third, they reflect regional and sectoral diversity within Sub-Saharan Africa, aligning with the paper’s goal of building generalizable tools for data-scarce contexts.

The dataset comprises annual values for GDP, agriculture, industry, and services (as shares of GDP) from 2000 to 2020. Data were sourced from the World Bank’s World Development Indicators to ensure quality and consistency. To simulate real-world data sparsity in LMICs, we randomly removed 43% of observed entries, mimicking typical gaps encountered in development statistics. This artificial missingness allows us to test the imputation methods’ performance under controlled but realistic conditions.

Data imputation and model setup

We evaluate three imputation techniques, SoftImpute, k-Nearest Neighbors (kNN), and Iterative Imputation, based on their ability to reconstruct missing data accurately. SoftImpute uses matrix factorization to model latent structures and performs well with smooth and continuous trends. kNN estimates missing values based on local similarity, often excelling with structured macro indicators like GDP. Iterative Imputation builds chained regression models, offering flexibility but often struggling with sparsity and high variance.

Root Mean Squared Error (RMSE) is used to quantify imputation accuracy. A detailed comparison is provided in Fig. 2.

Fig. 2
figure 2

Comparison of GDP RMSE for SoftImpute, k-nearest neighbours and Iterative Imputation methods.

Figure 2 shows that SoftImpute consistently outperforms the other two methods in agriculture, industry, and services, sectors characterized by seasonal variation and latent structures. However, for GDP, kNN achieves the lowest RMSE, likely due to its local-similarity advantage in modeling macro-level trends. Iterative Imputation yields middling performance, occasionally outperforming in specific sectors but generally less stable under high sparsity.

The choice of imputation method has practical implications for policymaking in LMICs:

  1. (i)

    Over- or under-estimation of agriculture may distort productivity assessments and misallocate rural investments.

  2. (ii)

    Poor GDP imputation can skew macroeconomic forecasting and debt sustainability analyses.

  3. (iii)

    Inaccurate service sector data may obscure digital and knowledge-economy trends critical for diversification strategies.

These findings underscore the need for sector-aware imputation strategies in structurally diverse and data-constrained settings.

Modeling strategy and validation

Following imputation, the dataset was standardized and prepared for modeling. We evaluate three complementary approaches:

  • Bayesian Hierarchical Modeling (BHM) to capture country- and sector-level heterogeneity, accounting for uncertainty and partial pooling.

  • Logistic Regression, used to classify GDP into tertiles for a simplified, interpretable categorization of growth phases.

  • LASSO Regression, applied to identify the most predictive sectoral indicators while imposing regularization to prevent overfitting.

Each model provides distinct insights. BHM is suited to handling sparse, multi-level data and extracting posterior distributions. Logistic regression simplifies structure-to-growth relationships. LASSO isolates key transformation drivers. By combining these tools, the framework offers a flexible but rigorous method for analyzing structural change, even in the presence of high data sparsity.

To validate model performance and assess component reliability, we proceed with a series of empirical evaluations, that is, RMSE comparisons, trace plots, prediction scatterplots, and residual analyses. These are presented in the sections that follow, with a focus on both interpretability and diagnostic transparency.

Fig. 3
figure 3

SoftImpute’s performance on the respective sectoral data.

Figure 3 compares the original (blue) and imputed (orange) series using SoftImpute across four key indicators from 2000 to 2020. The method generally preserves long-term sectoral patterns while smoothing sharp fluctuations, particularly in years with substantial missing data. In agriculture, it captures seasonal trends but dampens volatility during extreme shifts, highlighting both its reliability and limitations when faced with prolonged data gaps. GDP reconstruction closely tracks long-run movements but underrepresents shocks, which may affect macro-level diagnostics. For industry, the model reproduces the sector’s volatility but slightly smooths higher-order fluctuations, possibly masking transient industrial dynamics. In services, the imputation aligns strongly with observed data, reinforcing the sector’s inherent stability and SoftImpute’s capacity to reconstruct trends with minimal error. Overall, while the method demonstrates robustness, combining it with uncertainty-aware techniques is crucial in high-variance settings.

Fig. 4
figure 4

SoftImpute’s performance on the respective sectoral data.

Figure 4 illustrates the hybrid imputation output with confidence intervals (CIs), combining SoftImpute and probabilistic modeling. The close alignment between imputed and actual data, along with the CI bands, demonstrates that this hybrid strategy preserves signal while explicitly quantifying uncertainty, an essential consideration in development statistics. Models that lack uncertainty awareness may produce misleadingly precise estimates, leading to policy missteps. Integrating imputation accuracy with credible intervals enhances reliability in data-scarce environments.

figure a

The iterative log confirms that SoftImpute converges after 100 iterations with a final MAE of 0.0237 and stabilized rank 4. However, its total RMSE ( 15.17 billion) suggests only moderate accuracy, likely influenced by smoothing effects and latent structure assumptions. kNN Imputation, with the lowest RMSE ( 9.88 billion), demonstrates superior performance in this dataset, benefiting from its local, pattern-matching nature. In contrast, Iterative Imputation yields the highest RMSE ( 15.96 billion), possibly due to overfitting or its sensitivity to sparsity patterns.

figure b

Column-wise results confirm tradeoffs. SoftImpute performs best in “Industry” and “Services” (RMSE: 1.899 and 2.330, respectively), but underperforms in GDP (RMSE: 60.66 billion). kNN excels in GDP (39.50 billion) but fares worse in agriculture and industry. Iterative Imputation shows promise in “Industry” and “Services” but remains inconsistent overall. These variations suggest that no single method is universally optimal across indicators.

A note on imputation results

SoftImpute excels in modeling latent structures, ideal for relatively stable sectors like services, but struggles with high-magnitude aggregates like GDP. kNN outperforms in GDP imputation due to its localized distance-matching approach, making it more responsive to sectoral aggregates. Iterative Imputation, while statistically rigorous, is sensitive to missingness patterns and performs inconsistently across sectors. These findings suggest a hybrid strategy, one that combines SoftImpute for sector-level imputation with kNN for macro-level correction. Such complementarity improves reliability, reduces bias, and retains structure. This approach integrates well with Bayesian hierarchical modeling and factor analysis, ensuring imputed data remains usable for sectoral diagnostics and posterior inference. In LMIC contexts, where data are often noisy or incomplete, hybrid imputation offers a pragmatic solution for recovering transformation signals that are otherwise lost in sparse datasets.

Fig. 5
figure 5

Comparison of RMSE values for Bayesian Hierarchical Modelling, LASSO Regression and Logistic Regression in all considered sectors.

Figure 5 compares RMSE across models, Bayesian Hierarchical Modeling (BHM), LASSO, and Logistic regression. BHM consistently outperforms across sectors, especially where interdependencies are strong. Its hierarchical structure captures both global and local effects, yielding lower RMSEs and more stable predictions. In contrast, Logistic and LASSO regressions exhibit higher RMSEs in agriculture and GDP, underscoring their limited ability to model nonlinearities and structural noise. For industry and services, BHM’s advantage lies in its capacity to model latent sectoral links and share information across units. This makes it particularly effective in sparse-data contexts where conventional models fail to learn. Despite its complexity, BHM offers interpretability, transparency, and superior diagnostics when combined with careful imputation.

Fig. 6
figure 6

Scatter-plot and time series plot for GDP values.

The scatter plot in Fig. 6a confirms a strong correlation (0.93) between actual and imputed GDP, supporting the imputation method’s effectiveness. Deviations at low GDP values, however, reveal limitations in capturing informal activity or shocks. Figure 6b shows reconstructed GDP series for Ghana, Kenya, and Nigeria. Imputed values closely follow historical data, with notable deviations in Nigeria (2012–2014), possibly due to oil shocks or fiscal volatility. In Ghana and Kenya, trends remain stable, with early oscillations in Ghana suggesting transitional shifts. The ability of the imputation to capture these nuances strengthens its role in structural transformation modeling32.

Fig. 7
figure 7

Productivity and Error distribution for the GDP of Kenya.

Figure 7a shows the distribution of prediction errors. Most cluster near zero, though a few extreme values highlight localized over- or underestimation, often in informal or underreported sectors. These residuals indicate where imputation may bias factor analysis or Bayesian posteriors. Figure 7b demonstrates LASSO’s predictive structure. The estimated coefficients, i.e.,

$$\hat{\beta } = [0.79, -1.28, 0.00, 0.11, 2.17]$$

identify relevant predictors, shrinking irrelevant ones to zero. The low MSE (5.31) suggests a good fit for a single-country scenario, though cross-country generalizability remains limited.

Fig. 8
figure 8

Posterior.

Figure 8 reveals a key diagnostic mismatch. The observed GDP distribution is multimodal and dispersed, while the posterior predictive distribution is unimodal and narrow. This suggests that the model underrepresents volatility and may oversimplify underlying dynamics. In LMICs, such oversimplification, especially if it arises from poorly imputed inputs, risks producing biased policy recommendations. To mitigate this, future iterations should refine imputation strategies and incorporate structural priors to better model data complexity. This reinforces the paper’s core argument. Precisely, data systems in development must be context-aware, uncertainty-inclusive, and methodologically transparent.

Cross-Country comparative analysis

The framework enables systematic comparative analysis of structural transformation trajectories across countries, even when data gaps differ substantially. Figure 9 illustrates the reconstructed evolution of sectoral shares for five countries over the 2000–2020 period, derived after imputation and dimensionality reduction.

Fig. 9
figure 9

Comparison of structural transformation trajectories.

Despite having similar degrees of missing data, the countries exhibit notably distinct transformation patterns. In some cases, such as Countries C and E, the shift away from agriculture is more rapid and sustained, while in others, like Country A, agriculture retains a larger share of GDP16. These differences are made visible only through imputation, which recovers the underlying sectoral dynamics that were previously obscured. Table 2 presents the mean and standard deviation of sectoral GDP shares across agriculture, industry, and services for the five countries. These metrics offer a clearer statistical lens on each country’s transformation trajectory.

Table 2 Mean and standard deviation of shares for agriculture, industry, and services for the 5 countries.

These summary statistics reinforce several insights. First, agriculture consistently shows a downward trend, but the pace varies. Countries C and E demonstrate more rapid transitions, while Country A maintains a relatively high agricultural share. Second, industrial shares remain relatively stable, with modest fluctuations across all countries. Country E shows a temporary industrial surge, possibly driven by resource investments or export growth. Third, services exhibit the most consistent expansion, particularly in Countries D and E, suggesting growing integration with global value chains, finance, or ICT services. The variation in transformation pathways, despite similar imputation coverage, points to underlying socio-economic and institutional differences. These may include disparities in foreign direct investment, labor reallocation mechanisms, policy regimes, and infrastructure development. In Country D, the rise of services is especially notable and may signal a bypassing of industrialization, a pattern consistent with the “services-first” transformation observed in parts of Sub-Saharan Africa.

Quantification of growth patterns and sectoral shifts

To complement the sectoral share analysis, the framework also quantifies sectoral productivity and their contributions to economic growth. This dimension is particularly valuable in data-sparse contexts where official productivity statistics are often inconsistent or missing.

Sectoral productivity differentials

Productivity is evaluated using output per worker or per input unit across agriculture, industry, and services. As expected, services consistently lead in productivity growth, a finding echoed in other empirical studies of LMIC structural change14. Industry contributes substantially in countries transitioning from agrarian to industrial economies, although its performance varies with global commodity cycles and domestic capacity constraints.

Growth contributions from industry and services

The framework allows for the decomposition of overall growth into sectoral contributions. Countries such as E exhibit a clear transition toward a service-led growth model, while Country C demonstrates a more balanced structure where both industry and services contribute. These patterns reflect differences in factor reallocation, investment targeting, and the absorptive capacity of the non-agricultural sectors.

Persistent productivity gaps between agriculture and other sectors

Despite progress, agriculture remains the least productive sector in all countries analyzed. This gap constrains aggregate growth in economies where agriculture still employs a large share of the labor force. Country A, for instance, shows a pronounced and persistent productivity lag in agriculture, pointing to the need for targeted investment in rural infrastructure, mechanization, and value-chain integration33. Addressing these gaps is not only essential for equity, but also for accelerating structural transformation and improving aggregate productivity11,14.

Application of the framework to historical data

We now apply our proposed framework to real-world economic data from selected low- and middle-income countries (LMICs) between 2000 and 2020. This empirical application builds on simulation-based validation and tests the framework’s robustness in handling real-world data gaps, sectoral volatility, and structural transformations. Using the pandas-datareader library, we retrieve sector-level indicators, agriculture, industry, services, and GDP, from the World Bank. We begin by presenting Bayesian hierarchical estimates of country and sector effects (Fig. 10), before examining imputation performance in this setting, where mean imputation surprisingly outperforms SoftImpute under LASSO regression (Fig. 11). To replicate realistic missingness patterns, we also simulate gaps by randomly removing observed values (Fig. 12a, b). These are then reconstructed using SoftImpute and mean imputation. The completed dataset is analyzed using our full framework to trace structural transformation trends and benchmark performance against conventional tools.

Fig. 10
figure 10

Posterior densities of country effects (\(\alpha _j\)) and sector effects (\(\beta _i\)) from the BHM (MCMC). Vertical line at 0. Posterior means (moderate prior \(\mathscr {N}(0,2)\)): \(\bar{\alpha}\) ≈ 0.012, \(\bar{\beta}\) ≈ − 0.006 (Table 3).

Fig. 11
figure 11

SoftImpute versus mean imputation.

Fig. 12
figure 12

Missing data before and after imputations.

Fig. 13
figure 13

Distribution and time series of GDP.

Fig. 14
figure 14

Reconstructed sectoral shares (% of GDP) and GDP (index), 2000–2020, with 95% CIs after imputation. Shaded bands mark 2008–2009 (global financial crisis) and ~ 2012 (commodity downturn).

Figure 12a and b reveal the extent of data sparsity and the effectiveness of imputation. Missingness was highest in industry and services, consistent with real reporting challenges in LMICs. Post-imputation, key structural relationships were preserved. Figure 13a shows that while imputed GDP distributions slightly underestimate variance, the time series in Fig. 13b closely tracks underlying trends. The imputation strategy thus balances smoothing and signal recovery effectively. Figure 11a shows that mean imputation outperforms SoftImpute in LASSO regression, yielding lower RMSE. This counterintuitive result reflects how SoftImpute’s variance retention conflicts with LASSO’s linear penalty (Fig. 11a). As confirmed by Fig. 11b, negative \(R^2\) values signal poor predictive performance, emphasizing that linear models are ill-suited to high-variance, sparse data environments. Finally, Fig. 14 presents the reconstructed sectoral and GDP time series with confidence intervals, highlighting long-run structural shifts and downturns (2008–2009, ~ 2012), providing the bridge to the forecasting analysis in Fig. 15. Figure 15 further demonstrates overfitting and underfitting patterns in GDP prediction, reinforcing the need for nonlinear or regularized alternatives.

Fig. 15
figure 15

Residuals for SoftImpute LASSO predictions.

Fig. 16
figure 16

Trace plot for country effect.

Fig. 17
figure 17

Trace plot for sector effect..

The reconstructed structural trends across agriculture, industry, services, and GDP are shown in Fig. 14, along with 95% confidence intervals. Agriculture declines steadily, marking labor reallocation and productivity gains. Industry is volatile, reflecting susceptibility to policy shifts and external shocks. Services grow steadily, supporting the observation that many LMICs are leapfrogging industrialization. GDP mirrors these movements with a trend that is stable overall but features downturns around 2008 and 2012, likely linked to financial and commodity price crises. Figures 10, 16, and 17 show the Bayesian hierarchical model’s trace plots and posterior distributions. While trace plots demonstrate MCMC convergence, the posterior means for both country (\(\alpha _j\)) and sector (\(\beta _i\)) effects remain near zero. As discussed later in Table 3, this pattern reflects model non-identifiability rather than robust inference, likely due to insufficient signal in the covariates or data scaling challenges. Overall, the empirical results reveal that while the Bayesian component underperformed in this case, the machine learning and factor analysis modules provided meaningful reconstruction of economic trajectories. The framework’s modular design allows strengths in one component to compensate for limits in another, making it adaptable for future iterations and broader development applications. Policymakers may use these results to better estimate sectoral dynamics even where reporting gaps are significant.

figure c

We tested three prior configurations to assess how parameter estimates respond to differing assumptions. We tested a weak prior \(\mathscr {N}(0,10)\), a moderate prior \(\mathscr {N}(0,2)\), and a strong prior \(\mathscr {N}(0,0.5)\). The posterior means for region-level effects (\(\alpha _j\)) and sector-level effects (\(\beta _i\)) were consistently close to zero across all cases. The RMSE also remained extremely large, above \(1.47 \times 10^{11}\), suggesting that the model failed to learn any substantive structure from the data. These patterns are not a sign of prior robustness but indicate poor identifiability, possible mis-scaling, or overly flat gradients during optimization. This result reinforces a central diagnostic insight, that is, even theoretically appropriate models may underperform when applied to sparse, noisy, or weakly informative datasets. The identical behavior across prior strengths suggests that the posterior distribution is dominated by the likelihood, but in this case, the likelihood surface itself may be uninformative. It becomes clear that prior sensitivity analysis is not just a formality but a critical tool for detecting empirical limits. Figure 18 visualizes the resulting posterior distributions from each prior setup. As expected, the weak prior yields a wide, diffuse posterior; the strong prior leads to a narrow, zero-centered curve. While this conforms to theoretical expectations, the actual posterior shifts are minimal, further illustrating that the priors exert little influence on an otherwise flat likelihood.

Fig. 18
figure 18

Prior sensitivity analysis.

Taken together, these results validate the transparency and diagnostic value of our modeling process. They also underscore the importance of pairing Bayesian techniques with proper preprocessing, variable scaling, and context-specific modeling assumptions, particularly in low-information settings. The next section summarizes the full sensitivity analysis outcomes in Table 3, which integrates these posterior statistics and diagnostic results. In the following Table 3, we reassess the Bayesian component of our framework under varying prior assumptions. The sensitivity analysis was performed by running the hierarchical model under weakly informative, moderate, and strongly informative priors for both \(\alpha _j\) (region-level intercepts) and \(\beta _i\) (sector-specific coefficients). As shown, the posterior means for both parameters remain near-zero under all prior settings, while the RMSE remains extremely large.

Table 3 Sensitivity analysis summary table.

These results indicate that the Bayesian hierarchical model failed to identify meaningful regional or sectoral structure under any prior configuration. The near-zero posteriors, combined with consistently large RMSE values, suggest either data scaling problems, misspecification, or a lack of variation in the predictors \(X_{ij}\). Therefore, we do not interpret these results as robustness to prior assumptions, but rather as evidence that the model was unable to learn from the available data in its current form. This finding reinforces the need for careful scaling, model diagnostics, and possibly richer prior structures in future applications. Despite these limitations in the Bayesian module, the broader framework, including the machine learning-based imputation and factor analysis components, continues to offer meaningful predictive power. As demonstrated in Fig. 19, the full pipeline remains effective in reconstructing GDP trajectories, outperforming traditional models. This suggests that while the Bayesian layer contributed little in this implementation, the integrated framework retains utility when imputation and dimensionality reduction are well-tuned.

Fig. 19
figure 19

GDP predictions using the Solow, CGE and our framework.

Figure 19 validates our framework’s effectiveness in reconstructing GDP trends in LMICs. The actual GDP exhibits high volatility, with pronounced seasonal fluctuations. Our framework closely tracks these variations, capturing both short-term shocks and long-term patterns. This represents a substantial improvement over the Solow and CGE models, which fail to capture the magnitude and frequency of structural shifts. The Solow model underpredicts GDP across the period, exhibiting a smoothed trajectory that overlooks key turning points. The CGE model, though somewhat responsive, still fails to represent high-frequency fluctuations accurately, likely due to rigid equilibrium assumptions. In contrast, our framework more closely replicates both trend and volatility, owing to the flexibility of machine learning-based imputation and the capacity of factor analysis to extract core economic signals from noisy data.

Computational trade-offs: MCMC versus variational inference

Bayesian hierarchical modeling requires posterior estimation techniques that balance accuracy with computational efficiency. This framework explores two such methods, Markov Chain Monte Carlo (MCMC) and Variational Inference (VI). MCMC offers precise posterior estimates through iterative sampling but is computationally intensive and slow for large datasets. VI approximates the posterior distribution via optimization, significantly improving runtime while sacrificing some inferential precision20. In practice, while MCMC yields more reliable uncertainty estimates, VI offers scalability, an important consideration for real-time economic diagnostics in data-constrained policy environments.

Table 4 Computational performance of competing estimation techniques.

Table 4 compares competing models across accuracy and efficiency dimensions. The hybrid imputation approach delivers the lowest RMSE (2.35) at reasonable computational cost, while the Bayesian model performs well both in terms of precision (RMSE = 2.50) and interpretability. While MCMC was used in this study to explore the full posterior landscape, future implementations could explore VI for rapid deployment in applied settings such as budgeting or policy dashboards in LMICs.

Policy-relevant insights from structural transformation patterns

The framework reveals structural bottlenecks that impede transformation in LMICs. Specifically, three recurrent constraints emerge across countries; i.e.,

  1. 1.

    Infrastructure Deficits: Underdeveloped transport and energy systems limit industrial scale and hinder service sector expansion.

  2. 2.

    Human Capital Mismatch: Inadequate education and training systems misalign labor supply with sectoral demands, reducing productivity gains.

  3. 3.

    Rigid Labor Mobility: Weak retraining schemes and policy inertia restrict movement across sectors, slowing structural reallocation.

Targeted interventions in these areas, such as vocational upskilling, rural electrification, and policy reforms supporting labor reallocation, can significantly accelerate structural change and improve sectoral resilience to shocks.

Scalability and generalizability of the framework

A notable advantage of the proposed framework lies in its adaptability to varied data environments. Whether operating under 30% or 60% missingness, the framework delivers stable predictive performance and interpretable trends. Its modular design allows easy extension across sectors, time horizons, and geographies. While this study applies the framework to structural transformation, it is equally adaptable to other domains suffering from data sparsity, such as labor market transitions, poverty mapping, and regional convergence studies. This generalizability makes it a valuable tool in the growing field of sparse data econometrics.

Methodological contributions to sparse data econometrics

This study contributes a unified framework that advances how sparse economic datasets are analyzed. By integrating low-rank matrix imputation, Bayesian hierarchical modeling, and latent factor extraction, it addresses three fundamental barriers, that is, data incompleteness, heterogeneity, and dimensionality. Unlike traditional econometric approaches that often treat missingness as a nuisance, this framework treats it as a feature to be modeled and leveraged. It formalizes how modern statistical learning methods can be combined with economic theory to reconstruct hidden structural patterns, offering a replicable approach to sparse data analysis in development contexts. Nonetheless, additional sensitivity tests, particularly on prior distribution choice and hierarchical variance structures, could further improve transparency and robustness. Future work should include simulation scenarios with 60–80% missingness to rigorously stress-test the framework’s limits.

Country-level applications and policy implications

The empirical application yields granular insights into country-specific transformation patterns, offering a springboard for tailored economic strategies as depicted below.

  1. 1.

    Kenya: Sectoral data shows strong service-led growth, especially in ICT and finance. However, declining agricultural productivity suggests a need for investment in digital farming and rural logistics, particularly in arid regions like Turkana and semi-arid counties, where yield gaps persist despite mobile tech penetration.

  2. 2.

    Uganda: Despite agriculture’s dominance, its productivity remains stagnant. Agro-industrial clusters in areas such as Namanve and sorghum value chains in Northern Uganda should be scaled, while eco-tourism in regions like Bwindi and Queen Elizabeth National Park offers sustainable diversification with export potential.

  3. 3.

    Nigeria: Volatility in industrial output, largely tied to oil, underscores the urgency of diversifying into non-oil manufacturing and local value chains, such as textile clusters in Kano or cassava-based agro-processing in Benue, which show resilience and job absorption potential.

  4. 4.

    Tanzania: Structural shifts toward mining and tourism are underway, but success hinges on investing in workforce skills through targeted TVET expansion in mineral-rich zones like Mwanza, and improving rural infrastructure to connect southern highlands producers to port corridors.

  5. 5.

    Ghana: Sustained service expansion and latent demand in logistics and digital services call for targeted public-private partnerships, especially around Accra’s fintech corridor. Meanwhile, persistent agricultural inefficiencies in the Volta and Northern regions require mechanization and irrigation investments.

  6. 6.

    Ethiopia: While industrial zones and urban expansion drive recent growth, rural-urban disparities and logistics bottlenecks remain pressing concerns. Bridging these gaps through improved rail access to rural zones and expanding power supply to agrarian regions can catalyze inclusive transformation.

These insights underscore the value of sector-sensitive diagnostics and highlight how imputation-informed frameworks can support smart policy allocation, even in contexts where baseline data is patchy.

Conclusion

This study presents a unified, data-efficient framework for analyzing structural transformation under data scarcity, a persistent challenge in LMICs. By integrating Bayesian hierarchical modeling, machine learning-based imputation, and factor analysis, the framework captures latent sectoral dynamics, accounts for uncertainty, and enables meaningful policy analysis even in the face of incomplete or noisy data.

Empirical validation using World Bank data (2000–2020) from six African economies shows that the framework reconstructs sectoral trajectories with high accuracy. SoftImpute performs best in agriculture (RMSE = 3.05), industry (1.89), and services (2.33), while kNN is more effective for GDP (RMSE = 39.5 billion), illustrating that no single method dominates across all domains. Structural insights, such as Kenya’s service-led shift, Nigeria’s industrial volatility, and Ghana’s diversified transformation, demonstrate the framework’s relevance and interpretive power.

The framework also highlights the limitations of conventional methods, particularly their inability to accommodate data irregularity or sectoral heterogeneity9. In contrast, this modular and scalable approach offers a credible alternative for analysts and policymakers in data-constrained environments.

Future research should further validate the framework under higher data sparsity thresholds (e.g., 60–80%) and refine hybrid imputation techniques for enhanced accuracy. The framework’s adaptability to other domains, like labor force transitions, fiscal simulations, and subnational development diagnostics, makes it a promising candidate for broader application in the field of development macroeconomics.