Skip to main content

Forecasting malaria incidence in the Southeast districts of Senegal using a machine learning approach

Abstract

Background

In Senegal, malaria remains endemic, with a highly heterogeneous distribution across the country. Despite all the interventions implemented by the National Malaria Control Program (NMCP), the South-East of the country alone accounts for 78.5% of cases and 43.6% of deaths in 2021. This situation requires new approaches to better understand and anticipate the dynamics of malaria. This study aims to develop a machine learning model that can predict the incidence of malaria in the south-east districts. By providing accurate and targeted information, this model will enable the NMCP to optimize the deployment of interventions in South-East Senegal.

Methods

The study is based on 936 observations, covering the period from January 2016 to December 2021, and encompassing 13 health districts in the South-East regions of Senegal. These monthly data include detailed information on malaria cases as well as climatic, demographic and environmental characteristics. Data were divided into three sets: a training set, a testing set, and a validation set. We compared predictions of three machine learning models: Extreme Gradient Boosting (XGBoost), Histogram Gradient Boosting (HistGB) and Light Gradient Boosting Machine (LightGBM). The performance of the models was evaluated using performance metrics, while the Shapley values were used to analyze the impact and importance of the variables in the predictions.

Results

The results showed that XGBoost is the best performing model in terms of score. Overall, the model achieved a coefficient of determination (R2) of 0.90 across all 13 health districts. In particular, it demonstrated excellent precision in the districts of Kidira, Vélingara and Koumpentoum, where the predictions were very close to the observed values. Thus, this designed model provides forecasts, offering valuable information for detecting early warning signals to anticipate and reduce the impact of malaria epidemics.

Conclusion

This study predicts the incidence of malaria in a region where existing strategies are struggling to reduce the incidence thereof. This is done by accurately predicting transmission peaks that will help to better target the allocation of resources and increase the effectiveness of actions to combat the disease.

Peer Review reports

Introduction

Today, malaria remains a major public health challenge worldwide, particularly in Africa, where 94% of estimated cases and more than 50% of deaths are linked to the disease [1]. Among the ten countries most affected by malaria on the continent, the majority are located in sub-Saharan Africa [1, 2]. In Senegal, the south-eastern regions of Kolda, Kédougou and Tambacounda (KKT) alone account, in 2021, for 78.5% of malaria cases and 43.6% of malaria-related deaths [3]. These area face persistent endemicity of malaria [4] (Fig. 1), despite numerous interventions by the National Malaria Control Program (NMCP). These interventions include early diagnosis with rapid diagnostic tests, artemisinin-based combination therapies (ACT), insecticide-treated nets (ITNs), community case management (PECADOM), intermittent preventive treatment for pregnant women (IPTp), indoor residual spraying (IRS) and seasonal malaria chemoprevention (SMC). Additionally, its climatic diversity [5], and context of poverty [6] further complicate the fight against malaria. In this context, mathematical modeling is proving to be an essential tool for limiting the spread of malaria and assessing the effectiveness of interventions and prevention measures. For example, it can be used to predict the dynamics of the disease’s spread over the long term [7] and help optimize, through data, the effective allocation and mobilization of limited public health resources [8].

Fig. 1
figure 1

Malaria incidence trends in Senegal’s regions from 2016 to 2021. Evolution of malaria incidence in Senegal from 2016 to 2021, highlighting the regions of Kolda, Kédougou, and Tambacounda. These regions, shown in red and orange, consistently exhibit higher malaria incidence rates (exceeding 300 per 1000) compared to the rest of the country, where the incidence remains relatively low

Since the discovery of malaria, numerous mathematical models have been developed. These include the compartmental SIR (Susceptible-Infected-Recovered) model [9, 10] and its variants [11, 12], statistical models [4, 13,14,15] and, in particular, the regression models used, for example, to model and forecast malaria incidence in different countries [4, 14]. In recent decades, however, artificial intelligence-based approaches have become popular in infectious disease modeling, sometimes supplanting existing techniques. Their popularity is due to their ability to efficiently process large volumes of data in a variety of domains.

In 2019, Thakur Santosh et al. [16] used artificial neural networks to predict the abundance of malaria cases in four health centers in India. They incorporated various clinical and environmental variables, such as positive asymptomatic cases, humidity, temperature and rainfall. Model performance varied by location, with error rates ranging from 18 to 77%. In 2021, Odu Nkiruka et al. [17] developed a classification model for malaria incidence in sub-Saharan Africa over a 28 years period in six countries. They selected relevant climate variables using Pearson correlation analysis, cleaned the data with the K-means clustering algorithm, then classified them with XGBoost. The latter performed better than the other models, revealing that non-seasonal changes in precipitation, temperature and surface radiation strongly influenced epidemics. In addition, David Harvey et al. [18] designed an alert System to predict malaria cases over a 13-week period in a health facility in Burkina Faso. Using Gaussian processes and Random Forest Regressor, their algorithm showed an accuracy of over 99% for a high alert threshold. A comparative study between several models [19] including, SARIMA, Holt-Winters exponential smoothing, the harmonic model, the Seasonal and Trend decomposition using Loess (STL) model and artificial neural networks (ANNs), showed that ANNs outperformed the other models.

In this work, we propose to predict malaria incidence in the health districts of the KKT zone by carrying out a comparative study using three machine learning models (Extreme Gradient Boosting, Histogram Gradient Boosting and Light Gradient Boosting Machine).

This paper is organized as follows. Firstly, Sect. 2 describes the methodology in detail, covering the study area, data collection, models used, their training, evaluation criteria, as well as their interpretability. Next, Sect. 3 presents the results obtained and offers a discussion. Finally, Sect. 4 provides a conclusion to this work.

Methods

Study area

Our research focused on the south-east of Senegal. Specifically, the regions of Kolda, Kédougou and Tambacounda (Fig. 2). These were among the areas of Senegal where malaria is most endemic. The prevalence of the disease varies according to factors such as season, rainfall and prevention strategies. Indeed, Senegal had a diversity of seven bioclimatic zones, ranging from the Sahelian climate in the north to the Sudano-Guinean climate in the south [20]. This climatic variability was reflected in annual rainfall that varies from 300 mm in the north to 1200 mm in the south [21]. Rainfall distribution followed a seasonal pattern, marking a clear distinction between a dry season and a rainy season. In southern Senegal, the rainy season extended from May to November, with peaks in July, August and September [3]. This seasonal periodicity influences the malaria transmission cycle, characterized by a low intensity during the dry season and an intensification of transmission with the onset of the rains [3].

Fig. 2
figure 2

Study area. Map of the study areas for malaria incidence in Sénégal, covering the health districts of the Kolda, Kédougou, and Tamabacounda regions. The regional boundaries are indicated by distinct colors, with the study area boundary in red and the health districts marked with red crosses

In addition, malaria in Senegal continues to be unevenly distributed across the country’s different regions, a disparity which became more marked in 2021 [3], 2020 [22], 2019 [23] and 2018 [24]. The regions of Kolda, Tambacounda and Kédougou remain the most heavily impacted by the disease. Despite representing only 11.3% of the total population, these three regions registered 78.5% of malaria cases recorded in 2021 [3], down from 83.3%, 81% and 82% respectively observed in 2020 [22], 2019 [23], and 2018 [24].

In this study, we focused on health districts areas located specifically in the KKT zone (Kolda, Kédougou, Tambacounda). In Kolda, we examined the Kolda, Médina Yoro Foulah and Vélingara health districts. In the Kédougou region, we focused on the districts of Kédougou, Saraya and Salémata. In the Tambacounda region, we studied the districts of Bakel, Dianké Makha, Goudiry, Kidira, Koumpentoum, Makacolibantang and Tambacounda.

Data

In this work, we collected various categories of information, encompassing elements related to climate, clinical aspects, environment, malaria control strategies (interventions and preventions), as well as demographic data.

The climate and environmental data came from the National Aeronautics Space Administration (NASA) website [25]. We downloaded data for the following characteristics: humidity, soil moisture, precipitation, solar radiation and temperature.

Clinical parameters and control measures (interventions and preventions) are taken from the Senegal National Malaria Control Program (NMCP). Clinical data consist of number of confirmed malaria cases in all age categories. As for intervention measures, we used number of Artemisinin based Combination Therapy (ACT) distributed to infants, small children, older children and adults. In terms of preventive measures, these related to Seasonal Malaria Chemoprevention (SMC) and Intermittent Preventive Treatment for Pregnant women (IPTp) coverage. It should be noted that we had to recode the SMC variable into two values: True and False, where False corresponds to the period when SMC is not implemented, while True indicates the period when it is applied. In addition, the incidence of malaria was calculated using data provided by the NMCP. We then shifted total confirmed malaria cases by 12 months, based on analysis of autocorrelation graphs (Figs. 3 and 4).

Fig. 3
figure 3

Autocorrelation graph between confirmed cases and its lags. Subfigures (a) to (l) represent the relationship between confirmed cases and their lags, from the 1st (a) to the 12th month (l). The values in each subfigure indicate the corresponding correlation coefficient

Fig. 4
figure 4

Partial autocorrelation graph between confirmed cases population and its lags

For demographic data, we used population densities specific to each department of the KKT zone, taken from population statistics available on the National Agency for Statistics and Demography (ASND in french) [26]. The absence of data density in certain documents led us to use both national and regional annual reports, as well as demographic projections to extract the population size and surface areas of the departments in the KKT zone concerned. We then calculated their ratio to obtain the population density. The geographic coordinates and shapefiles used come from the Global Administrative Areas (GADM) site [27].

Models

We carried out a comparative study between three different boosting machine learning models with the aim of predicting the monthly incidence of malaria in certain districts of the KKT zone.

The first model, Extreme Gradient Boosting, was a tree-based ensemble technique that combines the predictions of several weak models to create a more robust and accurate prediction. Initially developed by Chen et Guestrin in 2016 [28], it has since been improved by other contributors through their work, including that of Fang et al. on COVID-19 [29]. Its objective function is defined by:

$$\:y_k=\sum\limits_{i=1}^nL\left(y_{i\:},\widehat y_j^{k-1}+f_k\left(x_i\right)\right)+\Omega\:\left(f_k\right)$$
(1)

where \(\:n\) denotes the number of training data, \(\:{x}_{i}\) and \(\:{y}_{i}\) are respectively the feature vector and its label at the \(\:{i}^{th}\) instance, \(\:{\widehat{y}}_{j}^{k-1}\) represents the prediction of the \(\:{i}^{th}\) instance at the \(\:{\left(k-1\right)}^{th}\) iteration, \(\:L\) is a loss function that calculates the difference between the label \(\:{y}_{i}\) and the final forecast \(\:{\widehat{y}}_{i}\), \(\:{f}_{k}\) is the new tree applied to the variable \(\:{x}_{i}\), \(\:\varOmega\:\left({f}_{k}\right)\) is the regularization term that penalizes the complexity of the new tree.

The second model, Histogram Gradient Boosting, used histograms to speed up decision tree building [30]. Its objective function combines a loss function and a regularization term:

$$\:obj\left(\theta\:\right)={\sum\:}_{\dot{l}}L\left({\widehat{y}}_{i},{y}_{i}\right)+\varOmega\:\left(\theta\:\right)$$
(2)

The third model, the Light Gradient Boosting Machine, used a one-sided gradient-based sampling technique to divide trees, reducing memory usage and improving accuracy [31]. The objective function of LightGBM is similar to that of XGBoost, incorporating a loss and a regularization term to avoid overfitting.

The hyperparameters choice for the different models is important since it can highly contribute to the results. Tables 1, 2 and 3 present the main hyperparameters for the different models. It’s essential to note that if a parameter hasn’t been defined for any of these models, it takes on its default values.

Table 1 Regression parameters for the XGBoost model [32]
Table 2 Regression parameters for the HistGB model [33]
Table 3 Regression parameters for the LightGBM model [34]

Data preprocessing

The database used for this study covers the period from January 2016 to December 2021, with monthly records for the 13 health districts of the KKT area. A total of 936 observations were recorded, spread over 17 columns. There are no missing values. Malaria incidence is the target variable, while ACT, SMC, IPTp coverage, population density, humidity, soil moisture, rainfall, solar radiation, temperature, latitude and longitude serve as explanatory variables. For model training, we used data from January 2016 to December 2019. Data from 2020 were used to test, and 2021 for final validation. We will use 2022 for prediction since these data were not available.

Model training

Fine tuning

Following data preprocessing, we modeled malaria incidence using HistGB, LightGBM and XGBoost. A Bayesian method [35] was used to find tuning hyperparameters ’models, with fine-tuning performed on training data. To ensure a fair and consistent comparison, all three models were trained and validated on the same folds using 5-split TimeSeriesSplit cross-validation strategy. For each model, the mean RMSE over the folds was determined, along with the standard deviation, to reflect the variability of performance over time. To assess the uncertainty of the XGBoost model predictions, a bootstrapping procedure was applied with 200 random samples drawn from the training set. For each iteration, a model was re-trained, and the predictions on the validation data were used to calculate a mean as well as 95% confidence on the predictions.

For the HistGB model, Bayesian optimization [35] refined the essential hyperparameters, resulting in the following optimal values: a maximum depth of 7.26, a learning rate of 0.08, a number of iterations of 462.79, a maximum bin number of 196.88 and a beta regularization of 0.84. These optimized values and cross-validation via TimeSeriesSplit consolidated the model’s performance and reliability.

For LightGBM, hyperparameters were optimized, focusing on maximum depth (optimal value: 10), learning rate (optimal value: 0.3), number of estimators (optimal value: 301.59) and fraction of columns to sample (optimal value: 0.7). Bayesian optimization, followed by cross-validation with TimeSeriesSplit, yielded these results, ensuring the model’s best performance.

For XGBoost, hyperparameters were tuned through Bayesian optimization and cross-validation with TimeSeriesSplit, resulting in the following optimal values: a maximum depth of 3.66, a learning rate of 0.02, a number of estimators of 387.97, a column fraction of 0.61 and an alpha regularization of 0.03. These processes enhanced the model’s performance and robustness.

Training

In this section, we detailed the training process for our different models.

For the Histogram Gradient Boosting model, it was configured with the best values obtained during hyperparameter optimization. The model was fitted to the training set using the fit method. Performance was evaluated by generating progressive predictions at each iteration using the staged predict method. The RMSE error was calculated for each iteration on the training set and test set, enabling us to assess the evolution of errors and the model’s ability to generalize, while reducing the risk of overfitting.

The Light Gradient Boosting Machine was instantiated with the optimal hyperparameters obtained during the optimization phase. He was then fitted to the training data using the fit method. A real-time evaluation was performed by passing the test set using the eval set argument. The RMSE metric was computed at each iteration to monitor model performance, ensuring that the fit remains optimal without overfitting.

The Extreme Gradient Boosting model was also configured with the best hyperparameter values obtained after optimization. It was then trained on the training data. Real-time evaluation was carried out on the test set using the RMSE metric. Performance monitoring allowed for tracking error evolution and adjusting parameters to prevent overfitting during training.

Test and validation

All models used optimization hyperparameters and incremental evaluation processes to ensure accurate and robust predictions. Our machine learning modeling began by training on data from 2016 to 2019, enabling them to capture the relationship between explanatory variables and malaria incidence. Once trained, the models were evaluated by forecasting on test data for the year 2020. This step is carried out using metrics such that MAE, RMSE and R² to assess their generalizability and verify the accuracy of predictions on independent data set. Finally, the models were applied to the 2021 validation data to make final forecasts. This validation on completely new data confirmed their performance in real-life conditions, demonstrating their robustness and reliability in predicting malaria incidence.

Evaluation criteria

We evaluated the performances of our models using four popular metrics: mean absolute error (MAE), root mean square error (RMSE) and the Nash-Sutcliffe criterion (R²).

$$\:MAE=\frac1m\sum\limits_{i=1}^m\left|y_i-{\widehat y}_i\right|$$
(3)
$$\:RMSE=\sqrt{\frac1m\sum\limits_{i=1}^m\left(y_i-\widehat yi\right)^2}$$
(4)
$$\:R^2=1-\frac{{\displaystyle\sum\limits_{i=1}^m}\left(y_i-{\widehat y}_i\right)^2}{{\displaystyle\sum\limits_{i=1}^m}\left(y_i-\overline y\right)^2}$$
(5)

where \(\:m\) represents the size of the database, \(\:{y}_{i}\) the actual incidence, \(\:{\widehat{y}}_{i}\) the predicted incidence and \(\:\overline{y}\:\)is the mean of the incidences.

Interpretability of the model

After the modeling stage, we interpreted our model’s predictions using the Shapley value method. This mathematical approach, derived from cooperative game theory, enables us to understand both the importance and influence of each feature in the predictions made by a machine learning model [36]. The basic principle of Shapley values can be translated into the following formula. Let 𝑣𝑎𝑙 be the payoff or value function, the SHAP value for feature 𝑗 can be calculated as follows:

$$\begin{aligned}\phi_j(\text{val}) &=\sum_{S \subseteq \{1, \dots, p\} \setminus \{j\}}\frac{|S|! \, (p - |S| - 1)!}{p!}\times\left[ \text{val}(S \cup \{j\}) - \text{val}(S) \right]\end{aligned}$$
(6)

Where: \(S\subseteq\left\{1,...,p\right\}\backslash\left\{j\right\}\) represents all possible coalitions not containing the player \(j\)\(S\) is the subgroup of players, \(p\) is the total number of players, \(j\) is our player of interest and \(x\) is the vector of feature values of the instance to be explained. \(\mathrm{val}\left(S\;\cup\;\left\{j\right\}\right)-\;\mathrm{val}\;\left(S\right)\) is the marginal contribution of the player \(j\) by adding it to the coalition \(S\). The value \(val\left(S\right)\) represents the model’s prediction based on the players in the coalition \(S\):

$$\begin{aligned}\:va{l}_{x}\left(S\right)=&\:\:\int\:\widehat{f}\left({x}_{1},\dots\:,\:{x}_{F}\right)\:\:d{P}_{\left\{x\notin\:S\right\}}-\:{E}_{X} \big{[}\widehat{f}(X) \big{]} \end{aligned}$$
(7)

Although the problem arises in the world of game theory, it can be translated into the machine learning paradigm by replacing the game by the prediction task, the payment by the model output and the players by the features.

Results and discussion

Temporal cross-validation on training data shows that XGBoost achieved a mean RMSE of 7.59 (95% CI: 3.14–12.04), compared to 9.52 (95% CI: 6.87–12.16) for LightGBM and 9.16 (95% CI: 5.03–13.28) for HistGB. These wide confidence intervals highlight considerable variability across folds, likely due to temporal fluctuations in malaria incidence. Figures 5 and 6 showed the estimated impact of malaria incidence on the test and validation samples, respectively. The green curve illustrated the actual test data, while the orange, purple, and violet curves represented the predictions of the LightGBM, HistGB and XGBoost models, respectively. Table 4 presented result from three different models in terms of metrics computed on the test and the validation set. Among them, XGBoost provided out for its superior performance, with the lowest RMSE and MAE values (12.74‰ and 6.38‰ on the test set, 10.58‰ and 5.95‰ on the validation set) and the highest R² values (0.90 on the test set and 0.93 on the validation set), indicating high accuracy and better capture of complex relationships. In comparison, HistGB showed higher RMSE (13.87‰ on the test set and 11.93‰ on the validation set) and MAE (6.43‰ on the test set and 6.16‰ on the validation set), as well as slightly lower R² (0.88 on the test set and 0.91 on the validation set). Similarly, LightGBM exhibited RMSE values of 13.73‰ on the test set and 12.3‰ on the validation set, MAE values of 6.57‰ on the test set and 6.96‰ on the validation set, and R² coefficient of 0.89 on the test set and 0.91 on the validation set. These results confirm the superiority of XGBoost in minimizing errors and explaining variance, positioning it as the most effective model for predicting malaria incidence. Beyond overall performances, examining the prediction intervals provides further insight into the model’s reliability under different conditions. On test data (Fig. 7), 95% confidence intervals are overall narrower and stable, reflecting good reliability of predictions, even during the transmission season. On the other hand, on the validation data (Fig. 8), the intervals are often wider and variable, particularly in high-incidence districts, reflecting a predictive uncertainty in these contexts. Therefore, XGBoost was used to make predictions for the year 2022 (Fig. 9). We found that it effectively captured the trend of disease incidence in the KKT area, although in some districts it was overestimated or underestimated. Indeed, significant variations in malaria incidence were observed according to the time of year and the districts. As early as June, the first signs of increased incidence were seen in several districts, including Bakel, Kédougou, Médina Yoroufoula, Salémata and Saraya. In July, other districts such as Dianké Makha, Goudiry, Kolda, Koumpentoum, Makacoulibantang, Tambacounda and Vélingara experienced a similar increase. Incidence peaks also occurred at different times depending on the area. In October, several districts, such as Bakel, Dianké Makha, Goudiry, Kidira, Koumpentoum, Kédougou, and Tambacounda, reached their peak incidence. In contrast, the districts of Kolda, Makacoulibantang, Médina Yoroufoula, and Vélingara reached their peak later, in November. Finally, in Salémata and Saraya, the peaks were recorded as early as August. With the exception of Salémata and Saraya, the results are in agreement with the studies of Diouf et al. [5, 37].

Fig. 5
figure 5

Models’ prediction on test data. Forecasts of incidence made by XGBoost (violet), LightGBM (purple), and HistGB (orange) models compared to the test data (green) for the various health districts (subplots a-m) in the KKT area in 2020

Fig. 6
figure 6

Models’ prediction on validation data. Forecasts of incidence made by XGBoost (violet), LightGBM (purple), and HistGB (orange) models compared to the validation data (green) for the various health districts (subplots a-m) in the KKT area in 2021

Table 4 Metrics table
Fig. 7
figure 7

XGBoost predictions and confidence intervals on test data. Forecasts of malaria incidence by XGBoost (violet) with 95% confidence intervals (shaded) compared to test data (green) across KKT districts (subplots a-m) in 2020

Fig. 8
figure 8

XGBoost predictions and confidence intervals on validation data. Forecasts of malaria incidence by XGBoost (violet) with 95% confidence intervals (shaded) compared to validation data (green) across KKT districts (subplots a-m) in 2021

Fig. 9
figure 9

Forecast impact for 2022. Predicted incidence for the year 2022 in the various health districts (subplots a-m) of the KKT area. The plots illustrate the temporal evolution of the predicted values across all districts

Figure 10 illustrates both the importance and influence of XGBoost model features through SHAP values. For ACT variable, SHAP values suggest that higher values (red dots) are generally associated with increased predicted malaria incidence in some districts, while lower values (blue dots) are more often linked to reduced predictions in most districts. This indicates a general association between higher use of ACT and lower predicted incidence, which aligns with findings from previous studies [38,39,40], though no causal inference can be drawn these model outputs. For population density, SHAP values indicate a strong influence, as also noted by [41], but with varying effects. Data points appear on both sides of the SHAP axis, suggesting that population density may be associated with either increased or decreased predictions depending on local context. In general, low population densities (blue dots) tend to be associated with higher predicted incidence, while higher densities (red dots) often correspond with lower predicted values. The features longitude and latitude show varying impacts across districts. Some areas exhibit patterns where higher geographic coordinates are associated with higher predicted incidence, while other show the opposite. This variation suggests that geographic location, potentially in combination with other clinical, environmental, or socio-economic factors, may help identify malaria hotspots or high-risk areas [42]. High soil and air humidity (red dots) are frequently associated with increased predicted incidence in specific districts. These associations may reflect environmental conditions conductive to mosquito breeding, consistent with previous findings [37, 43, 44] though again, no direct causal relationship is implied. The variable Confirmed.cases.lag12, representing confirmed malaria cases with a 12-month lag, shows that higher past values are associated with higher predicted current incidence. This may reflect persistent transmission patterns in historically affected areas. The SHAP values for temperature indicate a more modest and variable influence. While moderate temperatures may be associated with higher predicted incidence in some KKT areas, extreme temperature may have the opposite association-possibly due to impacts on mosquito survival, as noted in [37, 45]. SHAP values for rainfall also vary. Sufficient rainfall is necessary for the formation of breeding sites [46], but excessive rainfall can eliminate larvae by washing out the deposits [37, 45]. The rainy seasons in KKT are therefore critical for malaria control. For SMC, SHAP values are generally low, but higher SMC values (red dots) are typically associated with reduced predicted incidence, while lower values show the opposite. This pattern may indicated potential benefits of SMC where implemented effectively [47,48,49] though again, causality cannot be inferred. Finally, SHAP values for solar radiation are near zero, indicating minimal direct influence in the model, though it may indirectly affect other environmental factors such as temperature and humidity [50], which in turn can influence malaria transmission risk.

Fig. 10
figure 10

SHAP value - impact on model forecasts. The values on the x-axis represent the SHAP values for each data point. A SHAP value indicates the impact of a specific feature on the model’s prediction: - Positive SHAP values: the feature increases the model’s prediction for this data point. - Negative SHAP values: the feature decreases prediction. Features are listed on the y-axis and sorted in order of importance, with the most important at the top of the graph. Each point on the graph corresponds to an observation (a data point) in the test set. The color of the dots represents the value of the characteristic for that observation: - Blue dots: indicate low characteristic values. - Red dots: indicate high characteristic values

With a view to guiding health professionals in their fight against malaria, we have developed a web application (Fig. 11) based on the most optimal model capable of predicting monthly malaria incidence. This application is used to load a database containing all the variables mentioned in Sect. 2, and to specify a forecast start date and a forecast horizon, i.e. the number of months to be covered by the predictions.

Fig. 11
figure 11

Web application. PrevAPP predicts malaria incidence in the districts of southeastern Senegal. Based on explanatory data (see the data section in part 2) from the previous year, it generates forecasts for the following year with a 12-month horizon. The application also allows filtering forecasts by a specific month, as shown here from August 2022

Despite the promising results, our work has certain limitations. The relatively moderate size of our dataset may alter the generalizability of the results. Moreover, the climatic and clinical datasets used may themselves contain inherent biases and uncertainties. For instance, climatic data from satellite or reanalysis sources can suffer from spatial or temporal resolution issues, potentially misrepresenting local conditions. Clinical data, on the other hand, may be affected by underreporting, diagnostic errors, or incomplete records, which could introduce noise or systematic biases into the models. Furthermore, the absence of complex socio-economic phenomena, such as human behavior, migration, the quality of health infrastructure and long-lasting insecticidal nets (LLINs) may affect the overall understanding of malaria’s complexity.

Conclusion

Our study of malaria incidence prediction in southeastern Senegal highlights the importance of using advanced machine learning models in the fight against this disease. By comparing three ensemble models: Extreme Gradient Boosting (XGBoost), Histogram-based Gradient Boosting (HistGB) and Light Gradient Boosting (LightGBM), we have demonstrated that the XGBoost model offers the best performance for monthly malaria incidence prediction. The results show that predictions can effectively capture trends in malaria incidence, enabling us to anticipate peak periods and improve the responsiveness of health interventions. The implementation of our web application based on the XGBoost model represents a breakthrough in malaria surveillance and management in the KKT zone. This application will provide health authorities with a powerful predictive tool, facilitating informed decision-making to reduce the incidence of malaria in the long term.

For future work, several lines of research can be envisaged. Firstly, integrating additional data, such as long-lasting insecticidal nets, population mobility data, to improve prediction accuracy. Secondly, exploring other models, such as recurrent neural networks (RNN) or attention-based models, will better capture complex temporal dynamics. Finally, pilot trials in real-life conditions will be essential to test, adjust and validate our predictive application.

Overall, our study demonstrates the potential of artificial intelligence technologies in the fight against malaria, and paves the way for promising future developments to improve public health in the worst-affected regions.

Data availability

The malaria clinical parameters and control measures are available on request from the Senegal National Malaria Control Program (NMCP). The climate and environmental data can be obtained from https://power.larc.nasa.gov/ The demographic data are available on https://www.ansd.sn/Indicateur/donnees-de-population. The geographic coordinates and shapefiles can be download in https://gadm.org/. Model scripts can be accessed from https://github.com/AYLY92/PhD/tree/main.

Abbreviations

ACT:

Artemisinin-based Combination Therapy

ANNs:

Artificial Neural Networks

GADM:

Global Administrative Areas

HistGB:

Histogram Gradient Boosting

KKT:

Kolda, Kédougou and Tambacounda

ITNs:

Insecticide-Treated Nets

IPTp:

Intermittent Preventive Treatment in pregnancy

IRS:

Indoor Residual Spraying

LightGBM:

Light Gradient Boosting Machine

LLINs:

Long-Lasting Insecticidal Nets

NASD:

National Agency for Statistics and Demography

MAE:

Mean Absolute Error

NMCP:

National Malaria Control Program

R²:

Nash-Sutcliffe criterion

RMSE:

Root Mean Square Error

RNN:

Recurrent Neural Networks

SARIMA:

Seasonal Autoregressive Integrated Moving Average

SMC:

Seasonal Malaria Chemoprevention

STL:

Seasonal and Trend decomposition using Loess

XGBoost:

Extrem Gradient Boosting

References

  1. WHO World Health Organization (WHO). World malaria report; 2023. https://www.who.int/teams/global-malaria-programme/reports/world-malaria-report-2023.

  2. Organisation mondiale de la Santé. D’une charge élevée à un fort impact: Une riposte ciblée contre le paludisme. 2019. https://www.who.int/fr/publications/i/item/WHO-CDS-GMP-2018.25.

  3. Programme National de Lutte contre le Paludisme (PNLP). Bulletin épidémiologique annuel 2021 du paludisme au Sénégal. 2021. https://www.pnlp.sn/ressources/documents.

  4. Diao O, Absil PA, Diallo M. Generalized linear models to forecast malaria incidence in three endemic regions of Senegal. Int J Environ Res Public Health. 2023;20(13):6303. https://doi.org/10.3390/ijerph20136303.

    Article  Google Scholar 

  5. Diouf I, Ndione JA, Gaye AT. Malaria in Senegal: recent and future changes based on bias-corrected CMIP6 simulations. Trop Med Infect Dis. 2022;7(11):345. https://doi.org/10.3390/tropicalmed7110345.

    Article  Google Scholar 

  6. Camara AM. Dimensions régionales de la pauvreté au Sénégal. Belg Rev Belge Géographie. 2002;1:17–28. https://doi.org/10.4000/belgeo.15432.

    Article  Google Scholar 

  7. Djidjou-Demasse R. Development and analysis of a malaria transmission mathematical model with seasonal mosquito life-history traits. Stud Appl Math. 2019;144. https://doi.org/10.1111/sapm.12296.

  8. Ozodiegwu I, Ambrose M, Galatas B, Runge M, Nandi A, Okuneye K, et al. Application of mathematical modelling to inform National malaria intervention planning in Nigeria. Malar J. 2023;22. https://doi.org/10.1186/s12936-023-04563-w.

  9. devi R, Choudhury B. Analysis of SIR mathematical model for malaria disease: a study in Assam, India. J Ilmu Dasar. 2023;24:169.

  10. Safan M, Bichara D, Yong K, Alharthi A, Castillo-Chavez C. Role of differential susceptibility and infectiousness on the dynamics of an SIRS model for malaria transmission. Symmetry. 2023;15:1950.

    Article  Google Scholar 

  11. Khiri L, Gueye I, Naacke H, Sarr I, Gançarski S. A Malaria Control Model using Mobility Data: An Early Explanation of Kedougou Case in Senegal. In: 10th International Conference on Simulation and Modeling Methodologies, Technologies and Applications. on-line, France: SciTePress; 2020. pp. 35–46. Available from: https://hal.science/hal-03482056. Cited 4 Sept 2024.

  12. Kumar D, Singh J, Al Qurashi M, Baleanu D. A new fractional SIRS-SI malaria disease model with application of vaccines, antimalarial drugs, and spraying. Adv Differ Equ. 2019;2019.

  13. Abeku TA, de Vlas SJ, Borsboom G, Teklehaimanot A, Kebede A, Olana D, et al. Forecasting malaria incidence from historical morbidity patterns in epidemic-prone areas of Ethiopia: a simple seasonal adjustment method performs best. Trop Med Int Health. 2002;7(10):851–7.

    Article  Google Scholar 

  14. Chatterjee C, Sarkar RR. Multi-step polynomial regression method to model and forecast malaria incidence. PLoS One. 2009;4:e4726.

  15. LEGENDRE E. Vers Une stratégie d’élimination du paludisme saisonnier: Cibler pour optimiser La surveillance et les interventions [Santé publique]. [Marseille]: Aix-Marseille Université; 2023.

    Google Scholar 

  16. Thakur S, Dharavath R. Artificial neural network based prediction of malaria abundances using big data: a knowledge capturing approach. Clin Epidemiol Glob Health. 2019;7(1):121–6.

  17. Nkiruka O, Prasad R, Clement O. Prediction of malaria incidence using climate variability and machine learning. Inform Med Unlocked. 2021;22:100508.

  18. Harvey D, Valkenburg W, Amara A. Predicting malaria epidemics in Burkina Faso with machine learning. PLoS One. 2021;16(6):e0253302.

  19. Mohamed J, Mohamed AI, Daud EI. Evaluation of prediction models for the malaria incidence in Marodijeh region, Somaliland. J Parasit Dis Off Organ Indian Soc Parasitol. 2022;46(2):395–408.

    Article  Google Scholar 

  20. Diagne N, Fontenille D, Konate L, Faye O, Traore-Lamizana M, Legros F, et al. Les Anophèles du sénégal: liste commentée et illustrée [Anopheles of Senegal. An annotated and illustrated list]. Bull Soc Pathol Exot. 1994;87(4):267–77.

    Google Scholar 

  21. Enquête des. Indicateurs du Paludisme au Sénégal (EIPS) 2020–2021. https://www.ansd.sn/enquete-et-etude/enquete-sur-les-indicateurs-du-paludisme-2020-2021.

  22. Programme National de Lutte contre le Paludisme (PNLP). Bulletin épidémiologique annuel 2020 du paludisme au Sénégal. 2020. https://www.pnlp.sn/ressources/documents.

  23. Programme National de Lutte contre le Paludisme (PNLP). Bulletin épidémiologique annuel 2019 du paludisme au Sénégal. 2019. https://www.pnlp.sn/ressources/documents.

  24. Programme National de Lutte contre le Paludisme (PNLP). Bulletin épidémiologique annuel 2018 du paludisme au Sénégal. 2018. https://www.pnlp.sn/ressources/documents.

  25. National Aeronautics and Space Administration (NASA) Langley Research Center (LaRC). POWER | Data Access Viewer. Available from: https://power.larc.nasa.gov/data-access-viewer/. Cited 11 Dec 2023.

  26. Agence Nationale de la Statistique et de la. Démographie (ANSD) du Sénégal. Agence Nationale de la Statistique et de la Démographie (ANSD) du Sénégal. 2023. Données de population. Available from: https://www.ansd.sn/Indicateur/donnees-de-population. Cited 15 Dec 2023.

  27. Global Administrative Areas (GADM). | Geospatial Centre. GADM. Available from: https://gadm.org/download_country.html. Cited 28 Aug 2024.

  28. Chen T, Guestrin C, XGBoost:. A Scalable Tree Boosting System. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Francisco California USA: ACM; 2016. pp. 785–94. Available from: https://doi.org/10.1145/2939672.2939785. Cited 5 Feb 2024.

  29. Fang Zgang, Yang S, qin, Lv C, xia, An S, Wu W. Application of a data-driven XGBoost model for the prediction of COVID-19 in the USA: a time-series study. BMJ Open. 2022;12(7):e056685.

    Article  Google Scholar 

  30. Guryanov A. Histogram-Based Algorithm for Building Gradient Boosting Ensembles of Piecewise Linear Decision Trees. In: Analysis of Images, Social Networks and Texts: 8th International Conference, AIST 2019, Kazan, Russia, July 17–19, 2019, Revised Selected Papers. Berlin, Heidelberg: Springer-Verlag; 2019. pp. 39–50. Available from: https://doi.org/10.1007/978-3-030-37334-4_4. Cited 19 June 2024.

  31. Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W et al. LightGBM: A Highly Efficient Gradient Boosting Decision Tree. In: Advances in Neural Information Processing Systems. Curran Associates, Inc.; 2017. Available from: https://proceedings.neurips.cc/paper_files/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html. Cited 19 June 2024.

  32. xgboost developers. XGBoost Parameters — xgboost 2.0.3 documentation. Available from: https://xgboost.readthedocs.io/en/stable/parameter.html. Cited 7 Feb 2024.

  33. scikit-learn developers. scikit-learn. HistGradientBoostingRegressor. Available from: https://www.scikit-learn/stable/modules/generated/sklearn.ensemble.HistGradientBoostingRegressor.html. Cited 29 Aug 2024.

  34. Microsoft Corporation. Parameters — LightGBM 4.0.0 documentation. Available from: https://lightgbm.readthedocs.io/en/stable/Parameters.html. Cited 19 June 2024.

  35. Wu J, Chen XY, Zhang H, Xiong LD, Lei H, Deng SH. Hyperparameter optimization for machine learning models based on bayesian optimization. J Electron Sci Technol. 2019;17(1):26–40.

    Google Scholar 

  36. Molnar C. 9.5 Shapley Values | Interpretable Machine Learning [Internet]. [cited 2024 Jul 3]. Available from: https://christophm.github.io/interpretable-ml-book/shapley.html.

  37. Diouf I, Rodriguez-Fonseca B, Deme A, Caminade C, Morse AP, Cisse M, et al. Comparison of malaria simulations driven by meteorological observations and reanalysis products in Senegal. Int J Environ Res Public Health. 2017;14(10):1119.

    Article  Google Scholar 

  38. Gaye S, Kibler J, Ndiaye JL, Diouf MB, Linn A, Gueye AB, et al. Proactive community case management in Senegal 2014–2016: a case study in maximizing the impact of community case management of malaria. Malar J. 2020;19(1):166.

    Article  Google Scholar 

  39. Grueninger H, Hamed K. Transitioning from malaria control to elimination: the vital role of ACTs. Trends Parasitol. 2013;29(2):60–4.

    Article  Google Scholar 

  40. Okell LC, Drakeley CJ, Bousema T, Whitty CJM, Ghani AC. Modelling the impact of artemisinin combination therapy and long-acting treatments on malaria transmission intensity. PLoS Med. 2008;5(11):e226.

  41. Villena OC, Arab A, Lippi CA, Ryan SJ, Johnson LR. Influence of environmental, geographic, socio-demographic, and epidemiological factors on presence of malaria at the community level in two continents. Sci Rep. 2024;14(1):16734.

    Article  Google Scholar 

  42. Qayum A, Arya R, Kumar P, Lynn AM. Socio-economic, epidemiological and geographic features based on GIS-integrated mapping to identify malarial hotspots. Malar J. 2015;14(1):192.

    Article  Google Scholar 

  43. Ndiaye A, Niang EHA, Diène AN, Nourdine MA, Sarr PC, Konaté L, et al. Mapping the breeding sites of Anopheles gambiae s. l. in areas of residual malaria transmission in central western Senegal. PLoS One. 2020;15(12):e0236607.

  44. Duque C, Lubinda M, Matoba J, Sing’anga C, Stevenson J, Shields T, et al. Impact of aerial humidity on seasonal malaria: an ecological study in Zambia. Malar J. 2022;21:325.

  45. Fall P, Diouf I, Deme A, Diouf S, Sene D, Sultan B, et al. Bias-corrected CMIP5 projections for climate change and assessments of impact on malaria in Senegal under the VECTRI model. Trop Med Infect Dis. 2023;8(6):310.

    Article  Google Scholar 

  46. Sy O, Konaté L, Ndiaye A, Dia I, Diallo A, Taïrou F, et al. Identification des Gîtes larvaires d’anophèles Dans les foyers résiduels de faible transmission du paludisme « hotspots » Au centre-ouest du Sénégal. Bull Soc Pathol Exot. 2016;109(1):31–8.

    Article  Google Scholar 

  47. Kazanga B, Ba EH, Legendre E, Cissoko M, Fleury L, Bérard L, et al. Impact of seasonal malaria chemoprevention timing on clinical malaria incidence dynamics in the Kedougou region, Senegal. medRxiv. 2024;04(16):24305915. Available from: https://www.medrxiv.org/content/10.1101/2024.04.16.24305915v1. Cited 17 Nov 2024.

    Google Scholar 

  48. Ndiaye JLA, Ndiaye Y, Ba MS, Faye B, Ndiaye M, Seck A, et al. Seasonal malaria chemoprevention combined with community case management of malaria in children under 10 years of age, over 5 months, in south-east Senegal: a cluster-randomised trial. PLoS Med. 2019;16(3):e1002762.

  49. Cissé B, Ba EH, Sokhna C, NDiaye JL, Gomis JF, Dial Y, et al. Effectiveness of seasonal malaria chemoprevention in children under ten years of age in Senegal: a stepped-wedge cluster-randomised trial. PLoS Med. 2016;13(11):e1002175.

  50. Shrestha AK, Thapa A, Gautam H. Solar Radiation, Air Temperature, Relative Humidity, and Dew Point Study: Damak, Jhapa. Nepal Int J Photoenergy. 2019;2019:1–7.

    Article  Google Scholar 

Download references

Acknowledgements

The authors would like to thank the Bill and Melinda Gates Foundation, all supervisors for their contributions to the work and the NMCP Senegal for the data.

Funding

This study was funded by a grant from the Gates Foundation (INV-047051). The funders had no role or influence on the design and interpretation of the data collected, as well as in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

(A) Conception and design: AY Ly, M Bousso, JL Ndiaye; (B) Administrative support: JL Ndiaye; (B) Provisionof study materials: JL Ndiaye, M Ndiop; (C) Collection and assembly of data: AY Ly; (D) Data analysis andinterpretation: AY Ly, M Bousso, JL Ndiaye, MM Allaya, MA Loum, L Gning, FB Sall, O Sy; (E) Manuscriptwriting: AY Ly, MM Allaya, MA Loum, L Gning, M Bousso, JL Ndiaye; (F) Final approval: All authors.

Corresponding author

Correspondence to Almamy Youssouf Ly.

Ethics declarations

Ethics approval and consent to participate

The research was conducted through the Senegal National Malaria Control Program retrospective malaria data, with no direct contact with patients. All data were kept strictly confidential, and no personal identifiers were collected.

Consent for publication

Not Applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ly, A.Y., Allaya, M.M., Loum, M.A. et al. Forecasting malaria incidence in the Southeast districts of Senegal using a machine learning approach. BMC Artif. Intell. 1, 9 (2025). https://doi.org/10.1186/s44398-025-00007-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s44398-025-00007-4

Keywords