Predicting the next Pandemic – Dengue

Predicting Dengue – DataDriven Competition

Goal:  Predict Dengue outbreaks by the total number of cases by year and week for for two cities (Juan and Iquitos).

Data:  U.S. Centers for Disease Control and prevention, the Department of Defense’s Naval Medical Research Unit 6 and the Armed Forces Health Surveillance Center, in collaboration with the Peruvian government and U.S. universities. Environmental and climate data is provided by the National Oceanic and Atmospheric Administration (NOAA), an agency of the U.S. Department of Commerce.

Background Knowledge:  A big nadda, zero, nothin’.  I had no background or expertise of Dengue or the mosquito lifecycle before beginning this challenge.

End Result:  Highest rank was #413.  It was highly competitive and I’ll admit a little addicting.  The deeper I dug into understanding the Aegypti mosquito, the most creative my variables became and in turn, resulted in improved rankings.

PowerPoint Summary

Loader Loading...
EAD Logo Taking too long?

Reload Reload document
| Open Open in new tab

Dengue Forecasting Project


“Anyone who thinks that they are too small to make a difference has never tried to fall asleep with a mosquito in the room” – Christine Whitman

The dengue virus is spread by mosquitos resulting in dengue fever. With no vaccine available, approximately 50 to 528 million people are infected with the virus each year, 500,000 cases are severe life-threatening and 10 to 20 thousand people die from dengue hemorrhagic fever or dengue shock syndrome (Whitehorn & Farrar, 2010). Dengue affects people from all over the world but primarily in tropical and subtropical areas where one-third of the world’s population is located (Centers for Disease Control and Prevention, 2018).  No vaccine is available. Being able to accurately predict and reduce the impact of dengue epidemics could help focus efforts in prevention and save lives.

In order to predict number of cases of where epidemics may occur, data from a variety of sources including the environmental conditions such as ground observations, remote sensing and reanalysis were used in two locations, Iquitos Peru and San Juan, Puerto Rico. The goal is to predict the total cases for these two cities by year and the week of the year utilizing data for a five-year span from San Juan and three years from Iquitos. We begin with an overview of the problem, its significance along with the statistical information on the data. We will then outline the literature on the types of models selected and the models built including their formulation, performance and limitations, followed by a list of recommended ideas for future work and a retrospective evaluation of what was learned.

Problem to Address

Predict the total cases of Dengue by year and week of the year in San Juan, Puerto Rico and Iquitos, Peru.

Significance of the Problem

Predicting the total cases of Dengue by location as well as by time allows for efforts to be focused in reducing the impact of the outbreaks. Additionally, understanding areas of dengue outbreaks may impact the efforts of other related diseases such as the Zika virus while focusing on efforts in the development of vaccinations and medical treatments.

Exploratory Data Analysis

The DengAI project consists of several different files including training and test data along with separate files for the total number of cases and Driven Data submission format as follows:

San Jose, Puerto Rico Iquitos, Peru
Observ Start Date End Date Observ Start Date End Date Total
Training 936 4/30/1990 7/1/2000 520 7/1/2000 6/25/2010 1,456
Test 260 4/29/2008 4/30/2013 156 7/2/2010 6/25/2013 416
Total 1,196 7/2/2010 676 1,872

The complete dataset contains 1,872 observations and 24 variables.  The variables include numerical data such as the amount of precipitation and humidity along with categorical data such as the city.  In order to gain a better understanding of the data, a description and statistical information are shown below.

Dengue Data

Dengue Variables – Descriptives

Visual representations are an effective way to view the data such as variable correlations shown by city and which variables are significant (p-values >.01) with insignificant variables reflected with a blank.

Additionally, season deviation plots reflect the clearly-observable underlying yearly seasonal pattern.

Incomplete Observations

Distribution of NAs

Distribution of NAs

The incomplete observations could be interpreted as either the data was not attained or an error was made.  The volume of missing data and whether it was spread over time or during a particular time is also helpful for the imputation process.  The variable missing the most data is ndvi_ne with 12.66% followed by ndvi_nw with 3.37%.

A summary of the missing data is as follows:

Missing Data

                                                                                    Missing Data


To leverage the experience and expertise of how other peers have addressed similar types of challenges related to predicting Dengue cases, I conducted a review of several journal articles including:

  • Climate Variability, Social and Environmental Factors, and Ross River Virus Transmission: Research Development and Future Research Needs (Tong, Dale, Mackenzie, Wolff, & McMichael, 2008) compares a variety of models including Poisson, logistic, ARIMA and SARIMA
  • Time series analysis of malaria in Afghanistan: using ARIMA models to predict future trends in incidence (Anwar, Lewnard, Parikh, & Pitzer, 2016) developed ARIMA models based on temporal autocorrelation.
  • Weather variables and the El Niño Southern Oscillation may drive the epidemics of dengue in Guangdong Province, China (Xiao, et al., 2018) utilize GAM analysis, Random forest and Wavelet analysis were utilized.
  • Spatial and Temporal Dynamics of Dengue Fever in Peru: 1994-2006 (Chowell, et al., 2008) based their models using stepwise regression with the best regression model reflecting a lag of five weeks.
  • Developing a dengue forecast model using machine learning: A case study in China where models were constructed using Individual-Based life cycle Simulation (IBS) along with two populations models, the Ricker model and the Gompertz-logistic model (Jian, Silvestri, Brown, Hickman, & Marani, 2014).
  • Identification of the prediction model for dengue incidence in Can Tho city, a Mekong Delta area in Vietnam (Phung, et al., 2014) again uses SARIMA with a Box and Jenkins approach along with a Poisson distributed lag model

Types of Models

Based on the peer reviewed journals and materials on time series, several types

mosquito grow index

                  Mosquito Grow Index

of models were used. Holt Winters Exponential Smoothing, Seasonal ARIMA, ARIMA with regression terms, and Neural networks.  Each of these models were deemed appropriate in forecasting Dengue due to the seasons, trends and possible correlations within the data associated outbreaks.

Seasonal ARIMA

Autoregressive Integrated Moving Average (ARIMA) models can model seasonal data by using differences at a lag equal to the number of seasons to remove additive seasonal effects.  The seasonal ARIMA considers the statistical model for the irregular component of a time series while allowing for non-zero autocorrelations.   A non-seasonal portion reflected in lowercase letters and a seasonal portion reflected by capital letters:  ARIMA (p,d,q)(P,D,Q)m where m is the number of observations per year.  Utilizing the auto.arima function in the ‘forecast’ library resulted in (1,1,1) for San Jose and (0,1,2)(0,0,1)[52] for Iquitos, a plot can be found in Appendix A.

ARIMA with Regression Terms

Similar to the seasonal ARIMA, the xreg argument allowed us to use the auto.arima function to select the best model for the errors.  This resulted in (1,1,1)(1,1,1) for San Jose and (0,1,2)(0,0,1) for Iquitos and is shown in Appendix A.

Exponential Smoothing – Holt Winters

Exponential Smoothing models utilize a weighted mean of past values with a heavier weighting on more recent values whereas Holt and Winters method include seasonality so three smoothing equations are used – one for the level, another for the trend and the final one for seasonality.  Again, plots for exponential smoothing can be found in Appendix A.

Neural Networks

A neural network mimics the learning pattern of natural biological neural networks. They begin with a single perceptron that receives inputs, applies a weighting then passes into an activation function to produce an output which is then layered to create a network. Hidden layers are the layers between the input and output layers where you cannot see the inputs or outputs (Portilla, n.d.)

The ‘nnetar’ library was used to produce to produce a number of models using a variety of hidden nodes.  The default hidden number of nodes is half the number of input nodes plus 1 however, the best performing model used X hidden nodes for San Jose and Y for Iquitos.


In developing the models, the data was first reviewed with the additional variable, ‘total cases’ consolidated into the first training file.  Both the training and test data were then combined into one file.  As outlined earlier, there were missing values in the dataset that needed to be addressed before the models could be developed.  Depending upon the variable, a suitable method was chosen.  The method of imputation by variable was complete as follows:

  • ndvi_ne, ndvi_nw, ndvi_se, ndvi_sw were all updated using the Last Observation Carried Forward (na.locf) function
  • Precipitation: precipitation_amt_mm missing values were replaced with reanalysis_precip_amt_kg_per_m2
  • Minimum air temperature: reanalysis_min_air_temp k missing values were updated with station_min_temp_c using the Kelvin to Celsius formula:  reanalysis_min_air_temp_k – 273.15)
  • Maximum air temperature:  reanalysis_max_air_temp_k missing values were updated with station_max_temp_c variables using Kelvin to Celsius equivalent
  • Average air temperature: reanalysis_avg_temp_k missing values were updated with station_avg_temp_c variables using the Kelvin to Celsius equivalent
  • reanalysis_relative_humidity_percent and reanalysis_specific_humidity_g_per_kg was updated using Predictive Mean Matching (‘pmm’ in Mice package)

Additionally, after researching the ideal biological conditions for mosquitos and understanding the optimal temperature range of 22⁰ to 32⁰ Celsius, a variable termed ‘mosquito grow index’ was created.  It was based on the following:

The base temperature (20⁰Celsius) may not be ideal for the Aegypti mosquito species but was used as a starting point with the understanding that temperature under 16 degrees was considered less optimal.  A box plot of the Mosquito Grow Index reflects the distribution of the results.

Once all of the missing values were replaced, the datasets were separated into their relevant training and test sizes as per before while considering the location, one for Iquitos, Peru and the other for San Jose, Puerto Rico.  Each was then reflected as a time series within R.

The various data sets were then used in the different types of models – Seasonal ARIMA, ARIMA with regression terms, Exponential Smoothing – Holt Winters and Neural networks.  The Mean Average Error (MAE) were evaluated and if deemed appropriate, the file was adjusted to be put into the correct submission format and submitted.  The results were then logged and the methodology was then adjusted accordingly.


Overall, the best imputation model was the ARIMA with regression terms with a AICc value of 7235.6 for San Jose and 3273.8 for Iquitos.  This was a bit surprising as it was also one of the easiest models to build.  Another iteration of this model with the Mosquito Grow Index negatively impacted performance by approximately 0.5 – while this is quite a minor decline, the result was disappointing. The Neural Network also performed well ‘out of the box’ without modification.  Other models such as the Seasonal Arima didn’t quite perform as well.  Based on the AICc values, a combination of two models was applied, the first was an ARIMA for San Jose and the second was a Neural Network for Iquitos, this resulted in my best MAE on Driven Data of 25.4038.

Based on the Driven Data: DengAI: Predicting Disease Spread, my highest rank was #413 (username:  Christa) with a few of the following scores:


Recognizing my unfamiliarity with utilizing mapping coordinates in r, there could be more effective methods for determining these coordinates which may impact the weather conditions and on the mosquito lifecycle.  Additionally, having a more thorough understanding of humidity and its interaction with temperature to create additional variables could also be beneficial.  Furthermore, having the ability to understand the living population and turnover of mosquitos would be beneficial in understanding the ratios between the number of cases and the living mosquito population. The lack of background in atmospheric variables and keeping equivalent duplicates was not effective as it resulted in decreased model performance.

Future Work

Next steps include some type of methodology into forecasting the living population of mosquitos and the interaction between the weather conditions.  Additionally, being able to determine the amount of still water based on the amount of precipitation, humidity and temperature when compared to the number of Dengue cases could also be useful.  Overall, a more thorough understanding of the mosquito lifecycle and the interplay of weather conditions are likely key and where my efforts will be focused.


I found this case to be interesting from numerous perspectives.  The first is how effective various models are at optimizing the various seasonal components along with standardizing the variables.  Secondly, determining how best to deal with geographic coordinates in missing data was a new challenge that I never resolved but was happy to discover and apply the Kelvin to Celsius equation to relevant variables that were missing data.  Also, having an understanding of what is being modelled is important.  Having some background in environmental conditions, lifecycle of mosquitos, as well as their interaction would provide new and perhaps meaningful variables.  As my second online modelling contest I found it quite addicting and am looking forward to doing more as a method to hone my modelling skills.


In conclusion, the Data Driven competition in predicting the total cases of Dengue was very interesting and relevant in the application of time series models.  The use of geographic coordinates was a unique challenge in the imputation of missing data along with the importance of having relevant subject knowledge such as weather conversions from Kelvin to Celsius.  The application of research to the problem was also very helpful in both providing ideas on which models to build but also in understanding the issues and challenges.  Finally, being able to achieve results ‘out of the box’ with the various predictive models was surprising, resulting in the desire to create unique variables and expand my knowledge further into how to tweak the models further to reduce the root mean square error and improve my ranking within the contest.

Appendix A

total cases dengue iq sj

Total cases dengue

arima train xreg

Figure 1:  ARIMA with Regression San Jose


Figure 2:  ARIMA with Regression Terms – Iquitos

Figure 2:  Exponential Smoothing San JoseFigure 3:  Neural Network Iquitos

Figure 4:  Holt Winters Iquitos


Anwar, M. Y., Lewnard, J. A., Parikh, S., & Pitzer, V. E. (2016, 11 22). Time series analysis of malaria in Afghanistan: using ARIMA models to predict future trends in incidence. Malaria Journal, 15(1). Retrieved from

Centers for Disease Control and Prevention. (2018). Dengue. Retrieved from Centers for Disease Control and Prevention:

Chowell, G., Torre, C., Munayco-Escate, C., Suárez-Ognio, L., López-Cruz, R., Hyman, R., & Castillo-Chavez, C. (2008). Spatial and Temporal Dynamics of Dengue Fever in Peru: 1994-2006. Epidemiology and Infection, 136(12), 1667-1677. Retrieved from

Hu, W., Tong, S., Mengersen, K., & Oldenburg, B. (2006). Rainfall, mosquito density and the transmission of Ross River virus: A time-series forecasting model. Ecological Modelling, 196, 505-514.

Jian, Y., Silvestri, S., Brown, J., Hickman, R., & Marani, M. (2014). The Temporal Spectrum of Adult Mosquito Population Fluctuations: Conceptual and Modeling Implications. PLoS ONE, 9(12). Retrieved March 9, 2018, from

Phung, D., Huang, C., Rutherford, S., Chu, C., Wang, X., Nguyen, M., . . . Manh, C. D. (2014). Identification of the prediction model for dengue incidence in Can Tho city, a Mekong Delta area in Vietnam. Acta Tropica, 141, Part A, 88-96. Retrieved from

Tong, S., Dale, P., Mackenzie, J. S., Wolff, R., & McMichael, A. J. (2008). “Climate Variability, Social and Environmental Factors, and Ross River Virus Transmission: Research Development and Future Research Needs.”. Environmental Health Perspectives,, 116(12), 1591+. Retrieved from

Whitehorn, J., & Farrar, J. (2010, 09). Dengue. British Medical Bulletin, 95(1), 161-173. Retrieved from Wikipedia:

Xiao, J., Liu, T., Lin, H., Zhu, G., Zeng, W., Li, X., . . . M, W. (2018). Weather variables and the El Niño Southern Oscillation may drive the epidemics of dengue in Guangdong Province, China. Science of The Total Environment, 624, 926-934. Retrieved from

Link to contest:

Leave a Reply