Machine Learning in Urban Epidemiology

Dengue Fever is the most common mosquito-borne viral disease in the world. It is an illness caused by infection with a virus transmitted through the bite of the Aedes mosquito. Currently there is no drug for dengue fever, and to prevent dengue fever, you must prevent the breeding of its carrier, the Aedes mosquitoes. Historically, there have been significant Dengue Fever outbreaks in Singapore. Presently, it is estimated that Dengue Cases may exceed 30,000 in 2016, and astonishingly, there are 5122 cases since March 1st just for this year alone.

In this project we use machine learning to predict Dengue Fever outbreaks in Singapore. Based on the national historical data, we aim to predict the time of future outbreaks. Furthermore, if possible we plan to assess different districts in Singapore regarding the risk of having Dengue Fever outbreaks.
Figure_1

Dengue fever is a vector-borne disease, spread by a vector (Aedes mosquito) through biting a host (infected human). (figure above) Typically when a female mosquito takes a blood meal from an infected person, it takes two weeks of incubation period for the mosquito to be infectious to a healthy person.

Temporal Analysis: Analysis on historical weather data and dengue cases data shows that temperature is the most correlated feature compared to other features such as rainfall and wind speed. Rainfall and wind speed are more sporadic and volatile, making it hard to find a suggestive pattern. Indeed, the correlation matrix confirms the intuition drawn from the figures. Scatterplots of each of the features against the independent variable is contained in the appendix Figure III. Linear Regression confirms that temperature and rainfall are the only significant features with a p-value of 0.000 and 0.03, respectively, being associated with dengue fever positively. The R-squared value reflects the findings, with a low value of 8%. RF performs relatively better with an out of sample accuracy score of 34%, with the following parameters: 300 trees in the forest, a split of log! 𝑛 features, and entropy as inequality measure. Hence, using just meteorological data is not sufficient to accurately predict dengue fever.

ML Presentation_Dengue_Yuan Lai (dragged) 1

Machine Learning Workflow:

Logistic Regression and Random Forest Classifier are used for predictive modeling. Logistic regression predicts the probability of having a dengue case in each area and allows for determining the relative importance of the included features. This model is used to derive a general sense of the data and the significance of the model, judged additionally by the Area Under Curve (AUC) score. Since labeled data represents approximately 1.2% of the data, the AUC’s characteristic of giving the proportion of the time the guessed label equals the actual label makes this metric better than the accuracy score, since it is less affected by sample balance.

Random Forest is employed to crosscheck findings from logistic regression and make use of a more sophisticated model, accounting for nonlinearity of the data. The parameters of the RF are taken from cross-validation, taking the best performing set (highest AUC score) among different combinations of possible parameters (using K-fold split of 6), including the number of estimators or trees in the forest between 100 and 1000, the number of features considered for the best split among 𝑛 and log! 𝑛, and the function to measure the inequality of split among Gini impurity and information gain.

FlowchartSpatial Analysis: Mosquito habitat correlation with the dependent variable is suspected due to the fact that such data is collected by the same agency as dengue fever cases and it is highly likely that Singapore’s authorities identify mosquito habitat based on reported cases. Excluding this predictor, the AUC score was calculated to be 50%, which means that the algorithm is performing with the same accuracy as random guessing. However, pseudo R-square is approximately 20%, which suggests some predictive power. In addition, all included features are statistically significant at an alpha level of 5%. The marginal effects, i.e the change in probability of the dependent variable given changes in the independent ones, are the strongest in positive terms for transportation related variables (street network and bus stops) as well as trash bins. Parks and the total population have a negative effect. This suggests that higher mobility contributes to higher dengue fever risks. The association of population is harder to reason, because lot density and population do not fully capture population density and building density.

Layers
Real Reported Cases:
RealCase
Our Prediction:
Prediction
This project was a collaboration with Lucas Chizzali, Diego F. Garzon, and Bibby Bilguun from NYU Center for Urban Science and Progress (CUSP).

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s