Dengue Fever is the most common mosquito-borne viral disease in the world. It is an illness caused by infection with a virus transmitted through the bite of the Aedes mosquito. Currently there is no drug for dengue fever, and to prevent dengue fever, you must prevent the breeding of its carrier, the Aedes mosquitoes. Historically, there have been significant Dengue Fever outbreaks in Singapore. Presently, it is estimated that Dengue Cases may exceed 30,000 in 2016, and astonishingly, there are 5122 cases since March 1st just for this year alone.
Dengue fever is a vector-borne disease, spread by a vector (Aedes mosquito) through biting a host (infected human). (figure above) Typically when a female mosquito takes a blood meal from an infected person, it takes two weeks of incubation period for the mosquito to be infectious to a healthy person.
Temporal Analysis: Analysis on historical weather data and dengue cases data shows that temperature is the most correlated feature compared to other features such as rainfall and wind speed. Rainfall and wind speed are more sporadic and volatile, making it hard to find a suggestive pattern. Indeed, the correlation matrix confirms the intuition drawn from the figures. Scatterplots of each of the features against the independent variable is contained in the appendix Figure III. Linear Regression confirms that temperature and rainfall are the only significant features with a p-value of 0.000 and 0.03, respectively, being associated with dengue fever positively. The R-squared value reflects the findings, with a low value of 8%. RF performs relatively better with an out of sample accuracy score of 34%, with the following parameters: 300 trees in the forest, a split of log! 𝑛 features, and entropy as inequality measure. Hence, using just meteorological data is not sufficient to accurately predict dengue fever.
Logistic Regression and Random Forest Classifier are used for predictive modeling. Logistic regression predicts the probability of having a dengue case in each area and allows for determining the relative importance of the included features. This model is used to derive a general sense of the data and the significance of the model, judged additionally by the Area Under Curve (AUC) score. Since labeled data represents approximately 1.2% of the data, the AUC’s characteristic of giving the proportion of the time the guessed label equals the actual label makes this metric better than the accuracy score, since it is less affected by sample balance.
Random Forest is employed to crosscheck findings from logistic regression and make use of a more sophisticated model, accounting for nonlinearity of the data. The parameters of the RF are taken from cross-validation, taking the best performing set (highest AUC score) among different combinations of possible parameters (using K-fold split of 6), including the number of estimators or trees in the forest between 100 and 1000, the number of features considered for the best split among 𝑛 and log! 𝑛, and the function to measure the inequality of split among Gini impurity and information gain.
Spatial Analysis: Mosquito habitat correlation with the dependent variable is suspected due to the fact that such data is collected by the same agency as dengue fever cases and it is highly likely that Singapore’s authorities identify mosquito habitat based on reported cases. Excluding this predictor, the AUC score was calculated to be 50%, which means that the algorithm is performing with the same accuracy as random guessing. However, pseudo R-square is approximately 20%, which suggests some predictive power. In addition, all included features are statistically significant at an alpha level of 5%. The marginal effects, i.e the change in probability of the dependent variable given changes in the independent ones, are the strongest in positive terms for transportation related variables (street network and bus stops) as well as trash bins. Parks and the total population have a negative effect. This suggests that higher mobility contributes to higher dengue fever risks. The association of population is harder to reason, because lot density and population do not fully capture population density and building density.