Our inspiration for this model was the noticeable difference in how quickly the virus spread in different states.
Our model predicts the average percentage daily increase in cases per state.
We built this with an ensemble of catboost regressors and KNN regressors. We used data from credible sources which can be found in our documentation.
The biggest challenge we encountered was trying to prove that our model isn't over-fit when there are only 50 observations given that there are 50 states.
We are proud of the final score of our model, insights we gained, and research that we conducted to explain these insights.
From this study, I learned that it is important to test your theories before asserting them. The majority of the variables that we identified as potential determinants in our thought experiment had little to no impact on the spread rate. Had we not built our model, we would not have realized this.
We will continue to work to build a similar model for counties as opposed to states. We believe that a county based model will be more accurate and less prone to over-fitting.
Link to official write-up of process:
Try It out
catboost, numpy, pandas, python, regression, scikit-learn