Data Science vs. Covid-19

Do you think it's possible to predict which countries are the most at risk for the next few weeks using Machine Learning and economic data?


We realized that what caused the virus to spread that quickly throughout the world was the passivity of governments. We felt that every one of them went through the stage 1: “We are fine, it is way less dangerous than the typical winter flu” to stage 10 in a couple of days: “Close everything! Stay at home! We need masks and respirators!”. There are still nations that have not yet been heavily impacted by Covid-19, regions that have the opportunity to avoid making the same mistakes as Italy or the United States. We thought that by combining our knowledge in Economics with our interest in Machine Learning/AI, we could develop an algorithm that would show them the predicted exponential growth rate that they are going to face based on their economy’s characteristics if they do not act.

What it does

It splits countries into 2 categories: “impacted” and “not impacted”. The former is made up of countries that recorded more than 250 confirmed cases, while between 0 and 249 for the latter. That way, the algorithm uses the large data of the “impacted” ones to predict the exponential growth rate of the spread of coronavirus in countries that can still make a difference, and rank them with the most at risk at the top. That way, we believe to be able to identify where organizations and NGOs should be particularly aware of a potential crisis coming soon.

How we built it

Our algorithm uses data from the World Bank, Johns Hopkins University, and a Kaggle dataset based on US CIA Factbook. We extensively operated with Python and its following libraries: Pandas, XGBoost, Sklearn, Numpy.

Challenges we ran into

Firstly, we faced adversity when it came to finding information about the economy of countries. Many data providers were not consistent, or skipped some nations. Secondly, data cleaning took us a lot of time, as small countries tended to have many “NaN” values. Finally, as we were joining 3 datasets together, we had to make sure that each of the +200 countries were spelled the same way. Obviously, because we only had 70 countries to train our model, we had to be careful with overfitting risks.

Accomplishments that we are proud of

As Bachelor in Economics students, we are extremely proud to having managed to complete this project. We believe our background gave us an unique view on the problem, focusing on important features and real-world applicability. We were also proud to be using what we learned about data sciences and machine learning in our Business Intelligence and Analytics class as well as in online courses for a good cause!

What's next for Data science vs Covid-19

We would like to present our project to international organizations (WHO and NGOs). With more time and data, we could efficiently improve the accuracy of our model and make it more powerful. While many studies already covered the estimated number of cases, non of them looked into economic data to predict growth rate. This project could act as a tool for public officers and NGOs to identify where the growth is most likely to be extremely important, putting hospitals and international cooperation in an immediate state of need. With this model, we believe we can see where extreme measures needs to be taken, and where governments will most need assistance in the upcoming weeks.

For many countries, we believe it is #NotTooLate and that we can act now to prevent many deaths

Try It out



jupyter-notebook, machine-learning, pandas, python

Devpost Software Identifier