Authors - Pandemic Outbreak and mitigation prediction

1.Overview is the ultimate tool for predicting epidemic trends. It has been built with the help of artificial intelligence and statistical methods. This epidemic forecasting model helps in giving a rough estimate about the future scenario and also helps in suggesting non-pharmaceutical/mitigation measures to control the outbreak with minimum efforts. This will give a head start in the preparations that are made to curb the pandemic before taking the lives of people.

Note: An NPI is the same as a mitigation measure.

2.What exactly is the problem?

During any pandemic, it's difficult to scale up the implementation of the mitigation measures, this is often because of the chaos that is caused during the pandemic. It often becomes an unseen situation wherein the authorities lack smart judgment on which step to take further which makes the situation even worse. It's not always necessary to implement the strongest mitigation measure as medium-strength mitigation can get the job done, thus giving more weightage to the economic stability and other subjects.

3.What can be done to tackle this issue?

A strategy that can give a rough picture of the future scenario describing the number of cases and the area of spread can give an insight into what could better be done to reduce the effect in an easy and cost-effective manner. Also, having a record of previously taken successful-steps can also provide much boost to this strategy.

4.Our Goals

a. To give an estimate by forecasting the number of cases and trends in the spread etc, which will give a good construction of how the scenario would be.

b. To suggest/predict the best suitable mitigation measures, according to previously taken successful steps,  thus saving resources and not creating chaos.

c. To make this approach a robust one, so that any agency working on


Prototype stage: We have completed our first stage training and testing on the covid19 data and have achieved over 90% accuracy in predicting the new cases the immediate next day and over 85% accuracy in predicting the long term scenario.

On the mitigation prediction part, we have achieved an accuracy of 91.8% and we were successful in bringing down the hamming loss to as low as 8.2%.

Accuracy: Our method is one of the most accurate ones among the others in predicting such trends.


Our submission is a script containing the machine-learning models that can be boosted with an interesting UI as mentioned in the gallery picture.

7.Technical details

Major tools used: a. Kalman filter: It’s an algorithm that uses a series of measurements observed over time, containing statistical noise and other inaccuracies, and produces estimates of unknown variables that tend to be more accurate than those based on a single measurement alone, by estimating a joint probability distribution over the variables for each timeframe.

b. Regression analysis: It’s a set of statistical processes for estimating the relationships between a dependent variable (often called the 'outcome variable') and one or more independent variables (often called 'predictors', 'covariates', or 'features').

c. Scikit-learn: Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.[3] It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means, and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

8. Dataset Description

Some details regarding the columns of mastersheet prediction are:

Each row is an entry/instance of a particular Npi getting implemented.

Country: This column represents the country to which the entry belongs to.

Net migration: The net migration rate is the difference between the number of immigrants (people coming into an area) and the number of emigrants (people leaving an area) throughout the year.

Population density: Population density is the number of individuals per unit geographic area, for example, number per square meter, per hectare, or per square kilometer.

Sex Ratio: The sex ratio is the ratio of males to females in a population.

Population age-distribution: Age distribution, also called Age Composition, in population studies, the proportionate numbers of persons in successive age categories in a given population. (0-14yrs/60+yrs %)

Health physicians per 1000 population: Number of medical doctors (physicians), including generalist and specialist medical practitioners, per 1 000 population.

Mobile cellular subscription per 100 inhabitants: Mobile cellular telephone subscriptions are subscriptions to a public mobile telephone service that provides access to the PSTN using cellular technology.

Active on the day: The number of active cases of covid19 infections in that particular country on the day it was implemented.

Seven-day, twelve-day and thirty-day predictions are for active cases from the date it was implemented.

And the date-implemented is converted to whether it was a week-day or a weekend to make it usable for training.

The last column represents the category to which the NPI that implemented belonged to.

9. I/O

Input: The epidemic data such as the number of infected people, demographics, travel history of the infected patients, the dates, etc up till a certain date

Output: 1) Prediction of the number of people who will be infected in the next 30days. 2) The countries that will get affected in the next 30days. 3) The mitigation/restriction methods to enforce such as curfew, social distancing, etc will also be predicted, to control the outbreak with minimalistic efforts.

10. Dividing the measures into categories:

Category 1: Public -health measures and social-distancing.

Category 2: Social-economic measures and movement-restrictions.

Category 3: Partial/complete lockdown.

To categorize the npis we followed a 5 step analysis:

Step 1: We chose 6 different countries that have implemented at least one of the above-mentioned npis.

Step 2: We had chosen a particular date wherein one of the NPI was implemented.

Step 3: From that date (chosen) we had calculated a 5day, 8day, 12day growth rate in the number of confirmed cases in that country.

Step 4: According to 1)

2) we took a reference that, over 50% of the people who are affected on day1 show symptoms by day5, over 30% of the people affected on day1 show symptoms by day8 and the last 20% start showing symptoms by day12. Assuming that, they get a checkup as soon as they are showing symptoms, we had calculated a cumulative growth rate.

Step 5: This cumulative growth rate was not very accurate due to the population densities of the countries being different. So, we had normalized the obtained scores from step4 by the population densities. That gave us the following results.

More information can be found here: link

[ (896.4961042933885, 'CHINA', 'SOCIAL DISTANCING'), (720.7571447424511, 'FRANCE', 'PUBLIC HEALTH MEASURES'), (578.0345389562175, 'SPAIN', 'SOCIAL AND ECONOMIC MEASURES'), (527.7087251438776, 'IRAN', 'MOV RESTRICTION'), (484.1021819976962, 'ITALY', 'PARTIAL LOCKDOWN'), (207.67676767676767, 'INDIA', 'COMPLETE LOCKDOWN')] Ex: (Cumilative growthrate(normalised), Country Name, Measure-taken)

So the above analysis shows the decreasing order of growth rates and increasing order of strength, however, this is not very accurate due to various other reasons, but this gives a rough estimate of the effectiveness/strength of the npis.

11. Working

a. The inputs given regarding the previous days’ record of the outbreak are first filtered by the Kalman filter and then further the modified inputs are sent to the regression model which will predict the scenario with better accuracies than any other simple regression model.

b. Then the predictions from the above models are fed into the machine-learning model which will further help in predicting the mitigations to be used, based on the previous history given in the literature, ex-social distancing.

c.We performed 10 Folds Cross-Validation by dividing our data set into 10 different chunks, then running the model 10 times. For each run, we designate one chunk to be for testing and the other 9 are used for training. This is done so that every data point will be in both testing and training.

12. Conclusions

This method can help the authorities to develop and predict various mitigation measures that will help in controlling the outbreak effectively with minimum efforts and chaos.

13. What did we learn?

a.This project was challenging in terms of the conceptualization and data collection part, there was no direct data available. We learned how to take relevant data from different datasets, engineer them, and use it for our purpose.

b. The regular regression algorithms failed in giving accurate results, so we had to think something different that can increase accuracy. Thus, we came across the idea of using the Kalman filter, and using these updated inputs we could achieve better accuracy.

c.Since we had to take regions having more than 1000cases only for the effectiveness of data, the overall dataset became small, deep-learning models failed. This made us switch to machine-learning algorithms.

d. We also used clustering algorithms which gave a deep understanding of why these work better in some situations.

e. Also due to some problems, it was exciting for us to use both R and python in a single notebook thus adding it to our learning.

14. The drawbacks of our approach

a. This above-mentioned approach has many drawbacks, one of them is an incomplete dataset.

b. There are no good-differentiating features in the dataset.

c. In our approach, we are not able to decide the effectiveness and a go-to plan of action for deploying npis.

All the data-points are very-similar to one-another, hence it is being difficult for the algorithm to learn.

15. What improvements do we want to make further?

a.There could be a set of strong differentiating features in the dataset, which will make the generalization easy.

b.There can be a further categorization of npis for better implementation of them.

c.The dataset can also be combined with economic parameters further, to understand the economic feasibility of the NPI-implementation.

d.It can further be used to predict the decrease in growth rates, once an NPI is implemented to further note the real-time effectiveness of the npis in a particular demographic

15. References





All the other references are mentioned in the submission notebook at every step.

16. Product Roadmap

The team has the functionality of the platform. We are currently in the process of bringing our front-end up to speed with our U/X designer's wireframes. Below is our Product Roadmap post hackathon submission:

a.New Security Features b.Admin Dashboard c.Analytical graphs

17. The team

a. Saketh Bachu - Machine-learning

b. Gauri Dixit - UI/UX development

c. Shaik Imran - Medical Expert/Design

Try It out



kalmanfilter, matplotlib, numpy, pandas, scikit-learn

Devpost Software Identifier