Ayshwarya Srinivasan and Vivek Sahoo
GithubThis project aims to predict two aspects of COVID19
The project is split into two parts:
There are multiple datasets used for the analysis
The data for COVID forecast was sourced from Kaggle. The dataset has 6 files.
Apart from this, we also performed Web scraping to find out the total populations of different countries from Wikipedia using BeautifulSoup
We first start by exploring what measures taken by various governments are the most popular. Here we are visualizing only the Major catergories. Each major category has several sub categorical measures. Eg: Public health includes Health screenings in airports and border crossings , Introduction of quarantine policies among many such sub categories.
We next visualize the trend in numbers across the world. We see that the Confirmed cases has been increasing exponentially. Around March 20th is when the confirmed cases seem to have burst in growth.
TO understand what countries have been affected the most, we look at the countries with top 15 fataility counts. As of April first week, it was Italy. The count has changed since but we are focusing on the data till April.
We wanted to dive in deeper and understand if the population density contributes to lack of social distancing and causes more positive cases. We visualized the number of confirmed cases against population density (per sq km). Suprisingly we observed that the trend was not linear as we expected. Even with countries with low population density the number of cases were disproportionately high.
Steps for predicting fatalities and confirmed cases
We follow the below steps for the modelling process:
Our focus in the modelling process is to model data for predicting fatalities and confirmed cases, so we won’t be diving much into the inferences from each of these models, except evaluating their performance in terms of predictions.
The root mean square logarithmic error from the various models are shown below
Model | RMSLE |
---|---|
Linear Regression | 1.328103957784514 |
Lasso Regression | 1.419285223064493 |
Ridge Regression | 1.646777130221392 |
Decision tree | 0.9097906897935402 |
Random forest | 1.7823492895858402 |
Gradient Boosting | 0.9291709468808671 |
Lower the RMSLE, the better the model performs. Since Decision tree performed the best in terms of RMSLE, we choose decision tree model to make predictions on the test set and submit it on Kaggle.
The test RMSLE is 1.23572, which means we are in the top 200 submissions in Kaggle. These are vanilla models but we have a lot of scope to use hyper-parameter tuning and other complex techniques to make better predictions.