Ayshwarya Srinivasan and Vivek Sahoo

Github

Header image

Introduction

This project aims to predict two aspects of COVID19

  1. Number of Confirmed Cases
  2. Number of Fatalities

The project is split into two parts:

  1. Exploratory Data Analysis
  2. Machine Learning Models for Predictions and final submission to Kaggle

There are multiple datasets used for the analysis

  • Kaggle files on cases till date
  • Government Measures by country
  • Covid health Indicators by country
  • Distance from China
  • Webscraped data about population of different countries

Data Collection

The data for COVID forecast was sourced from Kaggle. The dataset has 6 files.

  • One for the measures taken by the government
  • COVID indicators
  • Daily reports by John Hopkins
  • Train and test datasets
  • Submission file

Apart from this, we also performed Web scraping to find out the total populations of different countries from Wikipedia using BeautifulSoup

Exploratory Data Analysis

We first start by exploring what measures taken by various governments are the most popular. Here we are visualizing only the Major catergories. Each major category has several sub categorical measures. Eg: Public health includes Health screenings in airports and border crossings , Introduction of quarantine policies among many such sub categories.

Header image

We next visualize the trend in numbers across the world. We see that the Confirmed cases has been increasing exponentially. Around March 20th is when the confirmed cases seem to have burst in growth.

Header image

TO understand what countries have been affected the most, we look at the countries with top 15 fataility counts. As of April first week, it was Italy. The count has changed since but we are focusing on the data till April.

Header image

We wanted to dive in deeper and understand if the population density contributes to lack of social distancing and causes more positive cases. We visualized the number of confirmed cases against population density (per sq km). Suprisingly we observed that the trend was not linear as we expected. Even with countries with low population density the number of cases were disproportionately high.

Header image

Modelling

Steps for predicting fatalities and confirmed cases

We follow the below steps for the modelling process:

  1. Load Required Packages and Datasets
  2. Combine Train and Test Data
  3. Join Government Measures Data & Distance from China & COVID Indicators
  4. Prepare Data
  5. Split into Train and Test Sets
  6. Functions to make predictions
  7. Linear Models Linear Regression Lasso Regression Ridge Regression
  8. Non-Linear Models Decision Trees Random Forests Gradient Boosting
  9. Choosing the Best Model for Submission

Our focus in the modelling process is to model data for predicting fatalities and confirmed cases, so we won’t be diving much into the inferences from each of these models, except evaluating their performance in terms of predictions.

Result

The root mean square logarithmic error from the various models are shown below

Model RMSLE
Linear Regression 1.328103957784514
Lasso Regression 1.419285223064493
Ridge Regression 1.646777130221392
Decision tree 0.9097906897935402
Random forest 1.7823492895858402
Gradient Boosting 0.9291709468808671

Lower the RMSLE, the better the model performs. Since Decision tree performed the best in terms of RMSLE, we choose decision tree model to make predictions on the test set and submit it on Kaggle.

The test RMSLE is 1.23572, which means we are in the top 200 submissions in Kaggle. These are vanilla models but we have a lot of scope to use hyper-parameter tuning and other complex techniques to make better predictions.