STAT 430 – Group Project Progress Report
#06 – Pythonites, End the Pandemics!
Eunjeong Ro (ro12), Tess Yang (tiannuo2), Virgil Chen (hongfei6)
Abstract—In this project, we are addressing the prediction of the
number of COVID-19 deaths by country. We are gathering data
through the API method and building a prediction model based on
their historical death cases. To verify the best prediction model, we
are adjusting the time series analysis and generalized linear modeling
methods. In the end, we will perform visualizations of our results.
Keywords – COVID-19, Prediction modeling, time series,
generalized linear model, API, Visualization
I. INTRODUCTION
Since COVID-19 has made significant impacts on the
world, most industries need to predict the potential effect of
COVID-19 in the future. We are focusing on the insurer’s
perspective, inspired by Harris & Yelowitz’s (2021), “Did
COVID-19 change life insurance offerings?”, which
summarized the impact of COVID-19 on the life insurance
market up to 2021. We will use the most recent data to see if
the situation of the COVID-19 pandemic changes, with which
the insurance provider can review their policy on premium or
capital requirements and make changes accordingly. It would
be a great opportunity for us to experience prediction modeling
by using real-world data.
The input data will be several features related to the number
of new death cases, including country, newly confirmed cases,
and so on. The expected output will be a potential number of
death cases due to COVID-19 by countries.
To perform this prediction model, we constructed the entire
prediction model from API data collection to modeling and
visualization with Python. We would like to propose two models
first, the Time Series model and the Generalized Linear Model
(GLM), and suggest the better one based on the accuracy of each
model. By comparing the results from each modeling method
under the same dataset, the model performance would be
evaluated based on accuracy. [Fig. 1] showed the entire process
of this project.
Fig. 1. The entire process of this project.
II. RELATED WORK
Since many countries experiencing enormous COVID-19
cases, it is important to predict trends of confirmed and death
cases to enable effective implementation of control measures.
To predict future trends of COVID-19 cases, many researchers
have performed prediction modeling with various methods. [1]
used SARIMA time series analysis modeling to forecast
COVID-19 case trends. One of the most widely used
approaches for forecasting time series is the ARIMA model.
This model assumes that historical experiences or knowledge
will affect future behavior. [2] have applied the ARIMA model
to predict the number of COVID-19 deaths. In addition, the
SARIMA model is the model in which S (Seasonal) is added as
a seasonal component to the ARIMA model so that it can be
used to simultaneously forecast seasonal or non-seasonal time
series. In our modeling process, we used the SARIMA model
for getting better prediction performance.
Another widely used approach for predicting future trends
is Generalized Linear Models (GLM). There are several
regression methods in GLM including logistic regression. [3]
used generalized logistic growth regression to predict excess
deaths during the first 4 weeks of 2021. [4] also applied logistic
growth modeling to forecast the COVID-19 trends. On the other
hand, [5] used the GLM model with Poisson regression and
Negative Binomial Regression methods. From those references,
I built GLM with linear regression first at this step and will try
to apply other regressions. In the end, I would like to suggest
the best GLM model based on its accuracy.
III. DATA
A. Data description
We are working with the COVID-19 confirmed and death
cases data in CSV data format in this project. World Health
Organization (WHO) has been collecting and updating this
daily basis, so we can access this dataset on WHO Coronavirus
(COVID-19) Dashboard webpage. There are 8 features and
248,613 observations from 237 countries in this dataset. The
detailed information is in [Table. 1]. Everyone can download
the data from the link with ease, so there was no special step to
import this data.
Field name Description
Date_reported Date of reporting to WHO
Country_code ISO Alpha-2 country code
Country Country, territory, area
WHO_region WHO regional offices
New_cases New confirmed cases
Cumulative_cases Cumulative confirmed cases
New_deaths New confirmed deaths.
Cumulative_deaths Cumulative confirmed deaths
Table. 1. The detailed information about the dataset from
(https://covid19.who.int/data)
B. Data examples
Fig. 2. First 5 observations of the dataset
IV. EXPLORATORY DATA ANALYSIS
We need to figure out our dataset much closer before
performing prediction modeling. [Fig. 3] and [Table. 1] showed
the general information and numerical summaries of the
COVID-19 dataset. Although there are several Null values in
‘Country_code’, we can ignore it since there is ‘Country’
information in the dataset. In addition, ‘WHO_region’ is not
useful information for our prediction so it should be removed.
Fig. 3. Brief information of the dataset
Table. 2. Numerical summary
Thus, we have generated a new dataset that only includes 6
features as [Fig. 4].
Fig. 4. A new dataset
First, we have investigated the distribution of the number of
cumulative death cases due to COVID-19 across the world
since 01/03/2020. In [Fig. 5], we can see the cumulative death
rate increasingly increased up until Jan 2022, but the increasing
rate is slowing down after that. The overall distribution looks
like a Logistic growth, so we are fitting a Logistic growth model
to the data.
Fig. 5. Distribution of the number of death due to COVID-19
Lastly, we have tried to figure out the growth rate by
different types of cases – Confirmed, Death, and Cumulative
cases in [Fig. 6]. As we can see, the confirmed cases have
highly increased in Jan 2022 compared to before and after that
period. This supports the Logistic growth more as the increase
rate.
Fig. 6. Growth of different types of cases
Since there are too many countries (237) in the dataset, we
have decided to pick target countries to make performing
modeling much more meaningful for the real-world. Thus, we
found the top 10 countries with the highest number of death
cases in the last 30 days. As shown in [Fig. 7]. Lastly, we
investigated the plot of cumulative death cases for the picked-
out Top 10 countries, making sure that a Logistic growth would
be a reasonable model for each country. As shown in [Fig. 8].
Fig. 7. Top 10 countries with highest number of death cases in last 30 days
New
cases
Cumulative
cases
New
deaths
Cumulative
deaths
count 2.49E+05 2.49E+05 248613 2.49E+05
mean 2.55E+03 9.68E+05 26.522004 1.47E+04
std 2.87E+04 4.65E+06 250.072177 6.49E+04
min -8.26E+03 0.00E+00 -60 0.00E+00
25% 0.00E+00 7.33E+02 0 7.00E+00
50% 1.70E+01 1.99E+04 0 2.46E+02
75% 4.35E+02 2.64E+05 5 4.07E+03
max 5.54E+06 9.68E+07 23278 1.06E+06
Fig. 8. Cumulative death cases for Top 10 countries
V. PRELIMINARY TECHNICAL DETAILS AND RESULTS
A. Proposed model
Our first proposed prediction model is the Generalized
Logistic Growth Model within the GLM. To fit this model,
there are several things we needed to do first. First, the data set
was split into training and testing parts, with the testing set
being the most recent 30 days and the training set being the
historical date before that. Second, we prepared our logistic
model (() =
1+∗−
) with parameters (a, b, c) to be trained.
With China as an example (see Appendix for the model
results of other 9 Top-10 countries), from the result of the
trained Logistic Growth Model, we can see the fitted model,
compared with the real data and preliminary results of errors
from [Fig. 9] to [Fig. 12].
Fig.9. Logistic model vs the Training data (China)
Fig.10. preliminary results of errors of the training data (China)
Fig.11. Logistic model vs the Testing data (China)
Fig.10. preliminary results of errors of the testing data (China)
B. Data story
By looking at the observation of China, the three parameters
were initialized at [1.91712652, 2.26087307, 0.86812452] and
fitted at [99999.93749892835, 0.003568183275081294,
72576545.49357598] in the end. The model has higher error
rates in the training data because of the higher deviation from
the Logistic Growth Model at the increasing rate period and
started to fit well with lower error rates in the testing data, as
the increasing rate decreased. By looking at the preliminary
results of errors, the model fits the testing data fairly well.
Therefore, the Logistic Growth Model for China’s COVID-19
cumulative death trend is well-fitted.
C. Results
Fig. 11. Preliminary results with train data by Top 10 countries
Fig. 12. Preliminary results with test data by Top 10 countries
D. Discussions
First, we changed the data set used for this project since the
chosen data set in our proposal stopped updating and changed
our subjects from “States” to “Countries”, corresponding to the
data. In addition, we used the Logistic Growth Model – a
modified version of the logistic regression model to fit the
identified logistic trend of our data. For our next step, we are
moving forward into time series analysis and comparing
SARIMA models to the current Logistic Growth Model for a
better result. Then explore if the better model would explain the
pre-vaccine and post-vaccine difference.
VI. APPENDIX
A. Timeline of work
ID Task name Start date End date Duration
1 Data preprocessing 11/5 11/11 6
2 GLM modeling 11/12 11/18 6
3 Time Series modeling 11/19 11/25 6
4 Assessment 11/26 12/2 6
5 Visualization 12/3 12/9 6
6 Final report 12/10 12/14 4
REFERENCES
[1] Tan, Cia, et al. Forecasting COVID-19 case Trends Using
SARIMA Models during the Third Wave of COVID-19 in
Malaysia, International Journal of Environmental Research and
Public Health. (2022)
[2] Navid M., et al., Predicting number of Covid19 deaths using
Time Series Analysis (ARIMA MODEL), Towards Data
Science. (2020)
[3] Dahal S., et al., Characterizing all-cause excess mortality
patterns during COVID-19 pandemic in Mexico, BMC
Infectious Diseases 21:432. (2021)
[4] Elinor A-S., et al., Generalized logistic growth modeling of
the COVID-19 pandemic in Asia, Infectious Disease Modeling
5 (2020) 502-509. (2020)
[5] Temesgen B. B., Modeling Mortality from COVID-19 Using
Poisson Based Regressions: The Case of Sweden, U.U.D.M.
Project Report 2022:9. (2022)
CONTRIBUTIONS
Eunjeong Ro (34%): Coming up with the general idea of
how the project will be conducted based on the new data.
Conducting basic and exploratory data analysis.
Tess Yang (33%): Arranging introduction of the progress
report and introducing related scholarly work and how they
might be related to our project. Grammar check.
Virgil Chen (33%): Coming up with the general idea of how
the project will be conducted based on the new data. Fitting the
Logistic Growth Model and analyzing preliminary results of
errors.
11/4 11/14 11/24 12/4 12/14
Data preprocessing
GLM modeling
Time Series modeling
Assessment
Visualization
Final report ■ in progress ■ scheduled