MATH 4044 – Statistics for Data Sciences
Assignment 1 SP2 2022
Due Sunday 24 April 2022 by 11:55pm
Instructions
• This assignment is worth 25% of your final mark. It is due no later than 11:55pm
on Sunday 24 April, first week after the mid-break.
• You will need to submit your assignment via learnonline, including a completed
cover sheet. A partially filled-in cover sheet can be downloaded from the Assign-
ments page on the course website.
• The submitted assignment needs to be a single file, in either a Microsoft Word
(doc or docx) or pdf file format, 25 pages at most excluding any appendices.
• The assignment is out of 100 marks. To achieve maximum marks for each question,
you should aim to:
– Complete the requested statistical analysis in SAS using appropriate tasks or
procedures (40%).
– Include only the output most relevant to the question and interpret all key
results (40%). Do not include every piece of output produced by SAS!
– Discuss the results more broadly in the context of the given scenario (20%).
• Assignments submitted late, without an extension being granted, will attract a
penalty of 10 marks per each working day or any part thereof beyond the due date
and time.
1
MATH 4044 Statistics for Data Sciences Assignment 1
Data Description
Life Expectancy
The world health organisation (WHO) has been collecting data on life expectancy of
countries in the world. The data has been collated with potential predicting factors
relating to immunization, mortality and social-economic statuses.
We will attempt to identify the factors that have potential impact on life expectancy,
and propose strategies to improve life expectancy.
References and Data Sources:
• Data was downloaded from https://www.kaggle.com/datasets/kumarajarshi/life-
expectancy-who. Some variables that deem inaccurate was removed.
Data file for this assignment
The data file for this assignment is called expectancy.sas7bdat. Variables in the data
are:
Variable Description
country Country name
Year Year of observation
status Developed or Developing status
AdultMortality number of adult (between 15 and 60 years) death per 1000 population
Alcohol Alcohol (litre) per capita (15+) consumption
HepatitisB HepB immunization coverage (%) among 1-year-olds
BMI average body mass index of population
polio Polio immunization coverage among 1-year-olds
totalExpenditure% Expenditure on health relative to all government expenditures
Diphtheria DTP3 immunization coverage among 1-year-olds
HIVAIDS Death from HIV/AIDS per 1000 live births (0-4 years)
GDP Gross Domestic Product per capita (USD)
Population Population of the country
thinness5to9 Prevalence of thinness among 5 to 9 year-olds (%)
thinness10to19 Prevalence of thinness among 10 to 19 year-olds (%)
IncomeCompositionHuman develop index in terms of income composition of resources
Schooling Number of years of schooling (years)
Expectancy Life expectancy from birth (in age)
2
MATH 4044 Statistics for Data Sciences Assignment 1
Assignment Tasks
Question 1 (20 marks)
(a) (12 marks)Use SAS to study the distribution of life expectancy (expectancy)
and life expectancy by country status. Obtain measures of location, disper-
sion, skewness and kurtosis. Obtain a boxplot, histogram and a quantile-
quantile plot. Also carry out Normal Goodness-of-fit tests. What are the key
features of these distributions? What are the trends you observed?
(b) (8 marks) Generate a log and square-root transformations of the variable
GDP. Denote the transformed variables logGDP and sqrtGDP respectively.
What are the key features of the distribution of GDP, logGDP, and sqrtGDP.
Select the transformation that is closest to normality. For convenience, I will
refer to your choice as tGDP in the following task description.
Question 2 (60 marks)
(a) (10 marks)Obtain a Pearson correlation matrix relating variables expectancy,
tGDP, totalExpenditure and alcohol. Also obtain a scatterplot matrix of
the same variables. Discuss the relationships. Are there any relationships
that you find counter-intuitive. Explain briefly.
(b) (20 marks) Fit a simple regression model relating expectancy to tGDP,
with tGDP as the explanatory variable. Discuss the fitted relationship and
the goodness of fit. Examine residual plots and influence diagnostics and
comment on the residual patterns.
(c) (30 marks) Extend your regression model for expectancy by including the
other potential predictors. In building your model consider as many potential
explanatory variables as possible (you may need to define additional dummy
variables). You can use stepwise selection to help you find the most parsi-
monious (simplest) model with the highest R-square. Be sure to check for
collinearity.
Summarise how your final model was obtained, including rationale for any
modelling decisions you have made, and indicate why that final was considered
the ‘best’.
Report and interpret your final model in detail, including a discussion on the
main factors that effect life expectancy. Discuss the model diagnostics. Are
there observations that may require further inspection due to their influence
on the model. Identify any trend/commonality among these observations.
Question 3 (20 marks)
Write a summary of your findings from Questions 1 and 2. Keep the technical
details of the analyses that led you to these conclusions to the absolute minimum.
3
MATH 4044 Statistics for Data Sciences Assignment 1
Rather, focus on practical significance and present your findings in non-specialist
terms. One to two paragraphs (up to a page) will be sufficient.
4