r代写-STATS4044-Assignment 1|学霸联盟

r代写-STATS4044-Assignment 1

时间：2021-11-30

Assignment 1 (2020) alternative questions
STATS4044 Intro to R (H) 2021-2022
Introduction
This document contains alternative questions for Assignment 1 (2020), additional to those available in
A1_2020.pdf and A1_2020.html.
Contents:
• 3 alternative versions of Question 1 (each worth 15 marks):
– Game of Thrones characters 1
– Game of Thrones characters 2
– Star Wars
• 3 alternative versions of Question 2 (each worth 15 marks)
– Plant growth
– Seals
– BloodLead
The original assignment contained 1 of each Question (30 marks total).
Question 1 - Game of Thrones characters 1 [15 marks total]
If you have not already done so, use the line of code below to load the dataframes for this assignment into R:
load(url("http://www.maths.gla.ac.uk/~rhaggarty/rp/a1data2020.RData"))
This task is based on a set of data derived from the HBO television series Game of Thrones (based on George
R. R. Martin’s series of fantasy novels). The data used for this task comes from the first 6 seasons of the
show and are stored in the dataframe got which contains the following columns:
name character name
screentime time on screen in minutes
episodes number of episodes the character has been in
portrayed_by name of the actor the character is portrayed by
allegiance name of the ‘house’ (family) the character is allied to
gender male (m) or female (f)
Answer the questions below without manually extracting information from the data.
(a) [1 marks]
Update the dataframe got so that the rows where the number of episodes is missing are removed.
(b) [2 marks]
Update the dataframe got so that the rows are ordered by screentime in descending order (from highest to
lowest).
1
(c) [2 marks]
Define a variable popular which contains the name of actor that portrays the female character who has the
most screentime.
(d) [2 marks]
Define a vector named house which contains the names of all characters which have the same allegiance as
the character “Robb Stark”.
(e) [3 marks]
Update the got dataframe by adding a new variable named role which contains the value “minor” if
episodes ≤ 5 , “supporting” if 5 < episodes < 35 and “major” if episodes > 35.
(f) [2 marks]
Define a vector named props which contains the proportion of each of the characters that fall into the three
role categories as defined in part (e) above (i.e. minor, supporting and major)
The following question is not based on the Game of Thrones data but instead requires you to simulate some
data.
(g) [3 marks]
One of the actors needs to be on set by 7.30am. He leaves his house at a random time between 6 and 7am,
with each leaving time being equally likely. You can use the function runif(n,6, 7) to simulate n such
leaving times.
The actors journey to set can take anywhere between 30 minutes (i.e. half an hour) and 45 minutes (i.e. three
quarters of an hour) depending on traffic. Assume that each journey time is equally likely and independent
of the leaving time.
By first simulating n = 100,000 leaving times, and next simulating n = 100,000 journey times define a
variable named ontime which contains the proportion of times the actor is on set after 7.30am (based on
your simulation).
Question 1 - Game of Thrones characters 2 [15 marks total]
If you have not already done so, use the line of code below to load the dataframes for this assignment into R;
load(url("http://www.maths.gla.ac.uk/~rhaggarty/rp/a1data2020.RData"))
This task is based on a set of data derived from the HBO television series Game of Thrones (based on George
R. R. Martin’s series of fantasy novels). The data used for this task comes from the first 6 seasons of the
show and are stored in the dataset got which contains the following columns:
name character name
screentime time on screen in minutes
episodes number of episodes the character has been in
portrayed_by name of the actor the character is portrayed by
allegiance name of the ‘house’ (family) the character is allied to
gender male (m) or female (f)
Answer the questions below without manually extracting information from the data.
2
(a) [2 marks]
Update the got dataframe so that the rows are ordered by sreentime in ascending order (from lowest to
highest).
(b) [2 marks]
Define a vector of length 6 called missing where each element corresponds to a column and contains
the number of missing values in that column. The elements of the vector should be named to show the
corresponding columns.
(c) [1 marks]
Update the dataframe got so that the rows where the allegiance is missing are removed.
(d) [2 marks]
Define a variable mpop which contains the name of actor that portrays the male character who has appeared
in the most episodes.
(e) [3 marks]
Update the got dataframe by adding a variable named role which contains the value “minor” if episodes
≤ 10 , “supporting” if 10 < episodes ≤ 30 and “major” if episodes > 30.
(f) [2 marks]
Compute the average amount of screentime per episode for each character and hence identify the most popular
character using this measure. You should store the character name in a variable calledairtime.
The following question is not based on the Game of Thrones data but instead requires you to simulate some
data.
(g) [3 marks]
One of the actors needs to be on set by 9am. He leaves the house at a random time between 7 and 8am, with
each leaving time being equally likely. You can use the function runif(n,7, 8) to simulate n such leaving
times.
The actors journey to set can take anywhere between 60 minutes (i.e. one hour) and 90 minutes (i.e. one and
a half hours) depending on traffic. Assume that each journey time is equally likely and independent of the
leaving time.
By first simulating n = 100,000 leaving times, and next simulating n = 100,000 journey times define a variable
named ontime which contains the proportion of times the actor is on set by 9am (based on your simulation).
Question 1 - Star Wars [15 marks total]
If you have not already done so, use the line of code below to load the dataframes for this assignment into R;
load(url("http://www.maths.gla.ac.uk/~rhaggarty/rp/a1data2020.RData"))
In this task we will work with the dataframe starwars. Answer the questions below without manually
extracting information from the data. The dataframe contains data on characters from the Star Wars movie
series and contains the following 5 columns
• name - name of the character
• height - height of the character in cm
• mass - mass of character in kg
3
• homeworld - homeworld the character belongs to
• species - species of the character
(a) [2 marks]
Remove the rows from the dataframe starwars where the height or mass of the character is missing. The
updated data frame should continue to be called starwars.
(b) [2 marks]
Add a new column named BMI to the starwars dataset which contains the body mass index of each character.
Body mass index is calculated as 10000×mass/(height2) when mass is measured in kg, and height measured
in cm.
(c) [2 marks]
Define a variable extremes which contains the names of the characters with the lowest and highest weights.
(d) [3 marks]
Update the starwars data frame so that is includes a column named WeightStatus, which contains the
weight status as set out in the table below
BMI (to 1 decimal place) Weight Status
Below 18.5 Underweight
18.5 – 24.9 Healthy
25.0 – 29.9 Overweight
30 and above Obese
(e) [2 marks]
Define a vector named chewie which contains the names of all characters which are from the same homeworld
as Chewbacca.
(f) [1 mark]
Update the starwars dataframe so that the rows are in decreasing order according to the values of mass.
The following question is not based on the starwars data but instead requires you to simulate some data.
(g) [3 marks]
When filming one of the new Star Wars movies one of the actors needs to be on set by 9am. He leaves the
house at a random time between 7 and 8am, with each leaving time being equally likely. You can use the
function runif(n,7, 8) to simulate n such leaving times.
The actors journey to set can take anywhere between60 minutes (i.e. an hour) and 90 minutes (i.e. one and a
half hours) depending on traffic. Assume that each journey time is equally likely and independent of the
leaving time.
By first simulating n = 100,000 leaving times, and next simulating n = 100,000 journey times define a variable
named ontime which contains the proportion of times the actor arrives on set later than 9am (based on your
simulation).
Question 2 - Plant growth [15 marks total]
If you have not already done so, use the line of code below to load the dataframes for this assignment into R;
4
load(url("http://www.maths.gla.ac.uk/~rhaggarty/rp/a1data2020.RData"))
This task is based on the data set PlantGrowth and contains data from an experiment to compare yields
of plants (as measured by dried weight of plants) obtained under a control and two different treatment
conditions. The PlantGrowth data frame has the following two columns
• weight - the weight of the yield
• group - the treatment group (Control: ctrl, Treatment 1: trt1, or Treatment 2: trt2)
(a) [2 marks]
Use R to define a vector of length 5 which contains the treatment groups (trt1, trt2 or ctrl) corresponding to
the plants with the 5 highest weight yields.
(b) [3 marks]
Define a list of length 3 named group_summaries where each element of the list corresponds to a vector of
length 2 that contains the mean and median weight for a single group (trt1, trt2 or ctrl). The elements of the
list should have meaningful names.
(c) [2 marks]
Define a new data frame named sub which contains the rows of PlantGrowth data set that correspond to
treatment 1 (trt1) and treatment 2(trt2) only.
(d) [3 marks]
The two sample t-test is a formal statistical test which allows for checking whether there is a significant
difference in the mean weight of plants in treatment group 1 and treatment group 2.
Denote by x1, ..., xn1 the weight of the n1 plants in treatment group 1 (trt1 and y1, ..., yn2 the weight of the
n2 plants in treatment group 2 trt2.
The test statistic of the t-test is
t = x¯−y¯√(
1
n1
+ 1n2
)
S2xy
,
where
S2xy = 1n1+n2−2
(∑n1
i=1(xi − x¯)2 +
∑n2
i=1(yi − y¯)2
)
, x¯ = 1n1
∑n1
i=1 xi, y¯ = 1n2
∑n2
i=1 yi
Define a variable t which contains the test statistic t calculated using the formula shown above. You should
not use the in-built function t.test to compute t, but you are free to use this to check your calculation.
(e) [4 marks]
A non-parametric equivalent of the two sample t-test is the Mann Whitney U test. Compute the test statistics
for a Mann Whitney U test to compare the mean ranks of each of Treatment Group 1 and Treatment Group
2 weights by following the steps below;
1. Combine all weight values for treatment groups 1 and 2 into one vector and assign ranks to each weight
value according to size with 1 being assigned to the smallest element, 2 to the second smallest etc. For
example, the elements of a vector (1,10,3,7) would be assigned ranks (1,4,2,3).
Note you can use the function rank(z) to find the ranks of the elements of a numeric vector z,
2. Ri (where i = 1, 2) corresponds to the sum of the ranks for the observations in treatment group i.
Define two variables R1 and R2 which contain the values R1 and R2 respectively.
5
3. The test statistics Ui, where i = 1, 2 can be computed using the formula
Ui = Ri − ni(ni+1)2 ,
where ni is the sample size for treatment group i, and Ri, i = 1, 2 is as defined in step 2.
Define two variables U1 and U2 which contain the values U1 and U2 for the treatment group 1 and 2
plant growth data.
(f) [1 mark]
Define a vector of length two named auc which contains the AUC values for each value of Ui(i = 1, 2) as
shown below
AUCi = Uin1n2
(Note if you have not successfully managed to define the U1 and U2 in part (e) you can use the values U1 = 70
and U2 = 130 to complete this question)
Question 2 - Seals [15 marks total]
In this Task you will work with a mathematical model for the dynamics of a population of one species. In
this question we model the numbers of Northern fur seals. To keep things simple, we only model the number
of female seals. The table below gives the birth and survival rates in a healthy ecosystem as well as the initial
population (in 1,000s).
AgeGroup BirthRate SurvivalRate InitialPopulation (in 1,000s)
0 0.00 0.91 230
1 0.02 0.88 176
2 0.70 0.85 74
3 1.53 0.80 50
4 1.67 0.74 54
5 1.65 0.67 21
6 1.56 0.59 10
7 1.45 0.49 12
8 1.22 0.38 6
9 0.91 0.27 3
10 0.70 0.17 1
11 0.22 0.15 0
12 0.00 0.00 0
We can interpret the fourth line of the table as follows. A female seal in age group 3 will have 1.53 female
offspring by the time it reaches age group 4. A female seal in group 3 has a probability of 0.8 of still being
alive by the time it reaches age group 4. There are around 50,000 female seals in age group 3 in the initial
time period.
Use R to answer the following questions.
(a) [1 mark]
Using the table provided and the code below above create a data frame named seals which contains the all
of the data in the table above (i.e. your data frame should have 4 columns).
BirthRate <- c(0, 0.02, 0.7, 1.53, 1.67, 1.65, 1.56, 1.45, 1.22, 0.91, 0.7, 0.22, 0)
Survival <- c(0.91, 0.88, 0.85, 0.8, 0.74, 0.67, 0.59, 0.49, 0.38, 0.27, 0.17, 0.15, 0)
6
(b) [3 marks]
Define a vector d which contains the differences in the population birth rates of successive age groups. In
other words d = (d1, ..., dn−1) with
di = bi − bi−1,
where n is the number of age groups and bi is the birth rate in the ith age group.
(c) [2 marks]
The “Leslie” matrix L for this population is given by
L =

0.00 0.02 0.70 . . . 0.22 0.00
0.91 0.00 0.00 . . . 0.00 0.00
0.00 0.88 0.00 . . . 0.00 0.00
0.00 0.00 0.85 . . . 0.00 0.00
...
...
... . . .
...
...
0.00 0.00 0.00 ... 0.15 0.00

Define the matrix L without entering each element of the matrix manually.
(d) [2 marks]
Define a vector p0 which holds the initial population of seals, i.e. its first entry is 230,000, its second entry is
176,000 etc. and a variable t0 which contains the total initial population of seals over all years.
(e) [2 marks]
Use the recursive formula pi = Lpi−1 to define a 4 row matrix called mat where the ith row is the vector pi
(i = 1, ..., 4). (p0 is defined in part (d)).
i.e. p1 = Lp0 would be the first row of mat
(f) [3 marks]
Define a vector t which contains t1, ..., t4 the total number of seals after each of the first four time periods.
Using this vector and the matrix mat defined above in part (e) define the matrix mat.norm which contains as
its rows the normalised age distribution of the seal population in each time period.
Hint: The normalised seal population in period i can be obtained by computing pi/ti.
(g) [2 marks]
The average population growth rate during the first four time periods is given by
4
√
t4
t0
,
where t4 and t0 are defined as above. Define a new variable ag4 which contains this average growth rate.
Question 2 - BloodLead [15 marks total]
If you have not already done so, use the line of code below to load the dataframes for this assignment into R;
load(url("http://www.maths.gla.ac.uk/~rhaggarty/rp/a1data2020.RData"))
This task will look at the dataset BloodLead which contains matched paired data corresponding to blood
lead levels for 33 children of parents who had worked in a lead related factory and 33 control children from
their neighbourhood.
The dataset contains the following 3 columns;
7
• Pair: Pair factor matched pair of children
• Exposed: Exposed numeric blood lead levels (mg/dl) for exposed children
• Control: Control numeric blood lead levels for controls
(a) [2 marks]
Update the BloodLead dataframe so that is contains a new column named diff which contains for each
matched pair (i.e. row) the difference between the Exposed and Control measurements.
(b) [1 mark]
Update the data set BloodLead so that any rows where the Exposed and the Control measurement are the
same are removed. You should work with this data set from this point forward.
(c) [3 marks]
Define a list of length 2 named sumblood where each element of the list contains a set of summary statistics
of blood lead levels for each group of children (i.e. the first element of the list corresponds to the Exposed
group, and the second to the Control group). The summary statistics each element of the list should contain
are
• mean
• median
• standard deviation
i.e. the two elements of the list should each be a vector of length three.
(d) [3 marks]
The paired t-test is a formal statistical test which allows for checking whether there is a significant difference
in mean blood lead levels in exposed and control children.
Denote by x1, ..., xn the blood lead levels of the n children in the “Exposed” group and y1, ..., yn the blood
lead levels of the n matched children in the “Control” group.
The test statistic of the t-test to compare the average blood lead levels of each group is then
t = d¯√
s2
n
where;
di = xi − yi, d¯ =
∑n
i=1
di
n , s
2
d = 1n−1
∑n
i=1(di − d¯)2.
Define a variable t which contains the test statistic t calculated using the formula shown above. Note: You
have already calculated the differences di in the first part of this question where they were stored in the
column named diff.
You should not use the in-built function t.test(...,paired=TRUE) to compute t, but you are free to use
this to check your calculation.
(e) [3 marks]
A Sign test is non-parametric test to assess if there is a statistically significant difference in the median blood
lead levels in exposed and control children.
Compute the test statistics for a Sign test to compare the median difference in blood lead levels of the two
groups (“Control” and “Exposed”) by following the steps below.
1. Define the vector tsign to be a vector of length two which contains the number of positive and the
number of negative differences, di .
8
2. Define a variable S which contains the maximum value of tsign.
3. Define a variable Z which contains the test statistic Z as calculated below. The value S is as defined by
the variable S in part 2 above
Z = (S−
1
2 )−n2√
n
2
.
(f) [3 marks]
Define a variable higher which contains the number of children in the “Exposed” group who have blood lead
levels more than 20% higher than the corresponding matched child in the “Control” group.
9