STAT 231-无代写-Assignment 3
时间:2022-11-04
STAT 231 Assignment 3: What Are You Waiting For? 
Due: 11am Eastern on Friday November 4 
Total marks: 50 
Please review the information on Page 1 of Assignment 1 for full details on how to submit your 
assignment. As a reminder, to complete this assignment you must: 
ˆ Upload your typed/computer-generated Assignment 3 Report file as a PDF to Crowdmark. 
ˆ Upload your Assignment 3 R code file as a .R file to the Assignment 3 R Code File dropbox. 
As with Assignments 1 and 2, your Report must be typeset, and your R code file should generate all 
the results presented in your Report. 
If you are unsure how to format an answer, please check the Layout Lowdown on pages 5-8! There 
are also template files available on LEARN for this assignment - you are not required to follow these 
templates, but you may if you wish! 
What’s this assignment about? 
This assignment covers the material up to and including Chapter 4, with a focus on interval estimation 
techniques. We will seek to model the tweet.gap variate, which measures the time (or ‘gap’) between 
the publication of tweets. More precisely, for a particular tweet, tweet.gap gives the number of 
seconds since the user’s previous tweet was published. 
Data about how often a user is interfacing with a website, service, or product, are valuable for a 
variety of reasons. The regularity, and reliability, with which users return (sometimes referred to as 
‘stickiness’) is a key metric to assess product performance, as well as for testing the effectiveness of 
new features and initiatives. 
In addition to providing insights into how often users post tweets, the variate tweet.gap also provides 
an opportunity to explore some challenges commonly encountered in real-world data analysis. Many 
of you will find that tweet.gap contains some particularly large values, as a result of users not 
tweeting for several days, or even weeks. When working with real-world data it is common to 
encounter unusual behaviour such as this, which can make finding a suitable statistical model difficult. 
In this assignment we will explore two approaches for modelling data with unusual distributions. One 
of these is to consider a subset of the data, narrowing the focus of our research question in order to 
facilitate meaningful analysis. The other is data transformation, which we have used previously (such 
as in taking logs of the likes variate) and will now extend to other, more complex transformation 
procedures. 
Before we begin 
For the purposes of this assignment, the study population is defined as the set of tweets in the 
primary dataset from which you downloaded your sample at the start of term. 
In this analysis we will include all of the data in your Twitter dataset (that is, all five accounts). 
You may find it interesting to re-run your analyses on your personal and organizational accounts 
separately, while thinking about why we might expect these accounts to have different distributions 
for this variate. 
Because tweet.gap is measured in seconds, we will convert this to hours to make it easier to 
interpret our results. You should create the variate tweet.gap.hour, just like how we created 
time.of.day.hour in Assignment 1. 

Analysis 1: Time Between Tweets and an Exponential Model 
In Analyses 1 and 2 we will be exploring the distribution of tweet.gap.hour for tweets that are not 
the first tweet of the day. In the following, we refer to two sets of tweets denoted Tweet Set A and 
Tweet Set B as follows: 
Tweet Set A: All tweets in your dataset. 
Tweet Set B: Just tweets that are not the first tweet of the day. Note that these are the tweets 
for which first.tweet equals 0. 
1a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number. 
1b. [2 marks] Do you have any concerns about measurement error in the first.tweet variate? 
Briefly explain why or why not. 
1c. [2 marks] State the sample size, and calculate the sample mean, sample median, sample mini- 
mum, sample maximum, and sample standard deviation of tweet.gap.hour for Tweet Set A 
and Tweet Set B. Display these values in a table in your Report. 
1d. [1 mark] Briefly explain why the maximum value of tweet.gap.hour for Tweet Set B should not 
be greater than 24. Note: This question is not asking you to simply verify that the maximum 
calculated in Analysis 1c is not larger than 24; your answer should explain why, based on how 
Tweet Set B is constructed, it should not contain a value larger than 24 for any possible sample. 
1e. [4 marks] Generate a relative frequency histogram and an empirical cumulative distribution 
function plot of the variate tweet.gap.hour for each of Tweet Set A and Tweet Set B (that 
is, you should include a total of four plots, two for each Tweet Set). All plots should feature 
a suitable superimposed Exponential probability density or cumulative distribution function 
curve. Hint: You may wish to use par(mfrow = c(2, 2)) so that your plots are displayed in 
a single image. 
1f. [7 marks] For each of Tweet Set A and Tweet Set B, discuss how well an Exponential model 
fits the data. Your answer should explain what you would expect to observe if the data were 
generated from an Exponential distribution, and compare this with what you observe in your 
sample. You should make at least three comparisons (of what you would expect, and what you 
observe) for each of Tweet Set A and Tweet Set B, and include an overall conclusion on which 
of Tweet Set A and Tweet Set B the Exponential model appears to fit better. 
Analysis 2: Interval Estimation Using an Exponential Model 
In this analysis we will use an Exponential model to describe the time between tweets that were 
not the first tweet of the day. Note that, regardless of your conclusion in Analysis 1f, you should 
complete Analysis 2 using Tweet Set B. 
Let Y ∼ Exponential(θ) denote the value of tweet.gap.hour for a randomly chosen tweet from the 
study population that was not the first tweet of the day. You are reminded that in our notation 
E[Y ] = θ. 
2a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number. 
2b. [1 mark] What is the maximum likelihood estimate of θ based on your sample? 
2c. [3 marks] Generate a plot of R(θ), the relative likelihood function for θ based on your sample 
and the assumed Exponential(θ) model. Your plot should include a horizontal line that could 
be used to identify the 15% likelihood interval for θ. 

2d. [2 marks] Using uniroot() or uniroot.all(), calculate the 15% likelihood interval for θ. 
Give your answer to four decimal places. 
2e. [3 marks] Calculate approximate 15%, 95%, and 99% confidence intervals for θ based on a 
Central Limit Theorem approximation. Your Report should include an explanation of how this 
was calculated, which may be expressed algebraically or, if you wish, by including the relevant 
R command(s). 
2f. [2 marks] Which of the confidence intervals you calculated in Analysis 2e is most similar to the 
15% likelihood interval found in Analysis 2d? Is this what you would expect? Briefly explain 
why or why not. 
2g. [3 marks] Write 1-2 sentences that explain what the 95% confidence interval calculated in 
Analysis 2e means in the context of the study. Note: your answer should relate your interval 
to the real-world question under consideration, and not simply be written in terms of θ. 
Analysis 3: Time Between Tweets and a Gaussian Model 
In Analyses 3 and 4 we will be exploring the distribution of tweet.gap.hour for tweets that are 
the first tweet of the day. We will exclude tweets that were published more than 24 hours after the 
preceding tweet (think about why we might wish to do this). You can create this subset of tweets as 
follows: 
> tgh.first <- mydata$tweet.gap.hour[mydata$first.tweet == 1 & mydata$tweet.gap.hour <= 24] 
Note: We have called the variate tgh.first as shorthand for ‘tweet gap hour first tweets’; you are 
welcome to use your own choice of naming convention! 
The data in tgh.first are therefore the times between the first tweet sent on a particular day, and 
the last tweet sent the preceding day. Hint: Run summary(tgh.first) and check the results make 
sense based on how we have defined this variate. 
We will explore various transformations of the variate in an attempt to facilitate the use of a Gaussian 
model. In particular, we will consider the following three transformations, which we first define in 
general terms for data y1, y2, . . . , yn, recalling that y(n) denotes the maximum value in our sample. 
ˆ Square Root: si = 
√( 
y(n) − yi 

+ 1 
ˆ Log: li = log( 

y(n) − yi 

+ 1) 
ˆ Reciprocal: ri = 

(y(n)−yi)+1 
You should generate three new variates as follows (where, again, you are welcome to use your own 
naming conventions): 
# Square Root 
> tf1 <- sqrt(max(tgh.first) - tgh.first + 1) 
# Log 
> tf2 <- log(max(tgh.first) - tgh.first + 1) 
# Reciprocal 
> tf3 <- 1/(max(tgh.first) - tgh.first + 1) 
We will refer to the non-transformed data as the ‘Original’ data. 

3a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number. 
3b. [4 marks] Generate a relative frequency histogram or an empirical cumulative distribution 
function plot of the Original, Square Root, Log, and Reciprocal transformations of the variate 
defined above as tgh.first. All four plots should be of the same type (that is, your Report 
should contain four histograms, or four e.c.d.f. plots). All four plots should feature a suitable 
superimposed Gaussian probability density or cumulative distribution function curve. Hint: 
You may wish to use par(mfrow = c(2, 2)) as you did in Analysis 1e. 
3c. [2 marks] Which of the Square Root, Log, or Reciprocal transformations leads to the best fit of 
a Gaussian model? Briefly justify your answer in 1-2 sentences. It is sufficient to refer only to 
your results in Analysis 3b, but if you wish to carry out additional analyses you are welcome 
to. Note that even if you believe the original dataset exhibits the best fit, you must choose one 
of the three transformation options detailed above. 
Analysis 4: Interval Estimation Using a Gaussian Model 
In our final analysis, we will use the transformed variate chosen in Analysis 3c. Let X ∼ G(µ, σ) 
denote the value of the transformed variate for a randomly chosen tweet from the study population. 
Note that all questions in this analysis should be conducted using the transformed variate you chose 
in Analysis 3c. 
4a. [0.25 marks] To facilitate grading, please provide your 8-digit student ID number, and write 
down the name of the transformation you chose in Analysis 3c (that is, Square Root, Log, or 
Reciprocal). 
4b. [1 mark] State the sample size, and calculate the sample mean and sample standard deviation 
for your transformed variate. 
4c. [3 marks] Calculate a 95% confidence interval or approximate confidence interval for µ based 
on your sample. (You should decide which is the appropriate confidence interval to calculate.) 
Your Report should include an explanation of how this was calculated, which may be expressed 
algebraically or, if you wish, by including the relevant R command(s). 
4d. [1 mark] Is the confidence interval you calculated in Analysis 4c exact or approximate? Briefly 
justify your answer. (Note: this question concerns whether the interval is theoretically ex- 
act or approximate, your answer should not discuss numerical matters such as rounding, or 
approximations used within R itself.) You may cite results in the Course Notes without proof. 
4e. [3 marks] Write 1-2 sentences that explain what the interval calculated in Analysis 4c means 
in the context of the study. Note: your answer should relate your interval to the real-world 
question under consideration, and not simply be written in terms of µ. Note: Do not transform 
your interval back to the original scale on which tweet.gap.hour is measured. 
4f. [3 marks] Calculate a 95% confidence interval for σ based on your sample. Your Report should 
include an explanation of how this was calculated, which may be expressed algebraically or, if 
you wish, by including the relevant R command(s). 
4g. [2 marks] You are told that Alex, another STAT 231 student, has a sample which contains 
considerably fewer tweets than your sample. Would the interval Alex calculated in Analysis 
4f be narrower, wider, or about the same width as the interval you calculated in Analysis 4f? 
Justify your answer in 1-2 sentences. 

Layout Lowdown 
If you have not used the Layout Lowdown for your previous assignments, please review the informa- 
tion posted in Assignment 1! 
Reminder: Please start each of Analysis 1, 2, 3, and 4 on a new page. This will be necessary for 
uploading to Crowdmark to help facilitate grading. Thank you! 
Analysis 1 
1a: You are not required to submit this answer in complete sentences, but an example of how to 
phrase this is: 
“My ID number is [12345678].” 
1b: An example of how you could phrase your answer: 
“I [do/do not] have concerns about measurement error in the first.tweet variate. This is because 
[brief justification].” 
1c: Here is one way to display your table of results. You should round your answers to an appropriate 
number of decimal places: 
Sample Sample Sample Sample Sample Sample 
Size Mean Median Minimum Maximum SD 
Tweet Set A 
Tweet Set B 
1d: An example of how you could phrase your answer: 
“The maximum value of tweet.gap.hour for Tweet Set B should not be greater than 24 because 
[brief explanation].” 
1e: Your Report should contain two relative frequency histograms and two empirical cumulative 
distribution function plots (one each for Tweet Set A, one each for Tweet Set B, for four plots total). 
You should use your judgment on how these should be formatted. 
1f : An example of how you could structure your answer: 
“Tweet Set A: Based on the results in Analysis [1c/1e], we can see that [...] while for data generated 
from an Exponential distribution we would expect to see [...]. [Add more comparisons here.] Overall, 
the Exponential model [description of whether the model fits well]. 
Tweet Set B: Based on the results in Analysis [1c/1e], we can see that [...] while for data generated 
from an Exponential distribution we would expect to see [...]. [Add more comparisons here.] Overall, 
the Exponential model [description of whether the model fits well]. 
Overall, the Exponential model appears to be a better fit for [Tweet Set A/Tweet Set B], because 
[brief justification].” 

Analysis 2 
2a: You may reuse your answer from Analysis 1a. 
2b: This does not have to be written in complete sentences, but an example of how you could phrase 
your answer: 
“The maximum likelihood estimate of θ based on my sample is [number].” 
2c: Your Report should contain one relative likelihood function plot for this analysis. You should 
use your judgment on how it should be formatted. 
2d: This does not have to be written in complete sentences, but an example of how you could phrase 
your answer: 
“The 15% likelihood interval for θ is [lower bound, upper bound].” 
2e: This does not have to be written in complete sentences, but an example of how you could phrase 
your answer: 
“The approximate 15%, 95% and 99% confidence intervals for θ are [15% lower bound, 15% upper 
bound], [95% lower bound, 95% upper bound], and [99% lower bound, 99% upper bound], respec- 
tively. These were calculated by [explanation].” 
2f : An example of how you could phrase your answer: 
“The approximate [15%/95%/99%] confidence interval is most similar to the 15% likelihood interval. 
This [is/is not] what I would expect, because [brief justification].” 
2g: An example of how you could phrase your answer: 
“The interval [95% lower bound, 95% upper bound] tells us that [answer continues].” 

Analysis 3 
3a: You may reuse your answer from Analysis 1a. 
3b: Your Report should contain either four relative frequency histograms or four empirical cumula- 
tive distribution function plots. You should use your judgment on how these should be formatted. 
3c: An example of how you could phrase your answer: 
“The Gaussian model appears to fit the [Square Root/Log/Reciprocal] transformed data best, be- 
cause [brief justification].” 

Analysis 4 
4a: Note that, unlike Analysis 1a, 2a, and 3a, this requires the specification of your choice of 
transformed variate. An example of how you could phrase your answer: 
“My ID number is [12345678]. In Analysis 3c I chose the [Square Root/Log/Reciprocal] transforma- 
tion.” 
4b: You may write these numbers in a list, a sentence, or a table, as long as it is clear which number 
corresponds to each sample statistic. We recommend giving your answers to three decimal places. 
One possible formatting would be: 
“The sample size is [value], the sample mean is [value], the sample standard deviation is [value].” 
4c: An example of how you could phrase your answer: 
“A 95% [confidence interval/approximate confidence interval] for µ is [lower bound, upper bound]. 
This was calculated by [explanation].” 
4d: An example of how you could phrase your answer: 
“This is an [exact/approximate] confidence interval, because [brief justification].” 
4e: This does not have to be written in complete sentences, but an example of how you could phrase 
your answer: 
“The interval [95% lower bound, 95% upper bound] tells us that [answer continues].” 
4f : An example of how you could phrase your answer: 
“A 95% confidence interval for σ is [lower bound, upper bound]. This was calculated by [explana- 
tion].” 
4g: An example of how you could phrase your answer: 
“I would conclude [conclusion] about the width of Alex’s confidence intervals compared to those I 
calculated in Analysis 4f. This is because [brief justification].” 
essay、essay代写