PS4-r代写
时间:2024-03-17
Regression Modelling
Week 5
Abhinav Mehta
An Example in R - Video Game Dataset
An Example in R - Video Game Dataset
• The dataset contains a list of video games on PS4 platform in the
year 2016.
• Y : Global_Sales - Total worldwide sales (in millions), collected from
Kaggle website.
• X : Critic_score - Aggregate score compiled by Metacritic.
• We are interested in modelling sales of the game using the game
rating.
Exploratory Data Analysis
game_data <- read.csv("Video_Games.csv", header = TRUE)
head(game_data)
Name Global_Sales Critic_Score
1 FIFA 17 7.59 85
2 Uncharted 4: A Thief's End 5.38 93
3 Call of Duty: Infinite Warfare 4.46 77
4 Battlefield 1 4.08 88
5 Tom Clancy's The Division 3.80 80
6 Far Cry: Primal 2.26 76
Exploratory Data Analysis
sales <- game_data$Global_Sales
rating <- game_data$Critic_Score
cor(sales, rating)
[1] 0.3754284
Exploratory Data Analysis
plot(rating, sales, pch = 16)
30 40 50 60 70 80 90
0
2
4
6
rating
sa
le
s
Model Fitting
modelgame <- lm(sales ~ rating)
summary(modelgame)
Call:
lm(formula = sales ~ rating)
Residuals:
Min 1Q Median 3Q Max
-1.0508 -0.5888 -0.3267 0.1453 6.5085
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.015575 0.675139 -2.985 0.003610 **
rating 0.036436 0.009278 3.927 0.000164 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.125 on 94 degrees of freedom
Multiple R-squared: 0.1409, Adjusted R-squared: 0.1318
F-statistic: 15.42 on 1 and 94 DF, p-value: 0.0001638
Fitted Regression Line
plot(rating, sales, pch = 16, ylim = c(-2, 8))
abline(modelgame, col = 2)
30 40 50 60 70 80 90

2
0
2
4
6
8
rating
sa
le
s
# identify(rating, sales)
Diagnostics
par(mfrow = c(2,2))
plot(modelgame, which = c(1,2,4))
plot(hatvalues(modelgame), type = 'h', ylab="Leverage")
abline(h = 4/length(sales), lty = 2)
Diagnostics
−0.5 0.0 0.5 1.0

2
0
2
4
6
Fitted values
R
es
id
ua
ls
Residuals vs Fitted
1
23
−2 −1 0 1 2
0
2
4
6
Theoretical Quantiles
St
an
da
rd
ize
d
re
sid
ua
ls
Q−Q Residuals
1
2
3
0 20 40 60 80
0.
0
0.
1
0.
2
0.
3
0.
4
Obs. number
Co
ok
's
di
st
an
ce
Cook's distance
1
2
4
0 20 40 60 80
0.
02
0.
06
0.
10
Index
Le
ve
ra
ge
Variable Transformation
par(mfrow = c(2, 2))
plot(rating, sales)
plot(rating, log(sales))
plot(log(rating), sales)
plot(log(rating), log(sales))
Variable Transformation
30 40 50 60 70 80 90
0
2
4
6
rating
sa
le
s
30 40 50 60 70 80 90

4

3

2

1
0
1
2
rating
lo
g(s
ale
s)
3.4 3.6 3.8 4.0 4.2 4.4
0
2
4
6
log(rating)
sa
le
s
3.4 3.6 3.8 4.0 4.2 4.4

4

3

2

1
0
1
2
log(rating)
lo
g(s
ale
s)
Variable Transformation
library(MASS)
#boxcox(Global_Sales ~ Critic_Score, data = game_data, plotit = TRUE)
boxcox(modelgame, plotit = TRUE)
−2 −1 0 1 2

70
0

60
0

50
0

40
0

30
0
λ
lo
g−
Li
ke
lih
oo
d
95%
Variable Transformation
• Take natural log to the response variable
cor(rating, log(sales))
[1] 0.5469324
modelgame1 = lm(log(sales) ~ rating)
Model Fitting
summary(modelgame1)
Call:
lm(formula = log(sales) ~ rating)
Residuals:
Min 1Q Median 3Q Max
-2.78461 -0.93357 -0.03999 1.08761 2.94826
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.11176 0.84428 -8.423 4.10e-13 ***
rating 0.07349 0.01160 6.334 8.15e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.407 on 94 degrees of freedom
Multiple R-squared: 0.2991, Adjusted R-squared: 0.2917
F-statistic: 40.12 on 1 and 94 DF, p-value: 8.148e-09
Fitted Regression Line
plot(rating, log(sales), pch = 16)
abline(modelgame1, col = 2)
30 40 50 60 70 80 90

4

3

2

1
0
1
2
rating
lo
g(s
ale
s)
# identify(rating, log(sales))
# identify(rating, log(sales), labels = game_data$Name)
Diagnostics
par(mfrow = c(2,2))
plot(modelgame1, which = c(1,2,4))
plot(hatvalues(modelgame1), type = 'h', ylab="Leverage")
abline(h = 4/length(sales), lty = 2)
qf(0.5,2,length(sales))
Diagnostics
−5 −4 −3 −2 −1

3

2

1
0
1
2
3
Fitted values
R
es
id
ua
ls
Residuals vs Fitted
3 1
90
−2 −1 0 1 2

2

1
0
1
2
Theoretical Quantiles
St
an
da
rd
ize
d
re
sid
ua
ls
Q−Q Residuals
31
90
0 20 40 60 80
0.
00
0.
02
0.
04
0.
06
0.
08
Obs. number
Co
ok
's
di
st
an
ce
Cook's distance
32
50
1
0 20 40 60 80
0.
02
0.
06
0.
10
Index
Le
ve
ra
ge
[1] 0.6981761
Diagnostics
plot(fitted(modelgame1), rstandard(modelgame1),
xlab="Fitted Values", ylab="Standardized Residuals")
abline(h=c(0, -2, 2), lty=2)
−5 −4 −3 −2 −1

2

1
0
1
2
Fitted Values
St
an
da
rd
ize
d
Re
sid
ua
ls
Model Significance
• Is the model significant?
• Is β1 equal to 0?
• t test or F test
Model Fitting
summary(modelgame1)
Call:
lm(formula = log(sales) ~ rating)
Residuals:
Min 1Q Median 3Q Max
-2.78461 -0.93357 -0.03999 1.08761 2.94826
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -7.11176 0.84428 -8.423 4.10e-13 ***
rating 0.07349 0.01160 6.334 8.15e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.407 on 94 degrees of freedom
Multiple R-squared: 0.2991, Adjusted R-squared: 0.2917
F-statistic: 40.12 on 1 and 94 DF, p-value: 8.148e-09
ANOVA
anova(modelgame1)
Analysis of Variance Table
Response: log(sales)
Df Sum Sq Mean Sq F value Pr(>F)
rating 1 79.421 79.421 40.12 8.148e-09 ***
Residuals 94 186.080 1.980
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Model Interpretation
• The fitted model
̂log(Y ) = −7.1118+ 0.0735X
• One unit increase in the game rating corresponds to an estimated
0.0735 unit increase in the mean of the log of sales.
• Or
Ŷ = e(−7.1118+0.0735X)
= e−7.1118 × e0.0735X
• One unit increase in the game rating corresponds to an estimated
multiple of exp(0.0735) on the mean of sales.
Intervals
Xh <- seq(30, 95, by = 2)
ci <- predict(modelgame1, newdata = data.frame(rating = Xh),
interval = "confidence")
pi <- predict(modelgame1, newdata = data.frame(rating = Xh),
interval = "prediction")
plot(rating, log(sales), ylim = c(-8,2.5))
abline(modelgame1)
lines(Xh, ci[,2], lty = 2, col = 2)
lines(Xh, ci[,3], lty = 2, col = 2)
lines(Xh, pi[,2], lty = 2, col = 4)
lines(Xh, pi[,3], lty = 2, col = 4)
legend(30, 2, c("95CI", "95PI"),
col = c(2, 4), lty = c(2, 2))
Intervals
30 40 50 60 70 80 90

8

6

4

2
0
2
rating
lo
g(s
ale
s)
95CI
95PI
CI and PI at Xh = 60
predict(modelgame1, newdata = data.frame(rating = 60),
interval = "confidence")
fit lwr upr
1 -2.702424 -3.094905 -2.309944
predict(modelgame1, newdata = data.frame(rating = 60),
interval = "prediction")
fit lwr upr
1 -2.702424 -5.523441 0.1185926
• When the rating is 60, the estimated mean of log sales/ the predicted log of sales
is
• When the rating is 60, we are 95% confident that the mean value of log sales is
between
• We are 95% confident that the log of sales for a game with rating at 60 is between
Intervals
plot(rating, sales)
lines(Xh, exp(ci[,1]))
lines(Xh, exp(ci[,2]), lty = 2, col = 2)
lines(Xh, exp(ci[,3]), lty = 2, col = 2)
lines(Xh, exp(pi[,2]), lty = 2, col = 4)
lines(Xh, exp(pi[,3]), lty = 2, col = 4)
legend(30, 6, c("95CI", "95PI"),
col = c(2, 4), lty = c(2, 2))
Intervals
30 40 50 60 70 80 90
0
2
4
6
rating
sa
le
s
95CI
95PI
CI and PI at Xh = 60
exp(predict(modelgame1, newdata = data.frame(rating = 60),
interval = "confidence"))
fit lwr upr
1 0.06704279 0.04527932 0.09926684
exp(predict(modelgame1, newdata = data.frame(rating = 60),
interval = "prediction"))
fit lwr upr
1 0.06704279 0.003992087 1.125911
• When the rating is 60, the estimated mean of sales/ the predicted sales is
• When the rating is 60, we are 95% confident that the mean value of sales is
between
• We are 95% confident that the sales for a game with rating at 60 is between
Read 10.2-10.4.
essay、essay代写