STP 429-STP429 无代写|学霸联盟

STP 429-STP429 无代写

时间：2022-11-19

STP 429
Lab #2

Executive Summary
The baseball reference website is a fantastic resource for writers, statisticians, and
basketball lovers. It makes all relevant statistics and biographical information about NBA and
ABA players and coaches available with a single mouse click. The year-by-year batting statistics
in this website provides multiple factors that are associated with team wins, such as runs, hits,
doubles, triples, home runs, runs battled in, stolen bases, walks, strikeouts, batting average,
and age of batters, etc. In general, higher chance of winning a game is strongly related to higher
batting average, more hits (including doubles and triples) and runs, as well as higher runs
batted in totals.
To investigate the factors that have the most significant impact on boosting the chance
of winning the baseball game, a study is conducted to try and predict the winning percentage
for the Chicago Cub’s team. Several predictors are considered to be critical when predicting the
winning percentage are home hits (H), runs (R), walks (BB), and runs batted in (RBI). A statistical
analysis was performed to determine the most important three factors in order to build a
model to predict the winning percentage in the period of 1998 to 2022 (excluding 2020).
Using regression techniques that will help analyze each of the predictors and their
relevance, I am able to conclude that from 1998 to 2022 (excluding 2020), RBI and BB, can be
used in order to predict winning percentage. The following report will include details of the
analysis. The models that were developed will help baseball teams to evaluate their
performance and increase the chance of winning games.
Introduction
This study was developed in order to determine if there are factors in the baseball
games that are related so that we might be confident to build a model to predict winning
percentage for a team. Several factors that might have a strong relationship with winning, such
as H, R, BB and RBI in 1998 to 2022 (excluding 2020) were included in the development of the
two models for this study. Runs batted and walks are the major components for winning the
game, so to be able to better predict winning percentage, will allow Chicago Cubs team to have
higher probability of winning the games in the following season years.
Analysis
In order to determine the best model for predicting wins, I first had to determine if
there was an association between the included variables vs. wins in the two models. A
correlation matrix allowed me to determine the strength of the relationship between the
independent variables and wins. Graphical representations of each independent variable vs.
wins visually represents the strength in the association as well as the ability to identify potential
outliers. Once the four independent variables having the highest association with wins were
chosen to use in the two models, regression analysis were performed to identify the best model
(two predictors and one potential interaction) for predicting the wins. The model was selected
based on root mean square error (RMSE), adjusted R square, AIC (difference between given
model and the true model), and sum of squares due to error (SSE). The two-predictor model
that yields the lowest RMSE and SSE, maximum adjusted R square and smallest AIC will be
selected. Furthermore, all possible two-way interactions are tested for significance and
evaluated to check if any interaction should be included in the models.
Data Section
Some descriptive statistics for the five variables included in the model are in the
following table:
Variable N Mean Std Dev Minimum Maximum
R
RBI
BB
H

24
24
24
24

728.9583333
694.0833333
530.5000000
1408.29

71.3274623
69.2047477
78.0980432
80.7817668

602.0000000
570.0000000
395.0000000
1255.00

855.0000000
811.0000000
656.0000000
1552.00

The runs have a standard deviation of 71 and a range from 602 to 855, showing that Chicago
Cub’s runs performance varies a lot by years. This is further supported by the wide RBI from 570
to 811 with a standard deviation of 69. Walks and hits also have big variations over years.
Results
The correlation matrix in Table 1.1 for period 1998 to 2022 (excluding 2020), shows that
R is most significantly correlated with the dependent variable, W (wins) as the p-value is 0.004
(the smallest among all four selected variables). W and R has a linear correlation coefficient of
0.66. The strength of this association is further supported by the scatterplot matrix in Table 2.1.
The strong, linear and positive association is seen as the data points are fairly close together
and follow a positive trend. The strong association makes sense since is awarded a run if he
crosses the plate to score his team a run. The other variables, while positively associated with
wins, show that the association is not as strong as the points spread out a little bit further
(Table 2.1).
Two-variable model selection resulted in a model which included RBI and BB because
this it has the lowest RMSE 9.23 and lowest AIC 109.58. After testing all possible two-way
interactions (RBI*BB), it shows that RBI*BB has a p-value of 0.91, which is not significant.
Besides, the RMSE of the model with interaction is 9.67, which is bigger than 9.23 of the models
without interaction. Given the fact that the interaction term is not significant and the model
with interaction has a higher prediction error, the model without interaction is chosen for the
first period. The final first model is the following:
Wins = -0.36752 + 0.09930 (RBI) + 0.02205 (BB)
Parameter Estimates
Variable DF Parameter
Estimate
Standard
Error
t Value Pr > |t|
Intercept 1 -0.36752 19.89342 -0.02 0.9854
RBI 1 0.09930 0.04137 2.40 0.0257
BB 1 0.02205 0.03666 0.60 0.5539

The final model has an R-Square = 0.45, indicating that 45% of the variation in wins is
explained by RBI and BB. This is a relatively low R-squared, which shows that there is fairly
amount of variation which would be accounted for by some other factor. The overall F-statistic
is 8.69 with a p-value less than 0.018, which is significant.
The residuals pot in Table 3.1 shows that there is a slight pattern as medium predicted
values have higher residuals. The normal probability plot in Table 3.1 shows that while in
general the points are close to the line, there is variation off the line on middle part of the
graph. I would also look at the histogram of the residuals in order to confirm the normality
assumption is not violated.
In conclusion, I am confident to create a lowest prediction error model for 1998 to 2022
(excluding 2020) to predict winning percentage using runs batted in and walks.
Future Work
Given that it is a longitudinal data, observations may have strong correlation with each
other, and a simple multiple regression model may fail to take the correlation into
consideration. As a result, even time may not have a strong correlation with the dependent
variable wins, it would be better if we force time as a covariate in the model in the future.

Appendix
Table 1.1 – Correlation Matrix for Period 1998 to 2022 (excluding 2020)
Pearson Correlation Coefficients, N = 24
Prob > |r| under H0: Rho=0
W R RBI BB H
W 1.00000

0.66191
0.0004

0.66580
0.0004

0.55007
0.0054

0.42018
0.0409

R 0.66191
0.0004

1.00000

0.99645
<.0001

0.72434
<.0001

0.62049
0.0012

RBI 0.66580
0.0004

0.99645
<.0001

1.00000

0.72587
<.0001

0.60463
0.0018

BB 0.55007
0.0054

0.72434
<.0001

0.72587
<.0001

1.00000

0.12058
0.5746

H 0.42018
0.0409

0.62049
0.0012

0.60463
0.0018

0.12058
0.5746

1.00000

Table 2.1 – Scatterplot Matrix for Period 1998 to 2022 (excluding 2020)

Table 3.1 – Diagnostic Plots for Final Model for Period 1998 to 2022 (excluding 2020)

SAS Codes:

proc import datafile = "C:\Users\china\Downloads\sportsref_download.csv" out
= data
DBMS = csv replace;
run;

proc print data = data;
run;

%macro DummyVars(DSIn, /* the name of the input data set */
VarList, /* the names of the categorical variables */
DSOut); /* the name of the output data set */
/* 1. add a fake response variable */
data AddFakeY / view=AddFakeY;
set &DSIn;
_Y = 0; /* add a fake response variable */
run;
/* 2. Create the design matrix. Include the original variables, if desired
*/
proc glmselect data=AddFakeY NOPRINT
outdesign(addinputvars)=&DSOut(drop=_Y);
class &VarList;
model _Y = &VarList / noint selection=none;
run;
%mend;

%DummyVars(data,LG, data1);

PROC PRINT data = data1;
run;

data data2(drop = LG);
set data1;
run;

data part1;
set data2;
where 1998 <= year <= 2019;
run;

data part2;
set data2;
where 2021 <= year <= 2022;
run;

data project;
set part1 part2;
run;

proc corr data = project;
run;

*choose R, RBI, BB, H ;

proc corr data = project nomiss plots=matrix(histogram);
var W R RBI BB H;
run;

proc means data = project;
var R RBI BB H;
RUN;

data project1;
set project;
R_RBI = R*RBI;
R_BB = R*BB;
R_H = R*H;
RBI_BB = RBI*BB;
RBI_H = RBI*H;
BB_H = BB*H;
run;

proc reg data=project1 outest=est2;
model W = R RBI BB H /
selection=adjrsq sse aic adjrsq;
output out=out p=p r=r; run; quit;
proc reg data=project1 outest=est3;
model W = R RBI BB H /
noint selection=adjrsq sse aic adjrsq;
output out=out p=p r=r; run; quit;
data both1; set est2 est3; run;
proc sort data=both1; by _rmse_; run;
proc print data=both1; run;

PROC reg data = project1;
model W = RBI BB;
run; *RMSE = 9.2272;

PROC reg data = project1;
model W = RBI BB RBI_BB ;
run;