xuebaunion@vip.163.com
3551 Trousdale Rkwy, University Park, Los Angeles, CA
留学生论文指导和课程辅导
无忧GPA:https://www.essaygpa.com
工作时间:全年无休-早上8点到凌晨3点

微信客服:xiaoxionga100

微信客服:ITCS521
STAT 3450 Assignment 1 (35 points)
Quan Yuan
Banner: B00923505
Intro
In this assignment, you will use three datasets:
1. a1d1.csv
2. a1d2.csv
3. a1d3.csv
Please download them from BS and place them in the same folder as this Rmd file
before working on your assignment.
This assignment is composed of 4 parts:
1. Problem-1 (logistic regression): [10 pts]
2. Problem-2 (knn and naive bayes): [8 pts]
3. Problem-3 (discriminant analysis): [5 pts]
4. Problem-4 (logistic regression): [12 pts]
You will need to load a few libraries. These libraries have already been used in the
lectures. Make sure they are installed on your computer.
For example, the glm function belongs to the stats package, therefore I load this
package in order to fit generalized linear models. Using the argument : quietly = R
disables some extra message to make the html/pdf report a bit tidier. I add some
flags to the cell to make sure that verbosity is reduced in the report:
library(stats, quietly=T)
Problem 1 [10 pts]
TOPIC: Logistic regression
Question 1 [1 pt] [1/35]
Load and prepare the data.
The data used in this problem records the number of failures (column NFailed)
observed in a series of machines of the same type, at various temperatures (in
Fahrenheit) and pressures.
Read the data from file a1d1.csv. Call the resulting dataframe dc, and use str to print
the structure of dc.
dc <- read.csv("a1d1.csv")
Convert the Pressure column to a factor.
dc$Pressure <- as.factor(dc$Pressure)
Add a new column called Failure to dataframe dc, which contains the result of the
boolean question : is NFailed positive?
dc$Failure <- dc$NFailed > 0
Question 2 [1 pt] [2/35]
Fit a logistic regression model of the variable Failure with Temperature as the only
predictor. Call the resulting model lrmod1, save the summary of lrmod1 to an
object called s1 and print s1.
lrmod1 <- glm(Failure~Temperature, family = "binomial", data = dc)
s1 <- summary(lrmod1)
s1
##
## Call:
## glm(formula = Failure ~ Temperature, family = "binomial", data = dc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.0611 -0.7613 -0.3783 0.4524 2.2175
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 15.0429 7.3786 2.039 0.0415 *
## Temperature -0.2322 0.1082 -2.145 0.0320 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 28.267 on 22 degrees of freedom
## Residual deviance: 20.315 on 21 degrees of freedom
## AIC: 24.315
##
## Number of Fisher Scoring iterations: 5
Extract the p-value of the summary for the regression coefficient of the Temperature
predictor. You can do this by extracting the second row and fourth column of the
coef function applied to s1.
s1$coefficients[2,4]
## [1] 0.0319561
Is the temperature effect statistically significant at the 5% level?
Yes, the p-value of temperature is less than 0.05.
Is the temperature effect statistically significant at the 1% level?
No, the p-value of temperature is more than 0.01.
Question 3 [1 pt] [3/35]
Does the sign of the regression coefficient associated with the Temperature variable
indicate that:
a. failure risk increases when temperature increases?
b. failure risk decreases when temperature increases?
Report the correct choice (a or b):
b
Question 4 [1 pt] [4/35]
Compute and print the estimated Odds ratios and 95% confidence interval of the
coefficients of the model lrmod1.
exp(coefficients(lrmod1))
## (Intercept) Temperature
## 3.412315e+06 7.928171e-01
exp(confint(lrmod1, level=0.95))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 27.9546841 8.214986e+14
## Temperature 0.5972188 9.409919e-01
By how much does the odds of failure (increase/decrease: choose the correct option)
if we decrease the temperature by 10°F?
# computation here:
exp(coefficients(lrmod1))[2]^10
## Temperature
## 0.09811378
increase or decrease ? by a factor of ?
Decrease, and by a factor of 0.09811378.
Question 5 [2 pts] [6/35]
We want to produce a plot that shows the following elements:
1. the temperatures in F are on the x-axis
2. observations (Failure=0 or 1) are shown on the plot
3. the curve of the probability of Failure=1, as a function of Temperature
4. a horizontal red line passing through all failure observations (Failure=1)
5. a horizontal green line passing through all non-failure observations
(Failure=0)
6. 3 vertical lines from 0 to 1 passing through the temperatures 31, 41 and 51
(Fahrenheit)
HINT-1: to add the curve, you can call the function curve like this:
curve(f(x),add=true)
where you will have to replace f(x) by the function predicted by the model.
HINT-2: Remember that, for a univariate logistic regression model with parameters
b0 and b1, the prediction formula is:
( = 1|) =
exp(0 + 1)
1 + exp(0 + 1)
# call the plot function
# use this label for the x-axis: Temperature (°F)
# use this label for the x-axis: Failure probability
# use the temperature range 25,90
# use a point choice (pch) of 16
plot(dc$Temperature, dc$Failure, xlab = "Tempereture(°F)", ylab = "Fail
ure probability", xlim = c(25,90), pch = 16)
# add the logistic curve (call the curve function as described in the t
ext)
curve(exp(s1$coefficients[1,1] + s1$coefficients[2,1] * x) / (1 + exp(s
1$coefficients[1,1] + s1$coefficients[2,1] * x)), add=TRUE)
# use abline to draw the three vertical segments:
# use the function abline to draw a red vertical line that goes through
# the temperatures T=51, T=41 and T=31. Use line type 1 (lty=1)
#
abline(v = c(51,41,31), lty = 1)
# now use abline to draw the two horizontal segments (height of 0 or 1)
# respect the colors
abline(h = 1,col = "red")
abline(h = 0,col = "green")
Question 6 [1 pt] [7/35]
Use the predict function to compute the predicted probabilities of failure at the
temperatures 51, 41 and 31 F. Save this to a vector predp1 and print predp1.
predp1 <- predict(lrmod1, newdata <- data.frame(Temperature = c(51,41,3
1)), type = "response")
print(predp1)
## 1 2 3
## 0.9609321 0.9960269 0.9996088
Question 7 [1 pt] [8/35]
We would like to know how much these probabilities would change if the dataset
had not included all the records.
You can sample 2 random records among those who fail like this:
sample(which(dc$Failure==1), 2)
Initialize an empty vector pp. Set the variable k to 2 (the number of records to
remove from the data).
Use a for loop with index i running from 1 to 10. For each iteration i, do the
following:
1. compute a set is of k records among the failing records
2. fit a logistic regression model named lrmod2 of Failure as a function of
Temperature on the subset of the dc dataframe that excludes the set of
records is
3. compute the predicted probability of failure at T=41 F (save this to a variable
named p)
4. append p to the vector pp
5. after the loop, print the vector pp.
pp <- c()
k <- 2
for(i in 1:10) {
is <- sample(which(dc$Failure==1), k)
lrmod2 <- glm(Failure ~ Temperature, family = "binomial", data = dc[-
is,])
p <- predict(lrmod2, newdata=data.frame(Temperature = 41), type = "re
sponse")
pp <- c(pp,p)
}
print(pp)
## 1 1 1 1 1 1
1 1
## 0.9960551 0.9997801 0.9940336 0.9996740 0.9806156 0.9800026 0.985366
0 0.9806156
## 1 1
## 0.9999882 0.9960551
Question 8 [2 pts] [10/35]
We want to know if pressure plays a role in the failure of the machines.
Fit a logistic regression model (lrmodf) of Failure as a function of Temperature and
Pressure. Use the function anova to compare the simple model (lrmod1) and more
complex model (lrmodf).
Which model should we keep?
lrmodf <- glm(Failure ~ Temperature + Pressure, family = "binomial", da
ta = dc)
anova(lrmod1,lrmodf)
## Analysis of Deviance Table
##
## Model 1: Failure ~ Temperature
## Model 2: Failure ~ Temperature + Pressure
## Resid. Df Resid. Dev Df Deviance
## 1 21 20.315
## 2 19 18.971 2 1.3438
# We should keep lrmodf.
Problem 2 [8 pts]
TOPIC: K-nearest neighbours and naive bayes classifiers.
Question 1 [2 pts] [12/35]
Read the file a1d2.csv into a dataframe named df. Print the top of the dataframe and
the structure of the dataframe (use the function str).
df <- read.csv("a1d2.csv")
print(head(df))
## y a b c d e f g h i
j
## 1 3 -4.058 1.072 0.157 1.136 0.385 -0.687 -0.583 -0.185 -0.300 1.
177
## 2 5 -2.563 2.829 -0.283 -0.368 -1.593 0.408 0.121 0.933 -0.946 -0.
356
## 3 3 -3.232 1.551 -0.832 1.674 -0.042 0.296 -0.553 -0.443 -0.413 0.
960
## 4 8 -3.840 3.745 0.121 -0.656 -1.066 0.329 0.734 0.875 -0.295 0.
418
## 5 1 -4.047 0.913 0.734 1.479 0.258 0.206 -1.117 -0.423 -0.336 0.
982
## 6 8 -3.958 3.922 0.150 -0.265 -1.201 -0.249 0.202 1.163 -0.049 -0.
039
print(str(df))
## 'data.frame': 720 obs. of 11 variables:
## $ y: int 3 5 3 8 1 8 5 7 2 7 ...
## $ a: num -4.06 -2.56 -3.23 -3.84 -4.05 ...
## $ b: num 1.072 2.829 1.551 3.745 0.913 ...
## $ c: num 0.157 -0.283 -0.832 0.121 0.734 ...
## $ d: num 1.136 -0.368 1.674 -0.656 1.479 ...
## $ e: num 0.385 -1.593 -0.042 -1.066 0.258 ...
## $ f: num -0.687 0.408 0.296 0.329 0.206 -0.249 0.357 -0.215 0.257
0.195 ...
## $ g: num -0.583 0.121 -0.553 0.734 -1.117 ...
## $ h: num -0.185 0.933 -0.443 0.875 -0.423 ...
## $ i: num -0.3 -0.946 -0.413 -0.295 -0.336 -0.049 0.109 -0.641 -0.7
07 0.958 ...
## $ j: num 1.177 -0.356 0.96 0.418 0.982 ...
## NULL
We will treat the column y of dataframe df as a response. Convert this column to a
factor and use is.factor to check if y is now a factor.
df$y <- as.factor(df$y)
is.factor(df$y)
## [1] TRUE
compute the number of rows in df in a variable named n. We want to keep half of
this number for the training set. Compute the size of the training set and store it in a
variable named nt.
n <- nrow(df)
nt <- n/2
Do not change this
set.seed(135)
Now use the sample function to compute an array named index1 of nt random row
numbers among n rows. Use the row numbers in index1 to define a training
dataframe named traind and place the complement in a testing dataframe named
testd. Keep all columns of df in these dataframes.
index1 <- sample(1:n, nt)
traind <- df[index1,]
testd <- df[-index1,]
Question 2 [1 pt] [13/35]
Load the class library. Use the knn function to fit a KNN model. Use the square root
of the number of training data as the k parameter (number of neighbours). Store the
result of the function into an object called knnmodel.
Print a summary of this object.
library(class)
knnmodel <- knn(train = traind, test = testd, cl = traind$y, k = sqrt(n
t))
summary(knnmodel)
## 1 2 3 4 5 6 7 8
## 35 61 46 48 48 40 40 42
Question 3 [1 pt] [14/35]
Load the caret library. Use the confusionMatrix function to compute a confusion
Matrix for the knn model classifier. You will have to apply the function to the ytest
vector which is the y column of the dataframe testd.
library(caret)
## 载入需要的程辑包:ggplot2
## 载入需要的程辑包:lattice
confusionMatrix(knnmodel, testd$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8
## 1 35 0 0 0 0 0 0 0
## 2 14 47 0 0 0 0 0 0
## 3 0 3 38 5 0 0 0 0
## 4 0 0 0 48 0 0 0 0
## 5 0 0 0 0 48 0 0 0
## 6 0 0 0 0 0 40 0 0
## 7 0 0 0 0 0 0 38 2
## 8 0 0 0 0 0 0 3 39
##
## Overall Statistics
##
## Accuracy : 0.925
## 95% CI : (0.8928, 0.95)
## No Information Rate : 0.1472
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9142
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Cl
ass: 6
## Sensitivity 0.71429 0.9400 1.0000 0.9057 1.0000
1.0000
## Specificity 1.00000 0.9548 0.9752 1.0000 1.0000
1.0000
## Pos Pred Value 1.00000 0.7705 0.8261 1.0000 1.0000
1.0000
## Neg Pred Value 0.95692 0.9900 1.0000 0.9840 1.0000
1.0000
## Prevalence 0.13611 0.1389 0.1056 0.1472 0.1333
0.1111
## Detection Rate 0.09722 0.1306 0.1056 0.1333 0.1333
0.1111
## Detection Prevalence 0.09722 0.1694 0.1278 0.1333 0.1333
0.1111
## Balanced Accuracy 0.85714 0.9474 0.9876 0.9528 1.0000
1.0000
## Class: 7 Class: 8
## Sensitivity 0.9268 0.9512
## Specificity 0.9937 0.9906
## Pos Pred Value 0.9500 0.9286
## Neg Pred Value 0.9906 0.9937
## Prevalence 0.1139 0.1139
## Detection Rate 0.1056 0.1083
## Detection Prevalence 0.1111 0.1167
## Balanced Accuracy 0.9603 0.9709
Question 4 [1 pt] [15/35]
Fit a naive Bayes model to the training set.
For this, you will use the naiveBayes function from the e1071 package. In your call to
this function, provide a model formula (the response is the y column and we use all
predictors in the training dataframe). Save the resulting model in an object called
modelnb.
library(e1071)
modelnb <- naiveBayes(y~.,data = traind)
Question 5 [1 pt] [16/35]
Now use the predict function to calculate the predictions of model modelnb on the
training set. Save these predictions into an object named prednbtrain, then and use
the table function to calculate the contingency table to compare the true value of the
response on the training set (columns) and the predictions (rows).
prednbtrain=predict(modelnb, traind)
table(traind$y, prednbtrain)
## prednbtrain
## 1 2 3 4 5 6 7 8
## 1 37 4 0 0 0 0 0 0
## 2 4 33 3 0 0 0 0 0
## 3 0 3 47 2 0 0 0 0
## 4 0 0 5 27 3 2 0 0
## 5 0 0 1 0 32 5 3 1
## 6 0 0 3 2 11 34 0 0
## 7 0 0 0 0 11 1 32 5
## 8 0 0 0 0 0 0 7 42
Question 6 [1 pt] [17/35]
Use the confusionMatrix function to obtain more detailed information on the
accuracy of predictions on the training dataset.
confusionMatrix(prednbtrain,traind$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8
## 1 37 4 0 0 0 0 0 0
## 2 4 33 3 0 0 0 0 0
## 3 0 3 47 5 1 3 0 0
## 4 0 0 2 27 0 2 0 0
## 5 0 0 0 3 32 11 11 0
## 6 0 0 0 2 5 34 1 0
## 7 0 0 0 0 3 0 32 7
## 8 0 0 0 0 1 0 5 42
##
## Overall Statistics
##
## Accuracy : 0.7889
## 95% CI : (0.743, 0.8299)
## No Information Rate : 0.1444
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7583
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Cl
ass: 6
## Sensitivity 0.9024 0.82500 0.9038 0.72973 0.76190 0.
68000
## Specificity 0.9875 0.97813 0.9610 0.98762 0.92138 0.
97419
## Pos Pred Value 0.9024 0.82500 0.7966 0.87097 0.56140 0.
80952
## Neg Pred Value 0.9875 0.97812 0.9834 0.96960 0.96700 0.
94969
## Prevalence 0.1139 0.11111 0.1444 0.10278 0.11667 0.
13889
## Detection Rate 0.1028 0.09167 0.1306 0.07500 0.08889 0.
09444
## Detection Prevalence 0.1139 0.11111 0.1639 0.08611 0.15833 0.
11667
## Balanced Accuracy 0.9449 0.90156 0.9324 0.85867 0.84164 0.
82710
## Class: 7 Class: 8
## Sensitivity 0.65306 0.8571
## Specificity 0.96785 0.9807
## Pos Pred Value 0.76190 0.8750
## Neg Pred Value 0.94654 0.9776
## Prevalence 0.13611 0.1361
## Detection Rate 0.08889 0.1167
## Detection Prevalence 0.11667 0.1333
## Balanced Accuracy 0.81045 0.9189
Question 7 [1 pt] [18/35]
Repeat the same operations (computation of the predictions of the fitted naive
bayes model and call to the confusionMatrix function) but now for the testing set.
prednbtest=predict(modelnb,testd)
confusionMatrix(prednbtest,testd$y)
## Confusion Matrix and Statistics
##
## Reference
## Prediction 1 2 3 4 5 6 7 8
## 1 38 6 0 0 0 0 0 0
## 2 11 36 4 0 0 0 1 0
## 3 0 8 32 7 0 3 0 0
## 4 0 0 2 40 0 1 0 0
## 5 0 0 0 2 39 9 9 0
## 6 0 0 0 4 3 27 0 0
## 7 0 0 0 0 5 0 24 5
## 8 0 0 0 0 1 0 7 36
##
## Overall Statistics
##
## Accuracy : 0.7556
## 95% CI : (0.7078, 0.7991)
## No Information Rate : 0.1472
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7203
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: 1 Class: 2 Class: 3 Class: 4 Class: 5 Cl
ass: 6
## Sensitivity 0.7755 0.7200 0.84211 0.7547 0.8125 0.
67500
## Specificity 0.9807 0.9484 0.94410 0.9902 0.9359 0.
97813
## Pos Pred Value 0.8636 0.6923 0.64000 0.9302 0.6610 0.
79412
## Neg Pred Value 0.9652 0.9545 0.98065 0.9590 0.9701 0.
96012
## Prevalence 0.1361 0.1389 0.10556 0.1472 0.1333 0.
11111
## Detection Rate 0.1056 0.1000 0.08889 0.1111 0.1083 0.
07500
## Detection Prevalence 0.1222 0.1444 0.13889 0.1194 0.1639 0.
09444
## Balanced Accuracy 0.8781 0.8342 0.89310 0.8725 0.8742 0.
82656
## Class: 7 Class: 8
## Sensitivity 0.58537 0.8780
## Specificity 0.96865 0.9749
## Pos Pred Value 0.70588 0.8182
## Neg Pred Value 0.94785 0.9842
## Prevalence 0.11389 0.1139
## Detection Rate 0.06667 0.1000
## Detection Prevalence 0.09444 0.1222
## Balanced Accuracy 0.77701 0.9265
Problem 3 [5 pts]
TOPIC: Linear discriminant Analysis for classification problems.
We will use the 8 first columns of df as our dataframe for this problem. The y column
is, again, the response variable.
Keep the following code: PLEASE REMOVE THE SHARP SIGN # and run
np=8
traind2=df[,c(1:np)]
#traind2 =df[,c(1:np)]
you can check that all covariates are of type numeric like this: PLEASE REMOVE THE
SHARP SIGN # and run
sapply(df, class)
## y a b c d e
f g
## "factor" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric
" "numeric"
## h i j
## "numeric" "numeric" "numeric"
#sapply(df, class)
Question 1 [2 pts] [20/35]
Convert the y column to a factor. Load the MASS library. It is required for the use the
lda function. Use this function to fit a linear discriminant model to the dataframe
traind2. Use all non-response columns as predictors. Save the resulting fitted model
into an object called ldam. Print this object.
traind2$y <- as.factor(traind2$y)
library(MASS)
ldam <- lda(y~.,data = traind2)
print(ldam)
## Call:
## lda(y ~ ., data = traind2)
##
## Prior probabilities of groups:
## 1 2 3 4 5 6 7 8
## 0.125 0.125 0.125 0.125 0.125 0.125 0.125 0.125
##
## Group means:
## a b c d e f
g
## 1 -3.336467 0.0259000 -0.3266333 1.37886667 0.34801111 1.2185889 -0.
02000000
## 2 -2.943167 0.7204556 -0.5066556 0.96433333 0.28738889 0.9738889 -0.
14170000
## 3 -2.522022 0.9430556 -0.7457778 0.81323333 0.02162222 0.4640889 -0.
20356667
## 4 -2.261022 1.6315444 -0.8529111 0.32836667 -0.38687778 0.2521333 -0.
01732222
## 5 -2.758656 2.5190333 -0.6948000 0.10976667 -0.83483333 0.5694444 0.
13108889
## 6 -2.705522 1.9477000 -0.5223000 0.25952222 -0.49555556 0.4967556 0.
08701111
## 7 -3.168922 2.6410778 -0.3632000 0.22528889 -0.78877778 0.4486889 0.
10490000
## 8 -3.951533 3.4632444 -0.4209222 0.06214444 -0.91896667 0.4450444 0.
16994444
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3 LD4 LD5 LD
6
## a 0.22076005 -1.35432649 1.5603782 0.7165815 -0.20911984 0.464981
8
## b 1.34492610 0.09636508 0.6477478 1.1038340 -0.16393059 0.488643
3
## c 0.02482872 0.92633496 1.7996928 0.4540135 -1.05481220 0.300580
4
## d -0.53666996 0.07536166 0.5544593 0.1674645 -0.03061012 1.332040
5
## e -1.22511892 -0.35666803 0.1543936 1.3703772 -0.77214253 -0.338184
0
## f -0.34193462 1.84612651 1.3247223 1.0235430 0.72865236 0.035272
5
## g -0.18749888 0.56605650 1.3163988 -1.3223487 -0.76782875 -0.312561
3
## LD7
## a -0.74762244
## b -1.03205587
## c 0.02773546
## d -1.74299877
## e -1.02822897
## f 0.14127187
## g -2.07779319
##
## Proportion of trace:
## LD1 LD2 LD3 LD4 LD5 LD6 LD7
## 0.7729 0.2023 0.0163 0.0057 0.0019 0.0007 0.0001
Question 2 [3 pts] [23/35]
Mimic the template code shown in Lecture 2 to plot the data points of ‘traind2’ in
the LD1,LD2 plane. You will need to apply the predict function to retrieve the LD1
and LD2 components calculated by the model.
You can use the ggplot2 library to produce this scatterplot.
library(ggplot2)
pred <- predict(ldam, traind2)
# create first a dataframe which appends the LD coordinates as columns
#datawithLDcoords <-
# plot the records of traind in LD1,LD2 space (scatterplot colored by c
lass)
datawithLDcoords <- data.frame(traind2, LD1 = pred$x[,1], LD2 = pred$x[,
2])
ggplot(data = datawithLDcoords, aes(x = datawithLDcoords[,9], y = dataw
ithLDcoords[,10], color = y)) + geom_point() + xlab("LD1") + ylab("LD2")
Problem 4 [12 pts]
TOPIC: Logistic regression
We will use this code to define the data for this problem:
dgc=read.csv('a1d3.csv')
names(dgc)
## [1] "Creditability" "Account.Balance"
## [3] "Duration.of.Credit..month." "Payment.Status.of.Previous.
Credit"
## [5] "Purpose" "Credit.Amount"
## [7] "Value.Savings.Stocks" "Length.of.current.employme
nt"
## [9] "Instalment.per.cent" "Sex...Marital.Status"
## [11] "Guarantors" "Duration.in.Current.addres
s"
## [13] "Most.valuable.available.asset" "Age..years."
## [15] "Concurrent.Credits" "Type.of.apartment"
## [17] "No.of.Credits.at.this.Bank" "Occupation"
## [19] "No.of.dependents" "Telephone"
## [21] "Foreign.Worker"
new=c('response','balance','monthdur','payment','purpose','amount', 'sa
vings','joblength','instal','status','guarant','duration','asset','age',
'credit','apartment','numcred','job','ndep','phone','foreign')
length(new)
## [1] 21
names(dgc)=new
names(dgc)
## [1] "response" "balance" "monthdur" "payment" "purpose" "am
ount"
## [7] "savings" "joblength" "instal" "status" "guarant" "du
ration"
## [13] "asset" "age" "credit" "apartment" "numcred" "jo
b"
## [19] "ndep" "phone" "foreign"
Question 1 [2 pts] [25/35]
Load the dplyr library and use the sapply function to calculate the number of distinct
elements in each column of the dgc dataframe (the function counting distinct
elements is called n_distinct).
library(dplyr)
sapply(dgc, n_distinct)
## response balance monthdur payment purpose amount saving
s joblength
## 2 4 33 5 10 923
5 5
## instal status guarant duration asset age credi
t apartment
## 4 4 3 4 4 53
3 3
## numcred job ndep phone foreign
## 4 4 2 2 2
Convert the columns response, foreign, status and instal columns of dgc to factors.
dgc$response <- as.factor(dgc$response)
dgc$foreign <- as.factor(dgc$foreign)
dgc$status <- as.factor(dgc$status)
dgc$instal <- as.factor(dgc$instal)
Keep and run the following code. This will define our training and testing set.
n3=nrow(dgc)
n3
## [1] 1000
nt3=0.8*n3
nt3
## [1] 800
set.seed(135)
index3<-sample(n3,nt3)
traind3<-dgc[index3,]
testd3 <-dgc[-index3,]
#generate ytrain and ytest:used to generate outcome vectors for train a
nd test data
ytrain3=dgc$y[ index3]
ytest3 =dgc$y[-index3]
Question 2 [2 pts] [27/35]
Use the glm function to fit a Logistic regression model. Use response as the
dependent variable, and monthdur, amount,age,foreign, status and instal as the
independent variables. Save your model in an object called glm1 and print a
summary of this object.
glm1 <- glm(response ~ monthdur + amount + age + foreign + status + ins
tal, family = "binomial", data = traind3)
summary(glm1)
##
## Call:
## glm(formula = response ~ monthdur + amount + age + foreign +
## status + instal, family = "binomial", data = traind3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.0288 -1.1384 0.6658 0.8661 1.6769
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 7.504e-01 5.159e-01 1.455 0.14580
## monthdur -2.852e-02 8.961e-03 -3.183 0.00146 **
## amount -9.363e-05 3.992e-05 -2.345 0.01902 *
## age 1.750e-02 7.720e-03 2.266 0.02343 *
## foreign2 1.089e+00 6.538e-01 1.666 0.09578 .
## status2 4.867e-01 3.737e-01 1.303 0.19272
## status3 1.078e+00 3.640e-01 2.962 0.00306 **
## status4 7.624e-01 4.355e-01 1.751 0.08000 .
## instal2 -1.160e-01 2.949e-01 -0.393 0.69396
## instal3 -4.044e-01 3.180e-01 -1.272 0.20347
## instal4 -7.185e-01 2.822e-01 -2.546 0.01089 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 985.71 on 799 degrees of freedom
## Residual deviance: 910.15 on 789 degrees of freedom
## AIC: 932.15
##
## Number of Fisher Scoring iterations: 4
Question 3 [1 pt] [28/35]
Use the appropriate function (named c…) to calculate a 95% confidence interval of
all coefficients of the model glm1.
confint(glm1)
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) -0.2559752443 1.771656e+00
## monthdur -0.0462304170 -1.103324e-02
## amount -0.0001725398 -1.557135e-05
## age 0.0025894265 3.289699e-02
## foreign2 -0.0464860985 2.596364e+00
## status2 -0.2541378849 1.218676e+00
## status3 0.3564693354 1.791971e+00
## status4 -0.0922730255 1.621710e+00
## instal2 -0.7017995096 4.573539e-01
## instal3 -1.0342685258 2.152645e-01
## instal4 -1.2850982984 -1.760707e-01
Question 4 [1 pt] [29/35]
Repeat question 2, but use only the predictor instal. Name this model glma and print
the model summary.
glma = glm(response ~ instal, family = "binomial", data = traind3)
summary(glma)
##
## Call:
## glm(formula = response ~ instal, family = "binomial", data = traind3)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6704 -1.4643 0.8150 0.9155 0.9155
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.1104 0.2179 5.096 3.47e-07 ***
## instal2 -0.1789 0.2726 -0.656 0.5116
## instal3 -0.2047 0.2906 -0.705 0.4811
## instal4 -0.4575 0.2438 -1.876 0.0606 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 985.71 on 799 degrees of freedom
## Residual deviance: 980.89 on 796 degrees of freedom
## AIC: 988.89
##
## Number of Fisher Scoring iterations: 4
Question 5 [2 pts] [31/35]
Now, pipe glma to the predict function to calculate the predicted probabilities of the
response. Save these probabilities in an object called predp.
Use the function ifelse and the predp object to calculate the predicted classes. Save
them in an object called predc.
Finally, compute (and print) the accuracy of the model on the training set. You can
do this by applying the mean function to a properly defined boolean expression that
compares predc to the true values of the response in the training set.
predp = predict(glma, type = "response")
predc = ifelse(predp > 0.5, 1, 0)
mean(predc==traind3$response)
## [1] 0.69375
What is the accuracy of prediction on the training set? (report in format 0.XXXX e.g
56.89% accuracy would be reported as 0.5689)
0.6938
Question 6 [2 pts] [33/35]
Compute and print the estimated Odds ratios for the glma model and their 95%
Confidence interval.
## odds ratios and 95% CI
exp(coef(glma))
## (Intercept) instal2 instal3 instal4
## 3.0357143 0.8361991 0.8148607 0.6328856
exp(confint(glma, level = 0.95))
## Waiting for profiling to be done...
## 2.5 % 97.5 %
## (Intercept) 2.0077743 4.733625
## instal2 0.4858800 1.418550
## instal3 0.4583199 1.436416
## instal4 0.3872392 1.010111
Question 7 [2 pts] [35/35]
Repeat question 5, but use a model that contains all the predictors.
glmall <- glm(response ~., family = "binomial", data = traind3)
predp = predict(glmall, newdata = traind3)
predc = ifelse(predp > 0.5, 1, 0)
mean(predc==traind3$response)
## [1] 0.775
What is the accuracy now? (report in format 0.XXXX e.g 56.89% accuracy would be
reported as 0.5689)
0.7750
Is the model less or more accurate compared to the univariate model?
More accurate.