GR5058 Final Practice Problems
Ben Goodrich
Answers will be available December 16, 2021
1 Text
Execute
data("constitution", package = "qss")
Each row of constitution contains the (English translation of the) text of the preamble to a constitution,
written by that country in that year for a total of 155.
• Create a country_year variable inside the constitution data.frame that combines the country and
year variables that uniquely identifies the observation (since some countries rewrite their constitutions
over time).
• Use the functions in the tidytext R package to make a “tidy” data.frame from the information in
constitution if the word is the unit of analysis
• Eliminate English “stop words” from the “tidy” data.frame to form a new data.frame
• How many words are left once the “stop words” have been removed from these Constitution preambles
and how many of those are unique?
• Which five words have the largest weight under the term-frequency inverse document-weight metric?
• Use functions in the dplyr package to create a tibble that counts up all of the times that a word appears
in each constitution’s preamble
• Use the cast_dtm function to create a document-term matrix
• Apply K-means clustering to this document-term matrix for some value of K. Which constitutions
appear in each cluster and how would you interpret those clusters?
2 Classification
Execute
data("OJ", package = "ISLR2")
• Split the data into training and testing, stratifying on Purchase, which is a the outcome variable (brand
of orange juice purchased)
• Estimate an appropriate but unpenalized logistic regression for this outcome in the training data
• Use Linear Discriminant Analysis to estimate a model with the same or similar set of predictors
• Use elastic net logistic regression to estimate a model with the same or similar set of predictors
• Which of the above models classifies best in the testing data if overall accuracy is the criterion?
3 Neural Network Regression
In a blog post at
https://matloff.wordpress.com/2018/06/20/neural-networks-are-essentially-polynomial-regression/
1
the author links to a presentation and a paper that support their contention that “neural networks are
essentially polynomial regression”, i.e. regression models with typically multivariate, sometimes high-degree
polynomials as predictors.
Execute
data(College, package = "ISLR2")
• Split the data into training and testing
• Use the tidymodels framework to estimate a neural network model using the keras engine, but otherwise
with the default values of the tuning parameters
• Use OLS with step_interact and step_poly in the recipe to perform a regression with many polyno-
mials, as suggested in the blog post
• Is there any (maximum) polynomial degree such that the polynomial regression predicts better in the
training data than the previously estimated neural network if root mean-squared error is the criterion?
2
