r代写-HS21

时间：2021-11-21

title: Environmental Systems Data Science HS21
output:
html_document:
number_sections: true
pandoc_args: --number-offset=7
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Application 1: Variable selection
## Introduction
In Chapter 7, we noted that the coefficient of determination $R^2$ may
increase even when uninformative predictors are added to a model. This will
ascribe some predictive power to an uninformative predictor that is in fact
misguided by its (random) correlation with the target variable. Often, we start
out formulating models, not knowing beforehand what predictors should be
considered and we are tempted to use them all because the full model will
always yield the best $R^2$. In such cases, we're prone to building
overconfident models that perform well on the training set, but will not perform
well when predicting to new data.
In this application session, we'll implement an algorithm that sequentially
searches the best additional predictor to be included in our model, starting
from a single one. This is called *stepwise-forward* regression (see definition
below). There is also *stepwise-backward* regression where predictors are
sequentially removed from a model that includes them all. The challenge is
that we often lack the possibility to confidently assess generalisability. The
effect of spuriously increasing $R^2$ by adding uninformative predictors can
be mitigated, as we noted in Chapter 7, by considering alternative metrics
that penalize the number of predictors in a model. They balance the trade-off
between model complexity (number of variables in the linear regression case)
and goodness of fit. Such metrics include the *adjusted-*$R^2$,the Akaike
Information Criterion (AIC), or the Bayesian Information Criterion (BIC). In
cases, where sufficient data is available, also cross-validation can be used for
assessing the generalisability of alternative models. Here, we'll assess how
these different metrics behave for a sequence of linear regression models
with an increasing number of predictors. You'll learn to write code that
implements an algorithm determining the order in which variables enter the
model, starting from one and going up to fourteen predictors. You'll write your
own stepwise-forward regression code.
Let's get started!
## Application
### Warm-up 1: Nested for-loop
Given a matrix A and a vector B (see below), do the following tasks:
- Replace the missing values (`NA`) in the first row of A by the largest value of
B. After using that element of B for imputing A, drop that element from the
vector B and proceed with imputing the second row of A, using the (now)
largest value of the updated vector B, and drop that element from B after
using it for imputing A. Repeat the same procedure for all four rows in A.
- After imputing (replacing) in each step, calculate the mean of the remaining
values in B and record it as a single-row data frame with two columns
`row_number` and `avg`, where `row_number` is the row number of A where
the value was imputed, and `avg` is the mean of remaining values in B. As
the algorithm proceeds through rows in A, sequentially bind the single-row
data frame together so that after completion of the algorithm, the data frame
contains four rows (corresponding to the number of rows in A).
```{r}
A <- matrix(c(6, 7, 3, NA, 15, 6, 7,
8, 9, 12, 6, 11, NA, 3,
9, 4, 7, 3, 21, NA, 6,
7, 19, 6, NA, 15, 8, 10),
nrow = 4, byrow = TRUE)
B <- c(8, 4, 12, 9, 15, 6)
```
Before implementing these tasks, try to write down a pseudo code. This is
code-like text that may not be executable, but describes the structure of real
code and details where and how major steps are implemented. Next, you'll
need to write actual R code. For this, you will need to find answers to the
following questions:
+ How to go through each of the element in matrix?
+ How to detect NA value?
+ How to drop an element of a given value from a vector?
+ How to add a row to an existing data frame?
**Solution**:
```{r message=FALSE}
## write your code here
```
### Warm-up 2: Find the best single predictor
**The math behind forward stepwise regression:**
1. Let $\mathcal{M_0}$ denote the null model, which contains no predictors.
2. For $k=0,..,p-1$:
(a) Consider all $p − k$ models that augment $\mathcal{M}_k$ with one
additional predictor.
(b) Choose the best model among these $p − k$ models, and call it $
\mathcal{M}_{k+1}$. Here _best_ is defined as having the highest $R^2$ .
3. Select a single best model from among $\mathcal{M}_0$, . . . , $
\mathcal{M}_p$ using cross-validated prediction error, AIC, BIC, or adjusted
$R^2$.
The first step of a stepwise forward regression is to find the single most
powerful predictor in a univariate linear regression model for the target
variable `GPP_NT_VUT_REF` among all fourteen available predictors in our
data set (all except those of type `date` or `character`). Implement this first
part of the search, using the definition of the stepwise-forward algorithm
above. Remove all rows with at least one missing value before starting the
predictor search.
- Which predictor achieves the highest $R^2$?
- What value is the $R^2$?
- Visualise $R^2$ for all univariate models, ordered by their respective $R^2$
values.
- Do you note a particular pattern? Which variables yield similar $R^2$? How
do you expect them to be included in multivariate models of the subsequent
steps in the stepwise forward regression?
_Hints_:
+ Model structure:

- The "counter" variables in the for loop can be provided as a vector, and
the counter will sequentially take on the value of each element in that vector.
For example: `for (var in all_predictors){ ... }`.
+ Algorithm:
- To record $R^2$ values for the different models, you may start by creating
an empty vector (`vec <- c()`) before the loop and then sequentially add
elements to that vector inside the loop (`vec <- c(vec, new_element)`).
Alternatively, you can do something similar, but with a data frame (initialising
with `df_rsq <- data.frame()` before the loop, and adding rows by `df_rsq <-
bind_rows(df_rsq, data.frame(pred = predictor_name, rsq = rsq_result))`
inside the loop).
- A clever way how to construct formulas dynamically is described, for
example in [this stackoverflow post](https://stackoverflow.com/questions/
4951442/formula-with-dynamic-number-of-variables).

+ Value retrieving:
- Extract the $R^2$ from the linear model object: `summary(fit_lin)
[["r.squared"]]`
+ Visualising:
- Search for solutions for how to change the order of levels to be plotted
yourself.

```{r message=F, warning=F}
## write your code here ##
## read CSV
## determine the predictors (should be 14) and the target
## fit the linear regression model to the target vs each predictor and extract
R2 (should be 14)
## print best single predictor and its R2
```
```{r}
## Plot the R2 that corresponds to each predictor (think how to change the
order of levels)
```
### Full stepwise regression
Now, we take it to the next level and implement a full stepwise forward
regression as described above. For each step (number of predictors $k$),
record the following metrics: $R^2$, *adjusted-*$R^2$, the Akaike Information
Criterion (AIC), Bayesian Information Criterion (BIC), and the 5-fold cross-
validation $R^2$ and RMSE.
- Write pseudo-code for how you plan to implement the algorithm first.
- Implement the algorithm in R, run it and display the order in which predictors
enter the model.
- Display a table with the metrics of all $k$ steps, and the single variable,
added at each step.
_Hints_:
+ Model structure:

- Recall what you learned in the breakout session, you may use the same
idea on this task. Try to think of the blueprint (*pseudo-code*) first: How to go
through different models in each forward step? How to store predictors added
to the model and how to update candidate predictors?
+ Algorithm:
- A complication is that the set of predictors is sequentially complemented at
each step of the search through $k$. You may again use `vec <- list()` to
create an empty vector, and then add elements to that vector by `vec <-
c(vec, new_element)`.
- It may be helpful to explicitly define a set of "candidate predictors" that
may potentially be added to the model as a vector (e.g., `preds_candidate`),
and define predictors retained in the model from the previous step in a
separate vector (e.g., `preds_retained`). In each step, search through
`preds_candidate`, select the best predictor, add it to `preds_retained` and
remove it from `preds_candidate`.
- At each step, record the metrics and store them in a data frame for later
plots. As in the first "warm-up" exercise, you may record metrics at each step
as a single-row data frame and sequentially stack (bind) them together.
- (As above) A clever way how to construct formulas dynamically is
described, for example in [this stackoverflow post](https://stackoverflow.com/
questions/4951442/formula-with-dynamic-number-of-variables).

- The metrics for the $k$ models are assessed *after* the order of added
variables is determined. To be able to determine the metrics, the $k$ models
can be saved by constructing a list of models and sequentially add elements
to that list (`mylist[[ name_new_element ]] <- new_element`). You can also fit
the model again after determining which predictor worked best.

- Your code will most certainly have bugs at first. To debug efficiently, write
code first in a simple R script and use the debugging options in RStudio (see
[here](https://support.rstudio.com/hc/en-us/articles/205612627-Debugging-
with-RStudio)).

+ Value retrieving
- To get AIC and BIC values for a given model, use the base-R functions
`AIC()` and `BIC()`.

- To get the cross-validated $R^2$ and RMSE, use the caret function
`train()` with RMSE as the loss function, and `method = "lm"` (to fit a linear
regression model). Then extract the values by
`trained_model$results$Rsquared` and `trained_model$results$RMSE`.

+ Displaying:
- To display a table nicely as part of the RMarkdown html output, use the
function `knitr::kable()`

- To avoid reordering of the list of variable names in plotting, change the type
of variable names from "character" to "factor" by `pred <- factor(pred, levels =
pred)`

```{r message=FALSE}
## determine predictors (should be 14) and the target
# Initialize the "candidate predictors" and the predictors retained in the model
from the previous step
# create an empty vector or list for the metrics before the loop
## Implement your algorithm ##
## You have to create two loops : one for k predictors and one for a single
additional predictor
## The one for the single additional predictor will determine which is the best
predictor wrt. r2 that will be added to the retained predictors later (as in the
warm-up example)
## The second loop is used to manipulate the addition of the best predictor to
the retained predictors.
## Every time a new predictor is added to the retained ones get the cross-
validated $R^2$ and RMSE, record AIC and BIC and adjusted-R2 of the
respective model
## At the end order in which variables enter the model
```
- Visualise all metrics as a function of the number of predictors (add labels for
the variable names of the added predictor). Highlight the best-performing
model based on the respective metric. How many predictors are in the best
performing model, when assessed based on each metric?
```{r eval = FALSE}
## write your code here ##
```
### Bonus: Stepwise regression out-of-the-box

In R, you can also conduct the above variable selection procedures
automatically. `regsubsets()` from leaps package provides a convenient way
to do so. It does model selection by exhaustive search, including forward or
backward stepwise regression.
- Do a stepwise forward regression using `regsubsets()`.
- Create a data frame with the same metrics as above (except cross-validated
$R^2$).
- Visualise metrics as above.
- Are your result consistent?

_Hints_:
- Specify stepwise *forward* by setting `method = "forward"`.
- Specify the number of predictors to examine, that is all fourteen, by setting
`nvmax = 14`.
- AIC values of each step are stored in `summary(regfit.fwd)$cp` and BIC
values of each step are in `summary(regfit.fwd)$bic`, etc.
- Get order in which predictors $k$ are added (which corresponds to values
returned by `summary(regfit.fwd)$bic`), by
`all_predictors[regfit.fwd$vorder[2:15]-1]`. `vorder` is the order of variables
entering into model. Note that `Intercept` is counted here as the first to enter
into the model and should be removed.
- To avoid reordering of the list of variable names in plotting, change the type
of variable names from "character" to "factor" by `preds_enter <-
factor(preds_enter, levels = preds_enter)`
```{r}
# get variables in their order added to the model
## create metrics data frame
## visualize metrics
```

学霸联盟