程序代写案例-WS 2021/2022
时间:2022-01-29
Introduction to Machine Learning Exercise sheet 1
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: Car Price Prediction
Imagine you work at a second-hand car dealer and are tasked with finding for-sale vehicles your company can
acquire at a reasonable price. You decide to address this challenge in a data-driven manner and develop a model
that predicts adequate market prices (in EUR) from vehicles’ properties.
a) Characterize the task at hand: supervised or unsupervised? Regression or classification? Learning to explain
or learning to predict? Justify your answers.
b) How would you set up your data? Name potential features along with their respective data type and state the
target variable.
c) Assume now that you have data on vehicles’ age (days), mileage (km), and price (EUR). Explicitly define the
feature space X and target space Y.
d) You choose to use a linear model (LM) for this task. For this, you assume the targets to be conditionally
independent given the features, i.e., y(i)|x(i) ? y(j)|x(j) for all i, j 2 {1, 2, . . . , n}, i 6= j, with sample size n. The
LM models the target as a linear function of the features with Gaussian error term: y = X✓ + ✏,
✏ ⇠ N(0, diag(2)), > 0.
State the hypothesis space for the corresponding model class. For this, assume the parameter vector ✓ to include
the intercept coecient.
e) Which parameters need to be learned? Define the corresponding parameter space ⇥.
f) State the loss function for the i-th observation using L2 loss.
g) In classical statistics, you would estimate the parameters via maximum likelihood estimation (MLE). The
likelihood for the LM is given by:
L(✓|x) =
nY
i=1
1p
2⇡2
exp
✓
1
22
⇣
y(i) ✓Tx(i)
⌘2◆
Describe how you can make use of the likelihood in empirical risk minimization (ERM) and write down the
resulting empirical risk.
h) Now you need to optimize this risk to find the best parameters, and hence the best model, via empirical risk
minimization. State the optimization problem formally and list the necessary steps to solve it.
Congratulations, you just designed your first machine learning project!
Introduction to Machine Learning Exercise sheet 2
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: HRO in mlr3
Throughout the lecture, we will frequently use the R package mlr3 and its descendants, providing an integrated
ecosystem for all common machine learning tasks. Let’s recap the HRO principle and see how it is reflected in
mlr3. An overview of the most important objects and their usage, illustrated with numerous examples, can be
found at https://mlr3book.mlr-org.com/basics.html.
a) How are the key concepts (i.e., hypothesis space, risk and optimization) you learned about in the lecture videos
implemented in mlr3?
b) Have a look at mlr3::tsk("iris"). What attributes does this task object store?
c) Pick an mlr3 learner of your choice. What are the di↵erent settings for this learner?
(Hint: use mlr3::mlr learners keys() to see all available learners.)
Exercise 2: Loss Functions for Regression Tasks
In this exercise, we will examine loss functions for regression tasks somewhat more in depth.
0
10
20
30
2.5 5.0 7.5 10.0
x
y
a) Consider the above linear regression task. How will the model parameters be a↵ected by adding the new outlier
point (orange) if you use
i) L1 loss
ii) L2 loss
in the empirical risk? (You do not need to actually compute the parameter values.)
010
20
30
−10 −5 0 5 10
x
y
b) The second plot visualizes another loss function popular in regression tasks, the so-called Huber loss (depend-
ing on ✏ > 0; here: ✏ = 5). Describe how the Huber loss deals with residuals as compared to L1 and L2 loss.
Can you guess its definition?
c) Derive the least-squares estimator, i.e., the solution to the linear model when using L2 loss, analytically via
✓ˆ = argmin✓2⇥ ky X✓k22.
Exercise 3: Polynomial Regression
Assume the following (noisy) data-generating process from which we have observed 50 realizations:
y = 3 + 5 · sin(0.4⇡x) + ✏
with ✏ ⇠ N (0, 1).
−10
−5
0
−2 0 2
x
y
a) We decide to model the data with a cubic polynomial (including intercept term). State the corresponding
hypothesis space.
b) Demonstrate that this hypothesis space is simply a parameterized family of curves by plotting in R curves for
3 di↵erent models belonging to the considered model class.
c) State the empirical risk w.r.t. ✓ for a member of the hypothesis space. Use L2 loss and be as explicit as possible.
d) We can minimize this risk using gradient descent. In order to make this somewhat easier, we will denote the
transformed feature matrix, containing x to the power from 0 to 3, by X˜, such that we can express our model
by X˜✓ (note that the model is still linear in its parameters, even if X has been transformed in a non-linear
manner!). Derive the gradient of the empirical risk w.r.t ✓.
e) Using the result from d), state the calculation to update the current parameter ✓[t].
f) You will not be able to fit the data perfectly with a cubic polynomial. Describe the advantages and disadvantages
that a more flexible model class would have. Would you opt for a more flexible learner?
Exercise 4: Predicting abalone
We want to predict the age of an abalone using its longest shell measurement and its weight.
See https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/ for more details.
url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data"
abalone <- read.table(url, sep = ",", row.names = NULL)
colnames(abalone) <- c(
"sex", "longest_shell", "diameter", "height", "whole_weight",
"shucked_weight", "visceral_weight", "shell_weight", "rings")
abalone <- abalone[, c("longest_shell", "whole_weight", "rings")]
a) Plot LongestShell and WholeWeight on the x- and y-axis, respectively, and color points according to Rings.
Using mlr3:
b) Create an mlr3 task for the abalone data.
c) Define a linear regression learner (for this you will need to load the mlr3learners extension package first)
and use it to train a linear model on the abalone data.
d) Compare the fitted and observed targets visually.
(Hint: use autoplot().)
e) Assess the model’s training loss in terms of MAE.
(Hint: losses are retrieved by calling score(), which accepts di↵erent mlr measures, on the
prediction object.)
https://en.wikipedia.org/wiki/Abalone#/media/File:LivingAbalone.JPG
Introduction to Machine Learning Exercise sheet 3
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: Logistic Regression Basics
a) What is the relationship between softmax
⇡k(x | ✓) = exp(✓
>
k x)
gP
j=1
exp(✓>j x)
, k 2 {1, . . . , g}
and the logistic function
⇡(x | ✓) = 1
1 + exp(✓Tx)
for g = 2 (binary classification)?
b) The likelihood function for a multinomially distributed target variable with g target classes is given by1
Li(✓) = P(y(i)|x(i),✓1,✓2, . . . ,✓g) =
gY
j=1
⇡j
⇣
x(i) | ✓
⌘I(y(i)=j)
where the posterior class probabilities ⇡1
x(i) | ✓ ,⇡2 x(i) | ✓ , . . . ,⇡g x(i) | ✓ are modeled with softmax
regression. Derive the likelihood function for n independent observations.
c) We have already addressed the connection that holds between maximum likelihood estimation and empirical
risk minimization. Transform the joint likelihood function into an empirical risk function.
Hints:
By following the maximum likelihood principle, we should look for parameters ✓1,✓2, . . . ,✓g that maximize
the likelihood function.
The expressions
QLi and logQLi, if defined, are maximized by the same parameters.
Minimizing a scalar function multiplied with -1 is equivalent to maximizing the original function.
State the associated risk function.
d) Write down the discriminant functions of multiclass logistic regression resulting from this minimization objective.
How do we arrive at the final prediction?
e) State the parameter space ⇥ and corresponding hypothesis space H for the multiclass case.
Exercise 2: Decision Boundaries & Thresholds in Logistic Regression
In logistic regression (binary case), we estimate the probability P(y = 1 | x,✓) = ⇡(x | ✓). In order to decide about
the class of an observation, we set yˆ = 1 i↵ ⇡ˆ(x | ✓) ↵ for some ↵ 2 (0, 1).
a) Show that the decision boundary of the logistic classifier is a (linear!) hyperplane.
Hint: derive the value of ✓Tx (depending on ↵) starting from which you predict y = 1 rather than y = 0.
1While this might look somewhat complicated, it is actually just a very concise way to express the multinomial likelihood: for
each
observation, all factors but the one corresponding to the true class j0
will be 1 (due to the 0 exponent), so the result is simply
⇡j0
x(i) | ✓.
b) Below you see the logistic function for a binary classification problem with two input features for di↵erent values
✓ = (✓1, ✓2) (plots 1-3) as well as ↵ (plot 4). What can you deduce for the values of ✓1, ✓2 and ↵? What are
the implications for classification in the di↵erent scenarios?
Plot (1) Plot (2)
Plot (3) Plot (4)
c) Derive the equation for the decision boundary hyperplane if we choose ↵ = 0.5.
d) Explain when it might be sensible to set ↵ to 0.5.
Introduction to Machine Learning Exercise sheet 4
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: Naive Bayes
You are given the following table with the target variable Banana:
ID Color Form Origin Banana
1 yellow oblong imported yes
2 yellow round domestic no
3 yellow oblong imported no
4 brown oblong imported yes
5 brown round domestic no
6 green round imported yes
7 green oblong domestic no
8 red round imported no
a) We want to use a Naive Bayes classifier to predict whether a new fruit is a Banana or not. Estimate the posterior
probability ⇡ˆ(x⇤) for a new observation x⇤ = (yellow, round, imported). How would you classify the object?
b) Assume you have an additional feature Length that measures the length in cm. Describe in 1-2 sentences how
you would handle this numeric feature with Naive Bayes.
Exercise 2: Discriminant Analysis
2.0
2.5
3.0
3.5
4.0
0 2 4 6 8
x
y
The above plot shows D = x(1), y(1) , . . . , x(n), y(n), a data set with n = 200 observations of a continuous
target variable y and a continuous, 1-dimensional feature variable x. In the following, we aim at predicting y with
a machine learning model that takes x as input.
a) To prepare the data for classification, we categorize the target variable y in 3 classes and call the transformed
target variable z, as follows:
z(i) =
8><>:
1, y(i) 2 (1, 2.5]
2, y(i) 2 (2.5, 3.5]
3, y(i) 2 (3.5,1)
Now we can apply quadratic discriminant analysis (QDA):
i) Estimate the class means µk = E(x|z = k) for each of the three classes k 2 {1, 2, 3} visually from the plot.
Do not overcomplicate this, a rough estimate is sucient here.
ii) Make a plot that visualizes the di↵erent estimated densities per class.
iii) How would your plot from ii) change if we used linear discriminant analysis (LDA) instead of QDA? Explain
your answer.
iv) Why is QDA preferable over LDA for this data?
b) Given are two new observations x⇤1 = 10 and x⇤2 = 7. State the prediction for QDA and explain how you
arrive there.
Exercise 3: Decision Boundaries for mlr3 Learners
We will now visualize how well di↵erent learners classify the three-class mlbench::mlbench.cassini data set.
Generate 1000 points from cassini, perturb the x.2 dimension with Gaussian noise (mean 0, standard deviation
0.5), and consider the classifiers already introduced in the lecture:
LDA,
QDA, and
Naive Bayes.
Plot the learners’ decision boundaries. Can you spot di↵erences in separation ability?
(Note that logistic regression cannot handle more than two classes and is therefore not listed here.)
Introduction to Machine Learning Exercise sheet 5
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: Evaluating regression learners
Imagine you work for a data science start-up and sell turn-key statistical models. Based on a set of training
data, you develop a regression model to predict a customer’s legal expenses from the average monthly number of
indictments brought against their firm.
a) Due to the financial sensitivity of the situation, you opt for a very flexible learner that fits the customer’s data
(ntrain = 50 observations) well, and end up with a degree-21 polynomial (blue, solid). Your colleague is skeptical
and argues for a much simpler linear learner (gray, dashed). Which of the models will have a lower empirical
risk if standard L2 loss is used?
5
7
9
11
10 11 12 13 14 15
average number of indictments per month
leg
al
ex
pe
ns
es
in
m
illi
on
E
UR
b) Why might evaluation based on training error not be a good idea here?
c) Evaluate both learners on the following test data (ntest = 10), using
i) mean squared error (MSE), and
ii) mean absolute error (MAE).
State your performance assessment and explain potential di↵erences.
(Hint: use R if you don’t feel like computing a degree-21 polynomial regression by hand.)
set.seed(123)
x_train <- seq(10, 15, length.out = 50)
y_train <- 10 + 3 * sin(0.15 * pi * x_train) + rnorm(length(x_train), sd = 0.5)
data_train <- data.frame(x = x_train, y = y_train)
set.seed(321)
x_test <- seq(10, 15, length.out = 10)
y_test <- 10 + 3 * sin(0.15 * pi * x_test) + rnorm(length(x_test), sd = 0.5)
data_test <- data.frame(x = x_test, y = y_test)
Exercise 2: Importance of train-test split
We consider the BostonHousing data for which we would like to predict the nitric oxides concentration (nox) from
the distance to a number of firms (dis).
library(mlbench)
data(BostonHousing)
data_pollution <- data.frame(dis = BostonHousing$dis, nox = BostonHousing$nox)
data_pollution <- data_pollution[order(data_pollution$dis), ]
head(data_pollution)
## dis nox
## 373 1.1296 0.668
## 375 1.1370 0.668
## 372 1.1691 0.631
## 374 1.1742 0.668
## 407 1.1781 0.659
## 371 1.2024 0.631
ggplot2::ggplot(data_pollution, ggplot2::aes(x = dis, y = nox)) +
ggplot2::geom_point() +
ggplot2::theme_classic()
0.4
0.5
0.6
0.7
0.8
2.5 5.0 7.5 10.0 12.5
dis
no
x
a) Use the first ten observations as training data to compute a linear model with mlr3 and evaluate the performance
of your learner on the remaining data using MSE.
b) What might be disadvantageous about the train-test split in a)?
c) Now, sample your training observations from the data set at random. Use a share of 0.1 through 0.9, in 0.1
steps, of observations for training and repeat this procedure ten times. Afterwards, plot the resulting test errors
(in terms of MSE) in a suitable manner.
(Hint: rsmp is a convenient function for splitting data – you will want to choose the ”holdout” strategy.
Afterwards, resample can be used to repeatedly fit the learner.)
d) Interpret the findings from c).
Introduction to Machine Learning Exercise sheet 6
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: Overfitting & underfitting
Assume a polynomial regression model with a continuous target variable y and a continuous, p-dimensional feature
vector x and polynomials of degree d, i.e.,
f
⇣
x(i)
⌘
=
pX
j=1
dX
k=0
✓j,k(x
(i)
j )
k,
and y(i) = f
x(i)
+ ✏(i) where the ✏(i) are iid with Var(✏(i)) = 2 8i 2 {1, . . . , n}.
a) For each of the following situations, indicate whether we would generally expect the performance of a flexible
polynomial learner (high d) to be better or worse than an inflexible one (low d). Justify your answer.
(i) The sample size n is extremely large, and the number of features p is small.
(ii) The number of features p is extremely large, and the number of observations n is small.
(iii) The true relationship between the features and the response is highly non-linear.
(iv) The variance of the error terms, 2, is extremely high.
b) Are overfitting and underfitting properties of a learner or of a fixed model? Explain your answer.
c) Should we aim to completely avoid both overfitting and underfitting?
Exercise 2: Resampling strategies
a) Why would we apply resampling rather than a single holdout split?
b) Using mlr3, classify the german credit data into solvent and insolvent debtors using logistic regression. Com-
pute the training error w.r.t. MCE.
c) In order to evaluate your learner, compare test MCE using
i) three times ten-fold cross validation (3x10-CV)
ii) 10x3-CV
iii) 3x10-CV with stratification for the feature foreign worker to ensure equal representation in all folds
iv) a single holdout split with 90% training data
(Hint: you will need rsmp, resample and aggregate.)
d) Discuss and compare your findings from c) and compare them to the training error from b).
e) Would you consider LOO-CV to be a good alternative?
Introduction to Machine Learning Exercise sheet 7
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: ROC metrics
Consider a binary classification algorithm that yielded the following results on 10 observations. The table shows
true classes and predicted probabilities for class 1:
ID True class Prediction
1 0 0.33
2 0 0.27
3 0 0.11
4 1 0.38
5 1 0.17
6 1 0.63
7 1 0.62
8 1 0.33
9 0 0.15
10 0 0.57
a) Create a confusion matrix assuming a threshold of 0.5. Point out which values correspond to true positives
(TP), true negatives (TN), false positives (FP), and false negatives (FN).
b) Calculate: PPV, NPV, TPR, FPR, ACC, MCE and F1 measure.
c) Draw the ROC curve and interpret it. Feel free to use R for the drawing.
d) Calculate the AUC.
e) How would the ROC curve change if you had chosen a di↵erent threshold in a)?
Exercise 2: k-NN
a) Let the two-dimensional feature vectors in the following figure be instances of two di↵erent classes (triangles
and circles). Classify the point (7, 6) – represented by a square in the picture – with a k-NN classifier using L1
norm (Manhattan distance):
dManhattan(x, x˜) =
pX
j=1
|xj x˜j |.
As a decision rule, use the unweighted number of the individual classes in the k-neighborhood, i.e., assign the
point to the class that represents most neighbors.
i) k = 3
ii) k = 5
iii) k = 7
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
x
y
b) Now consider the same constellation but assume a regression problem this time, where the circle-shaped points
have a target value of 2 and the triangles have a value of 4.
Again, predict for the square point (7, 9), using both the unweighted and the weighted mean in the neighborhood
(still with Manhattan distance).
i) k = 3
ii) k = 5
iii) k = 7
Introduction to Machine Learning Exercise sheet 8
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: Splitting criteria
Given are the data set
x 1.0 2.0 7.0 10.0 20.0
y 1.0 1.0 0.5 10.0 11.0
and the same with log-transformed feature x:
log x 0.0 0.7 1.9 2.3 3.0
y 1.0 1.0 0.5 10.0 11.0
a) Compute the first split point the CART algorithm would find for each data set (with pen and paper or in R).
b) State the optimal constant predictor for a node N when minimizing the empirical risk under L2 loss and explain
why this is equivalent to minimizing “variance impurity”.
Exercise 2: Impurity reduction
The fractions of the classes k = 1, . . . , g in node N of a decision tree are ⇡(N )1 , . . . ,⇡(N )g . Assume we replace the
classification rule in node N
kˆ | N = argmax
k
⇡(N )k
with a randomizing rule
kˆ ⇠ Cat
⇣
⇡(N )1 , . . . ,⇡
(N )
g
⌘
,
in which we draw the classes in one node from the categorical distribution of their estimated probabilities (i.e.,
class k is predicted with probability ⇡(N )k ).
Compute the expected MCE in node N for data distributed i.i.d. like the training data. What do you notice?
(Hint : The observations and the predictions using the randomizing rule follow the same distribution.)
Introduction to Machine Learning Exercise sheet 9
https://slds-lmu.github.io/i2ml/ WS 2021/2022
Exercise 1: Classifying spam
a) Take a look at the spam dataset (?mlr3::mlr tasks spam). Shortly describe what kind of classification
problem this is and access the corresponding task predefined in mlr3.
b) Use a decision tree to predict spam. Re-fit the tree using two random subsets of the data (each comprising
60% of observations). How stable are the trees?
(Hint: Use rpart.plot() from the package rpart.plot to visualize the trees.)
c) Forests come with a built-in estimate of their generalization ability via the out-of-bag (OOB) error.
i) Show that the probability for each observation to be OOB in an arbitrary bootstrap sample converges
to 1e .
ii) Verify this result empirically by a small simulation. For this, draw 1000 bootstrap samples from a set of
1000 IDs and compute the average relative frequency of being OOB over all IDs.
iii) Use the random forest learner classif.ranger to fit the model and state the out-of-bag (OOB) error.
d) You are interested in which variables have the greatest influence on the prediction quality. Explain how to
determine this in a permutation-based approach and compute the importance scores for the spam data.
(Hint: use an adequate variable importance filter as described in
https://mlr3filters.mlr-org.com/#variable-importance-filters.)
Exercise 2: Decision boundaries
Simulate 500 samples from the mlbench.spirals data with a standard deviation of 0.1, and 4 cycles.
Visualize the decision boundaries of a random forest (classif.ranger learner from mlr3learners), using
mlr3viz::plot learner prediction, for forest sizes M 2 (1, 2, 10, 100, 1000) trees. Explain what you see.