Machine Learning Algorithms代写-COGS 118A|学霸联盟

Machine Learning Algorithms代写-COGS 118A

时间：2022-05-31

COGS 118A: Supervised Machine Learning Algorithms
Midterm Exam II Practice Problems
1 True or False
1. The false positive rate is the how frequently truly positive data samples are misclassified as
negative class.
[True]
[False]
2. For a linear classifier that is well-calibrated (i.e., wTx+ b can be interpreted as a probabil-
ity), the ROC curve is constructed by varying the classification threshold t ∈ [−1, 1] when
predicting outputs on a given dataset according to
f(x;w, b) =
{
+1, if wTx+ b ≥ t,
−1, otherwise. (1)
and then plotting the true positive rate vs the false positive rate for each choice of t
[True]
[False]
3. The best possible ROC curve lies along the diagonal.
[True]
[False]
4. Perceptrons, logistic regressions, and SVMs all share the identical prediction formulation
sign(w⊤x+ b) at test time.
[True]
[False]
5. Typically, when the gradient descent algorithm is used to train a classifier f(x;w, b), the
better the current model (wt, bt) predicts a sample xi, the less contribution xi makes to
the gradient in the update of wt.
[True]
[False]
6. The regularization term in the SVM loss function improves the robustness of the classifier
because it increases the margin through maximizing the magnitude of the weights.
[True]
[False]
7. The kernel trick can only be applied to SVMs, not other ML algorithms.
[True]
[False]
8. Perceptrons can learn the XOR task.
[True]
[False]
9. Even with a convex loss function, if you pick the wrong step size, gradient descent could end
up never converging on the answer or taking a very long time to do so
[True]
[False]
10. A one-vs-rest multi-class prediction scheme uses fewer classifiers than a one-vs-one scheme.
[True]
[False]
2 Conceptual Questions
Select the correct option(s). Note that there might be multiple correct options.
1. Choose all the valid answers to the description about gradient descent from the options
below:
A. With different initial weights, it is possible for the gradient descent algorithm to obtain
to different local minimum.
B. Every gradient descent iteration can always decrease the value of loss function even when
the gradient of the loss function is zero.
C. It is possible for the gradient descent algorithm to find the global minimum, though it is
not guaranteed.
D. Large learning rate is always better than a small one because it leads to faster convergence.
2. In the Figure 1, we use a black line to mark the decision boundary of a linear SVM classifier
parameterized by weight vector w and bias b. If we start to decrease the margin of the
classifier, what would happen to w and b?
7SCR©
Computing e margin w dth
Figure 1: A linear SVM classifier with data points.
A. b will be larger.
B. w will have smaller magnitude.
C. w will have greater magnitude.
D. Both A and C.
3. Pick all of the following classifiers that can fit a non-linear decision boundary.
A. Decision tree
B. Perceptron
C. Kernel SVM
D. Logistic regression
E. k-NN
4. Suppose we are given a dataset S = {(xi, yi), i = 1, . . . , n} where x is the feature vector and
y ∈ {−1,+1} is the corresponding label. We train a logistic regression classifier parameterized
by weight w and bias b on the dataset S. Please choose the correct loss formulation L(w, b)
of the logistic regression classifier:
A. L(w, b) = 2√
w⊤w
+ C
∑n
i=1max (0, 1− yi(w⊤xi + b))
B. L(w, b) = −∑ni=1 ln 11+e−yi(w⊤xi+b)
C. L(w, b) = −∑ni=1 ln 11+e−(w⊤xi+b)
D. L(w, b) =∑ni=1max(0,−yi(w⊤xi + b))
E. L(w, b) = 2√
w⊤w
− C∑ni=1max (0, 1− yi(w⊤xi + b))
5. Choose all the valid answers to the question from the options below.
When training a classifier, the goal of cross-validation is to:
A. Reduce the training error.
B. Reduce the training time.
C. Tune/Choose the hyper-parameters.
D. Estimate the generalization error.
3 Error Metrics
There are 20 people tested for SARS-CoV-2 by the DeepMLBuzzword3000 test. If someone is
actually well or sick they have the appropriate emoji. Our test results are always negative (no
virus) unless the thermometer icon appears on that person’s head; those people test positive for
the virus. Evaluate the following metrics for this (not very good) virus test.
1. Compute the Recall using the following formula:
Recall = number of true positives (correct guesses)number of actual targets
2. Compute the Precision using the following formula:
Precision = number of true positives (correct guesses)number of guesses
3. Compute the F-value using the following formula:
F-value = 2 * Precision * RecallPrecision + Recall
4 Support Vector Machine
Consider a dataset {(xi, yi), i = 1, 2, ..., 10} where x = [x1, x2]⊤ and y ∈ {−1,+1}, which is shown
in the Figure 2 below. Suppose we have trained a support vector machine (SVM) classifier on the
dataset, which has the decision boundary ℓ : w⊤x + b = 0. The SVM classifier is optimized as
following:
Find: argmin
w
1
2
||w||22
Subject to, ∀i :w⊤xi + b ≥ +1, if yi = +1,
w⊤xi + b ≤ −1, if yi = −1.
We define ℓ+ : w
⊤x + b = +1 as the positive plane and ℓ− : w⊤x + b = −1 as the negative
plane. After training the SVM classifier using the gradient descent method for several iterations
to reach an intermediate state, we have obtained the positive plane and the negative plane, which
are indicated as solid lines in the Figure 2 below.
Figure 2: the positive plane and the negative plane, and the data points
1. Calculate the w and b from your drawn positive plane and negative plane.
2. Calculate the size of the margin.
5 K Nearest Neighbors
Consider a training dataset Straining = {(xi, yi), i = 1, 2, ..., 10} where each data point (x, y) has
a feature vector x = [x1, x2]
⊤ and the corresponding label y ∈ {−1,+1}. The points with the
corresponding labels in the training set are shown in the figure below. In this question, we attempt
to predict the label of a new point with the k nearest neighbors (k-NN) algorithm. Specifically, in
the k-NN, we use Euclidean distance as the distance metric and k is set to 1.
1. Predict the label for feature vector (1, 3).
2. Predict the label for feature vector (5, 4).
6 Perceptron
In this section, we will apply perceptron learning algorithm to solve a binary classification problem:
We need to predict a binary label y ∈ {−1,+1} for a feature vector x = [x0, x1]⊤. The decision
rule of the perceptron model is defined as:
f(x;w, b) =
{
+1, if wTx+ b ≥ 0,
−1, otherwise. (2)
where w = [w0, w1]
⊤ is the weight vector, and b is the bias scalar. Given a training dataset
Straining = {(xi, yi)}, i = 1, . . . , n}, the outline of the perceptron algorithm is shown as below:
• Initialize weight w and bias b.
• If not all data points in Straining are correctly predicted:
– Obtain the prediction for each feature vector in Straining.
– If the prediction is wrong, then update the parameters w and b.
• Otherwise return current weight w and bias b.
In this problem, you will implement a perceptron algorithm. Suppose X is the matrix of all feature
vectors xi and Y is the matrix of all labels yi in the training set Straining. Besides, assume a function
calc error(X, Y, W, b) is given, which computes the error from the feature matrix X, the label
matrix Y, the weight W and the bias b.
Please reorder the following lines of Python code to complete your perceptron algorithm. Fill the
indexes (a-g) into the blanks.
def perceptron_algorithm(X, Y):
# Initialization.
W = np.zeros(2)
b = 0.0
lam = 1
# Main algorithm.
while calc_error(X, Y, W, b) > 0:
for xi, yi in zip(X, Y):
__________________________________________________
______________________________________________
else:
______________________________________________
__________________________________________________
______________________________________________
else:
______________________________________________
______________________________________________
return W, b
(a) W = W + lam * (yi - yi_pred) * xi
(b) yi_pred = -1
(c) b = b + lam * (yi - yi_pred)
(d) yi_pred = +1
(e) if W.T.dot(x) + b >= 0:
(f) continue
(g) if yi_pred == yi:
7 Model selection
For the tasks you have to complete below, you have been given a dataset S = {(xi, yi), i = 1, 2, ..., n},
where xi is a vector of d dimensions.
1. You wish to estimate the generalization error of a classifier that is VERY computationally
intensive to train. The number of samples in the dataset, n, is huge, on the order of the
number of cat videos on YouTube. You should use
A. leave one out cross-validation
B. 10 fold cross-validation
C. 100 bootstrap resamples
D. 50%/50% train/test split
2. You wish to find out which one of h different sets of hyperparameters optimize logistic regres-
sion’s performance on S. This classifier has training time complexity O(nd) and testing time
complexity O(d). Assuming you use k fold cross-validation to select the hyperparameters,
approximately how long will it take to do the model selection?
A. O(kd)
B. O(hkd)
C. O(knd)
D. O(hknd)
3. You want to both select the correct hyperparameters and also estimate the generalization
error. The number of samples is probably just barely enough. You should
A. do a randomized 50%/50% train/test split; select hyperparameters on the training set, fit
the best hyperparameters on the training set, estimate generalization error with the fitted
best version on the test set
B. do a randomized 50%/25%/25% train/validation/test split; train on the training set,
select hyperparameters on the validation set, fit the best hyperparameters on the train-
ing+validation sets, estimate generalization error with the fitted best version on the test
set
C. do a randomized 70%/30% train/test split; use 5 fold cross validation on the training set
to select hyperparameters, fit the best hyperparameters on the training set, and estimate
generalization error with the fitted best version on the test set
8 Kernels
Consider a toy dataset S = {(xi, yi), i = 1, 2, 3} where each example is represented by a 2 dimen-
sional feature vector. The design matrix for this problem is as follows:
X =
2 43 5
0 0
 (3)
Based on your domain expertise, you would like to use the kernel trick to make predictions, e.g.,
predictions will be make using a linear combination of the training points based on kernel similarity.
In other words, you have a good intuition of how to compute the similarity between data points
rather than an underlying feature transformation ϕ(x) (of possibly unbounded dimensionality).
Given a new test point x′ = [2, 3]⊤, you would like to make a prediction. Assuming (1) you chose
the following Gaußian kernel (please remember that exp is base e and that ∥x∥2 is the square of
the L2 norm of x):
k(x, x′) = exp(
−∥x− x′∥2
2
)
and assuming (2) that k(x′) has entries ki(x′) = k(xi, x′) and represents the similarity of your
test point to all points in the training dataset, please compute the kernel similarity of x′ (up to 3
decimal points):
k(x′) =

 (4)
9 Logistic Regression
Consider a logistic regression classifier that predicts a binary label y ∈ {−1, 1} for a feature vector
x = [x1, x2]
⊤ ∈ R2:
y =
{
+1, if f(x;w, b) ≥ 0.5,
−1, otherwise.
P (y = +1|x) = f(x;w, b) = 1
1 + e−(w⊤x+b)
(5)
where weight vector w = [w1, w2]
⊤ ∈ R2 and bias b ∈ R. Besides, P (y = +1|x) represents the
probability of prediction y = +1 given a feature vector x.
1. Write down the formula for P (y = −1|x) (in a similar form as Equation (5) above).
2. Suppose that after training the logistic regression classifier on a dataset, we have learned the
weight vector w = [1,−2]⊤ and the bias b = 2.
(a) Sketch the decision boundary. Make sure you have marked the x1 and x2 axes and the
intercept points on these axes.
(b) Predict the label of xpred = [3, 4]
⊤ using the learned logistic regression classifier.