xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

机器学习代写 - CSE 142 Machine Learning

时间：2020-11-12

CSE 142 Machine Learning Due: November 7th Homework 2

Directions: This homework is to be done individually. Please upload a set of solutions containing your

name and @ucsc.edu e-mail address to Canvas. Typeset (e.g. TeX) solutions are preferred, but scans or

photographs of hand-written solutions are acceptable provided that they are neat and legible. The TA may

deduct points for poorly organized or illegible solutions.

Question: 1 2 3 4 Total

Points: 20 30 20 30 100

Bonus Points: 0 0 0 0 0

Score:

Questions:

1. Logistic Regression. Logistic regression treats a binary classification (e.g. is/is-not a dog) probabal-

istically, where the probability for any example x to be assigned the class y = 1 is given in terms the

logistic sigmoid function g and a weight vector w that must be learned:

g(w · x) = exp (w · x)

1 + exp (w · x)

p(1 | x,w) = g(w · x)

p(0 | x,w) = 1− g(w · x)

(a) (10 points) Let g(w · x) = q

From the definitions above, prove that w · x is a logit or “log-odds" function when 0 < q < 1

w · x = ln q

1− q

(b) (10 points) Just as p(1|w,x) = g(w · x) is the probability for a positive classification conditioned

on a set of weights, it is the likelihood of weights w given a classification y = 1. Show that the

gradient (in parameter space - derivatives should be taken with respect to the components of w) of

this log likelihood function reduces to

∇ log g(w · x) = x(1− g(w · x))

which is a vector quantity parallel to x.

2. (30 points) Naive Bayes. Use Naive Bayes to estimate whether a student will be an honor student

(H) or normal student (N) in college based on their high school performance. Each instance has two

features: the student’s high school GPA (a real number) and whether or not the student took any AP

courses (a Boolean value, yes=1, no=0). Based on the following training data, create (by hand or with

computational tools) a Naive Bayes prediction rule using normal (Gaussian) distributions to estimate

the conditional probability density of high school GPAs given honors status (H or N) (this assigns non-

zero probability to negative or greater-than-four GPA values, but that is fine for our purposes) and a

Bernoulli distribution for the AP feature.

label AP GPA

H yes 4.0

H yes 3.7

H no 2.5

N no 3.8

N yes 3.3

N yes 3.0

N no 3.0

N no 2.7

N no 2.2

Recall that Naive Bayes makes the simplifying assumption that the features are conditionally independent

given a class / label:

p(gpa, ap | honors) = p(gpa | honors)p(ap | honors).

Use maximum likelihood estimation (not the unbiased or Laplace estimates) for the distributions of the

two features conditioned on the two classes. Give the mean and variance for each distribution over GPA

values. For the variance here, you only need to calculate the biased sample variance estimator (divided

by n), not the unbiased one (divided by n− 1).

Describe your prediction rule in the following form:

If AP courses are taken, predict H if the GPA is in a ⊂ R

if AP courses are not taken, predict H if the GPA is in b ⊂ R

where a and b must be found.

Hint: It is probably easier to get this description if you take logarithms. 3 digits of precision should

sufficient. Also, the logarithm of the Gaussian densities are quadratic, possibly yielding two distrinct

zeros that correspond to the boundaries of the intervals a or b.

3. Nearest Neighbors. Assume that examples are drawn uniformly from the unit square. Independent

of example features (i.e., location (x1, x2) in the unit square), labels are generated at random such that

a proportion q where 0.5 < q < 1 are assigned label y = 1 and the remainder are assigned label y = 0.

The Bayes-optimal hypothesis will minimize the error rate of its predictions given (x1, x2) by always

predicting y = 1 and suffering an error rate of 1− q.

How should we expect the 3-Nearest-Neighbors algorithm perform? Assume that the algorithm is trained

on a large set of known labels. For each new sample, the algorithm finds the three closest (according to

any metric!) points in its training set, finds the label shared by the majority of these three, and assigns

this label to the new example.

(a) (10 points) What is the expected error rate of the 3-Nearest-Neighbor algorithm, in terms of q?

(b) (10 points) When is this better or worse than the Naive Bayes solution (recall, 0.5 < q < 1)?

4. Decision Tree. Sammy-the slug owns a car dealership that sells two types of cars: Honda(H) and

BMW(B). He collected the following data of his customers that records their Gender, Annual income

and the type of car they purchased:

Gender AI (Annual Income in thousands) Preference

F 100 B

M 150 B

F 80 B

M 75 B

F 90 H

F 85 H

M 85 H

F 70 H

Page 2

This question is about building decision trees to predict the car type that a new customer will prefer.

Note that one of the attributes, Annual Income (AI), is a continuous variable. Lets assume that we will

only allow binary splits for this attribute of the form AI < a and AI ≥ a, where a lies in the dataset.

However, there can be multiple such splits in one path from root to leaf.

For all your calculations, use log base 2.

(a) (3 points) For this part of the question, lets assume that for some unknown reason, Sammy insists

on keeping the “Annual Income” at the root node. How many possible values of a does he need to

consider? What are they?

(b) (3 points) What is the entropy of labels (car type) in the training dataset?

(c) (9 points) What is the optimal root node for this dataset? Show your calculations.

(d) (15 points) Draw the Decision Tree that would be learned by ID3 on this dataset. Only binary splits

are allowed i.e. a node can have only two children. Your tree should contain all the information

necessary to read it. At each node, indicate:

1. the attribute you are splitting on (when splitting on AI, also include the a used);

2. the label distribution of the instances at that node before the split (e.g. if there are 3 instances

at a node and 2 belong to H and 1 belongs to B class, then label the node as 〈2, 1〉).

3. Additionally, for each non-leaf node indicate the gain attained by the corresponding split, and

label each leaf node by its class-label (H or B).

4. All edges between a parent and a child should be labeled with the value of the attribute.

5. It is okay to draw the tree by hand and include a clear picture in your pdf.

6. Don’t forget to include your calculations (6 points here).

Page 3