COMS 4721 Spring 2022 Homework 1
Due at 11:59 PM on February 11, 2022
About this assignment. This homework assignment is a mix of some problems that involve programming
and data analysis using Python, as well as some problems that do not involve programming but may require
a bit of mathematical reasoning, simple derivations, and/or “calculations” (usually done by hand).
Data files needed for the programming/data analysis parts are available on Courseworks (under “Files”).
Working in pairs is explicitly permitted on this assignment (subject to the conditions detailed in the syllabus).
Submission. For the programming and data analysis portions (specifically, the problems in Section 2),
please create and use a Jupyter Notebook to solve these problems. Include all of the code you write for this
assignment in the notebook, and add appropriate comments so that the code is easy to understand. Make
sure your answers to the various questions are clearly stated and labeled in separate Markdown cells (even
if the answer also appears as an output of a code cell). You will need to submit the Jupyter Notebook on
Courseworks as part of your homework submission.
You will also need to create a PDF document containing your solutions to all problems (including those
involving programming and data analysis). Please read the homework policy (in the syllabus) for information
about how to write-up your solutions. You will need to submit this document on Gradescope.
• For redundancy purposes, please make sure your name and UNI appear prominently on the first page
of the PDF. If you are working in pairs, both students’ names and UNIs should appear in this manner.
• On Gradescope, for each problem, you need to select all of the pages that contain your solutions
for that problem. If part of your solution is on an unselected page, we will consider it missing. See the
instructions for submitting a PDF on Gradescope.
• On Gradescope, if you are working in pairs, then only one student in the pair should make the submission.
That student must make sure to add both students in the pair to the submission. This is
something that has to be done at the time of submission within the Gradescope submission system.
See the instructions for adding group members on Gradescope. Do not forget to do this, or else one
student will not receive credit for the homework submission!
• On Courseworks, submit only the Jupyter Notebook (a .ipynb file). Do not include any of the data
files—we already have those :-)
• On Courseworks, if you are working in pairs, then only one student in the pair should make the
submission of the Jupyter Notebook. (We are only using this to collect your Jupyter Notebooks; the
grading will happen on Gradescope.) If both students submit, we’ll only look at one of the submissions
(and we’ll pick one arbitrarily and ignore the other).
• There is a strict deadline for the submission on both Gradescope and Courseworks. Do not wait until
the last minute to make your submissions! No late submissions will be accepted.
Please consult the Gradescope Help Center if you need further help with submitting on Gradescope, and the
Canvas Guides if you need further help with submitting on Courseworks.
1
1 Decision trees and the greedy training heuristic
Recall the greedy training heuristic in the context of learning classification trees. In the problems below,
regard K as an arbitrary integer that is at least two. (In the Iris classification problem from lecture, we had
K = 3.) Also recall the notation [K] := {1, 2, . . . ,K}.
1.1 Categorical features
Problem 1.1. We saw that in the greedy training heuristic, if all features are numerical features, then
the total number of “decision stumps” that need to be considered in the first “greedy step” is at most
d× (n− 1), where d is the total number of features, and n is the total number of training data. Suppose,
instead, that all of the features are categorical features, with each feature taking 26 possible values (e.g.,
letters of the English alphabet). Recall that for categorical features, the types of predicates we consider are
those that check for membership in a subset of the possible values. How many possible “decision stumps”
may need to be considered in the first “greedy step” of the greedy training heuristic? Here, you should
give an upper-bound on this number that is as small as possible, but that holds for all training datasets of
size n with d categorical features as described above. Explain why your bound is correct.
1.2 Uncertainty and entropy
Problem 1.2. Define the “uncertainty” of a collection of labeled examples S by
uncertainty(S) := min
y∗∈[K]
[
fraction of examples (x⃗, y) ∈ S with y ̸= y∗
]
.
What must be true about a classification tree T What property should a classification tree T possess so
that, for any collection of labeled examples S,
mistakes(T ;S) =
∑
ℓ∈leaves(T )
|Sℓ| × uncertainty(Sℓ)
where Sℓ denotes the examples in S that are routed to leaf ℓ, and |Sℓ| is the number of examples in Sℓ.
Explain your answer.
Problem 1.3. Many implementations of the greedy training heuristic use a different notion of “uncertainty”
instead of the one defined in the previous problem. One example is entropy:
entropy(S) := log2(n)−
1
n
[ ∑
y∗∈[K]
ny∗(S) log2
(
ny∗(S)
)]
,
where ny∗(S) is the number of examples in S with label y∗, and n is the total number of examples in S.
If all of the examples in S have the same label, what is entropy(S)? And if all K possible labels appear
equally often in S, what is entropy(S)? Explain your answers.
Entropy is a somewhat mysterious term that vaguely refers to the “information” or “randomness” content of
a system. The specific notion of entropy we use in the previous problem comes from Shannon’s entropy of a
discrete random variable X. If the range of X is [K], then the Shannon entropy of X is defined to be
H(X) := −
K∑
k=1
Pr(X = k) log2 Pr(X = k).
(In this definition, 0 log2 0 is understood to be equal to 0.) It is a useful exercise to identify the connection
between Shannon entropy and the notion of entropy from the previous problem. (Do this on your own . . . )
2
1.3 Entropy objective
When entropy is used in place of uncertainty in the greedy training heuristic, the objective function is no
longer mistakes(T ;S), but rather
J(T ;S) :=
∑
ℓ∈leaves(T )
|Sℓ| × entropy(Sℓ),
where Sℓ and |Sℓ| are as defined above. In the next two problems, let S denote the following collection of
n = 15 labeled examples given in the table below, one example per column:
x 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2 1.3 1.4
y 2 2 2 1 2 2 2 1 1 1 2 2 2 2 1
(Here, K = 2, and there is a single numerical feature x.)
Problem 1.4.What should be the initial (single-node) classification tree at the start of the greedy training
heuristic? And what is the objective value achieved by this tree? Explain your answers.
Problem 1.5. Consider “splitting” the optimal single-node classification tree using the predicate “x ≤ 0.9?”,
assuming that when this “split” is done, the return values of the leaf nodes are chosen to minimize the
objective function. What is the reduction in the objective value J(T ;S) achieved by this split? That is, if
T1 is the initial classification tree (from Problem 1.4), and T2 is the decision stump that minimizes the
objective, then what is the value of
J(T1;S)− J(T2;S) ?
And what would be the reduction in objective value if we had used the original objective function
mistakes(T ;S) instead of J(T ;S)? Explain your answers.
3
2 Decision trees and the “Covertype” dataset
The problems in this section are intended to help you become familiar with the mechanics of using decision
trees with a classification dataset. Most of the problems will involve the specific software library scikit-learn
(sklearn). Please remember to submit your code on Courseworks for the problems in this section.
The “Covertype” dataset records cartographic variables of different wilderness areas in Roosevelt National
Forest. Each area is also associated with one of seven types of forest cover. The synthetic prediction
problem we consider here is predicting the forest cover type of a given wilderness area from the values of the
cartographic variables.
Load (a modified version of) the dataset from the file covtype3.pickle (available on Courseworks) using
the following code.
import pickle
with open('covtype3.pickle', 'rb') as f:
covtype = pickle.load(f)
The variable covtype refers to a dictionary that contains training data and test data. The training data’s
feature vectors and labels are in covtype['train_data'] and covtype['train_labels'], respectively. The
same for the test data are in covtype['test_data'] and covtype['test_labels']. The class names are
in covtype['class_names']. The feature names are in covtype['feature_names'].
2.1 Learn a decision tree
The following code can be used to learn a decision tree using the training data. Here, we just use the default
training algorithm implemented in sklearn with default hyperparameters.
from sklearn import tree
my_tree = tree.DecisionTreeClassifier(random_state=0)
my_tree.fit(covtype['train_data'], covtype['train_labels'])
Throughout, we will refer to the decision tree learned using the above code as my_tree.
Read the sklearn documentation to find answers to the following questions about the decision tree learning
algorithm implemented in sklearn.
Problem 2.1. What is the default criterion used to measure the quality of a split? What is the default
stopping criterion used by the learning algorithm? If you wanted to ensure that the learned decision tree
does not have more than 20 leaf nodes, how would you change the call to the DecisionTreeClassifier
constructor?
The learning algorithm implemented in sklearn uses pseudorandom numbers to decide how to break ties.
Setting a random seed (e.g., via the argument random_state=0, as done in the code example above) ensures
that the same pseudorandom numbers are used every time this code is executed.
The following code can be used to assess the quality of the learned decision tree my_tree on the training
data and the test data.
from sklearn.metrics import classification_report
print('Performance on training data')
print(classification_report(covtype['train_labels'],
my_tree.predict(covtype['train_data']),
target_names=covtype['class_names']))
print('Performance on test data')
print(classification_report(covtype['test_labels'],
4
my_tree.predict(covtype['test_data']),
target_names=covtype['class_names']))
The classification report displays various performance measures. The classification accuracy (which is one
minus the error rate) is given in the third-to-last line.
Problem 2.2. What is the training error rate of the learned decision tree my_tree? And what is the test
error rate of the learned decision tree my_tree?
2.2 Examine the structure of the learned decision tree
The following code (which uses matplotlib) can be used to visually display (part of) the learned decision tree
my_tree.
import matplotlib.pyplot as plt
fig, ax = plt.subplots(figsize=(100, 50))
tree.plot_tree(my_tree,
max_depth=2,
class_names=covtype['class_names'],
feature_names=covtype['feature_names'])
plt.show()
By increasing the max_depth parameter, you can display more of the learned decision tree. (No need to
include this visual rendition of the tree in your write-up.)
The following code can be used to determine the depth of each node in the learned decision tree my_tree.
(This is based on a code example from the sklearn documentation.)
import numpy as np
node_depth = np.zeros(shape=my_tree.tree_.node_count, dtype=np.int64)
stack = [(0, 0)]
while len(stack) > 0:
node_id, depth = stack.pop()
node_depth[node_id] = depth
if my_tree.tree_.children_left[node_id] != my_tree.tree_.children_right[node_id]:
stack.append((my_tree.tree_.children_left[node_id], depth + 1))
stack.append((my_tree.tree_.children_right[node_id], depth + 1))
The root node is considered to be at depth 0.
Problem 2.3. What is the maximum depth of the learned decision tree my_tree, and how many nodes
are at that depth? And which depth has the most number of nodes in my_tree? (You may want to write
some code to compute the answers to these questions. You can start with the code given above if you like.)
2.3 Use cross-validation to choose a hyperparameter value
Suppose you want to only learn decision trees that are limited in depth. This can be achieved using a
“maximum depth” hyperparameter in the decision tree learning algorithm implemented in sklearn. But how
should you choose this hyperparameter value? Using hold-out cross-validation is one possibility.
To implement hold-out cross-validation, you will first need to randomly partition the training data into two
parts. The following code does exactly this.
from sklearn.model_selection import train_test_split
5
X_train, X_val, y_train, y_val = train_test_split(covtype['train_data'],
covtype['train_labels'], test_size=0.2, random_state=0)
You will also need to decide on which values to consider for the “maximum depth” hyperparameter. Let’s
fixed this to the following ten values: 5, 10, 15, 20, 25, 30, 35, 40, 45, 50.
Problem 2.4. Use hold-out cross-validation to pick a value for the “maximum depth” hyperparameter.
(You may want to use sklearn.metrics.accuracy_score, and also save some of the intermediate results
for Problem 2.6 below.) What is the chosen hyperparameter value?
Problem 2.5. Use matplotlib (or any other plotting software with sufficient capabilities) to plot
the validation accuracy (or validation error rate, if you prefer) as a function of the “maximum depth”
hyperparameter value. Label the axes of the plot and give the plot an appropriate title. Please include
this plot in your write-up.
Problem 2.6. Now use the value for the “maximum depth” hyperparameter chosen in Problem 2.4
to learn another decision tree, using the entirety of the training data (covtype[’train_data’] and
covtype[’train_labels’], not just X_train and y_train). What is the test error rate of this new
learned decision tree?
6
3 Maximum likelihood estimators
For the problem of predicting the outcome of a coin toss, we came up with a natural prediction learning
strategy: look at the outcomes of n previous coin tosses Y1, . . . , Yn, and then predict with the majority
outcome:
• If Y1 + · · ·+ Yn > n2 , then predict heads (1)
• Else, predict tails (0)
This seems like a natural idea, but one can view it as a special case of a more general principle, which is the
plug-in principle.
Recall, if we had known the “bias” (i.e., the parameter θ = Pr[heads]) of the coin whose outcome we want to
predict, then we have a formula for the optimal prediction:
optimal prediction =
{
1 if θ > 1/2,
0 otherwise.
If we don’t know θ, the plug-in principle says that we should just estimate it, and then act as if our estimate is
correct. That is, we plug our estimate of the unknown quantity θ into the formula for the optimal prediction.
So, how can we estimate θ using the data Y1, . . . , Yn available to us? In many cases, there is a relatively
simple rule for deriving a good estimator, called the principle of maximum likelihood, and estimators derived
this way are called maximum likelihood estimators (MLEs).
The “Coin tosses handout” on the course website walks through the steps of deriving the MLE for θ is the
coin toss model. (See also §9.2 in A Course in Machine Learning or §7.1 in Statistical Models: Theory and
Practice.)
The next problems offer practice on deriving formulae for maximum likelihood estimators for other models.
Problem 3.1. Consider a statistical model for IID random variables X1, . . . , Xn, parameterized by θ ∈ Θ.
The distribution of X1 has probability density function fθ satisfying
fθ(x) =
1
Zθ
x2e−x/θ for all x > 0,
and fθ(x) = 0 for all x ≤ 0. Here, Zθ =
∫∞
0 x
2e−x/θ dx is the “normalization constant” that ensures the
probability density function integrates to one. The parameter space is the positive reals Θ = {θ ∈ R : θ > 0}.
Derive a simple formula for the MLE for θ (given data (X1, . . . , Xn) = (x1, . . . , xn)), and briefly justify
each step of the derivation.
Problem 3.2. Now consider the following statistical model for random variables X1, X2, which are not
independent. The distribution of X1 is Poisson(θ) (i.e., Poisson with rate θ), and for any non-negative
integer x2, the conditional distribution of X2 | X1 = x1 is Poisson(x1θ). The parameter space is the
positive reals Θ = {θ ∈ R : θ > 0}. Write a formula for the joint probability mass function of (X1, X2)
with parameter θ; derive a simple formula for the MLE for θ (given data (X1, X2) = (x1, x2)); and briefly
justify each step of the derivation.
Recall that the Poisson distribution with rate λ > 0 (which we denote by Poisson(λ)) is the discrete probability
distribution over the non-negative integers where the probability mass assigned to any non-negative k is
λke−λ
k! .
A non-standard special case is Poisson(λ) for λ = 0; we define this to be the distribution with all of the
probability mass at 0 (i.e., for X ∼ Poisson(0), we have Pr(X = 0) = 1).
7
4 Randomized response
Suppose you would like to estimate the proportion of people in a population with a positive COVID19 test
result, i.e., the positivity rate. (For the purpose of this problem, assume everyone has been tested.) You
randomly sample n individuals from the population.1 You could just ask each person whether or not they
have tested positive. However, they may be uncomfortable sharing that information with you outright.
The randomized response method was developed to address this privacy concern. In the randomized response
method, surveyed individuals are not asked to directly tell you whether or not they have tested positive.
Instead, you ask each surveyed individual to do the following:
• Toss a fair coin three times (without letting you see the outcomes).
• If at least one toss comes up heads:
– Respond truthfully (i.e., say 1 if the individual has tested positive; 0 if not).
• Else:
– Give the opposite response (i.e., say 0 if the individual has tested positive; 1 if not).
Because you do not observe the outcomes of the coin tosses, you never learn, with certainty, whether the
surveyed individual has tested positive or not. This is a very strong privacy guarantee for the surveyed
individuals: even if you know the true test results for all but one individuals, you remain uncertain about the
test result for that remaining individual. (This property is called differential privacy.) Moreover, the collected
information can still be used to obtain a good estimate of the proportion of individuals in the population
with a positive test.
Consider a statistical model for n responses collected in this way, where the responses are regarded as IID
{0, 1}-valued random variables Y1, . . . , Yn, and all coin tosses are independent. Let θ ∈ [0, 1] denote the
parameter of the model that equals the true positivity rate.
Problem 4.1. What is the probability that Y1 = 1? Give your answer in terms of the parameter θ.
And what is the log-likelihood of θ given data y1, . . . , yn ∈ {0, 1}? Write your answer in terms of θ and
y1, . . . , yn.
Problem 4.2. Suppose n = 100, and the number of yi that are equal to 1 is 40 (i.e.,
∑n
i=1 yi = 40). Plot
(e.g., using matplotlib) the log-likelihood as a function of θ ∈ [0, 1], and include this plot in your write-up.
What appears to be the θ with highest likelihood? (No need to submit any code for this problem.)
1When sampling from a finite population, we typically use sampling without replacement. But to simplify the problem,
assume that sampling with replacement is used.
8
5 Generative model
Consider a problem of predicting the virus infection status of an individual on the basis of a self-administered
test. We’ll encode “positive status” as 1, and “negative status” as 0. To administer the test, an individual
provides a sample from their body to the testing device, and then the device reports either 1 or 0. However,
the test may be imperfect, and hence the prediction of the virus infection status from the test result may
also be imperfect. We model this problem using the generative model for binary classification described
in lecture. The virus infection status is Y , and we assume it is a random variable with Y ∼ Bernoulli(π)
for some parameter π ∈ [0, 1]. The result of the self-administered test is X, and we assume it is a random
variable such that
(X | Y = 0) ∼ Bernoulli(q0), (X | Y = 1) ∼ Bernoulli(q1)
for some parameters q0, q1 ∈ [0, 1]. So the model parameters are θ⃗ = (π, q0, q1). (Recall the notation used
here: For each y ∈ {0, 1}, “the conditional distribution of X given Y = y is Bernoulli with parameter qy.”)
Problem 5.1. Consider the binary classifier f(x) = x (i.e., predict the virus infection status to be the
same as the test result). What is the error rate of f? Give your answer in terms of the model parameters.
Problem 5.2. Suppose the test is purported to “correctly identify 84.6% of positive specimens and 98.5%
of negative specimens”. (Here, “positive specimen” means a sample from an individual with positive (1)
virus infection status, and “negative specimen” means a sample from an individual with negative (0)
virus infection status.) What does this suggest about the values of the model parameters (assuming the
generative model described above is correct)?
Now suppose an individual is able to self-administer 10 identical tests whose results can be regarded as
independent conditional on the virus infection status of the individual. We would like to develop a binary
classifier to predicting the virus infection status on the basis of the 10 test results. Furthermore, we would
like the classifer to be as accurate as possible when used by individuals in a particular population, which
we model using a generative model similar to that described above, but now with X1, . . . , X10 denoting the
results of the 10 identical tests self-administered by an individual with virus infection status Y . In this new
generative model, we have Y ∼ Bernoulli(π), and
(X1, . . . , X10 | Y = 0) ∼iid Bernoulli(q0), (X1, . . . , X10 | Y = 1) ∼iid Bernoulli(q1).
Again, the model parameters are θ⃗ = (π, q0, q1). (Recall the notation used here: for each y ∈ {0, 1},
“X1, . . . , X10, given Y = y, are conditionally independent and each of the Xi has the same conditional
distribution, which is Bernoulli with parameter qy.”)
Problem 5.3. Assume the generative model described above with parameters π = 0.05, q0 = 0.01, and
q1 = 0.9. Succinctly but precisely describe the optimal classifer (in words). And what is the error rate of
this optimal classifier? Explain your answers.
Problem 5.4. How would your answers to Problem 5.3 change if π = 0.025 (instead of π = 0.05)? Explain
your answer.
Problem 5.5. If you didn’t know the values of these parameters θ⃗, how would you estimate them from
data? Describe the nature of the data you would use, any modeling assumptions that you would adopt,
and the specific estimation methods (with formulae and/or pseudo-code as appropriate).
9