CSE 347/447 Data Mining: Homework 1
Due on 11:59 PM, February 21, 2021
Standard and General Requirements
• Work it by yourself, and try to write as perfect a solution as you can. Discussion is allowed at the level
of technical conversation only. Students are expected to abide by Lehigh Academic Integrity Policy.
If you do, however, give proper acknowledgments.
• Typed solutions are encouraged, especially if your handwriting is messy. However, there will be no
• Partial credit will be given for partial solutions, but not for long off-topic discussion that leads nowhere.
Overall, think before you write, and try to give concise and crisp answers.
• Late policy: You can be at most 3 days late; for every late date you lose 10% of your grade, unless
some other arrangement is agreed to before the due date.
Exercise 1 (20 points): Classify the following attributes as binary, discrete, or continuous. Also classify them as
qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpre-
tation, so briefly indicate your reasoning if you think there may be some ambiguity.
Example: Age in years. Answer: Discrete, quantitative, ratio
(a) Time in terms of AM or PM.
(b) Brightness as measured by a light meter.
(c) Brightness as measured by people’s judgments.
(d) Angles as measured in degrees between 0 and 360.
(e) Bronze, Silver, and Gold medals as awarded at the Olympics.
(f) Height above sea level.
(g) Number of patients in a hospital.
(h) ISBN numbers for books. (Look up the format o the Web.)
(i) Ability to pass light in terms of the following values: opaque, translucent, transparent.
(j) Military rank.
(k) Distance from the center of campus.
(l) Density of a substance in grams per cubic centimeter.
(m) Coat check number. (When you attend an event, you can often give your coat to someone who, in turn, gives
you a number that you can use to claim your coat when you leave.)
Exercise 2 (20 points): The Jaccard similarity between two sets X and Y is defined as:
JSim(X,Y ) =
|X ∩ Y |
|X ∪ Y | .
The Jaccard distance between sets X and Y is defined as:
JDist(X,Y ) = 1− JSim(X,Y ).
1
Prove or disprove that the JDist function is a metric.
Exercise 3 (20 points): Consider a set of n points X = x1, · · · , xn in some d-dimensional space, and distance
function d(xi, xj) = L
2
2(xi, xj). Let x¯ be the d-dimensional vector that is the mean of all the vectors in X. Prove
that x¯ minimizes

xi∈X d(xi, x¯), i.e., that the mean is the representative for distance function d().
Exercise 4 (40 points):
1. [20 points]
Assume there are four one-dimensional data points X = 1, 2, 4, 5, use Euclidean distance metric to classify them
into two groups G = {G1, G2} such that the within-group variance (sum of squares) is minimized. Formally, the
objective is to find:
arg min
G
2∑
i=1

x∈Gi
‖x− ui‖22
where ui is the mean of all x ∈ Gi. Please provide details of your solution.
2. [10 points]
Assume there are four two-dimensional data points X = {(1, 2), (3, 5), (4, 6), (8, 9)}, use Euclidean distance metric
to classify them into two groups G = {G1, G2}. Please provide details of your solution.
3. [10 points]
Assume G1 = {(1, 2), (3, 5)} and G2 = {(4, 6), (8, 9)}, use the correlation coefficient to determine the relationship
between two sets G1 and G2 (i.e., calculate corr(G1, G2)). Is corr(G1, G2) equal to corr(G2, G1)?
Note∗: You could think about how to extend your solution to the case of n-dimensional (n > 2) sample points
and k > 2 groups.
2 