xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

扫码添加客服微信

扫描添加客服微信

程序代写案例-CSE 347/447

时间：2021-02-21

CSE 347/447 Data Mining: Homework 1

Due on 11:59 PM, February 21, 2021

Standard and General Requirements

• Work it by yourself, and try to write as perfect a solution as you can. Discussion is allowed at the level

of technical conversation only. Students are expected to abide by Lehigh Academic Integrity Policy.

If you do, however, give proper acknowledgments.

• Typed solutions are encouraged, especially if your handwriting is messy. However, there will be no

extra marks for typed answers.

• Partial credit will be given for partial solutions, but not for long off-topic discussion that leads nowhere.

Overall, think before you write, and try to give concise and crisp answers.

• Late policy: You can be at most 3 days late; for every late date you lose 10% of your grade, unless

some other arrangement is agreed to before the due date.

• Submission: Please return your answers as .PDF file to CourseSite.

Exercise 1 (20 points): Classify the following attributes as binary, discrete, or continuous. Also classify them as

qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpre-

tation, so briefly indicate your reasoning if you think there may be some ambiguity.

Example: Age in years. Answer: Discrete, quantitative, ratio

(a) Time in terms of AM or PM.

(b) Brightness as measured by a light meter.

(c) Brightness as measured by people’s judgments.

(d) Angles as measured in degrees between 0 and 360.

(e) Bronze, Silver, and Gold medals as awarded at the Olympics.

(f) Height above sea level.

(g) Number of patients in a hospital.

(h) ISBN numbers for books. (Look up the format o the Web.)

(i) Ability to pass light in terms of the following values: opaque, translucent, transparent.

(j) Military rank.

(k) Distance from the center of campus.

(l) Density of a substance in grams per cubic centimeter.

(m) Coat check number. (When you attend an event, you can often give your coat to someone who, in turn, gives

you a number that you can use to claim your coat when you leave.)

Exercise 2 (20 points): The Jaccard similarity between two sets X and Y is defined as:

JSim(X,Y ) =

|X ∩ Y |

|X ∪ Y | .

The Jaccard distance between sets X and Y is defined as:

JDist(X,Y ) = 1− JSim(X,Y ).

1

Prove or disprove that the JDist function is a metric.

Exercise 3 (20 points): Consider a set of n points X = x1, · · · , xn in some d-dimensional space, and distance

function d(xi, xj) = L

2

2(xi, xj). Let x¯ be the d-dimensional vector that is the mean of all the vectors in X. Prove

that x¯ minimizes

∑

xi∈X d(xi, x¯), i.e., that the mean is the representative for distance function d().

Exercise 4 (40 points):

1. [20 points]

Assume there are four one-dimensional data points X = 1, 2, 4, 5, use Euclidean distance metric to classify them

into two groups G = {G1, G2} such that the within-group variance (sum of squares) is minimized. Formally, the

objective is to find:

arg min

G

2∑

i=1

∑

x∈Gi

‖x− ui‖22

where ui is the mean of all x ∈ Gi. Please provide details of your solution.

2. [10 points]

Assume there are four two-dimensional data points X = {(1, 2), (3, 5), (4, 6), (8, 9)}, use Euclidean distance metric

to classify them into two groups G = {G1, G2}. Please provide details of your solution.

3. [10 points]

Assume G1 = {(1, 2), (3, 5)} and G2 = {(4, 6), (8, 9)}, use the correlation coefficient to determine the relationship

between two sets G1 and G2 (i.e., calculate corr(G1, G2)). Is corr(G1, G2) equal to corr(G2, G1)?

Note∗: You could think about how to extend your solution to the case of n-dimensional (n > 2) sample points

and k > 2 groups.

2

学霸联盟

Due on 11:59 PM, February 21, 2021

Standard and General Requirements

• Work it by yourself, and try to write as perfect a solution as you can. Discussion is allowed at the level

of technical conversation only. Students are expected to abide by Lehigh Academic Integrity Policy.

If you do, however, give proper acknowledgments.

• Typed solutions are encouraged, especially if your handwriting is messy. However, there will be no

extra marks for typed answers.

• Partial credit will be given for partial solutions, but not for long off-topic discussion that leads nowhere.

Overall, think before you write, and try to give concise and crisp answers.

• Late policy: You can be at most 3 days late; for every late date you lose 10% of your grade, unless

some other arrangement is agreed to before the due date.

• Submission: Please return your answers as .PDF file to CourseSite.

Exercise 1 (20 points): Classify the following attributes as binary, discrete, or continuous. Also classify them as

qualitative (nominal or ordinal) or quantitative (interval or ratio). Some cases may have more than one interpre-

tation, so briefly indicate your reasoning if you think there may be some ambiguity.

Example: Age in years. Answer: Discrete, quantitative, ratio

(a) Time in terms of AM or PM.

(b) Brightness as measured by a light meter.

(c) Brightness as measured by people’s judgments.

(d) Angles as measured in degrees between 0 and 360.

(e) Bronze, Silver, and Gold medals as awarded at the Olympics.

(f) Height above sea level.

(g) Number of patients in a hospital.

(h) ISBN numbers for books. (Look up the format o the Web.)

(i) Ability to pass light in terms of the following values: opaque, translucent, transparent.

(j) Military rank.

(k) Distance from the center of campus.

(l) Density of a substance in grams per cubic centimeter.

(m) Coat check number. (When you attend an event, you can often give your coat to someone who, in turn, gives

you a number that you can use to claim your coat when you leave.)

Exercise 2 (20 points): The Jaccard similarity between two sets X and Y is defined as:

JSim(X,Y ) =

|X ∩ Y |

|X ∪ Y | .

The Jaccard distance between sets X and Y is defined as:

JDist(X,Y ) = 1− JSim(X,Y ).

1

Prove or disprove that the JDist function is a metric.

Exercise 3 (20 points): Consider a set of n points X = x1, · · · , xn in some d-dimensional space, and distance

function d(xi, xj) = L

2

2(xi, xj). Let x¯ be the d-dimensional vector that is the mean of all the vectors in X. Prove

that x¯ minimizes

∑

xi∈X d(xi, x¯), i.e., that the mean is the representative for distance function d().

Exercise 4 (40 points):

1. [20 points]

Assume there are four one-dimensional data points X = 1, 2, 4, 5, use Euclidean distance metric to classify them

into two groups G = {G1, G2} such that the within-group variance (sum of squares) is minimized. Formally, the

objective is to find:

arg min

G

2∑

i=1

∑

x∈Gi

‖x− ui‖22

where ui is the mean of all x ∈ Gi. Please provide details of your solution.

2. [10 points]

Assume there are four two-dimensional data points X = {(1, 2), (3, 5), (4, 6), (8, 9)}, use Euclidean distance metric

to classify them into two groups G = {G1, G2}. Please provide details of your solution.

3. [10 points]

Assume G1 = {(1, 2), (3, 5)} and G2 = {(4, 6), (8, 9)}, use the correlation coefficient to determine the relationship

between two sets G1 and G2 (i.e., calculate corr(G1, G2)). Is corr(G1, G2) equal to corr(G2, G1)?

Note∗: You could think about how to extend your solution to the case of n-dimensional (n > 2) sample points

and k > 2 groups.

2

学霸联盟