程序代写案例-COMP20008

时间：2022-06-05

COMP20008
1

Element of Data processing-期末班 4
Echo
Machine Learning
Discretization
1. 利用 domain knowledge 来处理
Car speed:
0-40: slow
40-60: medium
>60: high
2. Equal-length bin
每一个 bin之间的长度都是相等的
Divide the range of the continuous feature into equal length intervals (bins). If speed
ranges from 0-100, then the 10 bins are [0,10), [10,20), [20,30), …[90,100]
3. Equal frequency bin
每一个 bin中所含数据的数量是相等的
Divide range of continuous feature into equal frequency intervals (bins). Sort the values
and divide so that each bin has same number of objects

COMP20008
2

Regression (Simple Linear Regression Model)
 Dependent variable: 需要被预测的值，the variable we wish to predict or explain（y）
 Independent variable: 用来预测的值 the variable used to explain the dependent
variable（X）
Regression的作用：
 根据一个 independent variable变化的值来预测 dependent variable的变化
Predict the value of a dependent variable based on the value of at least one independent
variable。
 解释 dependent variable变化对 independent variable的影响
Explain the impact of changes in an independent variable on the dependent variable
Regression vs Correlation
Correlation Regression
Indicate the relationship between
independent variable and dependent
variable
Describes how an independent variable is
numerically related to the dependent
variable
X and Y Y and X0 X1 X2
Describe current data Predict future data
Necessarily imply causality Imply causality
Correlation analysis is used to measure
strength/ direction of the relationship
between two variables
Predict future data, find relationship

Simple Linear Regression Model
 只有一个 independent variable X，Only one independent variable, X
 X和 Y之间是线性关系。Relationship between and Y is described by a linear function
 Assume Y是由于 X改变而改变。Changes in Y are assumed to be caused by changes
in X（两者之间有 correlation）
COMP20008
3

Model

i：第 i条 instance
Dependent Variable Yi: 我们需要预测的值
Intercept b0：constant，当 x为 0时，y的值是多少。estimated average value of Y when
the value of X is zero(intercept)
Slope coefficient b1：代表 regression line 的 slope，每增长一个单位的 b1，y会因此改变
多少。b1 is the estimated change in the average value of Y as a result of a one-unit change
in X(slope)
Independent variable Xi: feature
Error term: 估算值与实际值的差。Error term越小则说明 estimate的 y越接近实际值。

COMP20008
4

Least Squares Method
这里我们使用 Least Squares的方法来找出最合适的 b0和 b1

将 error term最小化。（计算不涉及）
Interpolation vs. Extrapolation
Regression模型的另外一个局限就是只能预测 Relevant range for interpolation。
只能预测所给范围内的数据，而超出范围的数据无法保证。

Evaluate Regression Model

COMP20008
5


1. SST = total sum of squares
Measures the variation of the Yi values around their mean Y
所有 response variable的 total variation
2. SSR = regression sum of squares
Explained variation attributable to the relationship between X and Y
可以被模型解释出来的 variation
3. SSE = error sum of squares
Variation attributable to factors other than the relationship between X and Y
无法被模型解析出来的 variation

Coefficient of Determination 2
2 (r-squared) = SSR/ SST , 0 < 2 ≤ 1
代表模型解析出来的 variation占全部 variation的比例.
当2越接近 1，说明我们的模型把 variation解析得越完整。
SSE = 0，all points perfectly fit to our model.

让2越小，则说明我们的模型可以解释出的 variation越小。
Assumptions of Linear Regression
1. Linearity: The underlying relationship between X and Y is linear
假设 x 和 y之间的关系是 linear relationship
COMP20008
6

2. Independence of residuals: Residual values (also known as fitting errors) are
statistically sound and sum to zero (residual 之间需要 independent，有 sum to
zero来判断)
3. Examine for constant variance for all levels of X
检查所有的 variance 是不是都是 constant variance

Residual是每一个 predict value 和 observed value之间的差。
通过 residual的 visualization我们可以 check linear regression的三个 assumption
如果 linear regression模型表现不好需要考虑一下是不是 dataset 不满足 assumption

COMP20008
7

COMP20008
8

COMP20008
9

Classification
Regression VS Classification
y: target variable is continuous (key difference to classification)
classification：用来预测 class label
Regression：用来预测一个 continuous value
Decision Trees
Decision Tree 决策树，树形结构分类器。
Tree Structure (A flow-chart-like tree structure)

根据 feature的不同选择，同一个 dataset可以建造出无数个不同的决策树。

COMP20008
10

如何决定 split nod 和如何评判 best split？
DT适合 discrete data，如果 continuous data需要做额外的 Discretization
对于 DT我们更加倾向于 node结果为 homogeneous class 的节点。

在这门课里我们用 entropy来计算节点的纯度，用 Information Gain来评判 best split。

Gain越大，则说明 split越好。注意，这门课里 information gain和 Mutual information的
计算方法是一样的。
DT Advantages
简单易懂 Easy to interpret
Tree的 data structure在 construct的时候非常有效率。
做 test set时 traversal tree速度很快。
DT Disadvantages
在面对复杂的 data structure/data set的时候这种简单的方法可能并不适用。可能会建造出
来一个非常复杂难以得出结果的 tree。
需要自己决定 Depth
COMP20008
11

COMP20008
12

COMP20008
13

COMP20008
14

KNN
K Nearest Neighbors Classifier, 顾名思义是寻找和 instance最为相近的 instance进行分
类。
Algorithm：
1. 计算每个点与每个点之间的距离，并且记录。Compute distance to other training
records
2.为每个点找出与之最接近的 k个邻居。Identify k nearest neighbors
3.利用 k个邻居的 class label来判断当前 data应该属于哪个 class。（本门课采用majority
voting的方法来 break the tie。）Use class labels of nearest neighbors to determine the
class label of unknown record (e.g., by taking a majority vote)

Knn不像 DT，它没有一个真实的模型，相反它全部的 instance就是他的模型。
用 KNN时，我们每次拿到一个新的 test set时，就要将他和全部 instance进行距离计算，
找到最近的 n个 instance并进行分类。因此每次都需要重新计算，花费时间很大。
K的选取
K的选取也很重要，通常情况我们倾向于使用一个奇数作为 k。
如果 K太大，则很容易收到 noisy point的影响
如果 K太大，容易把其他 class的 instance也包含进来。（极端情况：0R）
在遇到 tie的情况时采用Majority vote或者Weight the vote的方法

COMP20008
15

COMP20008
16

COMP20008
17

Clustering
K-means
The best-known clustering method
Algorithm分为四个 step
1. Randomly select k seed points as the initial cluster centroids，随机选择初始点作为
centroids
2. Assign each object to the cluster with the nearest centroid 通过计算公式计算每一个点
和每一个 centroid之间的距离，确认出每一个点应该属于那个 centroid所在从 cluster
3. Compute the centroids of the clusters of the current partitioning (the centroid is the
center, i.e., mean point, of the cluster) 算出每一个 cluster的新的 centroid，
4. Go back to Step 2, stop when the assignment does not change。重复 step2，直到
centroid的位置不再更新，每个点的 cluster也不再改变。
注意：
由于我们每次选择的初始点为随机选择，无法保证每次 clustering的结果都是一样的。
默认选择使用 Euclidean distance，也可以选择其他距离公式
一定会 converge 到 local optimum。
Cluster-based outlier detection
思路: An outlier is expected to be far away from any groups of normal objects，outlier应当
是距离每一个 group 都很远。我们这里将每一个 point与其 cluster centroid的距离用源泉
的大小来表示，圆圈越大代表距离 centroid越远。

COMP20008
18

Visual Assessment of Clustering Tendency (VAT)
用 Heat map 通过 (dis)similarity matrix 来做 cluster的 visualization
通常情况下，dissimilarity matrix是乱序的，我们无法从这样的 heat map中找到对应的
clusters，因为 reorder显得格外重要。
VAT algorithm 在面对 complex shaped datasets (either significant overlap or irregular
geometries between different clusters), 表现欠佳。
这也是很多 clustering method 都面临的问题，肉眼可以算出的分类机器却不能。

COMP20008
19

COMP20008
20

COMP20008
21

Hierarchical Clustering
Agglomerative(聚集) hierarchical clustering method从单个点开始，每一步找到
与之最接近的 pair进行 merge
在每次选择 closest cluster的时候我们有两种方法
1. single linkage： minimum distance between the clusters,
2. complete linkage： maximum distance between the clusters,

Strength of MIN: Single linkage: can handle non-similar shapes

Limitations of MIN: Single linkage: sensitive to noise and outliers

COMP20008
22

Strength of MAX: Complete linkage: Less susceptible to noise and outliers

Limitations of Max: Complete linkage: Tends to break large clusters

Divisive（分裂）
从全部 instance开始，持续不停地分出 cluster直到每个 cluster只含有一个点
Start with one, all-inclusive cluster
At each step, split a cluster until each cluster contains a point (or there are k clusters)

Evaluation
一组好的 Clusters应该满足，Cluster内 datapoint的距离很小，而 Clusters之间的距离很
大。这里我们用 sse来进行 evaluation。
COMP20008
23

SSE (sum of squared errors). is the sum of distances of objects from their cluster centroids
SSE-elbow method

COMP20008
24

COMP20008
25

COMP20008
26

COMP20008
27