CS280: Elements of Data Processing
Homework 1: Due by 23:59 Sunday, 2nd July
Instructor : Quan Li
2023 SJTU LEC International Summer Sessions
June 30, 2023
Q1. (Getting to Know Data)
A. For positively skewed data, do the data always have larger Mean than Median and larger
Median than Mode? If so, please briefly show your reason. If not, please offer a counter example.
(10 marks)
B. Consider data with an outlier. Which one(s) of following descriptions can ensure that readers
can know there is an outlier from reading the description: Boxplot, Histogram, Quantile plot,
Scatter plot? Please briefly explain your answer. (10 marks)
C. There are two data points A(2,5,3) and B(1,4,5). Under following distance measurements,
what is the distance from A to B: Manhattan distance and Euclidea distance? (10 marks)
Q2. (Frequent Pattern Mining)
Suppose that there are 9 items: 1,2,3,...,9. Here are itemsets:
TID Itemset
1 1,2,3,4,5,6
2 7,2,3,4,5,6
3 1,8,4,5
4 1,9,4,6
5 9,2,2,4,5
Given minsup threshold is 3.
1) Please use Apriori algorithm to find all frequent itemsets. (30 marks)
2) List all closed frequent items and all maximal frequent items. (10 marks)
3) Please use FP-growth to find all frequent patterns again and show the steps. Compare the
efficiency of two mining processes (FP-Tree and Apriori algorithm). (30 marks)