MS4252-无代写
时间:2023-04-18
Department of Management Sciences, City University of Hong Kong 1
Patterson Tam
CITY UNIVERSITY OF HONG KONG
DEPARTMENT OF MANAGEMENT SCIENCES
MS4252 Big Data Analytics
2018-2019 Semester B
Test
Date: 2018 Mar 9 Time allowed: 2.5 hours
Name:
Student ID:
EID (Login ID):
Seat No:
• Write down the answer in the space provided below each question
• There are 9 questions and a total 9 pages in this paper
• Answer ALL questions
• Show detail calculation; NO mark for the only answer
• 5 decimal places
• Total mark is 100%
Grade:
MS4252 Big Data Analytics, 2018-2019 Semester B Test
Department of Management Sciences, City University of Hong Kong 2
Patterson Tam
Question 1 (22 marks)
(a) Perform the MapReduce Word Count process with shuffle and sort for the following texts:
<231, It is a beautiful dinosaur. Where it is?>
• Shows the steps clearly; (4 marks)
<231, It is a beautiful dinosaur. Where it is?>
map
< It , 1 > < is , 1 > < a , 1>
Shuffle and sort
< It , (1,1) > < is , (1,1) > < a , 1>
Reduce
< It , 2 >
< is , 2 >
< a , 1>
(b) What is the advantage of using a combiner in MapReduce? (2 marks)
It reduces the amount of intermediate map output. It saves more space and processing time
MS4252 Big Data Analytics, 2018-2019 Semester B Test
Department of Management Sciences, City University of Hong Kong 3
Patterson Tam
(c) Given that the servers have enough space and memory, is it possible no Hadoop and MapReduce for big
data analytics? What is/are the disadvantage for no Hadoop and MapReduce? (3 marks)
Yes, it is possible no Hadoop and MapReduce.
Hadoop and MapReduce help to manage the data faster.
If no Hadoop and MapReduce are in use, it takes a longer time to complete the task
(d) In which step of data parsing, MapReduce can help it a lot? (1 marks)
Tokens and words
(e) There is a book about cats. How do you handle the word “Cat” in data parsing? Why? (3 marks)
Use stop list to remove it.
Because the CAT provides low information for the topic
(f) Given a list of word/phase, which specified data parsing steps (Normalization, Synonym, and Entity)
is/are used to identify the word/phase? (9 marks)
Word/Phase Data Parsing Step(s)
{ CityU } Synonym and Entity
{ Holiday , Vacation } Synonym
{ Broke } Normalization
{ Department of Management Sciences } Entity
{ He is such a pain } Synonym
{ localize } Normalization
{ Part of Speech } Synonym
{ Hong Kong Dollar } Entity
MS4252 Big Data Analytics, 2018-2019 Semester B Test
Department of Management Sciences, City University of Hong Kong 4
Patterson Tam
Question 2: Given 2 documents and a search expression (18 marks)
Document 1:
Data analysis is a process of inspecting, cleansing, transforming, and modelling data with the goal of
discovering useful information, informing conclusions, and supporting decision-making. Data analysis has
multiple facets and approaches, encompassing diverse techniques under a variety of names while being used
in different business, science, and social science domains.
Documents 2:
Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in
order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn
from many fields within the context of mathematics, statistics, information science, and computer science.
https://en.wikipedia.org/wiki/Main_Page
Search Expression:
Data science
a) Calculate the frequency weights using Binary for search expression (2 marks)
Term Document 1 Document 2
Data 1 1
Science 1 1
b) Calculate the scores for the search expression with weight-scheme log-entropy. Initial weights for search
expression are 0.3. (12 marks )
Search term Doc 1 Doc 2
Term
TF
(Log)
Weight g Entropy
TF
(Log)
Log-
Entropy
TF-
Log
Log-
Entropy
Data 1 0.3 6 1 + �36 36� × 2
2 = 0 2 0 2 0
Science 1 0.3 5 1 + �25 25� + �35 35�2 = 0.02905 1.58496 0.04604 2 0.0581
score Equation
score(Q,D1) 0.3 x 0 + 0.3 x0.04604 = 0.013812
score(Q,D2) 0.3 x 0 + 0.3 x0.0581 = 0.01743
c) Comment on part (b) result (Circle your answer) (1 marks)
Document 1 / Document 2 is a better match with the query
d) What is/are the advantage and disadvantage of using Term Frequency (Binary) (3 marks)
Adv: Smooth the big difference among terms. It only concerns the existence of the term.
Disadv: Discount the importance of the term
MS4252 Big Data Analytics, 2018-2019 Semester B Test
Department of Management Sciences, City University of Hong Kong 5
Patterson Tam
Question 3 (19 marks)
Given 2 data points: { 15, 19 }
a) Use Expectation and Maximization to classify the data points into two groups: group 1 and 2, and
calculate the means and standard deviation of clusters. Group 1 and 2 follow N(3,100) and N(5,64)
respectively. The initial probability for group 1 (G1) is 0.4. (16 marks)
E-step X=15 X=19
p(Xi|G1)
1
√2102 exp �− (15 − 3)22(102) � = 0.01942 1√2102 exp �− (19 − 3)22(102) � = 0.01109
p(Xi|G2)
1
√282 exp �− (15 − 5)22(82) � = 0.02283 1√282 exp �− (19 − 5)22(82) � = 0.01078
p(G1|Xi)
0.01942 × 0.40.01942 × 0.4 + 0.02283 × 0.6 = 0.36187 0.01109 × 0.40.01109 × 0.4 + 0.01078 × 0.6 = 0.40727
P(G2|Xi) 1-0.36187 = 0.63813 1-0.40727 = 0.59273
E-
Step
Mean � Standard Deviation σ Probability
G1
15 × 0.36187 + 19 × 0.407270.36187 + 0.40727
= 17.11805
�
0.36187(15 − 17.11805)2 + 0.40727(19 − 17.11805)20.36187 + 0.40727
= 1.9965
0.36187 + 0.407272
= 0.38457
G2
15 × 0.63813 + 19 × 0.592730.63813 + 0.59273
= 16.92623
�
0.63813(15 − 16.92623)2 + 0.59273(19 − 16.92623)20.63813 + 0.59273
= 1.9986
1-0.38457 =
0.61543
Cluster Items (data points)
G1
G2 15 , 19
b) Do you agree that “the probabilities of Clusters for the last two steps of the converged EM model must
be equal”? Why? (3 marks)
NO, the model tries to maximize the likelihood, not the probabilities. There may have some deviations in
the probabilities of Clusters for the last two steps.
MS4252 Big Data Analytics, 2018-2019 Semester B Test
Department of Management Sciences, City University of Hong Kong 6
Patterson Tam
Question 4 (6 marks)
(a) The following shows a similarity matrix (table 1)with the centroid distances of 3 types of the biscuit: A, B,
and C.
Step 1 A B C Step 2 A, B C
A 0 A, B 0
B 3 0 C 4.272 0
C 4 5 0
Table 1: (Centroid) Distance matrix
(i) Perform ward cluster analysis: (5 marks)
Step
1
A B C Step 2 A & B C
A 0 A & B 0
B 32/ (1+1) = 4.5 0 C 4.2722/ (1/2+1) = 12.167 0
C 42/ (1+1) = 8 52/ (1+1) = 12.5 0
(ii) If two clusters are required, write down the classification. (1 marks)
Cluster Type of biscuits
1 C
2 A & B
Question 5: Base on the transactions of the supermarket to perform association: (3 marks)
DocID Items
D1 I’m crazy for bananas
D2 Is a banana and cereal a good breakfast?
D3 Cereal or oatmeal is a naturally low-fat.
D4 A banana is an edible fruit
D5 The wafer banana pudding cereal will let you have dessert for breakfast
Table 2: Beer data
(a) Calculate the confidence for the rules { Detergent }-> { Chocolate } (1 marks)
Confidence Equation
c { Banana }-> { Cereal } 2/4 = 0.5
(b) Calculate the confidence for the rules { Chocolate }-> { Coke } (1 marks)
Confidence Equation
c{ Cereal }-> { Breakfast } 2/3 = 0.6667
(c) Based on part (a) & (b), which rule is more useful. (Circle your answer) (1 marks)
The rule { Banana }-> { Cereal } / { Cereal }-> { Breakfast } is useful
MS4252 Big Data Analytics, 2018-2019 Semester B Test
Department of Management Sciences, City University of Hong Kong 7
Patterson Tam
Question 6: Given 5 documents (5 marks)
Doc Contents
1 Tony Stark is an iron man.
2 Robert Downey plays Tony Stark.
3 Tony Stark has won an Oscar. The film Iron man has won an Oscar, too.
4 Iron man is dead.
a) The strength of association between a keyword (Iron man) and exploring the words (Tony Stark) by
concept link. (i.e. term B is Tony Stark) (3 marks)
Equation
Strength ln 13!2! 1! (23)2(13)(3−2) + 3!3! 0! (23)3(13)(3−3) = 0.3001
b) Comment on part a result (Circle your answer) (1 marks)
The association between 2 terms is very strong/weak.
c) Conditional count of concept link between a keyword (Iron man) and exploring the words (Tony Stark)
while the centered word is “Iron man” (1 marks)
Conditional count of concept link for Tony Stark = 2/3
Question 7 Given a statement and 3 probability constraints (6 marks)
< He is a funny guy >
Constraint 1 P(He) + P(is) = 2/5
Constraint 2 P(He) + P(funny) + P(guy) = 3/7
Constraint 3 P(He) + P(is) + P(a) + P(funny) + P(guy) = 1
(a) Use Maximum Entropy to calculate the probability for each word
The intuitive answer:
P(He) 1/5 = 0.2 1/7 = 0.14286
P(is) 1/5 = 0.2 2/5-1/7 = 9/35 = 0.25714
P(a) 1-2/5-4/35-4/35 = 0.37143 1-3/7-9/35 = 0.31429
P(funny) (3/7-1/5)/2 = 4/35 = 0.11429 1/7 = 0.14286
P(guy) (3/7-1/5)/2 = 4/35 = 0.11429 1/7 = 0.14286
Iron man
+ Tony Stark
MS4252 Big Data Analytics, 2018-2019 Semester B Test
Department of Management Sciences, City University of Hong Kong 8
Patterson Tam
Question 8: Using data parsing to identify the root words (positive, negative, neutral, and peripheral) in
given five classified (documents) comments in training data set (table 3) and one unclassified comment in
testing data set (table 4).
• The number of terms in the vocabulary is FOUR after data parsing (12 marks)
Document ID Objects in content Classification
1 Good, Negative, Excellent Good
2 No Comment, Worst Bad
3 Nice, Great Good
4 Bad, Monitor Bad
5 Good, Keyboard, USB Good
Table 3. Training data set
Testing data set
Document ID Content
7 Peripheral, Positive, Negative, Positive, Positive
Table 4. Testing data set
a) Use Naïve Bayes classifier to classify the document 7 (11 marks)
Probability Equation Pr () 3/5 = 0.6
Pr () 2/5 = 0.4
Pr(|) G(5+1)/(8+4) = 6/12 = 0.5
Pr(|) (1+1)/(8+4) = 2/12 = 0.1.667
Pr(ℎ|) (2+1)/(8+4) = 3/12 = 0.25
Pr(|) (0+1)/(4+4) = 1/8 = 0.125
Pr(|) (2+1)/(4+4) = 3/8 = 0.375
Pr(ℎ|) (1+1)/(4+4) = 2/8 = 0.25
Pr(| 7) 0.6 x 0.25 x 0.53 x 0.1667 = 0.003126
Pr(| 7) 0.4 x 0.25 x 0.1253 x 3/8 = 0.00007
b) Comment on part a result (Circle your answer) (1 marks)
Document 7 is a better match with Good / Bad opinion
MS4252 Big Data Analytics, 2018-2019 Semester B Test
Department of Management Sciences, City University of Hong Kong 9
Patterson Tam
Question 9: Given the testing result of sentiment analysis shown in figure 1: (9 marks)
Figure 1: Results of sentiment analysis
a) According to the testing result of sentiment analysis (figure 1), fill the confusion matrix (2 marks)
Predicted Class
Negative Positive
Actual Class Negative 3 4
Positive 2 4
b) Calculate the positive precision, negative precision and overall precision. Show detail calculation.
(3 marks)
Equation
Positive Precision: 4/6 = 66.7%
Negative Precision: 3/7 = 42.857%
Overall Precision: (3+4)/13 = 0.53846 = 53.846%
c) Comment the model based on the result of part (ii). (Circle your answer) (2 marks)
The model is better to predict positive/negative opinion.
Overall, performance is Strong / Fair / Bad
d) Justify your choice on the overall performance in one sentence (2 marks)
All are around 50%