COMP9313-无代写|学霸联盟

COMP9313-无代写

时间：2023-12-03

COMP9313: Big Data Management
Revisit
2MyExperience Survey
❖ “Please participate in the myExperience Survey and take the opportunity
to share your constructive thoughts on your learning experience. Your
contributions help your teachers and shape the future of education at
UNSW.”
❖ You can access the survey by logging into Moodle or accessing
myexperience.unsw.edu.au directly.
❖ The deadline of myExperience is 2023-11-23.
❖ As mentioned in WebCMS3 notice, if the response rate from the
class is more than 50%, everybody gets 1 bonus mark added to the
final mark
3Final exam
❖ Final written exam (50 pts)
❖ Six questions in total on five topics
❖ Four hours (Do not wait for the last minute to submit!)
❖ Online exam. Submit through Moodle
❖ If you are ill on the day of the exam, do not attend the exam – will not
accept any medical special consideration claims from people who
already attempted the exam.
4Chapters Required in Exam
❖ Hadoop MapReduce (Chapters 1, 2, and 3)
➢ HDFS
➢ MapReduce Concepts and Mechanism
➢ MapReduce algorithm design
❖ Spark (Chapters 4 and 5)
➢ RDD
➢ DataFrame
❖ Mining Data Streams (Chapter 6)
❖ Finding Similar Items (Chapter 7)
➢ Shingling, minhash, LSH
➢ Exact solution
❖ Graph Data Management (Chapter 8)
❖ NoSQL and Hive (Chapter 9)
5Exam Questions
❖ Question 1: HDFS, MapReduce, and Spark concepts
❖ Question 2: MapReduce algorithm design (pseudo-code only)
❖ Question 3: Spark algorithm design
➢ RDD
➢ DataFrame
❖ Question 4 Finding Similar Items
❖ Question 5 Mining Data Streams
❖ Question 6 Graph Data Management
6Question 0
❖ (a) (2 marks) Explain the data flow in MapReduce using the word
count problem as an example.
❖ (b) (2 marks) Explain the data flow in Spark using the word count
problem as an example.
7Map and Reduce Functions
❖ Programmers specify two functions:
➢ map (k1, v1) → list []
 Map transforms the input into key-value pairs to process
➢ reduce (k2, [v2]) → []
 Reduce aggregates the list of values for each key
 All values with the same key are sent to the same reducer
❖ Optionally, also:
➢ combine (k2, [v2]) → []
➢ partition (k2, number of partitions) → partition for k2
➢ Grouping comparator: controls which keys are grouped together
for a single call to Reducer.reduce() function
❖ The execution framework handles everything else…
8MapReduce Data Flow
9Sample Questions
❖ Assume that you are given a data set crawled from a location-based
social network, in which each line of the data is in format of (userID, a
list of locations the user has visited ). Your task is to
compute for each location the set of users who have visited it, and the
users are sorted in ascending order according to their IDs.
10
Sample Solution
class Question1
method map(self, userID, list of locations)
foreach loc in the list of locations
Emit(“loc, userID”, “”)
method reduce_init(self)
current_loc = “”
current_list = []
method reduce(self, key, value)
loc, userID = key.split(“,”)
if loc != current_loc
if current_loc!=“”
Emit(current_loc, current_list)
current_list = []
current_list.add(userID)
current_loc=loc
else
current_list.add(userID)
method reduce_final(self)
Emit(current_loc, current_list)
In JOBCONF, configure:
'mapreduce.map.output.key.field.separator':’,’,
'mapreduce.partition.keypartitioner.options':'-k1,1’,
'mapreduce.partition.keycomparator.options':'-k1,1 -k2,2'
11
Sample Questions
❖ Given a table shown as below, find out the person(s) with the
maximum salary in each department (employees could have the same
salary).
❖ Solution:
➢ Mapper: for each record, Emit(department + “,” + salary, name)
➢ Combiner: find out all persons with the local maximum salary for
each department
➢ Reducer: receives data ordered by (department, salary), the first
one is the maximum salary in a department. Check the next one
until reaching a smaller salary and ignore all remaining. Save all
persons with this maximum salary in the department
➢ JOBCONF: key partitioned by “-k1,1”, sorted by “-k1,1 -k2,2n”
EmployeeID Name DepartmentID Salary
001 Emma 1 100,000
002 Helen 2 85,000
003 Jack 3 85,000
004 James 1 110,000
12
Sample Questions
❖ Given a large text dataset, find the top-k frequent terms (considering
that you can utilize multiple reducers, and the efficiency of your
method is evaluated).
❖ Solution:
➢ Two rounds:
 First round compute term frequency in multiple reducers, and
each reducer only stores local top-k.
 Second round get the local top-k and compute the final top-k
using a single reducer.
13
What is RDD
❖ Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-
Memory Cluster Computing. Matei Zaharia, et al. NSDI’12
➢ RDD is a distributed memory abstraction that lets programmers
perform in-memory computations on large clusters in a fault-
tolerant manner.
❖ Resilient
➢ Fault-tolerant, is able to recompute missing or damaged partitions
due to node failures.
❖ Distributed
➢ Data residing on multiple nodes in a cluster.
❖ Dataset
➢ A collection of partitioned elements, e.g. tuples or other objects
(that represent records of the data you work with).
❖ RDD is the primary data abstraction in Apache Spark and the core of
Spark. It enables operations on collection of elements in parallel.
14
RDD Operations
❖ Transformation: returns a new RDD.
➢ Nothing gets evaluated when you call a Transformation function, it
just takes an RDD and return a new RDD.
➢ Transformation functions include map, filter, flatMap, groupByKey,
reduceByKey, aggregateByKey, filter, join, etc.
❖ Action: evaluates and returns a new value.
➢ When an Action function is called on a RDD object, all the data
processing queries are computed at that time and the result value
is returned.
➢ Action operations include reduce, collect, count, first, take,
countByKey, foreach, saveAsTextFile, etc.
15
DataFrame
❖ DataFrame more like a traditional database of two-dimensional form,
in addition to data, but also to grasp the structural information of the
data, that is, schema
➢ RDD[Person] although with Person for type parameters, but the
Spark framework itself does not understand internal structure of
Person class
➢ DataFrame has provided a detailed structural information, making
Spark SQL can clearly know what columns are included in the
dataset, and what is the name and type of each column. Thus,
Spark SQL query optimizer can target optimization
16
Sample Questions
❖ RDD: Given a large text file, your task is to find out the top-k most
frequent co-occurring term pairs. The co-occurrence of (w, u) is
defined as: u and w appear in the same line (this also means that (w,
u) and (u, w) are treated equally). Your Spark program should
generate a list of k key-value pairs ranked in descending order
according to the frequencies, where the keys are the pair of terms and
the values are the co-occurring frequencies (Hint: you need to define
a function which takes an array of terms as input and generate all
possible pairs).
val textFile = sc.textFile(inputFile)
val words = textFile.map(_.split(“ “).toLowerCase)
// fill your code here, and store the result in a pair RDD topk
topk.foreach(x => println(x._1, x._2))
17
Sample Questions
❖ Given a set of marks from different courses (the input format is as
shown in the left column), the task is to: compute average marks for
every course and sort the result by course_name in alphabetical
order.
❖ Solution:
Input: Output:
student1:course1,90;course2,92;course3,80;course4,
79;course5,93
student2:course1,92;course2,77;course5,85
student3:course3,64;course4,97;course5,82
course1:91
course2:84.5
course3:72
course4:88
course5:86.67
fileDF = spark.read.text("file:///home/comp9313/tinydoc")
student = fileDF.select(split(fileDF['value'], ':').getItem(0).alias('sid'), split(fileDF['value'],
':').getItem(1).alias('courses’))
scDF = student.withColumn('course', explode(split('courses', ';’)))
scDF2 = scDF.select(split(scDF['course'], ',').getItem(0).alias('cname'), split(scDF['course'],
',').getItem(1).alias('mark’))
avgDF = scDF2.groupBy('cname').agg(avg('mark')).orderBy('cname')
18
Mining Data Streams
❖ Sampling from a data stream
❖ Sliding window – counting bits (DGIM)
❖ Filtering data stream – (counting) bloom filter
1001010110001011010101010101011010101010101110101010111010100010110010
N
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
xi xj
19
Mining Data Streams
❖ Finding Frequent Elements
➢ Boyer-Moore voting algorithm, Misra-Gries algorithm
➢ lossy counting, count-min sketch
❖ Counting data stream – FM-Sketch
➢ Estimate d = c2R for scaling constant c ≈ 1.3 (original paper)
fringe of 0/1s
around log(d)
0 0 0 00 1
FM BITMAP
0 00 111 1 11111
position ≪ log(d)position ≫ log(d)
1L R
20
Sample Questions
❖ Use an example to explain the reservoir sampling algorithm
➢ Store all the first s elements of the stream to S
➢ Suppose we have seen n-1 elements, and now the nth element arrives (n >
s)
✓ With probability s/n, keep the nth element, else discard it
✓ If we picked the nth element, then it replaces one of the s elements in
the sample S, picked uniformly at random
21
Sample Questions
Suppose we are maintaining a count of 1s using the DGIM method. We
represent a bucket by (i, t), where i is the number of 1s in the bucket and t
is the bucket timestamp (time of the most recent 1).
Consider that the current time is 200, window size is 60, and the current
list of buckets is: (16, 148) (8, 162) (8, 177) (4, 183) (2, 192) (1, 197) (1,
200). At the next ten clocks, 201 through 210, the stream has
0101010101. What will the sequence of buckets be at the end of these
ten inputs?
22
Sample Solution
❖ There are 5 1s in the stream. Each one will update to windows to be:
➢ (1) (16, 148)(8, 162)(8, 177)(4, 183)(2, 192)(1, 197)(1, 200), (1, 202)
=> (16, 148)(8, 162)(8, 177)(4, 183)(2, 192)(2, 200), (1, 202)
➢ (2) (16, 148)(8, 162)(8, 177)(4, 183)(2, 192)(2, 200), (1, 202), (1, 204)
➢ (3) (16, 148)(8, 162)(8, 177)(4, 183)(2, 192)(2, 200), (1, 202), (1, 204), (1;
206)
=> (16, 148)(8, 162)(8, 177)(4, 183)(2, 192)(2, 200), (2, 204), (1, 206)
=> (16, 148)(8, 162)(8, 177)(4, 183)(4, 200), (2, 204), (1, 206)
➢ (4) Windows Size is 60, so (16,148) should be dropped.
(16, 148)(8, 162)(8, 177)(4, 183)(4, 200), (2, 204), (1, 206), (1, 208) =>
(8, 162)(8, 177)(4, 183)(4, 200), (2, 204), (1, 206), (1, 208)
➢ (5) (8, 162)(8, 177)(4, 183)(4, 200), (2, 204), (1, 206), (1, 208), (1, 210)
=> (8, 162)(8, 177)(4, 183)(4, 200), (2, 204), (2, 208), (1, 210)
23
Sample Questions
❖ Consider a Bloom filter of size m = 7 (i.e., 7 bits) and 2 hash functions
that both take a string (lowercase) as input:
h1(str) = ∑(c in str)(c-'a') mod 7
h2(str) = str.length mod 7
Here, c - ‘a’ is used to compute the position of the letter c in the 26
alphabetical letters, e.g., h1(“bd”) = (1 + 3) mod 7 = 4.
➢ (i) Given a set of string S = {“hi”, “big”, “data”}, show the update of
the Bloom filter
➢ (ii) Given a string “spark”, use the Bloom filter to check whether it
is contained in S.
➢ (iii) Given S in (i) and the Bloom filter with 7 bits, what is the
percentage of the false positive probability (a correct expression is
sufficient: you need not give the actual number)?
24
Sample Solution
❖ (i)
❖ (ii) h1 (spark) = (18 + 15 + 0 + 17 + 10) mod 7 = 4
h2 (spark) = 5 mod 7 = 5
Not in S since the 4th bit is 1 but the 5th bit is 0
❖ (iii) k – # of hash functions; m – # of inserting elements; n - # of bits
( − −

)= .
hi big data
h1 (7+8) mod 7 = 1 (1+8+6) mod 7 = 1 (3+0+19+0) mod 7 = 1
h2 2 mod 7 = 2 3 mod 7 = 3 4 mod 7 = 4
25
Sample Questions
❖ Assume that we have 5 buckets and three hash functions:
➢ h0(str) = str.length * 2 mod 5
➢ h1(str) = str.length mod 5
➢ h2(str) = (str[0]-‘a’) mod 5
Given you a stream of terms: “big”, “data”, “data”, “set”, “data”,
“analytics”, show the steps of building the CM-Sketch. Then, use the built
CM-sketch to get the count for word “data”.
❖ Solution:
➢ big: h0 = 1, h1 = 3, h2 = 1
➢ data: h0 = 3, h1 = 4, h2 = 3
➢ set: h0 = 1, h1 = 3, h2 = 3
➢ analytics: h0 = 3, h1 = 4, h2 = 0
26
Sample Solution
b0 b1 b2 b3 b4
h0 0 0 0 0 0
h1 0 0 0 0 0
h2 0 0 0 0 0
b0 b1 b2 b3 b4
h0 0 1 0 0 0
h1 0 0 0 1 0
h2 0 1 0 0 0
b0 b1 b2 b3 b4
h0 0 1 0 1 0
h1 0 0 0 1 1
h2 0 1 0 1 0
b0 b1 b2 b3 b4
h0 0 1 0 2 0
h1 0 0 0 1 2
h2 0 1 0 2 0
b0 b1 b2 b3 b4
h0 0 2 0 2 0
h1 0 0 0 2 2
h2 0 1 0 3 0
b0 b1 b2 b3 b4
h0 0 2 0 3 0
h1 0 0 0 2 3
h2 0 1 0 4 0
b0 b1 b2 b3 b4
h0 0 2 0 4 0
h1 0 0 0 2 4
h2 1 1 0 4 0
Initially:
big:
data:
data:
set:
data:
analytics:
Min(CMS[0][3], CMS[1][4], CMS[2][3])=4, which is not the correct count.
27
Finding Similar Items
❖ The Big Picture
Docu-
ment
The set
of strings
of length k
that appear
in the doc-
ument
Signatures:
short integer
vectors that
represent the
sets, and
reflect their
similarity
Locality-
Sensitive
Hashing
Candidate
pairs:
those pairs
of signatures
that we need
to test for
similarity
28
Sample Questions
❖ MinHash:
We want to compute min-hash signature for two columns, C1 and C2
using two pseudo-random permutations of columns using the following
function:
h1(n) = 3n + 2 mod 7
h2(n) = 2n - 1 mod 7
Here, n is the row number in original ordering. Instead of explicitly
reordering the columns for each hash function, we use the
implementation discussed in class, in which we read each data in a
column once in a sequential order, and update the min hash signatures
as we pass through them.
Complete the steps of the algorithm and give the resulting signatures for
C1 and C2.
29
Solution
h1(0) = 2 ∞ 2
h2(0) = 6 ∞ 6
h1(1) = 5 5 2
h2(1) = 1 1 6
h1(2) = 1 5 1
h2(2) = 3 1 3
h1(4) = 0 0 0
h2(4) = 0 0 0
Sig1 Sig2
∞ ∞
∞ ∞
h1(n) = 3n + 2 mod 7
h2(n) = 2n - 1 mod 7
30
Sample Questions
❖ Suppose we wish to find similar sets, and we do so by minhashing the
sets 10 times and then applying locality-sensitive hashing using 5
bands of 2 rows (minhash values) each. If two sets had Jaccard
similarity 0.6, what is the probability that they will be identified in the
locality-sensitive hashing as candidates (i.e. they hash at least once to
the same bucket)? You may assume that there are no coincidences,
where two unequal values hash to the same bucket. A correct
expression is sufficient: you need not give the actual number.
❖ Solution: 1 - (1 - tr)b
➢ 1 - (1 – 0.62)5
31
Graph – Shortet path (iteration 1)
❖ Map:
Read s --> 0 | n1: 10, n2: 5
Emit: (n1, 10), (n2, 5), and the adjacency list (s, n1: 10, n2: 5)
The other lists will also be read and emit, but they do not contribute, and
thus ignored
❖ Reduce:
Receives: (n1, 10), (n2, 5), (s, <0, (n1: 10, n2: 5)>)
The adjacency list of each node will also be received, ignored in example
Emit:
s --> 0 | n1: 10, n2: 5
n1 --> 10 | n2: 2, n3:1
n2 --> 5 | n1: 3, n3:9， n4:2
32
PageRank in MapReduce (One Iteration)
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
n2 n4 n3 n5 n1 n2 n3n4 n5
n2 n4n3 n5n1 n2 n3 n4 n5
n5 [n1, n2, n3]n1 [n2, n4] n2 [n3, n5] n3 [n4] n4 [n5]
Map
Reduce
33
Sample Questions
❖ A directed graph G has the set of nodes {1,2,3,4,5,6} with the edges
arranged as follows.
❖ Set up the PageRank equations, assuming β = 0.8 (jump probability =
1- β). Denote the PageRank of node a by r(a).
34
Solution
35
Thank you!