程序代写案例-COMPSCI 752-Assignment 4|学霸联盟

程序代写案例-COMPSCI 752-Assignment 4

时间：2022-05-15

COMPSCI 752
Big Data Management
Assignment 4 / Semester 1, 2022
Big data application
Student ID:
Name:
Intruction:
There are 2 exercises that are worth 5 marks. Each exercise has several questions with
its mark distribution. Good luck.
Submission:
Please submit it as a single pdf file on CANVAS by 23:59, Sunday 22 May 2022.
Penalty Dates
The assignment will not be accepted after the last penalty date unless there are special
circumstances (e.g., sickness with certificate). Penalties will be calculated as follows as
a percentage of the mark for the assignment.
By 5pm, Fri 6 May 2022 – No penalty
By 5pm, Sat 7 May 2022 – 25% penalty
By 5pm, Sun 8 May 2022 – 50% penalty
1
1 Odd sketch [5 marks]
Given two set A and B, the symmetric difference size between A and B is |A| + |B| −
2|A ∩B|. Consider 4 sets below:
S1 = {1, 2, 4, 6, 8} ,
S2 = {1, 2, 4, 6, 9} ,
S3 = {1, 2, 4, 7, 9} ,
S4 = {1, 3, 5, 7, 9} .
1. Finding all pairs of sets such that their symmetric difference size is at most 2.
[1 mark]
2. Using the hash function h(x) = x + 1 mod 4, construct odd sketches of 4 bits
length for these sets above. [2 marks]
3. How can we use these odd sketches to solve the question 1? Explain your solution.
[2 marks]
2
2 Similarity join [5 marks]
Consider the collection of 4 pre-processed documents. Each document is represented as
a list of terms together with its weight as follows:
d1 {(andrew, 6), (teach, 4), (logic, 2), (database, 5)}
d2 {(peter, 6), (study, 2), (python, 6), (logic, 3)}
d3 {(andrew, 5), (teach, 1), (python, 5)}
d4 {(peter, 5), (study, 4), (like, 2), (python, 5)}
We represent each document as a high-dimensional vector and use the inner product as
the similarity measure. We will use the prefix-filtering algorithm to find similar pairs of
documents whose inner product values are larger than a threshold t.
1. Compute the similarity of each pair of non-identical documents. [1 mark]
2. Assume that the dimension indexes are as follows:
{(andrew, 1), (peter, 2), (teach, 3), (study, 4), (like, 5),
(logic, 6), (database, 7), (python, 8)}.
Construct the signature of each document given the threshold t = 37. What are
the candidate pairs that we need to compute the similarity? [2 marks]
3. There are several way to index the dimensions, for example
{(peter, 1), (andrew, 2), (teach, 3), (study, 4), (like, 5),
(logic, 6), (database, 7), (python, 8)}.
Are there any way to index the dimensions so that we use less space to store our
inverted index? Explain your solution. [2 marks]
3