xuebaunion@vip.163.com

3551 Trousdale Rkwy, University Park, Los Angeles, CA

留学生论文指导和课程辅导

无忧GPA：https://www.essaygpa.com

工作时间：全年无休-早上8点到凌晨3点

微信客服：xiaoxionga100

微信客服：ITCS521

程序代写案例-COMP6714

时间：2020-12-29

Name: ,

(Family name) (Given name)

Student ID:

THE UNIVERSITY OF NEW SOUTH WALES

Final Exam

COMP6714

Information Retrieval and Web Search

TERM T3, 2020

• Time allowed: 10 minutes reading time + 2 hours

– Exception: students with extra exam time approved by Equitable Learning Services

(ELS) can make submissions after 14:30, 2 December 2020 within their approved

extra time.

• Total number of questions: 8.

• Total number of marks: 100

• Total number of pages: 8 excluding this cover page

• This is an open-book exam. You are allowed to use textbook(s), lecture notes and other

study materials. However, you are not allowed to (1) communicate with anyone else or (2)

use the Internet during the exam.

• Items allowed: UNSW approved calculators.

• You can answer the questions in any order.

• Start each question on a new page.

• Answers must be written in ink on A4 papers and scanned into a PDF file. Alternatively,

you can use any software to directly generate the answers in a PDF file.

Question 1 (10 marks)

Write down the pseudo-code of answering the Boolean query not A and B. You need to

use the function skipTo(id) wherever it is possible.

You can refer to the following skeleton code. Note that it is by no means the complete

pseudo-code, and you should add multiple lines or modify existing line(s) to complete

this task.

Algorithm 1: Q1(p1, p2)

1 answer ← ∅;

2 while . . . do

3 if docID(p1) > docID(p2) then

4 ;

5 else

6 if docID(p1) < docID(p2) then

7 ;

8 else

9 ;

10 return answer ;

COMP6714 Page 1 of 8

Question 2 (14 marks)

Consider applying γ-encoding to a posting list of n document IDs (within the range of

[1, N ]).

Prove that:

• For a value x, its γ-encoded value takes at most 2 log2(x) + 1 bits.

• The compressed posting list (using γ codes on the gaps) takes at most n · log2

(

2N2

n2

)

bits.

COMP6714 Page 2 of 8

Question 3 (14 marks)

(a) Consider a dictionary that contains v terms, and each term has exactly l characters.

Assume we build both a permuterm index and a bi-gram index for the dictionary.

What are the sizes of these two indexes (note, do not include the size of the dic-

tionary and the inverted lists), respectively? You may assume that a pointer (to a

term in the dictionary) is 4-bytes and all terms only contain characters from a to

z.

(b) Consider a query of P*Q*R, where P, Q, and R are a sequence of characters. Briefly

describe how a permuterm index can be used to efficiently answer this query.

(c) The above query can also be answer without using list interaction. Briefly describe

how this can be done, and give a reason why this might be more efficient than the

previous query processing method.

COMP6714 Page 3 of 8

Question 4 (14 marks)

Consider the logarithmic merge algorithm for dynamic index construction. The sub-

indexes created on the disk are: I3, I2, I1, and I0 (note: I0 is the smallest sub-index on

the disk). Assume the current in-memory index is full and needs to be dumped to the

disk.

(a) What are the sub-indexes after dumping the current in-memory index to the disk?

(b) How many sub-indexes will the algorithm create to index a document collection of

size |C| when the memory size is M .

(c) Let |C| = 14 ·M . What is the total number of times sub-indexes are merged? You

need to include merges of a sub-index on the disk and the index in the memory.

COMP6714 Page 4 of 8

Question 5 (10 marks)

The figure below shows the output of two information retrieval systems on the same two

queries in a competitive evaluation. The top 10 ranks are shown. Crosses correspond

to a document which has been judged relevant by a human judge; dashes correspond to

irrelevant documents. Assume that System 1 retrieved all the relevant documents for

both queries.

System 1: System 2:

Rank Q1 Q2

1 X X

2 X -

3 X -

4 X -

5 - -

6 - X

7 - -

8 - -

9 X X

10 X X

Rank Q1 Q2

1 X X

2 X -

3 X -

4 - X

5 X X

6 X -

7 - -

8 - -

9 X -

10 - -

(a) Explain the following evaluation metrics and give results for query Q2 for both

systems.

1. Precision at rank 8.

2. Recall at precision 1

3

.

(b) Give the formula for mean average precision (MAP), and calculating MAP for both

systems.

(c) Consider Q1 for System 1. Compute the interpolated precisions at recall levels 0.5

and 0.8, respectively.

COMP6714 Page 5 of 8

Question 6 (14 marks)

Consider the following documents and the ground-truth of their relevance to the query

{x1, x3}. Give the order that these documents will be ordered under the BIM model

with add 1

2

smoothing? You need to justify your answer.

Document Relevant? x1 x2 x3

D1 T 1 1 1

D2 T 0 1 0

D3 F 1 0 0

D4 T 0 0 1

D5 F 0 1 0

COMP6714 Page 6 of 8

Question 7 (10 marks)

Suppose we have a document collection with an extremely small vocabulary with only

6 words w1, w2, . . . , w6. The following table shows the estimated background language

model p(w|C) using the whole collection of documents (2nd column) and the word counts

for document d1 (3rd column) and d2 (4th column), where c(w, di) is the count of word

w in document di. Let Q = {w1, w2, w3, w4, w5, w6} be a query.

Word p(w|C) c(w, d1) c(w, d2)

w1 0.800 2 7

w2 0.100 3 1

w3 0.025 1 1

w4 0.025 2 1

w5 0.025 2 0

w6 0.025 0 0

(a) Suppose we do not smooth the language model for d1 and d2. Compute the likeli-

hood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute

the log-likelihood. You should use the scientific notation (e.g., 0.0061 should be

6.1× 10−3) Which document would be ranked higher?

(b) Suppose we now smooth the language model for d1 and d2 using the Jelinek-Mercer

smoothing method with λ = 0.8 (i.e., p(w|d) = λ ·pmle(w|Md)+(1−λ) ·pmle(w|Mc)).

Recompute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2)

(Do not compute the log-likelihood. You should use the scientific notation) Which

document would be ranked higher?

COMP6714 Page 7 of 8

Question 8 (14 marks)

Consider the basic architecture of a crawler in the figure below.

(a) Explain why the “content seen?” module is needed before the “Dup URL elim”

module?

(b) Assume that the search engine uses MinHash algorithm to detect near-duplicate

documents. Given a document with following hashed shingles: { 1, 7, 15, 81 }, and

two universal hashing functions:

• h1(x) = (7x+ 1 mod 31) mod 13

• h2(x) = (18x+ 26 mod 31) mod 13

What are the resulting MinHash signatures of the document?

END OF EXAM PAPER

COMP6714 Page 8 of 8

(Family name) (Given name)

Student ID:

THE UNIVERSITY OF NEW SOUTH WALES

Final Exam

COMP6714

Information Retrieval and Web Search

TERM T3, 2020

• Time allowed: 10 minutes reading time + 2 hours

– Exception: students with extra exam time approved by Equitable Learning Services

(ELS) can make submissions after 14:30, 2 December 2020 within their approved

extra time.

• Total number of questions: 8.

• Total number of marks: 100

• Total number of pages: 8 excluding this cover page

• This is an open-book exam. You are allowed to use textbook(s), lecture notes and other

study materials. However, you are not allowed to (1) communicate with anyone else or (2)

use the Internet during the exam.

• Items allowed: UNSW approved calculators.

• You can answer the questions in any order.

• Start each question on a new page.

• Answers must be written in ink on A4 papers and scanned into a PDF file. Alternatively,

you can use any software to directly generate the answers in a PDF file.

Question 1 (10 marks)

Write down the pseudo-code of answering the Boolean query not A and B. You need to

use the function skipTo(id) wherever it is possible.

You can refer to the following skeleton code. Note that it is by no means the complete

pseudo-code, and you should add multiple lines or modify existing line(s) to complete

this task.

Algorithm 1: Q1(p1, p2)

1 answer ← ∅;

2 while . . . do

3 if docID(p1) > docID(p2) then

4 ;

5 else

6 if docID(p1) < docID(p2) then

7 ;

8 else

9 ;

10 return answer ;

COMP6714 Page 1 of 8

Question 2 (14 marks)

Consider applying γ-encoding to a posting list of n document IDs (within the range of

[1, N ]).

Prove that:

• For a value x, its γ-encoded value takes at most 2 log2(x) + 1 bits.

• The compressed posting list (using γ codes on the gaps) takes at most n · log2

(

2N2

n2

)

bits.

COMP6714 Page 2 of 8

Question 3 (14 marks)

(a) Consider a dictionary that contains v terms, and each term has exactly l characters.

Assume we build both a permuterm index and a bi-gram index for the dictionary.

What are the sizes of these two indexes (note, do not include the size of the dic-

tionary and the inverted lists), respectively? You may assume that a pointer (to a

term in the dictionary) is 4-bytes and all terms only contain characters from a to

z.

(b) Consider a query of P*Q*R, where P, Q, and R are a sequence of characters. Briefly

describe how a permuterm index can be used to efficiently answer this query.

(c) The above query can also be answer without using list interaction. Briefly describe

how this can be done, and give a reason why this might be more efficient than the

previous query processing method.

COMP6714 Page 3 of 8

Question 4 (14 marks)

Consider the logarithmic merge algorithm for dynamic index construction. The sub-

indexes created on the disk are: I3, I2, I1, and I0 (note: I0 is the smallest sub-index on

the disk). Assume the current in-memory index is full and needs to be dumped to the

disk.

(a) What are the sub-indexes after dumping the current in-memory index to the disk?

(b) How many sub-indexes will the algorithm create to index a document collection of

size |C| when the memory size is M .

(c) Let |C| = 14 ·M . What is the total number of times sub-indexes are merged? You

need to include merges of a sub-index on the disk and the index in the memory.

COMP6714 Page 4 of 8

Question 5 (10 marks)

The figure below shows the output of two information retrieval systems on the same two

queries in a competitive evaluation. The top 10 ranks are shown. Crosses correspond

to a document which has been judged relevant by a human judge; dashes correspond to

irrelevant documents. Assume that System 1 retrieved all the relevant documents for

both queries.

System 1: System 2:

Rank Q1 Q2

1 X X

2 X -

3 X -

4 X -

5 - -

6 - X

7 - -

8 - -

9 X X

10 X X

Rank Q1 Q2

1 X X

2 X -

3 X -

4 - X

5 X X

6 X -

7 - -

8 - -

9 X -

10 - -

(a) Explain the following evaluation metrics and give results for query Q2 for both

systems.

1. Precision at rank 8.

2. Recall at precision 1

3

.

(b) Give the formula for mean average precision (MAP), and calculating MAP for both

systems.

(c) Consider Q1 for System 1. Compute the interpolated precisions at recall levels 0.5

and 0.8, respectively.

COMP6714 Page 5 of 8

Question 6 (14 marks)

Consider the following documents and the ground-truth of their relevance to the query

{x1, x3}. Give the order that these documents will be ordered under the BIM model

with add 1

2

smoothing? You need to justify your answer.

Document Relevant? x1 x2 x3

D1 T 1 1 1

D2 T 0 1 0

D3 F 1 0 0

D4 T 0 0 1

D5 F 0 1 0

COMP6714 Page 6 of 8

Question 7 (10 marks)

Suppose we have a document collection with an extremely small vocabulary with only

6 words w1, w2, . . . , w6. The following table shows the estimated background language

model p(w|C) using the whole collection of documents (2nd column) and the word counts

for document d1 (3rd column) and d2 (4th column), where c(w, di) is the count of word

w in document di. Let Q = {w1, w2, w3, w4, w5, w6} be a query.

Word p(w|C) c(w, d1) c(w, d2)

w1 0.800 2 7

w2 0.100 3 1

w3 0.025 1 1

w4 0.025 2 1

w5 0.025 2 0

w6 0.025 0 0

(a) Suppose we do not smooth the language model for d1 and d2. Compute the likeli-

hood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2) (Do not compute

the log-likelihood. You should use the scientific notation (e.g., 0.0061 should be

6.1× 10−3) Which document would be ranked higher?

(b) Suppose we now smooth the language model for d1 and d2 using the Jelinek-Mercer

smoothing method with λ = 0.8 (i.e., p(w|d) = λ ·pmle(w|Md)+(1−λ) ·pmle(w|Mc)).

Recompute the likelihood of the query for both d1 and d2, i.e., p(Q|d1) and p(Q|d2)

(Do not compute the log-likelihood. You should use the scientific notation) Which

document would be ranked higher?

COMP6714 Page 7 of 8

Question 8 (14 marks)

Consider the basic architecture of a crawler in the figure below.

(a) Explain why the “content seen?” module is needed before the “Dup URL elim”

module?

(b) Assume that the search engine uses MinHash algorithm to detect near-duplicate

documents. Given a document with following hashed shingles: { 1, 7, 15, 81 }, and

two universal hashing functions:

• h1(x) = (7x+ 1 mod 31) mod 13

• h2(x) = (18x+ 26 mod 31) mod 13

What are the resulting MinHash signatures of the document?

END OF EXAM PAPER

COMP6714 Page 8 of 8