COMP6714 -无代写|学霸联盟

COMP6714 -无代写

时间：2025-07-29

COMP6714 2025T2 Project Specification
1/6

COMP6714 2025T2 Project: Ranked Retrieval
In this project, you are going to implement (using Python3 in CSE Linux machines) a simple search engine
that ranks the output documents based on query term coverage, proximity of matched terms, and preservation
of query term order. A search query in this project is a list of space-separated search terms and each search
term may contain any numeric digits or uppercase/lowercase letters and will not contain any punctuations. You
will need to implement an indexer (to index the files) and a search program (to search the files based on the
index generated by your indexer). As a core requirement for this project, you must implement your solution
using an inverted index with positional information (for example, the positional index described in Lecture
Week 1). You may also implement any additional indexes as appropriate, if you wish.
Given a search query, each matching document from the search result must contain terms that match all the
search terms following the below term matching rules:
Search is case insensitive.
Full stops for abbreviations are ignored, e.g., U.S., US are the same.
Hyphenated terms are split unless the first part is fewer than 3 letters, in which case the full term is
preserved, e.g., D-Kans, co-author, in-depth are preserved, while set-aside, five-year are split.
Singular/Plural is ignored. e.g., cat, cats, cat's, cats' are all the same, ex-wives and ex-wife are the same.
Tense is ignored. e.g., breaches, breach, breached, breaching are all the same, co-author and co-authored
are the same.
A sentence can only end with a full stop, a question mark, or an exclamation mark.
Numbers with decimal places can be ignored from the index, if you wish, as a decimal number is not a
valid search term (since '.' is not allowed).
Numeric tokens such as years, integers should be indexed accordingly and searchable.
Commas in numeric tokens are ignored, e.g., 1,000,000 and 1000000 are the same.
Except the above, all other punctuation should be treated as token dividers.
As a requirement of this project, matching documents are ranked based on the number of query terms they cover,
the proximity between those terms, and how well the term order matches the query. Documents that match more
query terms, have those terms appear closer together, and preserve the query's term order will be ranked higher.
Further details are provided in the Ranking section.
You are provided with approximately 1000 small documents (named with their document IDs) available in
/home/cs6714/Public/data. You can find these files by logging into CSE machines and going to folder
/home/cs6714/Public/data. Your submitted project will be tested against a similar collection of up to 1000
documents (i.e., we may replace some of these documents to avoid any hard-coded solutions).
Your submission must include 2 main programs: index.py and search.py as described below. It is your
responsibility to submit any other auxiliary Python files if they are needed for the 2 main programs to work
properly.
This project will be marked based on auto marking, and will then be checked manually for other requirements
described in this specification (for example, if a positional index is implemented). To ensure your project
satisfies the input and output formatting requirements, a simple sanity test script is available in
/home/cs6714/Public/sanity. You should run the sanity test script before you submit your solution. To run the
sanity test script on a CSE Linux machine, simply go inside the folder that contains your index.py and search.py
and type: bash /home/cs6714/Public/sanity that will run tests based on examples presented below. Note that it
is only a sanity test primarily for formatting, and you are expected to test your project more thoroughly.

The Indexer
Your indexer is run by
python3 index.py [folder-of-documents] [folder-of-indexes]
where [folder-of-documents] is the path to the directory for the collection of documents to be indexed and
COMP6714 2025T2 Project Specification
2/6

[folder-of-indexes] is the path to the directory where the index file(s) should be created. All the files in
[folder-of-documents] should be opened as read-only, as you may not have the write permission for these
files. If [folder-of-indexes] does not exist, create a new directory as specified. You may create multiple index
files although too many index files may slow down your performance. The total size of all your generated
index files shall not exceed 20MB (which should be plenty for this project).
After the indexing is completed, it will output the total number of documents, the total number of tokens
(after any preprocessing and filtering) to be indexed, and the total number of terms to be indexed. The
following example illustrates the required input and output formats:
$ python3 index.py ~cs6714/Public/data ./MyTestIndex
Total number of documents: 1000
Total number of tokens: 268,568
Total number of terms: 259,182
$
Note: Each line of the output of index.py ends with one newline ('\n') character; and except for the total
number of documents, your indexer may have different numbers of tokens and terms depending on your
preprocessing choices.

The Search
Your search program is run by
python3 search.py [folder-of-indexes]
where [folder-of-indexes] is the path to the directory containing the index file(s) that are generated by the
indexer. After the above command is executed, it will accept a search query from the standard input and
output the result to the standard output as a sequence of document names (the same as their document IDs),
one per line and sorted according to their ranking as described in the Ranking section below. It will then
continue to accept the search queries from the standard input and output the results to the standard output
until the end (i.e., a Ctrl-D).
Note: Since there might be significant number of outputs for a single query. For the demonstration of query
shown below, we only list top 10 lines of output. Your solution should produce all outputs.
The following example illustrates the required input and output formats:
$ python3 search.py ~/Proj/MyTestIndex
Apple
1361
Australia Technology
3454
10
18
105
311
504
742
798
839
882
…
$

Ranking
A document’s relevance score is based on three main factors: (1) the proportion of query terms it contains, (2)
the average proximity distance between matched terms, and (3) the number of consecutive query term pairs that
appear in the same left-to-right order in the document as in the query. Search results are sorted by their total
scores in descending order, with ties broken by ascending document ID (e.g., 72 appears before 125). A
proximity distance is defined as the number of terms (ignoring decimal numbers when counting) between
consecutive matched query terms in a document, measured from left to right, ignoring any unmatched terms.
COMP6714 2025T2 Project Specification
3/6

For each document, we select the positions of terms that yield the minimum total proximity distance. The
average minimum distance, which is the minimum total distance divided by the number of matched query term
pairs, is then used in scoring. The matched term combinations are also used to compute the ordered pair value
ordered_pairs. Each term pair in the same order as query terms will increase ordered_pairs by 1. In the cases
where multiple term combinations yield the same minimum proximity distance, we select the one with the
maximum number of ordered_pairs as its value. ordered_pairs reflects how much of the query term order is
preserved in the document and is incorporated into the final score. The total score formula is shown as below:
() = ∗
#ℎ

+ ∗
1
1 + __
+ ∗ _
where: = 1.0, = 1.0 and = 0.1. If only one term is matched, the proximity score and order score are
both 0. The parameters α, β, and γ determine the relative importance of term coverage, proximity, and query
term order, respectively. You should use the default values provided, but you are welcome to experiment with
different values to better understand their impact on ranking behavior.
To elaborate this further, consider an example of 4 documents with document IDs 1-4:
DocID Content
1 apple durian cherry bread egg fennel garlic ham
2 bread garlic ham
3 egg bread cherry apple egg fennel ham garlic bread
4 ham garlic bread
5 garlic chili
6 egg apple banana bread
Given a search query of "garlic bread", all documents contain at least one of these two search terms. The
document scores are calculated as below:
Query: garlic bread
Expected computation:
DocID Coverage Proximity_score Order_score Total_score
3 1 1/(1+0) 0.1*1 2.1
4 1 1/(1+0) 0.1*1 2.1
2 1 1/(1+0) 0 2
1 1 1/(1+2) 0 1.33
5 1/2 0 0 0.5
6 1/2 0 0 0.5
The coverage part is straightforward. We use Document 3 to illustrate how proximity distance is computed.
In Document 3, the term bread appears twice. We select the second occurrence because it is adjacent to
garlic, resulting in the smallest possible distance of 0. The matched term that yields the minimum proximity
is referred to as the closest matching term. For ordered_pairs, Document 3 has ordered_pairs = 1 because
the closest matching term pair (garlic bread) appears in the same order as in the query. In contrast,
Document 2 also has a proximity distance of 0 between bread and garlic, but since the order is reversed
relative to the query, ordered_pairs is 0. Furthermore, since Documents 5 and 6 have the same total score,
they are then sorted by their documentIDs.
Consider another search query "egg ham bread" for the example documents above. The resulting document
rankings are shown below.
Query: egg ham bread
Expected computation:
DocID Coverage Proximity_score Ordered_Pair_score Total_score
3 1 1/(1+(1+1)/2) 0.1*(1+1) 1.7
1 1 1/(1+(2+3)/2) 0.1*1 1.39
4 2/3 1/(1+1) 0.1*1 1.27
2 2/3 1/(1+1) 0 1.17
6 2/3 1/(1+2/1) 1 1.1
Note that, in Document 6, the matched term pairs (egg bread) are in the same order as in the query, not the
opposite, so ordered_pairs = 1.
COMP6714 2025T2 Project Specification
4/6

Using the real documents available in ~cs6714/Public/data, the following examples present their expected
ranked output from your search.py and you can study their ranking further by inspecting the content of their
corresponding documents.
$ python3 search.py ~/Proj/MyTestIndex
australia technology
3454
10
18
105
311
504
742
798
839
882
…
bank expect distribution
3077
203
1919
5727
5769
4367
4019
875
441
1156
…
$
Since the input is from the standard input, your search engine should be able to accept input redirected from a
file as shown below:
$ cat ~cs6714/Public/test/query6.txt
US finance COMPANY investor
$ python3 search.py ~/Proj/MyTestIndex <
~cs6714/Public/test/query6.txt
1499
1656
2054
5171
3396
5778
1682
714
302
…
$

Displaying Lines Containing Matching Terms
Given a search query starting with '>' and a space, in addition to the displaying the matching document IDs,
the lines of text that contain the closest matching terms are also displayed according to the following rules:
Each matching document ID is displayed with '>' and a space in front, followed by lines of text that
containing the closest matching terms.
For each matching document, only one line of text that contains its corresponding closest matching
term is displayed for each search term.
Lines of text are displayed according to the order of lines in a document, i.e., output line 1 before line
2.
In case more than one closest matching term is determined for a search term and they are at different
lines, only the first line of these lines is displayed. This includes the case when a search query only
consists of one search term such that matching terms can be found in multiple lines (refer to the apple
example below).
COMP6714 2025T2 Project Specification
5/6

The examples below illustrate some of these rules and the display format.
$ python3 search.py ~/Proj/MyTestIndex
> Apples
> 1361
The department said stocks of fresh apples in cold storage
> AUStralia Technology
> 3454
marketing of high-technology smelting processes invented in
Australia, notably the Siromelt Zinc Fuming Process.
> 10
its Dot Matrix impact technology, including any future
> 18
in Australia, Canada, Brazil and Japan.
> 105
AUSTRALIA nil 75,530
> 311
…
$

Marking
This project is worth 40 points. Your submission will be tested and marked on CSE linux machines using
Python3. Therefore, please make sure you have tested your solution on these machines using Python3 before
you submit. You will not receive any marks if your program does not work on CSE linux machines and only
works in other environment such as your own laptop. Full marks will be awarded to submissions that follow
this specification and pass all the test cases.
Although we do not measure the runtime speed, your indexing program will be terminated if it does not end
after one minute, and you will receive zero marks for the project (since we cannot get the index generated
successfully for further testing); and your search program will be terminated if it does not end after 10
seconds per search query, and you will receive zero marks for that search query.

Partial Marks
For this project, a search result from your search engine is only considered correct if it contains exactly the
same set of document names as the expected answer and their order must be exactly the same as well. We
will grant you some partial marks based on F-measure to evaluate how close your result is to the expected
answer (in which any correct document names that are in wrong order will be treated as wrong documents).
F-measure = 2 * (precision * recall) / (precision + recall) will be used to calculate the final score for each
test when your search results differ from the expected output. The final score for each test will be calculated
by: [F-measure round to 2 decimal places] * [fullMarks of that test]. The following examples further
illustrate how this scheme works. Given the expected answer Ans, suppose that there are 6 results returned
from 6 different search programs.

Ans Search1 Search2 Search3 Search4 Search5 Search6
1 1 3 2 1 1 1
2 2 2 1 2 2 2
3 3 1 4 3 3 3
4 3 4 5 5
5 4 6

Their F-measure based on their precision and recall are calculated as below:
Search1 - recall: 0.75 precision: 1.00 fmeasure: 0.86
Search2 - recall: 0.25 precision: 0.33 fmeasure: 0.28
Search3 - recall: 0.50 precision: 0.50 fmeasure: 0.50
Search4 - recall: 1.00 precision: 0.80 fmeasure: 0.89
Search5 - recall: 1.00 precision: 0.80 fmeasure: 0.89
Search6 - recall: 0.75 precision: 0.60 fmeasure: 0.67

If this search query is worth 2 pts, Search6 will get 0.67 * 2 = 1.34 pts.
For search queries starting with '>' (the queries with matching lines displayed), no partial marks will be
COMP6714 2025T2 Project Specification
6/6

awarded. In order to get full marks, a search result has to exactly match the output of the expected answer
(including the document IDs, the matching lines and their order). Therefore, please check your solution with
the provided sanity test before submitting the project (and do not leave this till last minute). For your
information, search queries starting with '>' will contribute in total less than 10 points (out of 40 points) of
this project.

Python3 Libraries
As a requirement of this project, you are not allowed to use any Python libraries other than the Python
Standard Library (https://docs.python.org/3.9/library/index.html) and NLTK.
For NLTK, you may assume the entire collection (using "all") has been downloaded before the marking
process starts. i.e., you should delete/comment all nltk.download statements, if any, in your code before
your project submission. Note that during your project development, you may want to download only
individual required packages rather than "all", as your CSE account has limited space. Please also set the
quiet parameter to True in case you forgot to remove the nltk.download statements and the download output
affects the auto-marking. For example,
nltk.download('package_name', quiet=True)

Submission
Deadline: Friday 1st August 23:59pm.
The penalty for late submission of assignments will be 5% (of the worth of the assignment) subtracted from
the raw mark per day of being late. In other words, earned marks will be lost. No assignments will be
accepted later than 5 days after the deadline.
All .py files should be stored in a .zip folder and submitted via Moodle.
The submission entry can be found on Moodle under Assessment section.

Plagiarism
The work you submit must be your own work. Group submissions will not be allowed. Your program must be
entirely your own work. Plagiarism detection software will be used to compare all submissions pairwise
(including submissions for similar assignments in previous years, if applicable) and serious penalties will be
applied, particularly in the case of repeat offences. Your submissions will also be checked against any
relevant code available on the Internet. In particular:
Do not copy ideas or code from others.
Do not use a publicly accessible repository or allow anyone to see your code.
Please refer to the Academic Honesty and Plagiarism section under Other Useful Information in the course
outline to help you understand what plagiarism is and how it is dealt with at UNSW.
Finally, reproducing, publishing, posting, distributing or translating this assignment is an infringement of
copyright and will be referred to UNSW Student Conduct and Integrity for action.

© Copyright 2025, Xubo Wong

学霸联盟