CISC-235 Data Structures W21
Assignment 2
February 18, 2021
How to Submit
You need to submit your python code for all questions. You must comment
your code. Name your folder as your netid-A2, zip the folder and upload it
to OnQ. Check whether your code is runable before submitting.
1 Binary Search Tree: 5+5+10 points
Binary search tree (BST) is a special type of binary tree which satisfies the
binary search property, i.e., the key in each node must be greater than any key
stored in the left sub-tree, and less than any key stored in the right sub-tree.
See a sample binary search tree below:
Figure 1: Binary Search Tree 1
Your task is to create a class named BinarySearchTree satisfying the
following requirements (you can create more methods/attributes if needed):
1
1) Must have an insert function (i.e., insert a new node). You may assume
that the values to be stored in the tree are integers.
2) Must have a getTotalHeight function that computes the sum of the
heights of all nodes in the tree. Your getTotalHeight function should run
in O(n) time.
3) Must have a getWeightBalanceFactor function. The weight balance
factor of a binary tree is a measure of how well-balanced it is, i.e., how
evenly its nodes are distributed between the left and right subtrees of each
node. Note that, WeightBalanceFactor is different from the balance factor
introduced in lecture.
More specifically, weight balance factor of a binary tree is the maximum
value of the absolute value of difference between the number of nodes in
left subtree and in right subtree for all nodes in the tree.
For example, given the binary search tree shown in Figure 2, your function
should return 2. Why? We need to calculate the difference between the
number of nodes in left and right subtree for all nodes in the tree. For root
node 6, this value is 1 (3 nodes in right subtree - 2 nodes in left subtree).
Similarly, for node 4, 9, 5, 8, 7, we have the difference at each node being
1, 2, 0, 1, 0. Thus the return value, is 2, which is the max of 1, 1, 2, 0, 1,
0 .
Figure 2: Binary Search Tree 2
4) You also need to write test code for the three requested functions. You can
use the following code https://www.geeksforgeeks.org/print-binary-tree-2-dimensions/
to print the structure of your tree before and after inserting a new node.
Note, you can also write your own method to print a tree.
2 A Simple Search Engine: Background
You probably use Google at least 10 times a day to dig through the vast amount
of information on the web. But do you know how search engines store web pages
2
and retrieve the most relevant web pages given a query? In this assignment,
we will be trying out one technique that is used by most search engines via
building an index of the words on a page. For each word we encounter on
a web page, we will count how many times the word appears in the page and
keep a list of all locations for each word (the locations can help search engines
locate words). For instance, given the following text: “A data structure is a
way to store and organize data” If we scan the text and index all words in the
text (start from position 0, letters are lower cased), we will have Table 1.
Table 1: Word Index Table for “A data structure is a way to store and organize
data”
Word Count Positions
a 2 0,4
data 2 1,10
structure 1 2
is 1 3
way 1 5
to 1 6
store 1 7
and 1 8
organize 1 9
The key idea of search engines is to identify the most relevant documents to
a given search query. A search query is a sequence of words representing what
you would like to search for, e.g., ”how to sort dict in Python”.
A simple way to measure the relevance of a web page to a given query is to
count how many times each word in the query appears in each candidate web
page. For instance, given “how to sort dict in Python” as the search query,
if we find a web page mentioning the word ”sort” and ”dict” multiple times,
the web page is probably more relevant to the query than a web page does not
contain any words in query. Of-course, you can think of other ways to compute
a relevance score that measures the similarity between a query and a web page.
In the rest of this assignment, you will be guided to create a simple search engine
step by step.
2.1 AVLTreeMap: 20 points
Your first task is to implement a new data structure named AVLTreeMap.
AVLTreeMap is just an AVL tree in disguise, where each node in the tree con-
tains a “key-value” pair. We know that AVL tree is a special type of search
tree. Similarily, in AVLTreeMap, the position of each node is determined based
on the key. The key can be an integer or a string, and the value paired with
each key can be a string or a list.
Your AVLTreeMap class should satisfy the following requirements (you can
create more instance variables/functions if you want):
3
1) Each node has left and right child, key, value, and height of the node for
adjusting tree structure.
2) Must have a get(key) function that returns the value if the given key
exists in the AVL tree. Return null if the key isn’t in the tree, or the
associated value if it is.
3) Must have a put(key, value) function (i.e., insert a new key-value pair
into the tree). Your put function should adjust the structure in order to
maintain a valid AVL trees, i.e., perform rotations if needed and update
node height if needed. You can assume that we do not have key-value
pairs with the same key.
4) Create an AVLTreeMap by inserting the following key-value pairs in order:
15-bob, 20-anna, 24-tom, 10-david, 13-david, 7-ben, 30-karen, 36-erin, 25-
david. Use the above AVLTreeMap test your get and put function.
2.2 WebPageIndex: 20 points
Write a WebPageIndex class that will contain the index representation of
a web page. This class will help you transfer a web page (text file in this
assignment) into an AVLTreeMap, where each key refers to each word appearing
in the document, and each value represents a list containing the positions of this
word in the file.
Your WebPageIndex class should satisfy the following requirements (you can
create more instance variables/functions if you want):
1) Implement a init (self, file) function that initilizes an instance of Web-
PageIndex from a txt file. Must have an attribute specifying the path of
the file used for initilizing the WebPageIndex instance. Lower case all
words appearing in the input file.
2) Implement a getCount(s) function that returns the number of times the
word s appeared on the page.
3) Test your WebPageIndex class using data/doc1.txt.
2.3 WebpagePriorityQueue: 30 points
Your task is to write a heap-based implementation of a PriorityQueue named
WebpagePriorityQueue. It should contain an array-based list to hold the
data items in the priority queue. Your WebpagePriorityQueue should be a
maxheap. The most relevant web page given a specific query will have the
highest priority value so that this web page can be retrieved first.
Your WebpagePriorityQueue class should satisfy the following requirements
(you can create more instance variables/functions if you want):
4
1) Must have an initilization function which takes a query (string) and a set
of WebpageIndex instances as input. This function will create a max heap
where each node in the max heap represents a WebpageIndex instance.
The priority of a WebpageIndex instance is defined as the sum of the
word counts of the page for the words in the query. For instance,
given a query “data structures”, and a WebpageIndex instance, the prior-
ity value of this web page should be: sum(number of times “data” appears
in the page + number of times “structures” appears in the page).
2) Must contain an attribute specifying the query string that determines the
current WebpagePriorityQueue.
3) Must have a peek function. Return the WebpageIndex with the highest
priority in the WebpagePriorityQueue, without removing it.
4) Must have a poll function. Remove and return the WebpageIndex with
the highest priority in the WebpagePriorityQueue.
5) Must have a reheap function which takes a new query as input and reheap
the WebpagePriorityQueue. This is required as the priority value of each
web page is query-dependent.
2.4 Implement a simple web search engine: 10 points
Create a python file named ”processQueries.py”. This script implements a sim-
ple web search engine.
First, implement a function named readFiles, which takes a folder path as
input, and returns a list of WebPageIndex instances, representing the txt files
in the given input folder.
In the main function, given a query file (each line represents one search
query), create a WebpagePriorityQueue instance to process queries in the query
file in a loop. For every subsequent search query, use the new query to re-
heap your collection of WebPageIndex instances in the created WebpagePriori-
tyQueue instance. Then prints out the matching filenames in order from best to
worst match until there are no matches(no words in the query appear in the txt
file) or the user specified limit (e.g., return at most 3 matched files) is reached.
Test your web search engine using the provided data folder and query file.
5
学霸联盟