Python代写 - INT2222
时间:2020-10-27
Overview
Objective
To be able to use and analyse readily available Hash Table ADTs. To be able to implement and use
recursive sorting algorithms. To be able to use readily available List ADTs. To be able to use readily
available List ADTs. To continue practicing the implementation of correct, high quality and welldocumented code.
Task 1
• The aim of this question is to assess your knowledge in using an efficient Hash Table and being
able to perform operations on it.
Task 2
• The aim of this task is to assess your knowledge in computing the number of conflicts and probe
chain lengths for a given hash table, as well as your ability to use the resulting statistics to
analyse the performance of the hash table.
Task 3
• The aim of this question is to assess your knowledge of handling a Hash Table with associated
data.
Task 4
• The aim of this question is to assess your knowledge of generating meaningful data from our
Hash Table and using a recursive sort in descending order.
General Information
Constraints and assumptions
• You are required to use the names provided for internal and external methods, as we will use
them in our testing harnesses.
• You are required to provide appropriate documentation for each file, class, function, and
method you define or modify.
• For any method where you are asked to extend the Unit Testing methods in the test harness,
you must include at least 3 extra unittest style assertion tests.
• Your code cannot use any of the Python functionality that would make the job trivial (like
dictionaries), but you are allowed to use any Python list or string functionality you want (unless
the question forbids it).
Additional Information
To get you started, you will find the scaffold and also some supporting files in Tasks 1-4:
• hash_table.py - an implementation of the Hash Table ADT using Linear Probe for collision
handling.
• list_adt.py - an implementation of the List ADT.
• referential_array.py - internal representation of our ArrayList. • test_dictionary.py - skeleton code for unit testing of Tasks 1 and 2.
• test_frequency.py - skeleton code for unit testing of Tasks 3 and 4.
Your code will be tested with an extended version of the test harness provided, so make sure your code
can run the harness. To make the testing work, we have made assumptions about the naming of
instance variables, so please follow the naming used in the skeletons. Your modules should guard any
computation/printing/etc. for the tasks with if __name__ == '__main__':, so they can be safely
imported by the testing harness. The harness you have been provided is not exhaustive, so remember to
add your own tests as well.
Task 1
Add a new class Dictionary in file dictionary.py , which has the following methods:
• __init__(self, hash_base: int, table_size: int) -> None, which creates a new Hash
Table with the given hash base and initial table size, and uses it to initialise the instance variable
self.hash_table. • load_dictionary(self, filename: str, time_limit: int = None) -> int, which reads
a file filename containing one word per line, and adds each word to self.hash_table with
integer 1 as the associated data, and returns the number of words successfully read into the
hash table. If time_limit is specified and loading the dictionary exceedstime_limit seconds, a
TimeoutError is immediately raised.
• add_word(self, word: str) -> None, which adds the given word to the hash table with
integer 1 as the associated data.
• find_word(self, word: str) -> bool, which returns True if the word is in the hash table
and False otherwise.
• delete_word(self, word: str) -> None, which deletes the given word from the hash table.
• menu(self) -> None, which displays a menu and allows the user to read a file, add a word, find
a word, delete a word or exit. It also handles any bad user input or errors raised by methods in
the class appropriately.
Testing
Add extra tests to test_dictionary.py for the Dictionary class. Test any methods requiring user
input manually.
Constraints and assumptions
• load_dictionary should ignore any "newline" characters.
• load_dictionary does not need to handle exceptions raised by calls to the I/O functions such
as a FileNotFoundError.
• add_word,find_word, and delete_word should convert the given word to lowercase before
adding it. A word cannot be an empty string.
• find_word and delete_word should not raise any exceptions.
• menu should be called from your if __name__ == '__main__' section and not from your
__init__ method.
Additional Info
• You may encounter encoding errors depending on your system. Please ensure UTF-8 is used
when reading the dictionary.
• You may add additional utility methods if you wish, especially for modularisation purposes
Task 2
Method Creation
In hash_table.py: • Modify the LinearProbeHashTable class with a method statistics(self) -> Tuple, which
returns a tuple (collision_count, probe_total, probe_max, rehash_count), consisting
of
• o the total number of collisions,
o the sum of all the probe chain lengths,
o the length of the longest probe chain,
o the number of times rehash has been called.
Implement a Statistics class in your dictionary.py file with the following methods:
• load_statistics(self, hash_base: int, table_size: int, filename: str,
max_time: int) -> Tuple, which creates a new dictionary with hash_base and table_size
and returns the tuple (words, time, collision_count, probe_total, probe_max,
rehash_count)where:
o words is the number of words added to the table
o time is the time taken for load_dictionary to complete if <= max_time and max_time
otherwise
o The rest of the tuple elements are as in statistics above.
• table_load_statistics(self, max_time:int) -> None, which, for each possible
combination of values in the table below, uses load_statistics to time how long it takes for
load_dictionary to run. For each combination a line should be printed to output_task2.csv
which contains:
o The name of the dictionary file read.
o The TABLESIZE and b used.
o The number of words added to the hash table.
o The four counters computed in statistics. o Either the time taken to run load_dictionary, or max_time, whichever is smaller.
Combinations
Execution and Analysis
• Execute table_load_statistics() with max_time = 10 and graph the results from
output_task2.csv.
• Create the file explanation_task2.pdf which contains the above graph together with a short
analysis on:
o What values work well, which work poorly, and why.
o The relationship between the counters and the runtime.
o The length of the longest probe chain and the promise of O(1) time complexity.
o Explain why rehash_count is 0 in all runs.
Testing
Add tests for your statistics method, as well as to test_dictionary.py for the Statistics class.
Background (Tasks 3 & 4)
In most large collections of written language, the frequency of a given word in that collection is inversely
proportional to its rank in the words. That is to say:
the second most common word appears about half as often as the most common word, the third most
common word appears about a third as often as the most common word and so on.
This is known as Zipf's law.
If we count the number of occurrences of each word in such a collection, we can use just the number of
occurrences to determine the approximate rank of each word. Taking the number of occurrences of the
most common word to be max and the relationship described earlier, we can give a rarity score to each
word:
• Any word that appears at least max/100 times is considered to be common (Rarity.COMMON).
• Any word that appears less than max/1000 times is considered to be rare (Rarity.RARE).
• Any word in between (less than max/100 and greater or equal to max/1000) is considered to be
uncommon (Rarity.UNCOMMON).
• Any word which has never been observed is a misspelling (Rarity.MISSPELT).
In the following we will use a hash table to facilitate a frequency analysis on a given text file.
Task 3
In frequency.py create a new class Frequency which defines the following methods:
• __init__(self) -> None, which initialises a new Hash Table, as well as an instance of
Dictionary which has read in english_large.txt.
• add_file(self, filename: str) -> None, which reads each word from filename into the
hash table, and its data is the number of times the word has been read into the hash table.
• rarity(self, word: str) -> Rarity, which given a word, returns its rarity score as an
enumerated value.
The class should also keep track of the word with most occurrences using the variable max_word.
Edit the __setitem__() method in hash_table.py to call rehash() when the load of the hash table is
above 0.5.
Testing
You must add your own tests for the __init__, add_file and rarity methods in the provided
test_frequency.py file.
Constraints and assumptions
• add_file should only add the word if it exists in the dictionary instance in the __init__
method
• Variable max_word should be a Tuple containing the (word, frequency) pair.
• Convert each word added in add_file to lowercase.
• An appropriate initial table size value should be chosen for the hash table.
Additional Info
• Enumerated values are pretty easy to use in Python, for an example see the Python
documentation here: https://docs.python.org/3/library/enum.html.
• You may encounter encoding errors depending on your system. Please ensure UTF-8 is used
when reading the dictionary.
• Importing punctuation from the String class can be helpful for striping punctuation
• rehash may not be called, but is more realistic for when the size or number of the files read is
unknown.
Task 4
In your Frequency class add:
• Method ranking(self) -> ArrayList[Tuple], which returns a list of known words (word,
frequency) ordered by their frequency count in descending order. The returned list of words
must be have been sorted by Quicksort.
• Function frequency_analysis() -> None, which creates an instance of class Frequency,
adds 215-0.txt to it and generates the ranking list. It should prompt the user for the number of
rankings to show.
Testing
You must add your own tests for the ranking method in the provided file called test_frequency.py.
Constraints and assumptions
• In ranking you must copy and insert all your non-empty (key, value) tuples from the Hash
Table into an ArrayList. • Complexity documentation is not required for frequency_analysis. • The output of frequency_analysis should include the ranking, the word, frequency and the
rarity of the word.
• Appropriate validation should be done to ensure the user enters a valid value in
frequency_analysis.
• In frequency_analysis, if the user enters 5, the 5 most common words and statistics should be
shown.
Additional Info
• You may need to increase the Python recursion limit, by using the following command
sys.setrecursionlimit(limit). Note that you need to import the sys module to use this
function.
• You may encounter encoding errors depending on your system. Please ensure UTF-8 is used
when reading the dictionary.
• There might be more efficient ways to produce the rankings list, however this is outside the
scope of the prac.
• frequency_analysis should be called from within the if __name__ == '__main__' section
of the file.