Python代写 - INT2222
Overview Objective To be able to use and analyse readily available Hash Table ADTs. To be able to implement and use recursive sorting algorithms. To be able to use readily available List ADTs. To be able to use readily available List ADTs. To continue practicing the implementation of correct, high quality and well￾documented code. Task 1 • The aim of this question is to assess your knowledge in using an efficient Hash Table and being able to perform operations on it. Task 2 • The aim of this task is to assess your knowledge in computing the number of conflicts and probe chain lengths for a given hash table, as well as your ability to use the resulting statistics to analyse the performance of the hash table. Task 3 • The aim of this question is to assess your knowledge of handling a Hash Table with associated data. Task 4 • The aim of this question is to assess your knowledge of generating meaningful data from our Hash Table and using a recursive sort in descending order. General Information Constraints and assumptions • You are required to use the names provided for internal and external methods, as we will use them in our testing harnesses. • You are required to provide appropriate documentation for each file, class, function, and method you define or modify. • For any method where you are asked to extend the Unit Testing methods in the test harness, you must include at least 3 extra unittest style assertion tests. • Your code cannot use any of the Python functionality that would make the job trivial (like dictionaries), but you are allowed to use any Python list or string functionality you want (unless the question forbids it). Additional Information To get you started, you will find the scaffold and also some supporting files in Tasks 1-4: • - an implementation of the Hash Table ADT using Linear Probe for collision handling. • - an implementation of the List ADT. • - internal representation of our ArrayList. • - skeleton code for unit testing of Tasks 1 and 2. • - skeleton code for unit testing of Tasks 3 and 4. Your code will be tested with an extended version of the test harness provided, so make sure your code can run the harness. To make the testing work, we have made assumptions about the naming of instance variables, so please follow the naming used in the skeletons. Your modules should guard any computation/printing/etc. for the tasks with if __name__ == '__main__':, so they can be safely imported by the testing harness. The harness you have been provided is not exhaustive, so remember to add your own tests as well. Task 1 Add a new class Dictionary in file , which has the following methods: • __init__(self, hash_base: int, table_size: int) -> None, which creates a new Hash Table with the given hash base and initial table size, and uses it to initialise the instance variable self.hash_table. • load_dictionary(self, filename: str, time_limit: int = None) -> int, which reads a file filename containing one word per line, and adds each word to self.hash_table with integer 1 as the associated data, and returns the number of words successfully read into the hash table. If time_limit is specified and loading the dictionary exceedstime_limit seconds, a TimeoutError is immediately raised. • add_word(self, word: str) -> None, which adds the given word to the hash table with integer 1 as the associated data. • find_word(self, word: str) -> bool, which returns True if the word is in the hash table and False otherwise. • delete_word(self, word: str) -> None, which deletes the given word from the hash table. • menu(self) -> None, which displays a menu and allows the user to read a file, add a word, find a word, delete a word or exit. It also handles any bad user input or errors raised by methods in the class appropriately. Testing Add extra tests to for the Dictionary class. Test any methods requiring user input manually. Constraints and assumptions • load_dictionary should ignore any "newline" characters. • load_dictionary does not need to handle exceptions raised by calls to the I/O functions such as a FileNotFoundError. • add_word,find_word, and delete_word should convert the given word to lowercase before adding it. A word cannot be an empty string. • find_word and delete_word should not raise any exceptions. • menu should be called from your if __name__ == '__main__' section and not from your __init__ method. Additional Info • You may encounter encoding errors depending on your system. Please ensure UTF-8 is used when reading the dictionary. • You may add additional utility methods if you wish, especially for modularisation purposes Task 2 Method Creation In • Modify the LinearProbeHashTable class with a method statistics(self) -> Tuple, which returns a tuple (collision_count, probe_total, probe_max, rehash_count), consisting of • o the total number of collisions, o the sum of all the probe chain lengths, o the length of the longest probe chain, o the number of times rehash has been called. Implement a Statistics class in your file with the following methods: • load_statistics(self, hash_base: int, table_size: int, filename: str, max_time: int) -> Tuple, which creates a new dictionary with hash_base and table_size and returns the tuple (words, time, collision_count, probe_total, probe_max, rehash_count)where: o words is the number of words added to the table o time is the time taken for load_dictionary to complete if <= max_time and max_time otherwise o The rest of the tuple elements are as in statistics above. • table_load_statistics(self, max_time:int) -> None, which, for each possible combination of values in the table below, uses load_statistics to time how long it takes for load_dictionary to run. For each combination a line should be printed to output_task2.csv which contains: o The name of the dictionary file read. o The TABLESIZE and b used. o The number of words added to the hash table. o The four counters computed in statistics. o Either the time taken to run load_dictionary, or max_time, whichever is smaller. Combinations Execution and Analysis • Execute table_load_statistics() with max_time = 10 and graph the results from output_task2.csv. • Create the file explanation_task2.pdf which contains the above graph together with a short analysis on: o What values work well, which work poorly, and why. o The relationship between the counters and the runtime. o The length of the longest probe chain and the promise of O(1) time complexity. o Explain why rehash_count is 0 in all runs. Testing Add tests for your statistics method, as well as to for the Statistics class. Background (Tasks 3 & 4) In most large collections of written language, the frequency of a given word in that collection is inversely proportional to its rank in the words. That is to say: the second most common word appears about half as often as the most common word, the third most common word appears about a third as often as the most common word and so on. This is known as Zipf's law. If we count the number of occurrences of each word in such a collection, we can use just the number of occurrences to determine the approximate rank of each word. Taking the number of occurrences of the most common word to be max and the relationship described earlier, we can give a rarity score to each word: • Any word that appears at least max/100 times is considered to be common (Rarity.COMMON). • Any word that appears less than max/1000 times is considered to be rare (Rarity.RARE). • Any word in between (less than max/100 and greater or equal to max/1000) is considered to be uncommon (Rarity.UNCOMMON). • Any word which has never been observed is a misspelling (Rarity.MISSPELT). In the following we will use a hash table to facilitate a frequency analysis on a given text file. Task 3 In create a new class Frequency which defines the following methods: • __init__(self) -> None, which initialises a new Hash Table, as well as an instance of Dictionary which has read in english_large.txt. • add_file(self, filename: str) -> None, which reads each word from filename into the hash table, and its data is the number of times the word has been read into the hash table. • rarity(self, word: str) -> Rarity, which given a word, returns its rarity score as an enumerated value. The class should also keep track of the word with most occurrences using the variable max_word. Edit the __setitem__() method in to call rehash() when the load of the hash table is above 0.5. Testing You must add your own tests for the __init__, add_file and rarity methods in the provided file. Constraints and assumptions • add_file should only add the word if it exists in the dictionary instance in the __init__ method • Variable max_word should be a Tuple containing the (word, frequency) pair. • Convert each word added in add_file to lowercase. • An appropriate initial table size value should be chosen for the hash table. Additional Info • Enumerated values are pretty easy to use in Python, for an example see the Python documentation here: • You may encounter encoding errors depending on your system. Please ensure UTF-8 is used when reading the dictionary. • Importing punctuation from the String class can be helpful for striping punctuation • rehash may not be called, but is more realistic for when the size or number of the files read is unknown. Task 4 In your Frequency class add: • Method ranking(self) -> ArrayList[Tuple], which returns a list of known words (word, frequency) ordered by their frequency count in descending order. The returned list of words must be have been sorted by Quicksort. • Function frequency_analysis() -> None, which creates an instance of class Frequency, adds 215-0.txt to it and generates the ranking list. It should prompt the user for the number of rankings to show. Testing You must add your own tests for the ranking method in the provided file called Constraints and assumptions • In ranking you must copy and insert all your non-empty (key, value) tuples from the Hash Table into an ArrayList. • Complexity documentation is not required for frequency_analysis. • The output of frequency_analysis should include the ranking, the word, frequency and the rarity of the word. • Appropriate validation should be done to ensure the user enters a valid value in frequency_analysis. • In frequency_analysis, if the user enters 5, the 5 most common words and statistics should be shown. Additional Info • You may need to increase the Python recursion limit, by using the following command sys.setrecursionlimit(limit). Note that you need to import the sys module to use this function. • You may encounter encoding errors depending on your system. Please ensure UTF-8 is used when reading the dictionary. • There might be more efficient ways to produce the rankings list, however this is outside the scope of the prac. • frequency_analysis should be called from within the if __name__ == '__main__' section of the file.