FIT3155 S2/2021: Assignment 3 (Due midnight 11:55pm on FRIDAY 22 Oct 2021) [Weight: 10 = 2.5 + 2.5 + 5 marks.] Your assignment will be marked on the performance/efficiency of your program. You must write all the code yourself, and should not use any external library routines (that conflict with the core components being assessed. Standard data structures (such as dictionary, list, etc.) are allowed, but you will have to consider their efficiency in places of use. Also, use of bitarray is permitted for this assignment. Follow these procedures while submitting this assignment: The assignment should be submitted online via moodle strictly as follows: All your scripts MUST contain your name and student ID. Use gzip or Winzip to bundle your work into an archive which uses your student ID as the filename. (STRICTLY AVOID UPLOADING .rar ARCHIVES!) – Your archive should extract to a directory which is your student ID. – This directory should contain a subdirectory for each of the three questions, named as: q1/, q2/ and q3/. – Your corresponding scripts and work should be tucked within those subdirectories. Submit your zipped file electronically via Moodle. Academic integrity, plagiarism and collusion Monash University is committed to upholding high standards of honesty and academic integrity. As a Monash student your responsibilities include developing the knowl- edge and skills to avoid plagiarism and collusion. Read carefully the material available at https://www.monash.edu/students/academic/policies/academic-integrity to understand your responsibilities. As per FIT policy, all submissions will be scanned via MOSS. Assignment Questions 1. Towards a string-based proof of an observation you have learnt in this unit: In this question you will implement something really straightforward. Although the exercise itself is easy, the result your program will produce will inform a proof for an observation we made in one of the weeks in this unit. (Feel assured that you will be assessed only for Prepared by: Arun Konagurthu, Monash University your implementation and output. Yet it is encouraged that you reason and analyse the observed behaviour of your output.) Your program takes two inputs: (i) the size of a (small) alphabet, say |A|. For practicality during testing, restrict |A| within the range [1, 5]. Also, assume the characters in the alphabet start from ‘a’ onwards. That is, if |A| = 3, assume the alphabet A is: {‘a’,‘b’,‘c’}. (ii) the length of the string, say N . For practicality during testing, restrict N within the range [1, 10]. Using the two input parameters, |A| and N , write a program to: generate all possible strings of length N from the alphabet A, and for each string, report the number of distinct cyclic rotations that are possible for that string. (Refer to Week 1 tute sheet, questions 4 and 5, for the definition of a cyclic rotation.) Note: Although the inputs (|A| and N) are restricted within a range for testing, your implementation however must be generalizable to any values (≥ 1) of those parameters. Strictly follow the following specification to address this question: Program name: stringtest.py Arguments to your program: Two integers: (i) alphabet size (during testing, restrict to the range [1, 5]) (ii) string length (during testing, restrict to the range [1, 10]). Command line usage of your script: stringtest.py
Output to the terminal the following statistics (in a single line, space separated)
(i) The number of strings of length N from alphabet A with ≥ 2 distinct cyclic rotations. (ii) The number of strings of length N from alphabet A with exactly N distinct cyclic rotations. (iii) The number of strings of length N from alphabet A with exactly 1 distinct cyclic rotation. (iv) Is the number of strings with ≥ 2 distinct cyclic rotations an integer multiple of N? If yes, print true, else print false. Sample outputs (with supporting debug information, which you do not have to report) (SAMPLE 1) Input values: 2 4 Output: 14 12 2 false --------------------------------------------------------------------------------------------- Debug information for |A| = 2 and N = 4: 2 cyclicrotations(aaaa) | cyclicrotations(abaa) | cyclicrotations(baaa) | cyclicrotations(bbaa) aaaa | abaa | baaa | bbaa aaaa | baaa | aaab | baab aaaa | aaab | aaba | aabb aaaa | aaba | abaa | abba nDistinct = 1 | nDistinct = 4 | nDistinct = 4 | nDistinct = 4 cyclicrotations(aaab) | cyclicrotations(abab) | cyclicrotations(baab) | cyclicrotations(bbab) aaab | abab | baab | bbab aaba | baba | aabb | babb abaa | abab | abba | abbb baaa | baba | bbaa | bbba nDistinct = 4 | nDistinct = 2 | nDistinct = 4 | nDistinct = 4 cyclicrotations(aaba) | cyclicrotations(abba) | cyclicrotations(baba) | cyclicrotations(bbba) aaba | abba | baba | bbba abaa | bbaa | abab | bbab baaa | baab | baba | babb aaab | aabb | abab | abbb nDistinct = 4 | nDistinct = 4 | nDistinct = 2 | nDistinct = 4 cyclicrotations(aabb) | cyclicrotations(abbb) | cyclicrotations(babb) | cyclicrotations(bbbb) aabb | abbb | babb | bbbb abba | bbba | abbb | bbbb bbaa | bbab | bbba | bbbb baab | babb | bbab | bbbb nDistinct = 4 | nDistinct = 4 | nDistinct = 4 | nDistinct = 1 --------------------------------------------------------------------------------------------- (SAMPLE 2) Input values: 3 3 Output: 24 24 3 true --------------------------------------------------------------------------------------------- Debug information for |A| = 3 and N = 3: cyclicrotations(aaa) | cyclicrotations(acb) | cyclicrotations(bbc) | cyclicrotations(cba) aaa | acb | bbc | cba aaa | cba | bcb | bac aaa | bac | cbb | acb nDistinct = 1 | nDistinct = 3 | nDistinct = 3 | nDistinct = 3 cyclicrotations(aab) | cyclicrotations(acc) | cyclicrotations(bca) | cyclicrotations(cbb) aab | acc | bca | cbb aba | cca | cab | bbc baa | cac | abc | bcb nDistinct = 3 | nDistinct = 3 | nDistinct = 3 | nDistinct = 3 cyclicrotations(aac) | cyclicrotations(baa) | cyclicrotations(bcb) | cyclicrotations(cbc) aac | baa | bcb | cbc aca | aab | cbb | bcc caa | aba | bbc | ccb nDistinct = 3 | nDistinct = 3 | nDistinct = 3 | nDistinct = 3 cyclicrotations(aba) | cyclicrotations(bab) | cyclicrotations(bcc) | cyclicrotations(cca) aba | bab | bcc | cca baa | abb | ccb | cac aab | bba | cbc | acc nDistinct = 3 | nDistinct = 3 | nDistinct = 3 | nDistinct = 3 cyclicrotations(abb) | cyclicrotations(bac) | cyclicrotations(caa) | cyclicrotations(ccb) abb | bac | caa | ccb bba | acb | aac | cbc bab | cba | aca | bcc nDistinct = 3 | nDistinct = 3 | nDistinct = 3 | nDistinct = 3 cyclicrotations(abc) | cyclicrotations(bba) | cyclicrotations(cab) | cyclicrotations(ccc) abc | bba | cab | ccc bca | bab | abc | ccc cab | abb | bca | ccc nDistinct = 3 | nDistinct = 3 | nDistinct = 3 | nDistinct = 1 cyclicrotations(aca) | cyclicrotations(bbb) | cyclicrotations(cac) | aca | bbb | cac | caa | bbb | acc | aac | bbb | cca | nDistinct = 3 | nDistinct = 1 | nDistinct = 3 | --------------------------------------------------------------------------------------------- Non-examinable component: Systematically generate output values by varying the input parameters |A| and N in the ranges specified above, and see if you can detect a pattern 3 that emerges whenever the output line reports ‘true’. Does that inform any observation we have considered in this unit? If so, reason why to glean a ‘string based’ proof for that key observation we learnt. 2. A prefix-free code for positive integers based on Fibonacci numbers: The se- quence of Fibonacci numbers can be used to construct a uniquely decodable prefix-free code for positive integers. As you know, any Fibonacci number, denoted by Fn, is defined by the second-order recurrence Fn = Fn−1 +Fn−2, starting from F1 = 1 and F2 = 2. This recurrence results in the infinite Fibonacci sequence: {1, 2, 3, 5, 8, 13, 21, . . .}. A Belgian medical doctor, Edouard Zeckendorf, proved that every positive integer N > 0 can be uniquely represented as a sum of one or more distinct Fibonacci numbers, that is, N = Fi1 + Fi2 + Fi3 + . . ., such that the sum does NOT contain any two consecutive Fibonacci numbers from the Fibonacci sequence. For example, 73 = 55 + 13 + 5 = F9 + F6 + F4. Finding such a unique decomposition for any N > 0 translates to a simple greedy algo- rithm. Start from N and iteratively decompose by choosing the largest possible Fibonacci number at each iteration, to identify the terms Fi1 , Fi2 , . . . in a strictly decreasing order (subject to the constraint highlighted in red above). Then, the prefix-free Fibonacci code word for any positive integer N consists of i1 + 1 bits. If some Fibonacci number Fk appears in unique decomposition of N , then the k-th bit of the code word is set to ‘1’ otherwise it is ‘0’. Finally, the (i1 + 1)-th bit is set to ‘1’ to signify the end of the code word. For example, the Fibonacci code word for N = 73 is: FibonacciCode(73) = 0001010011. It is easy to see that this encoding is decodable and prefix-free, because the above con- straint that Zeckendorf defined ensures that one never observes two consecutive Fibonacci numbers in the decomposition of any N . Write a program that accepts an integer MaxN > 0 and generates Fibonacci code words for all integers in the range [1, MaxN]. Strictly follow the following specification to address this question: Program name: fibonaccicode.py Arguments to your program: A positive integer MaxN > 0. Command line usage of your script: fibonaccicode.py Output to terminal: Fibonacci code words of each integer N in the range [1, MaxN], one per line (see output format below). Sample output for MaxN = 10 : 1 11 2 011 3 0011 4 1011 5 00011 6 10011 7 01011 8 000011 4 9 100011 10 010011 3. BWT-based lossless data compression (encoder and decoder) scheme: In this exercise, you will be implementing a simple Burrows-Wheeler Transform (BWT) based encoder and decoder. Ideally, the input to BWT-based encoders should be a string that we want to compress. The general pipeline of the encoding involves reading the original string, computing its BWT, and compressing the BWT using a set of encoding schemes. The decoder on the other hand decodes the BWT and then inverts the BWT to recover the original string. However, in this exercise, to simplify matters for you, your encoder will be reading a BWT directly (rather than computing it from the original string). Similarly, the decoder will only have to recover the encoded BWT (without having to invert it to the original string). Having said this, as a non-examinable target, it would be an extremely satisfactory exercise for your encoder and decoder to accept and recover (respectively) the original string (rather than its BWT). You already know how to construct a BWT of a string and invert it, both in linear time. Your encoder will read a BWT of some original string from an ASCII file, containing all characters strictly in the range [36,126], and with ‘$’ (ASCII value 36) being the smallest (terminal) character in the string. Your encoder will then perform a run-length encoding on the BWT using Huffman code words (see Lecture 9) for encoding characters and Fibonacci code words (see Question 2 above) for encoding any positive integer. What is run-length encoding? As the name suggests, run-length encoding of a string (here, BWT) is based on encoding the lengths of runs of individual characters as they appear, when scanning the string from left to right. The observed characters and their corresponding run-lengths can be encoded as tuples of the form 〈character, run-length〉. For example, the run-length encoding of this BWT string aaac$bb results in the tuples: 〈a, 3〉, i.e. a appears over a run of length=3, followed by 〈c, 1〉, i.e. c appears over a run of length=1, followed by 〈$, 1〉, i.e. $ appears over a run of length=1, followed by 〈b, 2〉, i.e. b appears over a run of length=2. To generate a fully-decodable binary stream, your encoder encodes (as a continuous stream of bits) the following pieces of information (in that order): (a) The length of the BWT, encoded using its Fibonacci code word. (b) The number of distinct BWT characters, encoded using its Fibonacci code word. (c) For each distinct character in the BWT (in any order of these characters): the 7-bit ASCII code word of that character, the length of its constructed Huffman code word, where the length is encoded using its Fibonacci code word, and the Huffman code word of that character. (d) For each run-length encoded tuple of the BWT (in left-to-right order): the Huffman code word of the character being encoded, and the Fibonacci code word of its run-length. Strictly follow the following specification to address this question. Write a BWT-based encoder and decoder using the following specification. For dealing with packed binary input/output, you are free to use Python’s bitarray. 5 ENCODER SPEC: [3 marks] Program name: mybzip.py Arguments to your program: An input file containg the BWT of some original string. (Note: You do not have to compute the BWT in this exercise.) Command line usage of your script: mybzip.py Output file name: encodedBWT.bin Output format: The output is a binary stream of bits that losslessly (run- length) encodes the input BWT, by concatenating component code words for all pieces of information listed above. Encoding example: If the INPUT FILE contained the BWT string: "b$aaaaaaaaaa" the OUTPUT FILE will contain the following stream of bits: 10101100111100001111110001001100010010001101001101111010011 The above binary stream encodes the following pieces of information ------------------------------------------------------------------------ Length(BWT) = 12 => FibonacciCode(12) = 101011 nUniqChars = 3 (i.e., {‘a’,‘b’,‘$’}) => FibonacciCode(3) = 0011 Encoding unique characters in the BWT and their constructed... ...Huffman code words: {‘a’ = 1, ‘b’ = 00, ‘$’ = 01}: ASCII(‘a’)= 1100001 FibonacciCode(codelen=1) = 11 HuffmanCode(‘a’)= 1 ASCII(‘b’)= 1100010 FibonacciCode(codelen=2) = 011 HuffmanCode(‘b’)= 00 ASCII(‘$’)= 0100100 FibonacciCode(codelen=2) = 011 HuffmanCode(‘$’)= 01 Run-length encoded tuples of the BTW string b$aaaaaaaaaa <‘b’,1> => HuffmanCode(‘b’) = 00 FibonacciCode(runlen=1) = 11 <‘$’,1> => HuffmanCode(‘$’) = 01 FibonacciCode(runlen=1) = 11 <‘a’,10>=> HuffmanCode(‘a’) = 1 FibonacciCode(runlen=10)= 010011 ENCODING COMPLETE!------------------------------------------------------- Illustration of the packed (binary) representation of the OUTPUT: |--------|--------|--------|--------|--------|--------|--------|--------| |Byte-1 |Byte-2 |Byte-3 |Byte-4 |Byte-5 |Byte-6 |Byte-7 |Byte-8 | |--------|--------|--------|--------|--------|--------|--------|--------| |10101100|11110000|11111100|01001100|01001000|11010011|01111010|011ooooo| |--------|--------|--------|--------|--------|--------|--------|--------| Note: The binary output has all the bits packed into bytes. Therefore, depending on the length of your stream, you may have to pad the encoded stream with additional bits so that the length of the encoded binary stream becomes a perfect multiple of 8. In the example above, the encoded stream is 59 bits long. So, the last byte (Byte-8) is padded with five extra ‘0’ bits (shown above as ‘o’s, for readability). 6 DECODER SPEC: [2 marks] The decoder is a separate program that should be able to recover the BWT from the encoded file, without any additional information besides what it can gather from the binary file it is reading. Program name: mybunzip.py Arguments to your program: Any output file generated by your encoder (above). Command line usage of your script: mybunzip.py Output file name: decodedBWT.txt If the input file encodedBWT.bin file contained the following bit stream: 10101100111100001111110001001100010010001101001101111010011 then, the output decodedBWT.txt file will contain: b$aaaaaaaaaa -=o0o=- END -=o0o=- 7