CS 240-C++代写|学霸联盟

CS 240-C++代写

时间：2022-12-05

CS 240 – Data Structures and Data Management
Module 10: Compression
T. Biedl E. Schost O. Veksler
Based on lecture notes by many previous cs240 instructors
David R. Cheriton School of Computer Science, University of Waterloo
Winter 2021
Outline
 Compression
 Encoding Basics
 Huffman Codes
 Run-Length Encoding
 Lempel-Ziv-Welch
 bzip2
 Burrows-Wheeler Transform
Outline
 Compression
 Encoding Basics
 Huffman Codes
 Run-Length Encoding
 Lempel-Ziv-Welch
 bzip2
 Burrows-Wheeler Transform
Data Storage and Transmission
 The problem: How to store and transmit data?
 Source text
 the original data, string S of characters from the source alphabet ΣS
 Coded text
 the encoded data, string C of characters from the coded alphabet ΣC
 Encoding
 an algorithm mapping source texts to coded texts
 Decoding
 an algorithm mapping coded texts back to their original source text
 Notes
 source “text” can be any sort of data (not always text)
 usually the coded alphabet is just binary Σ = 0,1
 usually S and C are stored as streams
 read/write only one character at a time
 convenient for handling huge texts
 can start processing text while it is still being loaded
 input stream supports methods: (), (), ()
 output stream supports methods: (), ()
Judging Encoding Schemes
 Can measure time/space efficiency of encoding/decoding
algorithms, as for any usual algorithm
 What other goals make sense?
 reliability
 error-correcting codes
 security
 encryption
 size (our main objective in this course)
 Encoding schemes that try to minimize the size of the coded
text perform data compression
 We will measure the compression ratio
∙ log Σ
∙ log Σ
Types of Data Compression
 Logical vs. Physical
 Logical Compression
 uses the meaning of the data
 only applies to a certain domain (e.g. sound recordings)
 Physical Compression
 only know physical bits in data, not their meaning
 Lossy vs. Lossless
 Lossy Compression
 achieves better compression ratios
 decoding is approximate
 exact source text is not recoverable
 Lossless Compression
 always decodes exactly
 Lossy, logical compression is useful
 media files: JPEG, MPEG
 But we will concentrate on physical, lossless compression
 can be safely used for any application
Character Encodings
 Definition: character encoding maps each character in the source
alphabet to a string in coded alphabet
∶ Σ → Σ
∗
 for ∈ Σ , () is called the codeword (or code) of
 Character encoding sometimes is called character-by-character encoding
 encode one character at a time
 Two possibilities
 Fixed-length code: all codewords have the same length
 Variable-length code: codewords may have different lengths
Fixed Length Codes
 Example: ASCII (American Standard Code for Information Interchange), 1963
charin Σ null
start of
heading
start of
text
. . . 0 1 . . . A B . . . ∼ delete
code 0 1 2 . . . 48 49 . . . 65 66 . . . 126 127
code as
binary
string
0000000 0000001 0000010 0110000 0110001 01000001 01000010 1111110 1111111
 7 bits to encode 128 possible characters
 control codes, spaces, letters, digits, punctuation
 A P P L E → (65, 80, 80, 76, 69) → 01000001 1010000 1010000 1001100 1000101
 Standard in all computers and often our source alphabet
 Not well-suited for non-English text
 ISO-8859 extends to 8 bits, handles most Western languages
 Other (earlier) fixed-length codes: Baudot code, Murray code
 To decode a fixed-lenth code (say codewords have bits), we look up each
-bit pattern in a table
Variable-Length Codes
 Overall goal: Find an encoding that is short
 Observation: Some alphabet letters occur more often than others
 idea: use shorter codes for more frequent characters
 example: frequency of letters in typical English text
e 12.70% d 4.25% p 1.93%
t 9.06% l 4.03% b 1.49%
a 8.17% c 2.78% v 0.98%
o 7.51% u 2.76% k 0.77%
i 6.97% m 2.41% j 0.15%
n 6.75% w 2.36% x 0.15%
s 6.33% f 2.23% q 0.10%
h 6.09% g 2.02% z 0.07%
r 5.99% y 1.97%
Variable-Length Codes
 Example 1: Morse code
 Example 2: UTF-8 encoding of Unicode
 there are more than 107,000 Unicode characters
 uses 1-4 bytes to encode any Unicode character
Encoding
 Assume we have some character encoding : Σ → Σ
∗
 is a dictionary with keys in Σ
 Typically would be stored as array indexed by Σ
charByChar::Encoding(, , )
: encoding dictionary, : input stream with characters in Σ
: output stream
while is non-empty
← . ℎ(. ())
.append ()
 Example: encode text “WATT” with Morse code
W A T T

Decoding
⊔
A
N O E
T
 The decoding algorithm must map Σ
∗ to Σ
 The code must be uniquely decodable
 false for Morse code as described
 Do not need symbol $, codewords are prefix-free by definition
 From now on only consider prefix-free codes
 () is not a prefix of (′) for any , ’ ∈ Σ
 Store codes in a trie with characters of Σ at the leaves
decodes to both WATT and EAJ
 Morse code uses ‘end of character’ pause to avoid ambiguity
0 1
0
0 0
0 1
1
1
1
Example: Prefix-free Encoding/Decoding
 Code as table
 Code as trie
c ∈ΣS ⊔
000
A
01
E
101
N
001
O
100
T
11
 Encode AN⊔ANT →
⊔
A
N O E
T
0 1
0
0 0
0 1
1
1
1
()
 Decode 111000001010111 →
01 001 000 0100111
TO EAT⊔
Decoding of Prefix-Free Codes
PrefixFree::decoding(, , )
: trie of a prefix-free code, : input-stream with characters in Σ
: output-stream
while is non-empty
← .root
while is not a leaf
if is empty or has no child labelled . ()
return “invalid encoding”
← child of that is labelled with . ()
. (character stored at )
 Run-time: (||)
 Any prefix-free code is uniquely decodable
Encoding from the Trie
PrefixFree::encoding(, , )
: prefix-free code trie, : input-stream with characters in Σ
← array of nodes in indexed by Σ
for all leaves in
[character at ] ←
while is non-empty
← empty string; ← [. ()]
while is not the root
.prepend (character from to its parent)
← parent()
// now is the encoding of
.append()
 Run-time: +
 Can encode directly from the trie
=
= 0 (letter )
=∧
= 0
= 00
= 100
= 1 (letter )
=∧
= 1
= 01
= 001
= 100
= 100 001 ( Σ + ||) if has no nodes with one child

(

)

(

)
Outline
 Compression
 Encoding Basics
 Huffman Codes
 Run-Length Encoding
 Lempel-Ziv-Welch
 bzip2
 Burrows-Wheeler Transform
Huffman’s Algorithm: Building the Best Trie
 For a given the best trie can be constructed with Huffman tree algorithm
 trie giving the shortest coded text if alphabet is binary
 Σ = {0,1}
 tailored to frequencies in that particular
Example: Huffman Tree Construction
 Example text: , Σ = {, , , , }
 Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
N
1
G R E
2 42
Y
2
 Put each character into its own (single node) trie
 each trie has a frequency
 initially, frequency is equal to its character frequency
Example: Huffman Tree Construction
 Example text: , Σ = {, , , , }
 Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
N
1
G R E
2 42
Y
2
 Join two least frequent tries into a new trie
 frequency of the new trie = sum of old trie frequencies
Example: Huffman Tree Construction
 Example text: , Σ = {, , , , }
 Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
NR E
3 42 2
G Y
0 1
 Join two least frequent tries into a new trie
 frequency of the new trie = sum of old trie frequencies
Example: Huffman Tree Construction
 Example text: , Σ = {, , , , }
 Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
NR E
3 42 2
G Y
0 1
 Join two least frequent tries into a new trie
 frequency of the new trie = sum of old trie frequencies
Example: Huffman Tree Construction
 Example text: , Σ = {, , , , }
 Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
E
3 4 4
G Y
0 1
 Join two least frequent tries into a new trie
 frequency of the new trie = sum of old trie frequencies
R N
0 1
Example: Huffman Tree Construction
 Example text: , Σ = {, , , , }
 Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
 Join two least frequent tries into a new trie
 frequency of the new trie = sum of old trie frequencies
E
3 4 4
G Y
0 1
R N
0 1
Example: Huffman Tree Construction
 Example text: , Σ = {, , , , }
 Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
 Join two least frequent tries into a new trie
 frequency of the new trie = sum of old trie frequencies
7 4
R N
0 1
G Y
E
0
0 1
1
Example: Huffman Tree Construction
 Example text: , Σ = {, , , , }
 Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
 Join two least frequent tries into a new trie
 frequency of the new trie = sum of old trie frequencies
11
R N
0 1
G Y
E
0
0 1
1
0 1
Example: Huffman Tree Construction
 Example text: , Σ = {, , , , }
 Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
 Final Huffman tree
R N
0 1
G Y
E
0
0 1
1
0 1
 Compression ratio
25
11 ∙ log 5
≈ 98%
 GREENENERGY → 000 10 01 01 11 01 11 01 10 000 001
 These frequencies are not skewed enough to lead to good compression
Huffman Algorithm Summary
 For a given source , to determine a trie that minimizes length of
1) determine frequency of each character ∈ Σ in
2) for each ∈ Σ, create trie of height 0 holding only
 call it -trie
3) assign a weight to each trie
 sum of frequencies of all letters in a trie
 initially, these are just the character frequencies
4) find the two tries with the minimum weight
5) merge these tries with a new interior node
 the new weight is the sum of merged tries weights
 added one bit to the encoding of each character
6) repeat Steps 4–5 until there is only 1 trie left
 this is , the final decoder
 Data structure for making this efficient?
 min-ordered heap
 step 4 is two delete-min
 step 5 is insert
Heap Storing Tries during Huffman Tree Construction
S
4
L
E O
0 1
2 2
 Efficient data structure to store tries
 a min-ordered heap
 (key,value) = (trie weight, link to trie)
 step 4 is two delete-mins, step 5 is insert
2
2 4
Huffman’s Algorithm Pseudocode
Huffman::encoding(, )
: input-stream with characters in Σ , : output-stream
← array indexed by Σ , initialized to 0
while is non-empty do increase [. ()] by 1 // get frequencies
← min-oriented priority queue to store tries
for all ∈ Σ with [] > 0
.insert(single-node trie for with weight [])
while . () > 1
1 ← . (), 1 ← weight of 1
2 ← . (), 2 ← weight of 2
.insert(trie with 1 , 2 as subtries and weight 1+ 2)
← .() // trie for decoding
reset input-stream
← PrefixFreeEncodingFromTrie(, ) // perform actual encoding
 Total time is ( Σ log |Σ| + | |)
()
( Σ log |Σ|)
( Σ log |Σ|)
( Σ + ||)
Huffman Coding Evaluation
 Constructed trie is not unique (why?)
 So decoding trie must be transmitted along with the coded text
 This may make encoding bigger than source text!
 Encoding must pass through strem twice
 to compute frequencies and to encode
 cannot use stream unless it can be reset
 Time to compute trie and encode
( Σ log |Σ| + ||)
 Decoding run-time
(||)
 The constructed trie is optimal in the sense that
 is shortest among all prefix-free character encodings with ΣC = {0, 1}
 proof omitted
 Many variations
 give tie-breaking rules, estimate frequencies, adaptively
change encoding, …
Outline
 Compression
 Encoding Basics
 Huffman Codes
 Run-Length Encoding
 Lempel-Ziv-Welch
 bzip2
 Burrows-Wheeler Transform
Single-Character vs Multi-Character Encoding
 Single character encoding: each source-text character receive one codeword
= b a n a n a
 Multi-character encoding: multiple source-text characters can receive one
codeword
01 1 1 111 11
= b a n a n a
01 11 101
Run-Length Encoding
 RLE is an example of multi-character encoding
 Source alphabet and coded alphabet are both binary: Σ = 0, 1
 can be extended to non-binary alphabets
 Useful has long runs of the same character: 00000 111 0000
 Dictionary is uniquely defined by algorithm
 no need to store it explicitly
 Encoding idea
 give the first bit of (either 0 or 1)
 then give a sequence of integers indicating run lengths
 Example 00000 111 0000
 becomes: 0 5 3 4
 do not know how to parse in individual run lengths
 Need to encode run length in binary, how?
 cannot use variable length binary encoding 10111100
 do not have to give the bit for runs since they alternate
 fixed length binary encoding (say 16 bits) wastes space, bad compression
 000000000000010100000000000000110000000000000100
Prefix-free Encoding for Positive Integers
 Use Elias gamma code to encode
 log copies of 0, followed by
 binary representation of (always starts with 1)
in binary encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
 Easy to decode
 (number of zeros+1) tells you the length of (in binary representation)
 after zeros, read binary representation of (it starts with 1)
RLE Example: Encoding
 Encoding
= 1111110010000000000000000000011111111111

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
=
 Encoding
= 0010000000000000000000011111111111
= 1

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
RLE Example: Encoding
= 7

 Encoding
= 111111110000000000000000000011111111111

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
RLE Example: Encoding
= 2
= 100111
 Encoding
= 1111111000000000000000000000011111111111

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
RLE Example: Encoding
= 1
= 1001110101
RLE Example: Encoding
 Encoding
= 111111100111111111111

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
20 4 10100 000010100
= 20
= 1001110101
RLE Example: Encoding
 Encoding
= 111111100100000000000000000000

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
11 3 1011 0001011
 Compression ratio
26/41 ≈ 63%
= 11
= 1001110101000010100
RLE Encoding
RLE::encoding(, )
: input-stream of bits, : output-stream
← . ()
. ()
while is non-empty do
← 1 // initialize run length
while ( is non-empty and . () = ) //compute run length
+ +; . ()
// compute Elias gamma code (binary string) for
←empty string
while ( > 1)
.append(0) // 0 appended to output directly
.prepend( mod 2) // K is built from last digit forwards
← /2
.prepend(1) // the very first digit of K was not computed
.append ()
← 1 −
RLE Example: Decoding
=

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
13 3 1101 0001101 Decoding
= 00001101001001010
= 0
= 4

= 13
 Recall that (# zeros+1) tells you the
length of in binary representation
RLE Example: Decoding
 Decoding
= 00001101001001010
= 0000000000000

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
13 3 1101 0001101
=

= 3

= 4
1111
RLE Example: Decoding
 Decoding
= 00001101001001010
= 00000000000001111

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
13 3 1101 0001101
=

=
=
0
RLE Example: Decoding
 Decoding
= 00001101001001010
= 000000000000011110

in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
13 3 1101 0001101
=

= 2

= 2
11
RLE Decoding
RLE-Decoding()
: input stream of bits, : output-stream
← .() // bit-value for the first run
while is non-empty
← 0 // length of base-2 number - 1
while . () = 0
++
← 1 // base-2 number converted
for ( = 1 to ) // translate k from binary string to integer
← ∗ 2 + . ()
// if C runs out of bits then encoding was invalid
for ( = 1 to )
. ()
← 1 − // alternate bit-value
RLE Properties
 Variable length encoding
 Dictionary is uniquely defined by an algorithm
 no need to explicitly store or send dictionary
 Best compression is for = 000…000 of length
 compressed to 2 log + 2 ∈ () bits
 1 for the initial bit
 log zeros to encode the length of binary representation of integer
 log + 1 digits that represent itself in binary
 Usually not that lucky
 no compression until run-length ≥ 6
 expansion when run-length = 2 or 4
 Method can be adapted to larger alphabet sizes
 but then the encoding for each run must also store the character
 Method can be adapted to encode only runs of 0
 we will need this soon
 Used in some image formats (e.g. TIFF)
Outline
 Compression
 Encoding Basics
 Huffman Codes
 Run-Length Encoding
 Lempel-Ziv-Welch
 bzip2
 Burrows-Wheeler Transform
Longer Patterns in Input
 Huffman and RLE take advantage of frequent or repeated single characters
 Observation: certain substrings are much more frequent than others
 Examples
 English text
 most frequent digraphs: TH, ER, ON, AN, RE, HE, IN, ED, ND, HA
 most frequent trigraphs: THE, AND, THA, ENT, ION, TIO, FOR, NDE
 HTML
 “”
 Video
 repeated background between frames, shifted sub-image
 Ingredient 1 for Lempel-Ziv-Welch compression
 Encode characters and frequent substrings
 no need to know which substrings are frequent
 will discover frequent substring as we process text
 will encode them as we read the text
 dictionary constructed during encoding/decoding, no need to send it with encoding
 how?
Adaptive Dictionaries
 ASCII, UTF-8, and RLE use fixed dictionaries
 same dictionary for any text encoded
 no need to pass dictionary to the dencoder
 In Huffman, the dictionary is not fixed, but it is static
 each text has its own dictionary
 the dictionary does not change for the entire encoding/decoding
 need to pass dictionary to the decoder
 Ingredient 2 for LZW: adaptive dictionary
 start with some initial dictionary 0
 usually ASCII
 at iteration ≥ 0, is used to determine the th output
 after iteration , update to +1
 a new character combination is inserted
 encoder and decoder must both know how dictionary changes
 compute dictionary during encoding/decoding
S = abbbcbbad
LZW Overview
 Start with dictionary 0 for Σ
 usually Σ =
 codes from 0 to 127
 every step adds to dictionary multi-character string, using codenumbers 128, 129, …
 Store current dictionary in a trie
 Output is a list of numbers (codewords)
 each number is usually converted to bit-string with fixed-width
……encoding using 12 bits
 this limits code numbers to 4096
 Iteration of encoding, current dictionary
find longest substring that starts at current pointer and already in dictionary
+1 = .insert(‘bba’, next_available_code)
(logic: ‘bba’ would have been useful at iteration , so it may bey useful in the
future)
encode ‘bb’
= {a:65, ab:140, bb:145, bbc:146}
Tries for LZW
 Key-value pairs are (string,code)
 Trie stores KVP at all nodes (external and internal) except the root
 works because a string is inserted only after all its prefixes are inserted
 We show code (value) at each node, because the key can be read off from
the edges
65
78
83
A
N
S
N
128
A 129
A
130 N 131
N
132
‘AN’, 128
‘NA’, 129
‘S’, 83
‘ANA’, 130
LZW Example
 Text A N A N A N A N N A
 Start dictionary
 ASCII characters
 codes from 0 to 127
 next inserted code will be 128
 variable idx keeps track of next available code
 initialize = 128
65
78
83
A
N
S
LZW Example
 Text A N A N A N A N N A
 Dictionary
 = 128
65
78
83
A
N
S
 Encoding 65
add to dictionary
N
128
129
 Add to dictionary “string just encoded” + “first character of next string to be
encoded”
 Inserting new item is (1) since we stopped at the right node in the trie when
we searched for ‘A’
LZW Example
 Text A N A N A N A N N A
 Dictionary
 = 129
65
78
83
A
N
S
 Encoding 65
N
128
30
78
add to dictionary
A 129
LZW Example
 Text A N A N A N A N N A
 Dictionary
 = 130
65
78
83
A
N
S
 Encoding 65
N
128
131
78
A 129
128
add to dictionary
A
130
LZW Example
 Text A N A N A N A N N A
 Dictionary D
 = 131
65
78
83
A
N
S
 Encoding 65
N
128
132
78
A 129
128
add to dictionary
A
130
130
N 131
LZW Example
 Text A N A N A N A N N A
 Dictionary D
 = 132
65
78
83
A
N
S
 Encoding 65
N
128
133
78
A 129
128
add to dictionary
A
130
130
N 131
78
N
132
LZW Example
 Text A N A N A N A N N A
 Dictionary D
 = 133
65
78
83
A
N
S
 Encoding 65
N
128
78
A 129
128
A
130
130
N 131
78
N
132
129
LZW Example
 Text A N A N A N A N N A
 Dictionary D
 = 133
65
78
83
A
N
S
 Encoding
N
128
A 129
A
130 N 131
78
N
132
65 78 128 130 129
 Final output 000010000001000001000001 000001001110 000010000000 00001000010 000001001110
 Use fixed length (12 bits) per code
 12 bit binary string representation for each code
 total of 212 = 4096 codes available during encoding
LZW encoding pseudocode
LZW-encode()
: input stream of characters, : output-stream
initialize dictionary with ASCII in a trie
← 128
while there is input in do
← root of trie
while is non-empty and has a child labelled . ()
←
. ()
. (codenumber stored at )
if is non-empty
create child of labelled . () with code
++
LZW encoding pseudocode
LZW-encode()
: input stream of characters, : output-stream
initialize dictionary with ASCII in a trie
← 128
while there is input in do
← root of trie
while is non-empty and has a child labelled . ()
←
. ()
. (codenumber stored at )
if is non-empty
create child of labelled . () with code
++
trie
search
new
dictionary
entry
 Running time is (||)
LZW Encoder vs Decoder
 For decoding, need a dictionary
 Construct a dictionary during decoding , but one step behind
 at iteration of decoding we can reconstruct the substring which
encoder inserted into dictionary at iteration − 1
 delay is due to not having access to the original text
LZW Decoding Example
 Given encoding to decode back to the source text
65 A
78 N
83 S
initial
 Build dictionary adaptively, while decoding
 Decoding starts with the same initial dictionary as
encoding
 use array instead of trie, need that allows
efficient search by code
= 128
65 78 128 130 78 129
 We will show the original text during decoding in
this example, but just for reference
 do not need original text to decode
 Text A N A N A N A N N A
 Encoding
65 A
78 N
83 S
=
 Decoding
= 128
65 78 128 130 78 129
iter = 0
A
 During decoding, when read 65, cannot look ahead in
the text
 no new word added at iteration = 0
 but keep track of = string decoded at
previous iteration
 it is also the string encoder encoded
at previous iteration
 During encoding, added new string ‘AN’ to
the dictionary at iteration = 0
 looked ahead at the text and saw ‘N’
i=0
LZW Decoding Example
 First step: = 65 = ‘A’
 Text A N A N A N A N N A
 Encoding
65 A
78 N
83 S
=
 Decoding
= 128
65 78 128 130 78 129
iter = 1
A
i=0
 Now know that at iteration = 0 of encoding,
next character we peaked at was ‘N’
 So can add string ‘A’ + ’N’=‘AN’ to the dictionary
N
128 AN
9
LZW Decoding Example
i=1
 is string decoded at current iteration
[0]
 Starting at iteration = 1 of decoding
 add + 0 to dictionary
 = ‘A’
 First step: = 78 = ‘N’
LZW Decoding Example Continued
 Text A N A N A N A N N A
 Encoding
65 A
78 N
83 S
128 AN
=
 Decoding
= 129
65 78 128 130 78 129
iter = 2
A N
129 NA
 Next step: add to dictionary + [0]
AN
i=1
30
i=2
‘N’ + ’A’ = ‘NA’
 = ‘N’
 First step: = 128 = ‘AN’
LZW Decoding Example
 Text A N A N A N A N N A
 Encoding
65 A
78 N
83 S
128 AN
129 NA
=
 Decoding
= 130
65 78 128 130 78 129
iter = 3
A N AN
 Dictionary is exactly one step behind at decoding
 Current decoder iteration is = 3
 Encoder added (,130) to at iteration = 2
 encoder adds + [0]
 = ‘AN’
=???
AN + [0]130
 = ‘AN’
 First step: = 130 = ???
 problem: code 130 is not in

0 = [0] =‘A’
=‘ANA’
LZW Decoding Example
 Text A N A N A N A N N A
 Encoding
65 A
78 N
83 S
128 AN
129 NA
=
 Decoding
= 130
65 78 128 130 78 129
iter = 3
A N AN
 General rule: if code is not in
 = + [0]
 in our example
 AN + A = ANA
i=2
???
130 ANA
ANA
 Continue the example
 Add to dictionary + [0]
131
LZW Decoding Example
 Text A N A N A N A N N A
 Encoding
65 A
78 N
83 S
128 AN
129 NA
130 ANA
=
 Decoding
= 131
65 78 128 130 78 129
iter = 4
A N AN
 = ‘ANA’
 If code is not in
= + 0
 Add to dictionary + [0]
???ANA N
131 ANAN
132
LZW Decoding Example
 Text A N A N A N A N N A
 Encoding
65 A
78 N
83 S
128 AN
129 NA
130 ANA
131 ANAN
=
 Decoding
= 132
65 78 128 130 78 129
iter = 5
A N AN ???ANA N NA
 = ‘N’
 If code is not in
= + 0
 Add to dictionary + [0]
LZW Decoding Pseudocode
LZW::decoding(, )
: input-stream of integers, : output-stream
← dictionary that maps {0, . . . , 127} to ASCII
← 128 // next available code
← . ()
←
.
while there are more codes in do
←
← . ()
if code < idx
← //code in , look up string
if code = idx // code not in yet, reconstruct string
← + [0]
else Fail
.append()
.insert( , + [0])
++
 Running time is (||)
LZW decoding
 To save space, store new codes using its prefix code + one character
65 A
78 N
83 S
65 A
78 N
83 S
128 AN
129 NA
130 ANA
131 ANAN
=
128 65, N
129 78, A
130 128, A
131 130, N
wasteful storage
means ‘AN’
means ‘NA’
means ‘ANA’
means ‘ANAA’
 for each codeword, can find corresponding string in (||) time
Lempel-Ziv Family
 Lempel-Ziv is a family of adaptive compression algorithms
 LZ77 Original version (“sliding window”)
 Derivatives: LZSS, LZFG, LZRW, LZP, DEFLATE, . . .
 DEFLATE used in (pk)zip, gzip, PNG
 LZ78 Second (slightly improved) version
 Derivatives LZW, LZMW, LZAP, LZY, . . .
 LZW used in compress, GIF
 patent issues
LZW Summary
 Encoding is (||) time, uses a trie of encoded substrings to store the
dictionary
 Decoding is (||) time, uses an array indexed by code numbers to
store the dictionary
 Encoding and decoding need to go through the string only one time and
do not need to see the whole string
 can do compression while streaming the text
 Works badly if no repeated substrings
 dictionary gets bigger, but no new useful substrings inserted
 In practice, compression rate is around 45% on English text
Outline
 Compression
 Encoding Basics
 Huffman Codes
 Run-Length Encoding
 Lempel-Ziv-Welch
 bzip2
 Burrows-Wheeler Transform
Overview of bzip2
 Text transform changes input text into a different text
 not necessarily shorter
 but has properties likely to lead to better compression at a later stage
 To achieve better compression, bzip2 uses the following text transforms
Move-to-front
transform
Modified RLE
Huffman encoding compresses well since frequencies are skewed
Burrows-Wheeler
transform
if T0 has repeated longer substrings, then T1 has
long runs of characters
if T1 has long runs of characters, then T2 has long
runs of zeros and skewed frequencies
T0 = a l f e a t s a l f a l f a
T1 = a f f s $ e f l l l a a a t a
text T2 = 1 3 0 5 3 4 3 5 0 0 6 0 0 6
text T3
text T4
if T2 has long runs of zeroes, then T3 is shorter and
skewed frequencies remain
Move-to-Front transform
 Recall the MTF heuristic
 after an element is accessed, move it to array front
A B C D E
search D
D A B C E
 Use this idea for MTF (move to front) text transformation
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
 This gives us encoding dictionary
 single character encoding
 Code of any character = index of array where character stored in dictionary
 (‘’) = 1
 (‘’) = 7
 Coded alphabet is Σ = 0,1 . . . , − 1
 Change dictionary dynamically (like LZW)
 unlike LZW
 no new items added to dictionary
 codeword for one or more letters can change at each iteration
 Source alphabet Σ with size Σ =
 Store alphabet in an array
MTF Encoding Example
S = MISSISSIPPI
C =
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
S = MISSISSIPPI
C = 12
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
M A B C D E F G H I J K L N O P Q R S T U V W X Y Z
S = MISSISSIPPI
C = 12 9
I
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I M A B C D E F G H J K L N O P Q R S T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
S I M A B C D E F G H J K L N O P Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
S I M A B C D E F G H J K L N O P Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0
I
1
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I S M A B C D E F G H J K L N O P Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0 1 1
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
S I M A B C D E F G H J K L N O P Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0 1 1 0
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I P S M A B C D E F G G J K L N O Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0 1 1 0 1 16 0 1
 What does a run in mean about the source ?
 zeros tell us about consecutive character runs
MTF Decoding Example
S =
C = 12 9 18 0 1 1 0 1 16 0 1
 Decoding is similar
 Start with the same dictionary as encoding
 Apply the same MTF transformation at each iteration
 dictionary undergoes exactly the transformations when decoding
 no delays, identical dictionary at encoding and decoding iteration
 can always decode original letter
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
MTF Decoding Example
S =
C = 12 9 18 0 1 1 0 1 16 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
M
MTF Decoding Example
S = M
C = 12 9 18 0 1 1 0 1 16 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
M A B C D E F G H I J K L N O P Q R S T U V W X Y Z
I
MTF Decoding Example
S = M I
C = 12 9 18 0 1 1 0 1 16 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I M A B C D E F G H J K L N O P Q R S T U V W X Y Z
S
Move-to-Front Transform: Properties
S = a f f s $ e f l l l a a a t a = 1 3 0 5 3 4 3 5 0 0 6 0 0 6MTF
Transformation
 If a character in repeats times, then has a run of − 1 zeros
 contains a lot of small numbers and a few big ones
 has the same length as , but better properties for encoding
Move-to-Front Encoding/Decoding Pseudocode
MTF::encoding(, )
← array with Σ in some pre-agreed, fixed order (i.e. ASCII)
while is non-empty do
← . ()
← index such that [] =
for = − 1 down to 0
swap [] and [ + 1]
MTF::decoding(, )
← array with Σ in some pre-agreed, fixed order (i.e. ASCII)
while is non-empty do
← next integer of
. ( )
for = − 1 down to 0
swap [] and [ + 1]
Move-to-Front Transform Summary
 MTF text transform
 source alphabet is ΣS with size Σ =
 store alphabet in an array
 code of any character = index of array where character stored
 coded alphabet is Σ = {0,1, … , − 1}
 Dictionary is adaptive
 nothing new is added, but meaning of codewords are changed
 MTF is an adaptive text-transform algorithm
 it does not compress input
 the output has the same length as input
 but output has better properties for compression
Outline
 Compression
 Encoding Basics
 Huffman Codes
 Run-Length Encoding
 Lempel-Ziv-Welch
 bzip2
 Burrows-Wheeler Transform
Burrows-Wheeler Transform
 Transformation (not compression) algorithm
 transforms source text to a coded text with the same letters but
in different order
 source and coded alphabets are the same
 if original text had frequently occurring substrings, then
transformed text should have many runs of the same character
 more suitable for MTF transformation
S= a l f e a t s a l f a l f a C = a f f s $ e f l l l a a a t aBurrows-Wheeler
Transform
 Required: the source text ends with end-of-word character $
 $ occurs nowhere else in
 Based on cyclic shifts for a string
cdeababcde
string a cyclic shift
 a cyclic shift of string of length is the concatenation of
[ + 1. . . − 1] and [0. . . ], for 0 ≤ <
 example
alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
BWT Algorithm and Example
=alfeatsalfalfa$
 Write all consecutive cyclic shifts
 forms an array of shifts
 last letter in any row is the
first letter of the previous row
BWT Algorithm and Example
=alfeatsalfalfa$
alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
 the first column is the original
 Array of cyclic shifts
alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
BWT Algorithm and Example
=alfeatsalfalfa$
 Array of cyclic shifts
 has ‘alf’ repeated 3 times
 3 different shifts start with ‘lf’ and
end with ‘a’
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
=alfeatsalfalfa$
BWT Algorithm and Example
 Array of cyclic shifts
 Sort (lexographically) cyclic shifts
 strict sorting order due to ‘$’
 First column (of course) has many
consecutive character runs
 But also the last column has many
consecutive character runs
 sort groups ‘lf’ lines together, and
they all end with ‘a’
sorted shifts array
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
=alfeatsalfalfa$
BWT Algorithm and Example
d ……… h
lfeatsalf a$a
salfalfa$alfeat
tsalfalfa$alfea
 could happen that another pattern
will interfere
 ‘hlfd’ broken into ‘h’ and ‘lfd’
 the longer is repeated pattern,
the less chance of interference
 Array of cyclic shifts
 Sort (lexographically) cyclic shifts
 strict sorting order due to ‘$’
 First column (of course) has many
consecutive character runs
 But also the last column has many
consecutive character runs
 sort groups ‘lf’ lines together, and
they all end with ‘a’
sorted shifts array
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
=alfeatsalfalfa$
BWT Algorithm and Example
 Sorted array of cyclic shifts
 First column is useless for encoding
 cannot decode it
 Last column can be decoded
 BWT Encoding
 last characters from sorted shifts
 i.e. the last column
=affs$eflllaaata
sorted shifts array
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
 Can refer to a cyclic shift by the start
index in the text, no need to write it out
explicitly
 For sorting, letters after ‘$’ do not matter
lfa$alfeatsalfa
alfalfa$alfeats
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
 Can refer to a cyclic shift by the start
index in the text, no need to write it out
explicitly
 For sorting, letters after ‘$’ do not matter
lfa$alfeatsalfa
salfalfa$alfeat
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
 Can refer to a cyclic shift by the start
index in the text, no need to write it out
explicitly
 For sorting, letters after ‘$’ do not matter
lfalfa$alfeatsa
lfa$alfeatsalfa
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
 Can refer to a cyclic shift by the start
index in the text, no need to write it out
explicitely
 For sorting, letters after ‘$’ do not matter
 This is the same as sorting suffixes of
 We already know how to do it
 exactly as for suffix arrays, with
MSD-Radix-Sort
 ( log ) running time
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
j [] sorted cyclic shifts
0 14 $ a l f e a t s a l f a l f a
1 13 a $ a l f e a t s a l f a l f
2 10 a l f a $ a l f e a t s a l f
3 7 a l f a l f a $ a l f e a t s
4 0 a l f e a t s a l f a l f a $
5 4 a t s a l f a l f a $ a l f e
6 3 e a t s a l f a l f a $ a l f
7 12 f a $ a l f e a t s a l f a l
8 9 f a l f a $ a l f e a t s a l
9 2 f e a t s a l f a l f a $ a l
10 11 l f a $ a l f e a t s a l f a
11 8 l f a l f a $ a l f e a t s a
12 1 l f e a t s a l f a l f a $ a
13 6 s a l f a l f a $ a l f e a t
14 5 t s a l f a l f a $ a l f e a
BWT Fast Encoding: Efficient Sorting
=
 Can read BWT encoding from suffix array in time
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a l f e a t s a l f a l f a $
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 13 10 7 0 4 3 12 9 2 11 8 1 6 5
=
cyclic shift starts at [14]
we need the last letter of that cyclic shift, it is at [13]
a
BWT Fast Encoding: Efficient Sorting
=
 Can read BWT encoding from suffix array in time
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a l f e a t s a l f a l f a $
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 13 10 7 0 4 3 12 9 2 11 8 1 6 5
=
cyclic shift starts at [13]
we need the last letter of that cyclic shift, it is at [12]
a f
BWT Fast Encoding: Efficient Sorting
=
 Can read BWT encoding from suffix array in time
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a l f e a t s a l f a l f a $
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 13 10 7 0 4 3 12 9 2 11 8 1 6 5
=
cyclic shift starts at [5]
we need the last letter of that cyclic shift, it is at [4]
a f f s $ e f l l l a a a t a
BWT Fast Encoding: Efficient Sorting
j []
0 14 $ a l f e a t s a l f a l f a
1 13 a $ a l f e a t s a l f a l f
2 10 a l f a $ a l f e a t s a l f
3 7 a l f a l f a $ a l f e a t s
4 0 a l f e a t s a l f a l f a $
5 4 a t s a l f a l f a $ a l f e
6 3 e a t s a l f a l f a $ a l f
7 12 f a $ a l f e a t s a l f a l
8 9 f a l f a $ a l f e a t s a l
9 2 f e a t s a l f a l f a $ a l
10 11 l f a $ a l f e a t s a l f a
11 8 l f a l f a $ a l f e a t s a
12 1 l f e a t s a l f a l f a $ a
13 6 s a l f a l f a $ a l f e a t
14 5 t s a l f a l f a $ a l f e a
a f f s $ e f l l l a a a t a
BWT Decoding
alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
 the first column is the original
 Unsorted array of cyclic shifts
=affs$eflllaaata
unsorted shifts array
BWT Decoding
=affs$eflllaaata . . . . . . . . . . a
. . . . . . . . . . f
. . . . . . . . . . f
. . . . . . . . . . s
. . . . . . . . . . $
. . . . . . . . . . e
. . . . . . . . . . f
. . . . . . . . . . l
. . . . . . . . . . l
. . . . . . . . . . l
. . . . . . . . . . a
. . . . . . . . . . a
. . . . . . . . . . a
. . . . . . . . . . t
. . . . . . . . . . a
 Given , last column of sorted shifts array
 Can reconstruct the first column of sorted
shifts array by sorting
 first column has exactly the same
characters as the last column
 and they must be sorted
sorted shifts array
BWT Decoding
=affs$eflllaaata . . . . . . . . . . a , 0
. . . . . . . . . . f , 1
. . . . . . . . . . f , 2
. . . . . . . . . . s , 3
. . . . . . . . . . $ , 4
. . . . . . . . . . e , 5
. . . . . . . . . . f , 6
. . . . . . . . . . l , 7
. . . . . . . . . . l , 8
. . . . . . . . . . l , 9
. . . . . . . . . . a , 1 0
. . . . . . . . . . a , 1 1
. . . . . . . . . . a , 1 2
. . . . . . . . . . t , 1 3
. . . . . . . . . . a , 1 4
 Given , last column of sorted shifts array
 Can reconstruct the first column of sorted
shifts array by sorting
 first column has exactly the same
characters as the last column
 and they must be sorted
 Also need row number for decoding
sorted shifts array
BWT Decoding
 Now have the first and the last columns of
sorted shifts array
 use stable sort
 equal letters stay in the same order
= affs$eflllaaata
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
sorted shifts array
BWT Decoding
 Now have the first and the last columns of
sorted shifts array
 Key for decoding is figuring out where in
the sorted shifts array are the unsorted
rows 0, 1, …
 Where is row 0 of unsorted shifts array?
 must end with ‘$’
= affs$eflllaaata
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
0
sorted shifts array
BWT Decoding
= a
 Therefore
 string starts with a
= affs$eflllaaata
 Row = 0 of unsorted shifts starts with a
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
0
sorted shifts array
 Where is row = 1 of the unsorted shifts
array?
BWT Decoding
= affs$eflllaaata alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
unsorted shifts array
 In the unsorted shifts array, any row ends
with the first letter of previous row
 unsorted row 1 ends with the
same letter that unsorted row 0
begins with
BWT Decoding
= a
 Therefore
 string starts with a
= affs$eflllaaata
 Row = 0 of unsorted shifts starts with a
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
0
sorted shifts array
 Where is row = 1 of the unsorted shifts
array?
 row = 1 of unsorted shifts array
ends with a
BWT Decoding
?
?
?
?
?
 Multiple rows end with a, which
one is row 1 of unsorted shifts?
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
0
sorted shifts array
= a
 Therefore
 string starts with a
= affs$eflllaaata
 Row = 0 of unsorted shifts starts with a
 Where is row = 1 of the unsorted shifts
array?
 row = 1 of unsorted shifts array
ends with a
=alfeatsalfalfa$
BWT Algorithm and Example
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
 Consider all patterns in sorted
array that start with ‘a’
0
1
2
3
4
sorted shifts array
=alfeatsalfalfa$
BWT Algorithm and Example
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
 Consider all patterns in sorted
array that start with ‘a’
 Take their cyclic shifts (by one letter)
$alfeatsalfalfa
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
tsalfalfa$alfea
 Find them in sorted array of cyclic shifts
 They have ‘a’ at the end, and are the only rows that have ‘a’ at the end
 They appear in the same relative order as before cyclic shift
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
 for patterns with same first letter, cyclic shift by one letter does not change relative
sorting order
sorted shifts array
=alfeatsalfalfa$
BWT Algorithm and Example
 Consider all patterns in sorted
array that start with ‘a’
 Take their cyclic shifts (by one letter)
$alfeatsalfalfa
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
tsalfalfa$alfea
 Unsorted row 1 is a cyclic shift by 1 letter of unsorted row 0
 unsorted row 0 is #4 among all rows starting with ‘a’
 unsorted row 1 is #4 among all rows ending with ‘a’
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
0
10
11
12
14
0
10
11
12
14
row 0
sorted shifts array
=alfeatsalfalfa$
BWT Algorithm and Example
 Consider all patterns in sorted
array that start with ‘a’
 Take their cyclic shifts (by one letter)
$alfeatsalfalfa
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
tsalfalfa$alfea
 Unsorted row 1 is a cyclic shift by 1 letter of unsorted row 0
 unsorted row 0 is #4 among all rows starting with ‘a’
 unsorted row 1 is #4 among all rows ending with ‘a’
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
0
10
11
12
14
0
10
11
12
14
row 0
sorted shifts array
BWT Decoding
= a
 Multiple rows end with a, which
one is row 1 of unsorted shifts?
= affs$eflllaaata $ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
 Unsorted row = 1 is located in
row 12 of the sorted shifts
row 1
 [1] =

sorted shifts array
BWT Decoding
= al
= affs$eflllaaata $ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
 Unsorted row = 2 is located in row
9 of the sorted shifts
row 1
 [2] =

row 2
sorted shifts array
BWT Decoding
= alf
= affs$eflllaaata $ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
 Unsorted row = 3 is located in row 6
of the sorted shifts
row 1
 [3] =

row 2
row 3
sorted shifts array
BWT Decoding
= alfe
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
 Unsorted row = 4 is located in
row 5 of the sorted shifts
row 1
 [4] =

row 2
row 3
row 4
= affs$eflllaaata
sorted shifts array
BWT Decoding
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
row 1
row 2
row 3
row 4
=alfeatsalfalfa$
row 5
row 6
row 7
row 8
row 9
row 10
row 11
row 12
row 14
row 13
sorted shifts array
BWT Decoding
BWT::decoding( 0… − 1 , )
: string of characters over alphabet ΣC , : output stream
← array of size // leftmost column
for = 0 to − 1
[] ← ([], ) // store character and index
stably sort by character
for = 0 to // find $
if C [j] = $ break
repeat
. (character stored in )
← index stored in
until we have appended $
BWT Overview
 Encoding cost
 ( log) with special sorting algorithm
 in practice MSD sort is good enough but worst case is Θ(2)
 read encoding from the suffix array
 Decoding cost
 faster than encoding
 ( + |Σ|)
 Encoding and decoding both use O() space
 They need all of the text (no streaming possible)
 can use on blocks of text (block compression method)
 BWT tends to be slower than other methods
 But combined with MTF, RLE and Huffman leads to better compression
Compression Summary
Huffman Run-length encoding Lempel-Ziv-Welch
variable-length
multi-character
1-pass
bad on text
good on long runs
(e.g., pictures)
variable-length
single-character
2-pass
60% compression on
English text
optimal 01-prefix-code
fixed-length
multi-character
1-pass
45% compression on
English text
good on English text
requires uneven
frequencies
rarely used directly
part of pkzip, JPEG, MP3
requires runs requires repeated
substrings
frequently used
GIF, some variants of PDF,
Unix compress
Bzip2 (uses Burrows-Wheeler)
multi-step
multi-step
not streamable
70% compression on
English text
better on English text
requires repeated
substrings
used but slow
bzip2 and variants
rarely used directly
fax machines, old picture-
formats