CS 240 – Data Structures and Data Management
Module 10: Compression
T. Biedl E. Schost O. Veksler
Based on lecture notes by many previous cs240 instructors
David R. Cheriton School of Computer Science, University of Waterloo
Winter 2021
Outline
Compression
Encoding Basics
Huffman Codes
Run-Length Encoding
Lempel-Ziv-Welch
bzip2
Burrows-Wheeler Transform
Outline
Compression
Encoding Basics
Huffman Codes
Run-Length Encoding
Lempel-Ziv-Welch
bzip2
Burrows-Wheeler Transform
Data Storage and Transmission
The problem: How to store and transmit data?
Source text
the original data, string S of characters from the source alphabet ΣS
Coded text
the encoded data, string C of characters from the coded alphabet ΣC
Encoding
an algorithm mapping source texts to coded texts
Decoding
an algorithm mapping coded texts back to their original source text
Notes
source “text” can be any sort of data (not always text)
usually the coded alphabet is just binary Σ = 0,1
usually S and C are stored as streams
read/write only one character at a time
convenient for handling huge texts
can start processing text while it is still being loaded
input stream supports methods: (), (), ()
output stream supports methods: (), ()
Judging Encoding Schemes
Can measure time/space efficiency of encoding/decoding
algorithms, as for any usual algorithm
What other goals make sense?
reliability
error-correcting codes
security
encryption
size (our main objective in this course)
Encoding schemes that try to minimize the size of the coded
text perform data compression
We will measure the compression ratio
∙ log Σ
∙ log Σ
Types of Data Compression
Logical vs. Physical
Logical Compression
uses the meaning of the data
only applies to a certain domain (e.g. sound recordings)
Physical Compression
only know physical bits in data, not their meaning
Lossy vs. Lossless
Lossy Compression
achieves better compression ratios
decoding is approximate
exact source text is not recoverable
Lossless Compression
always decodes exactly
Lossy, logical compression is useful
media files: JPEG, MPEG
But we will concentrate on physical, lossless compression
can be safely used for any application
Character Encodings
Definition: character encoding maps each character in the source
alphabet to a string in coded alphabet
∶ Σ → Σ
∗
for ∈ Σ , () is called the codeword (or code) of
Character encoding sometimes is called character-by-character encoding
encode one character at a time
Two possibilities
Fixed-length code: all codewords have the same length
Variable-length code: codewords may have different lengths
Fixed Length Codes
Example: ASCII (American Standard Code for Information Interchange), 1963
charin Σ null
start of
heading
start of
text
. . . 0 1 . . . A B . . . ∼ delete
code 0 1 2 . . . 48 49 . . . 65 66 . . . 126 127
code as
binary
string
0000000 0000001 0000010 0110000 0110001 01000001 01000010 1111110 1111111
7 bits to encode 128 possible characters
control codes, spaces, letters, digits, punctuation
A P P L E → (65, 80, 80, 76, 69) → 01000001 1010000 1010000 1001100 1000101
Standard in all computers and often our source alphabet
Not well-suited for non-English text
ISO-8859 extends to 8 bits, handles most Western languages
Other (earlier) fixed-length codes: Baudot code, Murray code
To decode a fixed-lenth code (say codewords have bits), we look up each
-bit pattern in a table
Variable-Length Codes
Overall goal: Find an encoding that is short
Observation: Some alphabet letters occur more often than others
idea: use shorter codes for more frequent characters
example: frequency of letters in typical English text
e 12.70% d 4.25% p 1.93%
t 9.06% l 4.03% b 1.49%
a 8.17% c 2.78% v 0.98%
o 7.51% u 2.76% k 0.77%
i 6.97% m 2.41% j 0.15%
n 6.75% w 2.36% x 0.15%
s 6.33% f 2.23% q 0.10%
h 6.09% g 2.02% z 0.07%
r 5.99% y 1.97%
Variable-Length Codes
Example 1: Morse code
Example 2: UTF-8 encoding of Unicode
there are more than 107,000 Unicode characters
uses 1-4 bytes to encode any Unicode character
Encoding
Assume we have some character encoding : Σ → Σ
∗
is a dictionary with keys in Σ
Typically would be stored as array indexed by Σ
charByChar::Encoding(, , )
: encoding dictionary, : input stream with characters in Σ
: output stream
while is non-empty
← . ℎ(. ())
.append ()
Example: encode text “WATT” with Morse code
W A T T
Decoding
⊔
A
N O E
T
The decoding algorithm must map Σ
∗ to Σ
The code must be uniquely decodable
false for Morse code as described
Do not need symbol $, codewords are prefix-free by definition
From now on only consider prefix-free codes
() is not a prefix of (′) for any , ’ ∈ Σ
Store codes in a trie with characters of Σ at the leaves
decodes to both WATT and EAJ
Morse code uses ‘end of character’ pause to avoid ambiguity
0 1
0
0 0
0 1
1
1
1
Example: Prefix-free Encoding/Decoding
Code as table
Code as trie
c ∈ΣS ⊔
000
A
01
E
101
N
001
O
100
T
11
Encode AN⊔ANT →
⊔
A
N O E
T
0 1
0
0 0
0 1
1
1
1
()
Decode 111000001010111 →
01 001 000 0100111
TO EAT⊔
Decoding of Prefix-Free Codes
PrefixFree::decoding(, , )
: trie of a prefix-free code, : input-stream with characters in Σ
: output-stream
while is non-empty
← .root
while is not a leaf
if is empty or has no child labelled . ()
return “invalid encoding”
← child of that is labelled with . ()
. (character stored at )
Run-time: (||)
Any prefix-free code is uniquely decodable
Encoding from the Trie
PrefixFree::encoding(, , )
: prefix-free code trie, : input-stream with characters in Σ
← array of nodes in indexed by Σ
for all leaves in
[character at ] ←
while is non-empty
← empty string; ← [. ()]
while is not the root
.prepend (character from to its parent)
← parent()
// now is the encoding of
.append()
Run-time: +
Can encode directly from the trie
=
= 0 (letter )
=∧
= 0
= 00
= 100
= 1 (letter )
=∧
= 1
= 01
= 001
= 100
= 100 001 ( Σ + ||) if has no nodes with one child
(
)
(
)
Outline
Compression
Encoding Basics
Huffman Codes
Run-Length Encoding
Lempel-Ziv-Welch
bzip2
Burrows-Wheeler Transform
Huffman’s Algorithm: Building the Best Trie
For a given the best trie can be constructed with Huffman tree algorithm
trie giving the shortest coded text if alphabet is binary
Σ = {0,1}
tailored to frequencies in that particular
Example: Huffman Tree Construction
Example text: , Σ = {, , , , }
Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
N
1
G R E
2 42
Y
2
Put each character into its own (single node) trie
each trie has a frequency
initially, frequency is equal to its character frequency
Example: Huffman Tree Construction
Example text: , Σ = {, , , , }
Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
N
1
G R E
2 42
Y
2
Join two least frequent tries into a new trie
frequency of the new trie = sum of old trie frequencies
Example: Huffman Tree Construction
Example text: , Σ = {, , , , }
Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
NR E
3 42 2
G Y
0 1
Join two least frequent tries into a new trie
frequency of the new trie = sum of old trie frequencies
Example: Huffman Tree Construction
Example text: , Σ = {, , , , }
Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
NR E
3 42 2
G Y
0 1
Join two least frequent tries into a new trie
frequency of the new trie = sum of old trie frequencies
Example: Huffman Tree Construction
Example text: , Σ = {, , , , }
Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
E
3 4 4
G Y
0 1
Join two least frequent tries into a new trie
frequency of the new trie = sum of old trie frequencies
R N
0 1
Example: Huffman Tree Construction
Example text: , Σ = {, , , , }
Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
Join two least frequent tries into a new trie
frequency of the new trie = sum of old trie frequencies
E
3 4 4
G Y
0 1
R N
0 1
Example: Huffman Tree Construction
Example text: , Σ = {, , , , }
Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
Join two least frequent tries into a new trie
frequency of the new trie = sum of old trie frequencies
7 4
R N
0 1
G Y
E
0
0 1
1
Example: Huffman Tree Construction
Example text: , Σ = {, , , , }
Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
Join two least frequent tries into a new trie
frequency of the new trie = sum of old trie frequencies
11
R N
0 1
G Y
E
0
0 1
1
0 1
Example: Huffman Tree Construction
Example text: , Σ = {, , , , }
Calculate character frequencies
: 2, : 2, : 4, : 2, : 1
Final Huffman tree
R N
0 1
G Y
E
0
0 1
1
0 1
Compression ratio
25
11 ∙ log 5
≈ 98%
GREENENERGY → 000 10 01 01 11 01 11 01 10 000 001
These frequencies are not skewed enough to lead to good compression
Huffman Algorithm Summary
For a given source , to determine a trie that minimizes length of
1) determine frequency of each character ∈ Σ in
2) for each ∈ Σ, create trie of height 0 holding only
call it -trie
3) assign a weight to each trie
sum of frequencies of all letters in a trie
initially, these are just the character frequencies
4) find the two tries with the minimum weight
5) merge these tries with a new interior node
the new weight is the sum of merged tries weights
added one bit to the encoding of each character
6) repeat Steps 4–5 until there is only 1 trie left
this is , the final decoder
Data structure for making this efficient?
min-ordered heap
step 4 is two delete-min
step 5 is insert
Heap Storing Tries during Huffman Tree Construction
S
4
L
E O
0 1
2 2
Efficient data structure to store tries
a min-ordered heap
(key,value) = (trie weight, link to trie)
step 4 is two delete-mins, step 5 is insert
2
2 4
Huffman’s Algorithm Pseudocode
Huffman::encoding(, )
: input-stream with characters in Σ , : output-stream
← array indexed by Σ , initialized to 0
while is non-empty do increase [. ()] by 1 // get frequencies
← min-oriented priority queue to store tries
for all ∈ Σ with [] > 0
.insert(single-node trie for with weight [])
while . () > 1
1 ← . (), 1 ← weight of 1
2 ← . (), 2 ← weight of 2
.insert(trie with 1 , 2 as subtries and weight 1+ 2)
← .() // trie for decoding
reset input-stream
← PrefixFreeEncodingFromTrie(, ) // perform actual encoding
Total time is ( Σ log |Σ| + | |)
()
( Σ log |Σ|)
( Σ log |Σ|)
( Σ + ||)
Huffman Coding Evaluation
Constructed trie is not unique (why?)
So decoding trie must be transmitted along with the coded text
This may make encoding bigger than source text!
Encoding must pass through strem twice
to compute frequencies and to encode
cannot use stream unless it can be reset
Time to compute trie and encode
( Σ log |Σ| + ||)
Decoding run-time
(||)
The constructed trie is optimal in the sense that
is shortest among all prefix-free character encodings with ΣC = {0, 1}
proof omitted
Many variations
give tie-breaking rules, estimate frequencies, adaptively
change encoding, …
Outline
Compression
Encoding Basics
Huffman Codes
Run-Length Encoding
Lempel-Ziv-Welch
bzip2
Burrows-Wheeler Transform
Single-Character vs Multi-Character Encoding
Single character encoding: each source-text character receive one codeword
= b a n a n a
Multi-character encoding: multiple source-text characters can receive one
codeword
01 1 1 111 11
= b a n a n a
01 11 101
Run-Length Encoding
RLE is an example of multi-character encoding
Source alphabet and coded alphabet are both binary: Σ = 0, 1
can be extended to non-binary alphabets
Useful has long runs of the same character: 00000 111 0000
Dictionary is uniquely defined by algorithm
no need to store it explicitly
Encoding idea
give the first bit of (either 0 or 1)
then give a sequence of integers indicating run lengths
Example 00000 111 0000
becomes: 0 5 3 4
do not know how to parse in individual run lengths
Need to encode run length in binary, how?
cannot use variable length binary encoding 10111100
do not have to give the bit for runs since they alternate
fixed length binary encoding (say 16 bits) wastes space, bad compression
000000000000010100000000000000110000000000000100
Prefix-free Encoding for Positive Integers
Use Elias gamma code to encode
log copies of 0, followed by
binary representation of (always starts with 1)
in binary encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
Easy to decode
(number of zeros+1) tells you the length of (in binary representation)
after zeros, read binary representation of (it starts with 1)
RLE Example: Encoding
Encoding
= 1111110010000000000000000000011111111111
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
=
Encoding
= 0010000000000000000000011111111111
= 1
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
RLE Example: Encoding
= 7
Encoding
= 111111110000000000000000000011111111111
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
RLE Example: Encoding
= 2
= 100111
Encoding
= 1111111000000000000000000000011111111111
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
RLE Example: Encoding
= 1
= 1001110101
RLE Example: Encoding
Encoding
= 111111100111111111111
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
20 4 10100 000010100
= 20
= 1001110101
RLE Example: Encoding
Encoding
= 111111100100000000000000000000
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
7 2 111 00111
11 3 1011 0001011
Compression ratio
26/41 ≈ 63%
= 11
= 1001110101000010100
RLE Encoding
RLE::encoding(, )
: input-stream of bits, : output-stream
← . ()
. ()
while is non-empty do
← 1 // initialize run length
while ( is non-empty and . () = ) //compute run length
+ +; . ()
// compute Elias gamma code (binary string) for
←empty string
while ( > 1)
.append(0) // 0 appended to output directly
.prepend( mod 2) // K is built from last digit forwards
← /2
.prepend(1) // the very first digit of K was not computed
.append ()
← 1 −
RLE Example: Decoding
=
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
13 3 1101 0001101 Decoding
= 00001101001001010
= 0
= 4
= 13
Recall that (# zeros+1) tells you the
length of in binary representation
RLE Example: Decoding
Decoding
= 00001101001001010
= 0000000000000
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
13 3 1101 0001101
=
= 3
= 4
1111
RLE Example: Decoding
Decoding
= 00001101001001010
= 00000000000001111
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
13 3 1101 0001101
=
=
=
0
RLE Example: Decoding
Decoding
= 00001101001001010
= 000000000000011110
in
binary
encoding
1 0 1 1
2 1 10 010
3 1 11 011
4 2 100 00100
5 2 101 00101
6 2 110 00110
13 3 1101 0001101
=
= 2
= 2
11
RLE Decoding
RLE-Decoding()
: input stream of bits, : output-stream
← .() // bit-value for the first run
while is non-empty
← 0 // length of base-2 number - 1
while . () = 0
++
← 1 // base-2 number converted
for ( = 1 to ) // translate k from binary string to integer
← ∗ 2 + . ()
// if C runs out of bits then encoding was invalid
for ( = 1 to )
. ()
← 1 − // alternate bit-value
RLE Properties
Variable length encoding
Dictionary is uniquely defined by an algorithm
no need to explicitly store or send dictionary
Best compression is for = 000…000 of length
compressed to 2 log + 2 ∈ () bits
1 for the initial bit
log zeros to encode the length of binary representation of integer
log + 1 digits that represent itself in binary
Usually not that lucky
no compression until run-length ≥ 6
expansion when run-length = 2 or 4
Method can be adapted to larger alphabet sizes
but then the encoding for each run must also store the character
Method can be adapted to encode only runs of 0
we will need this soon
Used in some image formats (e.g. TIFF)
Outline
Compression
Encoding Basics
Huffman Codes
Run-Length Encoding
Lempel-Ziv-Welch
bzip2
Burrows-Wheeler Transform
Longer Patterns in Input
Huffman and RLE take advantage of frequent or repeated single characters
Observation: certain substrings are much more frequent than others
Examples
English text
most frequent digraphs: TH, ER, ON, AN, RE, HE, IN, ED, ND, HA
most frequent trigraphs: THE, AND, THA, ENT, ION, TIO, FOR, NDE
HTML
“
”
Video
repeated background between frames, shifted sub-image
Ingredient 1 for Lempel-Ziv-Welch compression
Encode characters and frequent substrings
no need to know which substrings are frequent
will discover frequent substring as we process text
will encode them as we read the text
dictionary constructed during encoding/decoding, no need to send it with encoding
how?
Adaptive Dictionaries
ASCII, UTF-8, and RLE use fixed dictionaries
same dictionary for any text encoded
no need to pass dictionary to the dencoder
In Huffman, the dictionary is not fixed, but it is static
each text has its own dictionary
the dictionary does not change for the entire encoding/decoding
need to pass dictionary to the decoder
Ingredient 2 for LZW: adaptive dictionary
start with some initial dictionary 0
usually ASCII
at iteration ≥ 0, is used to determine the th output
after iteration , update to +1
a new character combination is inserted
encoder and decoder must both know how dictionary changes
compute dictionary during encoding/decoding
S = abbbcbbad
LZW Overview
Start with dictionary 0 for Σ
usually Σ =
codes from 0 to 127
every step adds to dictionary multi-character string, using codenumbers 128, 129, …
Store current dictionary in a trie
Output is a list of numbers (codewords)
each number is usually converted to bit-string with fixed-width
……encoding using 12 bits
this limits code numbers to 4096
Iteration of encoding, current dictionary
find longest substring that starts at current pointer and already in dictionary
+1 = .insert(‘bba’, next_available_code)
(logic: ‘bba’ would have been useful at iteration , so it may bey useful in the
future)
encode ‘bb’
= {a:65, ab:140, bb:145, bbc:146}
Tries for LZW
Key-value pairs are (string,code)
Trie stores KVP at all nodes (external and internal) except the root
works because a string is inserted only after all its prefixes are inserted
We show code (value) at each node, because the key can be read off from
the edges
65
78
83
A
N
S
N
128
A 129
A
130 N 131
N
132
‘AN’, 128
‘NA’, 129
‘S’, 83
‘ANA’, 130
LZW Example
Text A N A N A N A N N A
Start dictionary
ASCII characters
codes from 0 to 127
next inserted code will be 128
variable idx keeps track of next available code
initialize = 128
65
78
83
A
N
S
LZW Example
Text A N A N A N A N N A
Dictionary
= 128
65
78
83
A
N
S
Encoding 65
add to dictionary
N
128
129
Add to dictionary “string just encoded” + “first character of next string to be
encoded”
Inserting new item is (1) since we stopped at the right node in the trie when
we searched for ‘A’
LZW Example
Text A N A N A N A N N A
Dictionary
= 129
65
78
83
A
N
S
Encoding 65
N
128
30
78
add to dictionary
A 129
LZW Example
Text A N A N A N A N N A
Dictionary
= 130
65
78
83
A
N
S
Encoding 65
N
128
131
78
A 129
128
add to dictionary
A
130
LZW Example
Text A N A N A N A N N A
Dictionary D
= 131
65
78
83
A
N
S
Encoding 65
N
128
132
78
A 129
128
add to dictionary
A
130
130
N 131
LZW Example
Text A N A N A N A N N A
Dictionary D
= 132
65
78
83
A
N
S
Encoding 65
N
128
133
78
A 129
128
add to dictionary
A
130
130
N 131
78
N
132
LZW Example
Text A N A N A N A N N A
Dictionary D
= 133
65
78
83
A
N
S
Encoding 65
N
128
78
A 129
128
A
130
130
N 131
78
N
132
129
LZW Example
Text A N A N A N A N N A
Dictionary D
= 133
65
78
83
A
N
S
Encoding
N
128
A 129
A
130 N 131
78
N
132
65 78 128 130 129
Final output 000010000001000001000001 000001001110 000010000000 00001000010 000001001110
Use fixed length (12 bits) per code
12 bit binary string representation for each code
total of 212 = 4096 codes available during encoding
LZW encoding pseudocode
LZW-encode()
: input stream of characters, : output-stream
initialize dictionary with ASCII in a trie
← 128
while there is input in do
← root of trie
while is non-empty and has a child labelled . ()
←
. ()
. (codenumber stored at )
if is non-empty
create child of labelled . () with code
++
LZW encoding pseudocode
LZW-encode()
: input stream of characters, : output-stream
initialize dictionary with ASCII in a trie
← 128
while there is input in do
← root of trie
while is non-empty and has a child labelled . ()
←
. ()
. (codenumber stored at )
if is non-empty
create child of labelled . () with code
++
trie
search
new
dictionary
entry
Running time is (||)
LZW Encoder vs Decoder
For decoding, need a dictionary
Construct a dictionary during decoding , but one step behind
at iteration of decoding we can reconstruct the substring which
encoder inserted into dictionary at iteration − 1
delay is due to not having access to the original text
LZW Decoding Example
Given encoding to decode back to the source text
65 A
78 N
83 S
initial
Build dictionary adaptively, while decoding
Decoding starts with the same initial dictionary as
encoding
use array instead of trie, need that allows
efficient search by code
= 128
65 78 128 130 78 129
We will show the original text during decoding in
this example, but just for reference
do not need original text to decode
Text A N A N A N A N N A
Encoding
65 A
78 N
83 S
=
Decoding
= 128
65 78 128 130 78 129
iter = 0
A
During decoding, when read 65, cannot look ahead in
the text
no new word added at iteration = 0
but keep track of = string decoded at
previous iteration
it is also the string encoder encoded
at previous iteration
During encoding, added new string ‘AN’ to
the dictionary at iteration = 0
looked ahead at the text and saw ‘N’
i=0
LZW Decoding Example
First step: = 65 = ‘A’
Text A N A N A N A N N A
Encoding
65 A
78 N
83 S
=
Decoding
= 128
65 78 128 130 78 129
iter = 1
A
i=0
Now know that at iteration = 0 of encoding,
next character we peaked at was ‘N’
So can add string ‘A’ + ’N’=‘AN’ to the dictionary
N
128 AN
9
LZW Decoding Example
i=1
is string decoded at current iteration
[0]
Starting at iteration = 1 of decoding
add + 0 to dictionary
= ‘A’
First step: = 78 = ‘N’
LZW Decoding Example Continued
Text A N A N A N A N N A
Encoding
65 A
78 N
83 S
128 AN
=
Decoding
= 129
65 78 128 130 78 129
iter = 2
A N
129 NA
Next step: add to dictionary + [0]
AN
i=1
30
i=2
‘N’ + ’A’ = ‘NA’
= ‘N’
First step: = 128 = ‘AN’
LZW Decoding Example
Text A N A N A N A N N A
Encoding
65 A
78 N
83 S
128 AN
129 NA
=
Decoding
= 130
65 78 128 130 78 129
iter = 3
A N AN
Dictionary is exactly one step behind at decoding
Current decoder iteration is = 3
Encoder added (,130) to at iteration = 2
encoder adds + [0]
= ‘AN’
=???
AN + [0]130
= ‘AN’
First step: = 130 = ???
problem: code 130 is not in
0 = [0] =‘A’
=‘ANA’
LZW Decoding Example
Text A N A N A N A N N A
Encoding
65 A
78 N
83 S
128 AN
129 NA
=
Decoding
= 130
65 78 128 130 78 129
iter = 3
A N AN
General rule: if code is not in
= + [0]
in our example
AN + A = ANA
i=2
???
130 ANA
ANA
Continue the example
Add to dictionary + [0]
131
LZW Decoding Example
Text A N A N A N A N N A
Encoding
65 A
78 N
83 S
128 AN
129 NA
130 ANA
=
Decoding
= 131
65 78 128 130 78 129
iter = 4
A N AN
= ‘ANA’
If code is not in
= + 0
Add to dictionary + [0]
???ANA N
131 ANAN
132
LZW Decoding Example
Text A N A N A N A N N A
Encoding
65 A
78 N
83 S
128 AN
129 NA
130 ANA
131 ANAN
=
Decoding
= 132
65 78 128 130 78 129
iter = 5
A N AN ???ANA N NA
= ‘N’
If code is not in
= + 0
Add to dictionary + [0]
LZW Decoding Pseudocode
LZW::decoding(, )
: input-stream of integers, : output-stream
← dictionary that maps {0, . . . , 127} to ASCII
← 128 // next available code
← . ()
←
.
while there are more codes in do
←
← . ()
if code < idx
← //code in , look up string
if code = idx // code not in yet, reconstruct string
← + [0]
else Fail
.append()
.insert( , + [0])
++
Running time is (||)
LZW decoding
To save space, store new codes using its prefix code + one character
65 A
78 N
83 S
65 A
78 N
83 S
128 AN
129 NA
130 ANA
131 ANAN
=
128 65, N
129 78, A
130 128, A
131 130, N
wasteful storage
means ‘AN’
means ‘NA’
means ‘ANA’
means ‘ANAA’
for each codeword, can find corresponding string in (||) time
Lempel-Ziv Family
Lempel-Ziv is a family of adaptive compression algorithms
LZ77 Original version (“sliding window”)
Derivatives: LZSS, LZFG, LZRW, LZP, DEFLATE, . . .
DEFLATE used in (pk)zip, gzip, PNG
LZ78 Second (slightly improved) version
Derivatives LZW, LZMW, LZAP, LZY, . . .
LZW used in compress, GIF
patent issues
LZW Summary
Encoding is (||) time, uses a trie of encoded substrings to store the
dictionary
Decoding is (||) time, uses an array indexed by code numbers to
store the dictionary
Encoding and decoding need to go through the string only one time and
do not need to see the whole string
can do compression while streaming the text
Works badly if no repeated substrings
dictionary gets bigger, but no new useful substrings inserted
In practice, compression rate is around 45% on English text
Outline
Compression
Encoding Basics
Huffman Codes
Run-Length Encoding
Lempel-Ziv-Welch
bzip2
Burrows-Wheeler Transform
Overview of bzip2
Text transform changes input text into a different text
not necessarily shorter
but has properties likely to lead to better compression at a later stage
To achieve better compression, bzip2 uses the following text transforms
Move-to-front
transform
Modified RLE
Huffman encoding compresses well since frequencies are skewed
Burrows-Wheeler
transform
if T0 has repeated longer substrings, then T1 has
long runs of characters
if T1 has long runs of characters, then T2 has long
runs of zeros and skewed frequencies
T0 = a l f e a t s a l f a l f a
T1 = a f f s $ e f l l l a a a t a
text T2 = 1 3 0 5 3 4 3 5 0 0 6 0 0 6
text T3
text T4
if T2 has long runs of zeroes, then T3 is shorter and
skewed frequencies remain
Move-to-Front transform
Recall the MTF heuristic
after an element is accessed, move it to array front
A B C D E
search D
D A B C E
Use this idea for MTF (move to front) text transformation
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
This gives us encoding dictionary
single character encoding
Code of any character = index of array where character stored in dictionary
(‘’) = 1
(‘’) = 7
Coded alphabet is Σ = 0,1 . . . , − 1
Change dictionary dynamically (like LZW)
unlike LZW
no new items added to dictionary
codeword for one or more letters can change at each iteration
Source alphabet Σ with size Σ =
Store alphabet in an array
MTF Encoding Example
S = MISSISSIPPI
C =
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
S = MISSISSIPPI
C = 12
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
M A B C D E F G H I J K L N O P Q R S T U V W X Y Z
S = MISSISSIPPI
C = 12 9
I
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I M A B C D E F G H J K L N O P Q R S T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
S I M A B C D E F G H J K L N O P Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
S I M A B C D E F G H J K L N O P Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0
I
1
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I S M A B C D E F G H J K L N O P Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0 1 1
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
S I M A B C D E F G H J K L N O P Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0 1 1 0
MTF Encoding Example
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I P S M A B C D E F G G J K L N O Q R T U V W X Y Z
S = MISSISSIPPI
C = 12 9 18 0 1 1 0 1 16 0 1
What does a run in mean about the source ?
zeros tell us about consecutive character runs
MTF Decoding Example
S =
C = 12 9 18 0 1 1 0 1 16 0 1
Decoding is similar
Start with the same dictionary as encoding
Apply the same MTF transformation at each iteration
dictionary undergoes exactly the transformations when decoding
no delays, identical dictionary at encoding and decoding iteration
can always decode original letter
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
MTF Decoding Example
S =
C = 12 9 18 0 1 1 0 1 16 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
M
MTF Decoding Example
S = M
C = 12 9 18 0 1 1 0 1 16 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
M A B C D E F G H I J K L N O P Q R S T U V W X Y Z
I
MTF Decoding Example
S = M I
C = 12 9 18 0 1 1 0 1 16 0 1
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
I M A B C D E F G H J K L N O P Q R S T U V W X Y Z
S
Move-to-Front Transform: Properties
S = a f f s $ e f l l l a a a t a = 1 3 0 5 3 4 3 5 0 0 6 0 0 6MTF
Transformation
If a character in repeats times, then has a run of − 1 zeros
contains a lot of small numbers and a few big ones
has the same length as , but better properties for encoding
Move-to-Front Encoding/Decoding Pseudocode
MTF::encoding(, )
← array with Σ in some pre-agreed, fixed order (i.e. ASCII)
while is non-empty do
← . ()
← index such that [] =
for = − 1 down to 0
swap [] and [ + 1]
MTF::decoding(, )
← array with Σ in some pre-agreed, fixed order (i.e. ASCII)
while is non-empty do
← next integer of
. ( )
for = − 1 down to 0
swap [] and [ + 1]
Move-to-Front Transform Summary
MTF text transform
source alphabet is ΣS with size Σ =
store alphabet in an array
code of any character = index of array where character stored
coded alphabet is Σ = {0,1, … , − 1}
Dictionary is adaptive
nothing new is added, but meaning of codewords are changed
MTF is an adaptive text-transform algorithm
it does not compress input
the output has the same length as input
but output has better properties for compression
Outline
Compression
Encoding Basics
Huffman Codes
Run-Length Encoding
Lempel-Ziv-Welch
bzip2
Burrows-Wheeler Transform
Burrows-Wheeler Transform
Transformation (not compression) algorithm
transforms source text to a coded text with the same letters but
in different order
source and coded alphabets are the same
if original text had frequently occurring substrings, then
transformed text should have many runs of the same character
more suitable for MTF transformation
S= a l f e a t s a l f a l f a C = a f f s $ e f l l l a a a t aBurrows-Wheeler
Transform
Required: the source text ends with end-of-word character $
$ occurs nowhere else in
Based on cyclic shifts for a string
cdeababcde
string a cyclic shift
a cyclic shift of string of length is the concatenation of
[ + 1. . . − 1] and [0. . . ], for 0 ≤ <
example
alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
BWT Algorithm and Example
=alfeatsalfalfa$
Write all consecutive cyclic shifts
forms an array of shifts
last letter in any row is the
first letter of the previous row
BWT Algorithm and Example
=alfeatsalfalfa$
alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
the first column is the original
Array of cyclic shifts
alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
BWT Algorithm and Example
=alfeatsalfalfa$
Array of cyclic shifts
has ‘alf’ repeated 3 times
3 different shifts start with ‘lf’ and
end with ‘a’
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
=alfeatsalfalfa$
BWT Algorithm and Example
Array of cyclic shifts
Sort (lexographically) cyclic shifts
strict sorting order due to ‘$’
First column (of course) has many
consecutive character runs
But also the last column has many
consecutive character runs
sort groups ‘lf’ lines together, and
they all end with ‘a’
sorted shifts array
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
=alfeatsalfalfa$
BWT Algorithm and Example
d ……… h
lfeatsalf a$a
salfalfa$alfeat
tsalfalfa$alfea
could happen that another pattern
will interfere
‘hlfd’ broken into ‘h’ and ‘lfd’
the longer is repeated pattern,
the less chance of interference
Array of cyclic shifts
Sort (lexographically) cyclic shifts
strict sorting order due to ‘$’
First column (of course) has many
consecutive character runs
But also the last column has many
consecutive character runs
sort groups ‘lf’ lines together, and
they all end with ‘a’
sorted shifts array
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
=alfeatsalfalfa$
BWT Algorithm and Example
Sorted array of cyclic shifts
First column is useless for encoding
cannot decode it
Last column can be decoded
BWT Encoding
last characters from sorted shifts
i.e. the last column
=affs$eflllaaata
sorted shifts array
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
Can refer to a cyclic shift by the start
index in the text, no need to write it out
explicitly
For sorting, letters after ‘$’ do not matter
lfa$alfeatsalfa
alfalfa$alfeats
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
Can refer to a cyclic shift by the start
index in the text, no need to write it out
explicitly
For sorting, letters after ‘$’ do not matter
lfa$alfeatsalfa
salfalfa$alfeat
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
Can refer to a cyclic shift by the start
index in the text, no need to write it out
explicitly
For sorting, letters after ‘$’ do not matter
lfalfa$alfeatsa
lfa$alfeatsalfa
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
Can refer to a cyclic shift by the start
index in the text, no need to write it out
explicitely
For sorting, letters after ‘$’ do not matter
This is the same as sorting suffixes of
We already know how to do it
exactly as for suffix arrays, with
MSD-Radix-Sort
( log ) running time
BWT Fast Encoding: Efficient Sorting
=alfeatsalfalfa$
i cyclic shift
0 a l f e a t s a l f a l f a $
1 l f e a t s a l f a l f a $ a
2 f e a t s a l f a l f a $ a l
3 e a t s a l f a l f a $ a l f
4 a t s a l f a l f a $ a l f e
5 t s a l f a l f a $ a l f e a
6 s a l f a l f a $ a l f e a t
7 a l f a l f a $ a l f e a t s
8 l f a l f a $ a l f e a t s a
9 f a l f a $ a l f e a t s a l
10 a l f a $ a l f e a t s a l f
11 l f a $ a l f e a t s a l f a
12 f a $ a l f e a t s a l f a l
13 a $ a l f e a t s a l f a l f
14 $ a l f e a t s a l f a l f a
j [] sorted cyclic shifts
0 14 $ a l f e a t s a l f a l f a
1 13 a $ a l f e a t s a l f a l f
2 10 a l f a $ a l f e a t s a l f
3 7 a l f a l f a $ a l f e a t s
4 0 a l f e a t s a l f a l f a $
5 4 a t s a l f a l f a $ a l f e
6 3 e a t s a l f a l f a $ a l f
7 12 f a $ a l f e a t s a l f a l
8 9 f a l f a $ a l f e a t s a l
9 2 f e a t s a l f a l f a $ a l
10 11 l f a $ a l f e a t s a l f a
11 8 l f a l f a $ a l f e a t s a
12 1 l f e a t s a l f a l f a $ a
13 6 s a l f a l f a $ a l f e a t
14 5 t s a l f a l f a $ a l f e a
BWT Fast Encoding: Efficient Sorting
=
Can read BWT encoding from suffix array in time
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a l f e a t s a l f a l f a $
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 13 10 7 0 4 3 12 9 2 11 8 1 6 5
=
cyclic shift starts at [14]
we need the last letter of that cyclic shift, it is at [13]
a
BWT Fast Encoding: Efficient Sorting
=
Can read BWT encoding from suffix array in time
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a l f e a t s a l f a l f a $
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 13 10 7 0 4 3 12 9 2 11 8 1 6 5
=
cyclic shift starts at [13]
we need the last letter of that cyclic shift, it is at [12]
a f
BWT Fast Encoding: Efficient Sorting
=
Can read BWT encoding from suffix array in time
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
a l f e a t s a l f a l f a $
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
14 13 10 7 0 4 3 12 9 2 11 8 1 6 5
=
cyclic shift starts at [5]
we need the last letter of that cyclic shift, it is at [4]
a f f s $ e f l l l a a a t a
BWT Fast Encoding: Efficient Sorting
j []
0 14 $ a l f e a t s a l f a l f a
1 13 a $ a l f e a t s a l f a l f
2 10 a l f a $ a l f e a t s a l f
3 7 a l f a l f a $ a l f e a t s
4 0 a l f e a t s a l f a l f a $
5 4 a t s a l f a l f a $ a l f e
6 3 e a t s a l f a l f a $ a l f
7 12 f a $ a l f e a t s a l f a l
8 9 f a l f a $ a l f e a t s a l
9 2 f e a t s a l f a l f a $ a l
10 11 l f a $ a l f e a t s a l f a
11 8 l f a l f a $ a l f e a t s a
12 1 l f e a t s a l f a l f a $ a
13 6 s a l f a l f a $ a l f e a t
14 5 t s a l f a l f a $ a l f e a
a f f s $ e f l l l a a a t a
BWT Decoding
alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
the first column is the original
Unsorted array of cyclic shifts
=affs$eflllaaata
unsorted shifts array
BWT Decoding
=affs$eflllaaata . . . . . . . . . . a
. . . . . . . . . . f
. . . . . . . . . . f
. . . . . . . . . . s
. . . . . . . . . . $
. . . . . . . . . . e
. . . . . . . . . . f
. . . . . . . . . . l
. . . . . . . . . . l
. . . . . . . . . . l
. . . . . . . . . . a
. . . . . . . . . . a
. . . . . . . . . . a
. . . . . . . . . . t
. . . . . . . . . . a
Given , last column of sorted shifts array
Can reconstruct the first column of sorted
shifts array by sorting
first column has exactly the same
characters as the last column
and they must be sorted
sorted shifts array
BWT Decoding
=affs$eflllaaata . . . . . . . . . . a , 0
. . . . . . . . . . f , 1
. . . . . . . . . . f , 2
. . . . . . . . . . s , 3
. . . . . . . . . . $ , 4
. . . . . . . . . . e , 5
. . . . . . . . . . f , 6
. . . . . . . . . . l , 7
. . . . . . . . . . l , 8
. . . . . . . . . . l , 9
. . . . . . . . . . a , 1 0
. . . . . . . . . . a , 1 1
. . . . . . . . . . a , 1 2
. . . . . . . . . . t , 1 3
. . . . . . . . . . a , 1 4
Given , last column of sorted shifts array
Can reconstruct the first column of sorted
shifts array by sorting
first column has exactly the same
characters as the last column
and they must be sorted
Also need row number for decoding
sorted shifts array
BWT Decoding
Now have the first and the last columns of
sorted shifts array
use stable sort
equal letters stay in the same order
= affs$eflllaaata
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
sorted shifts array
BWT Decoding
Now have the first and the last columns of
sorted shifts array
Key for decoding is figuring out where in
the sorted shifts array are the unsorted
rows 0, 1, …
Where is row 0 of unsorted shifts array?
must end with ‘$’
= affs$eflllaaata
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
0
sorted shifts array
BWT Decoding
= a
Therefore
string starts with a
= affs$eflllaaata
Row = 0 of unsorted shifts starts with a
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
0
sorted shifts array
Where is row = 1 of the unsorted shifts
array?
BWT Decoding
= affs$eflllaaata alfeatsalfalfa$
lfeatsalfalfa$a
featsalfalfa$al
eatsalfalfa$alf
atsalfalfa$alfe
tsalfalfa$alfea
salfalfa$alfeat
alfalfa$alfeats
lfalfa$alfeatsa
falfa$alfeatsal
alfa$alfeatsalf
lfa$alfeatsalfa
fa$alfeatsalfal
a$alfeatsalfalf
$alfeatsalfalfa
unsorted shifts array
In the unsorted shifts array, any row ends
with the first letter of previous row
unsorted row 1 ends with the
same letter that unsorted row 0
begins with
BWT Decoding
= a
Therefore
string starts with a
= affs$eflllaaata
Row = 0 of unsorted shifts starts with a
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
0
sorted shifts array
Where is row = 1 of the unsorted shifts
array?
row = 1 of unsorted shifts array
ends with a
BWT Decoding
?
?
?
?
?
Multiple rows end with a, which
one is row 1 of unsorted shifts?
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
0
sorted shifts array
= a
Therefore
string starts with a
= affs$eflllaaata
Row = 0 of unsorted shifts starts with a
Where is row = 1 of the unsorted shifts
array?
row = 1 of unsorted shifts array
ends with a
=alfeatsalfalfa$
BWT Algorithm and Example
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
Consider all patterns in sorted
array that start with ‘a’
0
1
2
3
4
sorted shifts array
=alfeatsalfalfa$
BWT Algorithm and Example
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
Consider all patterns in sorted
array that start with ‘a’
Take their cyclic shifts (by one letter)
$alfeatsalfalfa
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
tsalfalfa$alfea
Find them in sorted array of cyclic shifts
They have ‘a’ at the end, and are the only rows that have ‘a’ at the end
They appear in the same relative order as before cyclic shift
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
for patterns with same first letter, cyclic shift by one letter does not change relative
sorting order
sorted shifts array
=alfeatsalfalfa$
BWT Algorithm and Example
Consider all patterns in sorted
array that start with ‘a’
Take their cyclic shifts (by one letter)
$alfeatsalfalfa
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
tsalfalfa$alfea
Unsorted row 1 is a cyclic shift by 1 letter of unsorted row 0
unsorted row 0 is #4 among all rows starting with ‘a’
unsorted row 1 is #4 among all rows ending with ‘a’
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
0
10
11
12
14
0
10
11
12
14
row 0
sorted shifts array
=alfeatsalfalfa$
BWT Algorithm and Example
Consider all patterns in sorted
array that start with ‘a’
Take their cyclic shifts (by one letter)
$alfeatsalfalfa
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
tsalfalfa$alfea
Unsorted row 1 is a cyclic shift by 1 letter of unsorted row 0
unsorted row 0 is #4 among all rows starting with ‘a’
unsorted row 1 is #4 among all rows ending with ‘a’
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
$alfeatsalfalfa
a$alfeatsalfalf
alfa$alfeatsalf
alfalfa$alfeats
alfeatsalfalfa$
atsalfalfa$alfe
eatsalfalfa$alf
fa$alfeatsalfal
falfa$alfeatsal
featsalfalfa$al
lfa$alfeatsalfa
lfalfa$alfeatsa
lfeatsalfalfa$a
salfalfa$alfeat
tsalfalfa$alfea
0
10
11
12
14
0
10
11
12
14
row 0
sorted shifts array
BWT Decoding
= a
Multiple rows end with a, which
one is row 1 of unsorted shifts?
= affs$eflllaaata $ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
Unsorted row = 1 is located in
row 12 of the sorted shifts
row 1
[1] =
sorted shifts array
BWT Decoding
= al
= affs$eflllaaata $ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
Unsorted row = 2 is located in row
9 of the sorted shifts
row 1
[2] =
row 2
sorted shifts array
BWT Decoding
= alf
= affs$eflllaaata $ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
Unsorted row = 3 is located in row 6
of the sorted shifts
row 1
[3] =
row 2
row 3
sorted shifts array
BWT Decoding
= alfe
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
Unsorted row = 4 is located in
row 5 of the sorted shifts
row 1
[4] =
row 2
row 3
row 4
= affs$eflllaaata
sorted shifts array
BWT Decoding
$ , 4 . . . . . . . a , 0
a , 0 . . . . . . . f , 1
a , 1 0 . . . . . . f , 2
a , 1 1 . . . . . . s , 3
a , 1 2 . . . . . . $ , 4
a , 1 4 . . . . . . e , 5
e , 5 . . . . . . . f , 6
f , 1 . . . . . . . l , 7
f , 2 . . . . . . . l , 8
f , 6 . . . . . . . l , 9
l , 7 . . . . . . . a , 1 0
l , 8 . . . . . . . a , 1 1
l , 9 . . . . . . . a , 1 2
s , 3 . . . . . . . t , 1 3
t , 1 3 . . . . . . a , 1 4
row 0
row 1
row 2
row 3
row 4
=alfeatsalfalfa$
row 5
row 6
row 7
row 8
row 9
row 10
row 11
row 12
row 14
row 13
sorted shifts array
BWT Decoding
BWT::decoding( 0… − 1 , )
: string of characters over alphabet ΣC , : output stream
← array of size // leftmost column
for = 0 to − 1
[] ← ([], ) // store character and index
stably sort by character
for = 0 to // find $
if C [j] = $ break
repeat
. (character stored in )
← index stored in
until we have appended $
BWT Overview
Encoding cost
( log) with special sorting algorithm
in practice MSD sort is good enough but worst case is Θ(2)
read encoding from the suffix array
Decoding cost
faster than encoding
( + |Σ|)
Encoding and decoding both use O() space
They need all of the text (no streaming possible)
can use on blocks of text (block compression method)
BWT tends to be slower than other methods
But combined with MTF, RLE and Huffman leads to better compression
Compression Summary
Huffman Run-length encoding Lempel-Ziv-Welch
variable-length
multi-character
1-pass
bad on text
good on long runs
(e.g., pictures)
variable-length
single-character
2-pass
60% compression on
English text
optimal 01-prefix-code
fixed-length
multi-character
1-pass
45% compression on
English text
good on English text
requires uneven
frequencies
rarely used directly
part of pkzip, JPEG, MP3
requires runs requires repeated
substrings
frequently used
GIF, some variants of PDF,
Unix compress
Bzip2 (uses Burrows-Wheeler)
multi-step
multi-step
not streamable
70% compression on
English text
better on English text
requires repeated
substrings
used but slow
bzip2 and variants
rarely used directly
fax machines, old picture-
formats