Implementation #
Bzip2 uses several layers of compression techniques stacked on top of each other, which occur in the following order during compression and the reverse order during decompression:
- Run-length encoding (RLE) of initial data.
- Burrows–Wheeler transform (BWT), or block sorting.
- Move-to-front (MTF) transform.
- Run-length encoding (RLE) of MTF result.
- Huffman coding.
- Selection between multiple Huffman tables.
- Unary base-1 encoding of Huffman table selection.
- Delta encoding (Δ) of Huffman-code bit lengths.
- Sparse bit array showing which symbols are used.
RLE #
12W1B12W3B24W1B14W -> twelve Ws, one B etc.
Burrows-Wheeler Transform #
BWT, block-sorting compression. Rearranges a character string into runs of similar characters. This transformation is reversible.
Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
Output TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT
The output is easier to compress because it has many repeated characters. In this example the transformed string contains six runs of identical characters: XX
, SS
, PP
, ..
, II
, and III
, which together make 13 out of the 44 characters.
The transform is done by sorting all the circular shifts of a text in lexicographic order and by extracting the last column and the index of the original string in the set of sorted permutations of S.
Example #
S = ^BANANA|
, where ^
denotes the start of the string and |
is the EOF
pointer.
BWT: Step 1 #
Rotate N times, N = 8 (S length).
^BANANA|
|^BANANA
A|^BANAN
NA|^BANA
ANA|^BAN
NANA|^BA
ANANA|^B
BANANA|^
BWT: Step 2 #
These rotations, or circular shifts, are then sorted lexicographically.
ANANA|^B
ANA|^BAN
A|^BANAN
BANANA|^
NANA|^BA
NA|^BANA
^BANANA|
|^BANANA
BWT: Step 3 #
The output of the encoding phase is the last column L = BNN^AA|A
, and the index (0-based) I of the row containing the original string S, in this case I = 6.
ANANA|^B
ANA|^BAN
A|^BANAN
BANANA|^
NANA|^BA
NA|^BANA
^BANANA|
|^BANANA
BWT Example: Result #
Input: ^BANANA|
Output: BNN^AA|A
BWT: Pseudocode #
Simple but inefficient.
function BWT (string s)
create a table, where the rows are all possible rotations of s
sort rows alphabetically
return (last column of the table)
function inverseBWT (string s)
create empty table
repeat length(s) times
// first insert creates first column
insert s as a column of table before first column of the table
sort rows of the table alphabetically
return (row that ends with the 'EOF' character)