Bzip2 part 2

· slope000's blog


Part 1
Part 2

Implementation #

Bzip2 uses several layers of compression techniques stacked on top of each other, which occur in the following order during compression and the reverse order during decompression:

  1. Run-length encoding (RLE) of initial data.
  2. Burrows–Wheeler transform (BWT), or block sorting.
  3. Move-to-front (MTF) transform.
  4. Run-length encoding (RLE) of MTF result.
  5. Huffman coding.
  6. Selection between multiple Huffman tables.
  7. Unary base-1 encoding of Huffman table selection.
  8. Delta encoding (Δ) of Huffman-code bit lengths.
  9. Sparse bit array showing which symbols are used.

RLE #

12W1B12W3B24W1B14W -> twelve Ws, one B etc.

Burrows-Wheeler Transform #

BWT, block-sorting compression. Rearranges a character string into runs of similar characters. This transformation is reversible.

   Input  SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES
   Output TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT

The output is easier to compress because it has many repeated characters. In this example the transformed string contains six runs of identical characters: XX, SS, PP, .., II, and III, which together make 13 out of the 44 characters.

The transform is done by sorting all the circular shifts of a text in lexicographic order and by extracting the last column and the index of the original string in the set of sorted permutations of S.

Example #

S = ^BANANA|, where ^ denotes the start of the string and | is the EOF pointer.

BWT: Step 1 #

Rotate N times, N = 8 (S length).

^BANANA|
|^BANANA
A|^BANAN
NA|^BANA
ANA|^BAN
NANA|^BA
ANANA|^B
BANANA|^

BWT: Step 2 #

These rotations, or circular shifts, are then sorted lexicographically.

ANANA|^B
ANA|^BAN
A|^BANAN
BANANA|^
NANA|^BA
NA|^BANA
^BANANA|
|^BANANA

BWT: Step 3 #

The output of the encoding phase is the last column L = BNN^AA|A, and the index (0-based) I of the row containing the original string S, in this case I = 6.

ANANA|^B
ANA|^BAN
A|^BANAN
BANANA|^
NANA|^BA
NA|^BANA
^BANANA|
|^BANANA

BWT Example: Result #

Input: ^BANANA|
Output: BNN^AA|A

BWT: Pseudocode #

Simple but inefficient.

function BWT (string s)
    create a table, where the rows are all possible rotations of s
    sort rows alphabetically
    return (last column of the table)

function inverseBWT (string s)
    create empty table
    repeat length(s) times
        // first insert creates first column
        insert s as a column of table before first column of the table
        sort rows of the table alphabetically
   return (row that ends with the 'EOF' character)