Bzip2 part 1

· slope000's blog

Start day: 09/13/2022
A dependency of: app-admin/keepassxc-2.7.1-r1
Has no dependencies

Part 1
Part 2

# ebuild and meta-info

License: bzip2
Homepage: https://sourceware.org/bzip2/
Source code: https://sourceware.org/pub/${PN}/${P}.tar.gz
Patches (reviewed in the end of the article):

PATCHES=(
	"${FILESDIR}"/${PN}-1.0.4-makefile-CFLAGS.patch
	"${FILESDIR}"/${PN}-1.0.8-saneso.patch
	"${FILESDIR}"/${PN}-1.0.4-man-links.patch #172986
	"${FILESDIR}"/${PN}-1.0.6-progress.patch
	"${FILESDIR}"/${PN}-1.0.3-no-test.patch
	"${FILESDIR}"/${PN}-1.0.8-mingw.patch #393573
	"${FILESDIR}"/${PN}-1.0.8-out-of-tree-build.patch
)

Flags: +split-usr -static -static-libs -verify-sig
Lines of code: ~7336 (including only *.c files)

# Requisites

# Already familiar with

# Description

bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical compressors. It was developed by Julian Seward, and maintained by Mark Wielaard and Micah Snyder.

bzip2 compresses only single files and is not a file archiver.

# History

Seward made the first public release of bzip2, version 0.15, in July 1996. The compressor's stability and popularity grew over the next several years, and Seward released version 1.0 in late 2000. Following a nine-year hiatus of updates for the project since 2010, on 4 June 2019 Federico Mena accepted maintainership of the bzip2 project. Since June 2021, the maintainer is Micah Snyder.

18     July 1996 (version 0.15)
25   August 1996 (version 0.21)
 7   August 1997 (bzip2, version 0.1)
29   August 1997 (bzip2, version 0.1pl2)
23   August 1998 (bzip2, version 0.9.0)
 8     June 1999 (bzip2, version 0.9.5)
 4     Sept 1999 (bzip2, version 0.9.5d)
 5      May 2000 (bzip2, version 1.0pre8)
30 December 2001 (bzip2, version 1.0.2pre1)
15 February 2005 (bzip2, version 1.0.3)
20 December 2006 (bzip2, version 1.0.4)
10 December 2007 (bzip2, version 1.0.5)
 6     Sept 2010 (bzip2, version 1.0.6)
27     June 2019 (bzip2, version 1.0.7)
13     July 2019 (bzip2, version 1.0.8)

Contact: bzip2-devel@sourceware.org.

# Implementation

Bzip2 uses several layers of compression techniques stacked on top of each other, which occur in the following order during compression and the reverse order during decompression:

  1. Run-length encoding (RLE) of initial data.
  2. Burrows–Wheeler transform (BWT), or block sorting.
  3. Move-to-front (MTF) transform.
  4. Run-length encoding (RLE) of MTF result.
  5. Huffman coding.
  6. Selection between multiple Huffman tables.
  7. Unary base-1 encoding of Huffman table selection.
  8. Delta encoding (Δ) of Huffman-code bit lengths.
  9. Sparse bit array showing which symbols are used.

# License

License: bzip2.
[Break: read bzip2].

Julian Seward, jseward@acm.org
bzip2/libbzip2 version 1.0.8 of 13 July 2019

# Makefile

Objects:

OBJS= blocksort.o  \
      huffman.o    \
      crctable.o   \
      randtable.o  \
      compress.o   \
      decompress.o \
      bzlib.o

Options:

all: libbz2.a bzip2 bzip2recover test

# ...

check: test
install:
clean:

# an option for each obj

distclean:
dist:

# rebuilding manual from source
manual:
manual.ps:
manual.pdf:
manual.html:

Default installation paths: /usr/local/bin, /usr/local/lib, /usr/local/man and /usr/local/include.

# Makefile-libbz2_so

This Makefile builds a shared version of the library, libbz2.so.1.0.8, with soname libbz2.so.1.0, at least on x86-Linux (RedHat 7.2), with gcc-2.96 20000731 (Red Hat Linux 7.1 2.96-98).

all: $(OBJS)
	$(CC) -shared -Wl,-soname -Wl,libbz2.so.1.0 -o libbz2.so.1.0.8 $(OBJS)
	$(CC) $(CFLAGS) -o bzip2-shared bzip2.c libbz2.so.1.0.8
	rm -f libbz2.so.1.0
	ln -s libbz2.so.1.0.8 libbz2.so.1.0

# Build advice

I suggest you also build using the normal Makefile, since that conducts a self-test. A second reason to prefer the version statically linked to the library is that, on x86 platforms, building shared objects makes a valuable register (%ebx) unavailable to gcc, resulting in a slowdown of 10%-20%, at least for bzip2.

For Large File (2,147,483,648 (2^31) bytes or above) support, the -D_FILE_OFFSET_BITS=64 option was added to the Makefile. Many older OSs can't handle files above this size, but many newer ones can. if you get compilation errors which you think are related to large file support, try removing the above define from the Makefile.

# Xml stuff

The script xmlproc.sh takes an xml file as input, and processes it to create .pdf, .html or .ps (PostScript) output. It uses format.pl, a perl script to format <pre> blocks nicely, and add CDATA tags so writers do not have to use eg. &lt.

Usage:

# Validate
./xmlproc.sh -v manual.xml
./xmlproc.sh -html manual.xml
./xmlproc.sh -pdf manual.xml
./xmlproc.sh -ps manual.xml

# Manual

The manual is very detailed, thus here I'll list only brief notes regarding it.

# How to use bzip2

# Name

# Synopsis

# Memory Management

Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. The default block size is 900 kb.

# Recovering Data from Damaged Files

The compressed representation of each block is delimited by a 48-bit pattern, which makes it possible to find the block boundaries with reasonable certainty. Each block also carries its own 32-bit CRC, so damaged blocks can be distinguished from undamaged ones.

bzip2recover is a simple program whose purpose is to search for blocks in .bz2 files, and write each block out into its own .bz2 file. You can then use bzip2 -t to test the integrity of the resulting files, and decompress those which are undamaged.

# Performance Notes

The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeated several hundred times) may compress more slowly than normal (average-worst case ratio: ~1:10 after version 0.9.5).

# Programming with libbzip2

All externally visible symbols have names beginning BZ2_. To use any part of the library, you need to #include <bzlib.h> into your sources.

# Low-level usage summary

Compressing/decompressing data in memory. Thread safe and doesn't require stdio, which is convenient for embedded applications. Six routines make up the low level interface: BZ2_bzCompressInit, BZ2_bzCompress, and BZ2_bzCompressEnd for compression, and a corresponding trio BZ2_bzDecompressInit, BZ2_bzDecompress and BZ2_bzDecompressEnd for decompression. The *Init functions allocate memory for compression/decompression and do other initialisations, whilst the *End functions close down operations and release memory. The real work is done by BZ2_bzCompress and BZ2_bzDecompress. These compress and decompress data from a user-supplied input buffer to a user-supplied output buffer. These buffers can be any size.

# High-level usage summary

For reading files, BZ2_bzReadOpen, BZ2_bzRead, BZ2_bzReadClose and BZ2_bzReadGetUnused are supplied. For writing files, BZ2_bzWriteOpen, BZ2_bzWrite and BZ2_bzWriteFinish are available. The high-level library is also thread-safe since no global variables are in use. Unless errno is required, in which case an external thread-safe library will have to be supplied.

# Utility functions summary

For very simple needs, BZ2_bzBuffToBuffCompress and BZ2_bzBuffToBuffDecompress are provided. These compress data in memory from one buffer to another buffer in a single function call.

Yoshioka Tsuneo (tsuneo@rr.iij4u.or.jp) has contributed some functions to give better zlib compatibility. These functions are BZ2_bzopen, BZ2_bzread, BZ2_bzwrite, BZ2_bzflush, BZ2_bzclose, BZ2_bzerror and BZ2_bzlibVersion. They are not documented yet, though.

# Error handling

bzlib.h includes all the necessary definitions. The following definitions indicate success: BZ_OK, BZ_RUN_OK, BZ_FLUSH_OK, BZ_FINISH_OK, BZ_STREAM_END. The following definitions are used to indicate failure: BZ_CONFIG_ERROR (sizeof char,short, int are not 1, 2, 4 respectively), BZ_SEQUENCE_ERROR (functions called in an incorrect sequence), BZ_PARAM_ERROR, BZ_MEM_ERROR, BZ_DATA_ERROR (data integrity error), BZ_DATA_ERROR_MAGIC (incorrect magic bytes), BZ_IO_ERROR, BZ_UNEXPECTED_EOF (returned by BZ2_bzRead when the compressed file finishes before the logical end of stream is detected), BZ_OUTBUFF_FULL.

# Low-level interface

# BZ2_bzCompressInit

typedef struct {
  char *next_in;
  unsigned int avail_in;
  unsigned int total_in_lo32;
  unsigned int total_in_hi32;

  char *next_out;
  unsigned int avail_out;
  unsigned int total_out_lo32;
  unsigned int total_out_hi32;

  void *state;

  void *(*bzalloc)(void *,int,int);
  void (*bzfree)(void *,void *);
  void *opaque;
} bz_stream;

bz_stream holds all data pertaining to the compression activity. The fields of bz_stream comprise the entirety of the user-visible data. state is a pointer to the private data structures required for compression. Custom memory allocators are supported, via fields bzalloc, bzfree, and opaque. The value opaque is passed to as the first argument to all calls to bzalloc and bzfree, but is otherwise ignored by the library. The call bzalloc ( opaque, n, m ) is expected to return a pointer p to n * m bytes of memory, and bzfree ( opaque, p ) should free that memory. The fields should be set to null otherwise. total_in_lo32, total_in_hi32, total_out_lo32 and total_out_hi32 are used to inform the caller of the total amount of data passed.

int BZ2_bzCompressInit ( bz_stream *strm,
                         int blockSize100k,
                         int verbosity,
                         int workFactor );

Parameter blockSize100k specifies block size (1..9). verbosity range is 0..4 (unless the library is compiled with -DBZ_NO_STDIO), 0 is silent. workFactor default value is 30, range is 0..250 (0 is a special case, equals to 30). Lower values of workFactor reduce the amount of effort the standard algorithm will expend before resorting to the fallback in case the input data is highly repetitive, which is slower but always behaves reasonably, both algorithms generate the same output.

# BZ2_bzCompress

int BZ2_bzCompress ( bz_stream *strm, int action );

strm->next_in should point at the data to be compressed, and strm->avail_in should indicate how many bytes the library may read. strm->next_out and strm->avail_out should be defined in a similar way. The action value can be BZ_RUN (start compression), BZ_FINISH (remember current value of next_in, compress from next_in to next_out as much as possible, but do not accept any more input, next state - FINISHING; in FINISHING state, if stream end is reached, the next state is IDLE, otherwise is FINISHING) and BZ_FLUSH (in RUNNING state, same as BZ_FINISH, next state - FLUSHING; in FLUSHING state if all i/o was used, next state is RUNNING, otherwise next step is FLUSHING, other actions in FLUSHING state are illegal).

If the data you want to compress fits into your input buffer all at once, you can skip the calls of BZ2_bzCompress ( ..., BZ_RUN ) and just do the BZ2_bzCompress ( ..., BZ_FINISH ) calls.

# BZ2_bzCompressEnd

int BZ2_bzCompressEnd ( bz_stream *strm );

Releases memory.

# BZ2_bzDecompressInit

int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );

If small is nonzero, the library will use an alternative decompression algorithm which uses less memory but at the cost of decompressing more slowly.

# BZ2_bzDecompress

int BZ2_bzDecompress ( bz_stream *strm );

You should provide input and remove output as described above, and repeatedly call BZ2_bzDecompress until BZ_STREAM_END is returned.

# BZ2_bzDecompressEnd

int BZ2_bzDecompressEnd ( bz_stream *strm );

Releases memory.

# High-level interface

# BZ2_bzReadOpen

typedef void BZFILE;

BZFILE *BZ2_bzReadOpen( int *bzerror, FILE *f,
                        int verbosity, int small,
                        void *unused, int nUnused );

For reasons explained below, BZ2_bzRead will decompress the nUnused bytes starting at unused, before starting to read from the file f. At most BZ_MAX_UNUSED bytes may be supplied like this. If this facility is not required, you should pass NULL and 0 for unused and nUnused respectively.

# BZ2_bzRead

int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );

Reads up to len (uncompressed) bytes from the compressed file b into the buffer buf.

Internally, BZ2_bzRead copies data from the compressed file in chunks of size BZ_MAX_UNUSED bytes before decompressing it. If the file contains more bytes than strictly needed to reach the logical end-of-stream, BZ2_bzRead will almost certainly read some of the trailing data before signalling BZ_SEQUENCE_END. To collect the read but unused data once BZ_SEQUENCE_END has appeared, call BZ2_bzReadGetUnused immediately before BZ2_bzReadClose.

# BZ2_bzReadGetUnused

void BZ2_bzReadGetUnused( int* bzerror, BZFILE *b,
                          void** unused, int* nUnused );

# BZ2_bzReadClose

void BZ2_bzReadClose ( int *bzerror, BZFILE *b );

Doesn't close the b handle, only frees all the allocated resources.

# BZ2_bzWriteOpen

BZFILE *BZ2_bzWriteOpen( int *bzerror, FILE *f,
                         int blockSize100k, int verbosity,
                         int workFactor );

# BZ2_bzWrite

void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );

# BZ2_bzWriteClose

void BZ2_bzWriteClose( int *bzerror, BZFILE* f,
                       int abandon,
                       unsigned int* nbytes_in,
                       unsigned int* nbytes_out );

void BZ2_bzWriteClose64( int *bzerror, BZFILE* f,
                         int abandon,
                         unsigned int* nbytes_in_lo32,
                         unsigned int* nbytes_in_hi32,
                         unsigned int* nbytes_out_lo32,
                         unsigned int* nbytes_out_hi32 );

Compresses and flushes to the compressed file all data so far supplied by BZ2_bzWrite. Subsequent calls to BZ2_bzWrite are illegal. All memory associated with the compressed file b is released. fflush is called on the compressed file, but it is not fclose'd. If BZ2_bzWriteClose is called to clean up after an error, the only action is to release the memory. This behavior can be forced to happen by passing a nonzero value to abandon.

# Utility functions

# BZ2_bzBuffToBuffCompress

int BZ2_bzBuffToBuffCompress( char*         dest,
                              unsigned int* destLen,
                              char*         source,
                              unsigned int  sourceLen,
                              int           blockSize100k,
                              int           verbosity,
                              int           workFactor );

Attempts to compress the data from source to dest.

Allocate an output buffer of size 1% larger than the uncompressed data, plus six hundred extra bytes.

# BZ2_bzBuffToBuffDecompress

int BZ2_bzBuffToBuffDecompress( char*         dest,
                                unsigned int* destLen,
                                char*         source,
                                unsigned int  sourceLen,
                                int           small,
                                int           verbosity );

# zlib compatibility functions

typedef void BZFILE;

const char * BZ2_bzlibVersion ( void );
BZFILE * BZ2_bzopen  ( const char *path, const char *mode );
BZFILE * BZ2_bzdopen ( int        fd,    const char *mode );
int BZ2_bzread  ( BZFILE* b, void* buf, int len );
int BZ2_bzwrite ( BZFILE* b, void* buf, int len );
int  BZ2_bzflush ( BZFILE* b );
void BZ2_bzclose ( BZFILE* b );
const char * BZ2_bzerror ( BZFILE *b, int *errnum );

# Using the library in a stdio-free environment

Compile with the preprocessor BZ_NO_STDIO symbol defined.

# Critical error handling

In a stdio-free environment, assertion fail results in a call to the following function:

extern void bz_internal_error ( int errcode );

Otherwise, stdio is used to indicate the error. The program cannot be recovered after a critical error.

# Windows

For making a DLL: using Visual C++ 5.0, open the project file libbz2.dsp, and build. That's it. Then test it with the sample program dlltest.c, which also has a project file: dlltest.dsp. If you don't use VC++, you might need to define _WIN32.

For a makefile for Visual C, take a look at makefile.msc. If you compile bzip2 on Windows, set BZ_UNIX to 0 and BZ_LCCWIN32 to 1 in the file bzip2.c.

# Miscellanea

# Compressed file format

Didn't change since bzip2-0.1 for compatibility.

Unnecessary complexities of the file format:

# Portability issues

# Reporting bugs

Before reporting a bug,

# Further reading

The following paper gives valuable additional insights into the algorithm, but is not immediately the basis of any code used in bzip2.

Other related documents:

Performance-related papers: