Bzip2 part 1

Start day: 09/13/2022
A dependency of: app-admin/keepassxc-2.7.1-r1
Has no dependencies

ebuild and meta-info #

License: bzip2
Homepage: https://sourceware.org/bzip2/
Source code: https://sourceware.org/pub/${PN}/${P}.tar.gz
Patches (reviewed in the end of the article):

PATCHES=(
	"${FILESDIR}"/${PN}-1.0.4-makefile-CFLAGS.patch
	"${FILESDIR}"/${PN}-1.0.8-saneso.patch
	"${FILESDIR}"/${PN}-1.0.4-man-links.patch #172986
	"${FILESDIR}"/${PN}-1.0.6-progress.patch
	"${FILESDIR}"/${PN}-1.0.3-no-test.patch
	"${FILESDIR}"/${PN}-1.0.8-mingw.patch #393573
	"${FILESDIR}"/${PN}-1.0.8-out-of-tree-build.patch
)

Flags: +split-usr -static -static-libs -verify-sig
Lines of code: ~7336 (including only *.c files)

Requisites #

(x) gentoo wiki bzip2 page
( ) wikipedia
( ) Burrows-Wheeler algorithm
(x) bzip2 man page - https://linux.die.net/man/1/bzip2
(x) bzip2 manual
(x) What is bzip2? - https://sourceware.org/bzip2/

Already familiar with #

Huffman coding

Description #

bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical compressors. It was developed by Julian Seward, and maintained by Mark Wielaard and Micah Snyder.

bzip2 compresses only single files and is not a file archiver.

History #

Seward made the first public release of bzip2, version 0.15, in July 1996. The compressor's stability and popularity grew over the next several years, and Seward released version 1.0 in late 2000. Following a nine-year hiatus of updates for the project since 2010, on 4 June 2019 Federico Mena accepted maintainership of the bzip2 project. Since June 2021, the maintainer is Micah Snyder.

18     July 1996 (version 0.15)
25   August 1996 (version 0.21)
 7   August 1997 (bzip2, version 0.1)
29   August 1997 (bzip2, version 0.1pl2)
23   August 1998 (bzip2, version 0.9.0)
 8     June 1999 (bzip2, version 0.9.5)
 4     Sept 1999 (bzip2, version 0.9.5d)
 5      May 2000 (bzip2, version 1.0pre8)
30 December 2001 (bzip2, version 1.0.2pre1)
15 February 2005 (bzip2, version 1.0.3)
20 December 2006 (bzip2, version 1.0.4)
10 December 2007 (bzip2, version 1.0.5)
 6     Sept 2010 (bzip2, version 1.0.6)
27     June 2019 (bzip2, version 1.0.7)
13     July 2019 (bzip2, version 1.0.8)

Contact: bzip2-devel@sourceware.org.

Implementation #

Bzip2 uses several layers of compression techniques stacked on top of each other, which occur in the following order during compression and the reverse order during decompression:

Run-length encoding (RLE) of initial data.
Burrows–Wheeler transform (BWT), or block sorting.
Move-to-front (MTF) transform.
Run-length encoding (RLE) of MTF result.
Huffman coding.
Selection between multiple Huffman tables.
Unary base-1 encoding of Huffman table selection.
Delta encoding (Δ) of Huffman-code bit lengths.
Sparse bit array showing which symbols are used.

License #

License: bzip2.
[Break: read bzip2].

Julian Seward, jseward@acm.org
bzip2/libbzip2 version 1.0.8 of 13 July 2019

Makefile #

Objects:

OBJS= blocksort.o  \
      huffman.o    \
      crctable.o   \
      randtable.o  \
      compress.o   \
      decompress.o \
      bzlib.o

Options:

all: libbz2.a bzip2 bzip2recover test

# ...

check: test
install:
clean:

# an option for each obj

distclean:
dist:

# rebuilding manual from source
manual:
manual.ps:
manual.pdf:
manual.html:

Default installation paths: /usr/local/bin, /usr/local/lib, /usr/local/man and /usr/local/include.

Makefile-libbz2_so #

This Makefile builds a shared version of the library, libbz2.so.1.0.8, with soname libbz2.so.1.0, at least on x86-Linux (RedHat 7.2), with gcc-2.96 20000731 (Red Hat Linux 7.1 2.96-98).

all: $(OBJS)
	$(CC) -shared -Wl,-soname -Wl,libbz2.so.1.0 -o libbz2.so.1.0.8 $(OBJS)
	$(CC) $(CFLAGS) -o bzip2-shared bzip2.c libbz2.so.1.0.8
	rm -f libbz2.so.1.0
	ln -s libbz2.so.1.0.8 libbz2.so.1.0

Build advice #

I suggest you also build using the normal Makefile, since that conducts a self-test. A second reason to prefer the version statically linked to the library is that, on x86 platforms, building shared objects makes a valuable register (%ebx) unavailable to gcc, resulting in a slowdown of 10%-20%, at least for bzip2.

For Large File (2,147,483,648 (2^31) bytes or above) support, the -D_FILE_OFFSET_BITS=64 option was added to the Makefile. Many older OSs can't handle files above this size, but many newer ones can. if you get compilation errors which you think are related to large file support, try removing the above define from the Makefile.

Xml stuff #

The script xmlproc.sh takes an xml file as input, and processes it to create .pdf, .html or .ps (PostScript) output. It uses format.pl, a perl script to format <pre> blocks nicely, and add CDATA tags so writers do not have to use eg. &lt.

Usage:

# Validate
./xmlproc.sh -v manual.xml
./xmlproc.sh -html manual.xml
./xmlproc.sh -pdf manual.xml
./xmlproc.sh -ps manual.xml

Manual #

The manual is very detailed, thus here I'll list only brief notes regarding it.

How to use bzip2 #

Name #

bzip2, bunzip2 - a block-sorting file compressor, v1.0.8
bzcat - decompresses files to stdout
bzip2recover - recovers data from damaged bzip2 files

Synopsis #

bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ]
bunzip2 [ -fkvsVL ] [ filenames ... ]
bzcat [ -s ] [ filenames ... ]
bzip2recover filename
-f - overwrite existing files
-d - decompress, an alternative of bunzip2; recognised endings: .bz2, .bz, .tbz2 or .tbz
-t - integrity testing of concatenated compressed files
-c - compress/decompress from/to stdin/stdout
bzcat = bzip2 -dc
bzip2recover - try to recover data from damaged files.
-z - compress regardless of name
-k - keep input files
-s - modified algorithm for reduction of memory usage (use this if m < ~8 mb)
-q - quiet
-v - verbose
-L, -V - licence, version
-1 (--fast) ... -9 (--best) - sets blocks sizes to 100..900 kb.
-- - treats all subsequent arguments as file names.
-vvvv - monitor the process with a great detail.
Arguments supply priority: BZIP2, BZIP environment variables, command line
Files with size s < ~100b tend to get larger output because of constant overhead, random data is coded at about 8.05 bits per byte, giving an expansion of around 0.5%.
bzip2 uses 32-bit CRCs to make sure that the decompressed version of a file is identical to the original.
Return values - 0:success, 1:environmental problems, 2:corrupt compressed file, 3:internal consistency error (i.e. bug)

Memory Management #

Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. The default block size is 900 kb.

Recovering Data from Damaged Files #

The compressed representation of each block is delimited by a 48-bit pattern, which makes it possible to find the block boundaries with reasonable certainty. Each block also carries its own 32-bit CRC, so damaged blocks can be distinguished from undamaged ones.

bzip2recover is a simple program whose purpose is to search for blocks in .bz2 files, and write each block out into its own .bz2 file. You can then use bzip2 -t to test the integrity of the resulting files, and decompress those which are undamaged.

Performance Notes #

The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeated several hundred times) may compress more slowly than normal (average-worst case ratio: ~1:10 after version 0.9.5).

Programming with libbzip2 #

All externally visible symbols have names beginning BZ2_. To use any part of the library, you need to #include <bzlib.h> into your sources.

Low-level usage summary #

Compressing/decompressing data in memory. Thread safe and doesn't require stdio, which is convenient for embedded applications. Six routines make up the low level interface: BZ2_bzCompressInit, BZ2_bzCompress, and BZ2_bzCompressEnd for compression, and a corresponding trio BZ2_bzDecompressInit, BZ2_bzDecompress and BZ2_bzDecompressEnd for decompression. The *Init functions allocate memory for compression/decompression and do other initialisations, whilst the *End functions close down operations and release memory. The real work is done by BZ2_bzCompress and BZ2_bzDecompress. These compress and decompress data from a user-supplied input buffer to a user-supplied output buffer. These buffers can be any size.

High-level usage summary #

For reading files, BZ2_bzReadOpen, BZ2_bzRead, BZ2_bzReadClose and BZ2_bzReadGetUnused are supplied. For writing files, BZ2_bzWriteOpen, BZ2_bzWrite and BZ2_bzWriteFinish are available. The high-level library is also thread-safe since no global variables are in use. Unless errno is required, in which case an external thread-safe library will have to be supplied.

Utility functions summary #

For very simple needs, BZ2_bzBuffToBuffCompress and BZ2_bzBuffToBuffDecompress are provided. These compress data in memory from one buffer to another buffer in a single function call.

Yoshioka Tsuneo (tsuneo@rr.iij4u.or.jp) has contributed some functions to give better zlib compatibility. These functions are BZ2_bzopen, BZ2_bzread, BZ2_bzwrite, BZ2_bzflush, BZ2_bzclose, BZ2_bzerror and BZ2_bzlibVersion. They are not documented yet, though.

Error handling #

bzlib.h includes all the necessary definitions. The following definitions indicate success: BZ_OK, BZ_RUN_OK, BZ_FLUSH_OK, BZ_FINISH_OK, BZ_STREAM_END. The following definitions are used to indicate failure: BZ_CONFIG_ERROR (sizeof char,short, int are not 1, 2, 4 respectively), BZ_SEQUENCE_ERROR (functions called in an incorrect sequence), BZ_PARAM_ERROR, BZ_MEM_ERROR, BZ_DATA_ERROR (data integrity error), BZ_DATA_ERROR_MAGIC (incorrect magic bytes), BZ_IO_ERROR, BZ_UNEXPECTED_EOF (returned by BZ2_bzRead when the compressed file finishes before the logical end of stream is detected), BZ_OUTBUFF_FULL.

Low-level interface #

`BZ2_bzCompressInit` #

typedef struct {
  char *next_in;
  unsigned int avail_in;
  unsigned int total_in_lo32;
  unsigned int total_in_hi32;

  char *next_out;
  unsigned int avail_out;
  unsigned int total_out_lo32;
  unsigned int total_out_hi32;

  void *state;

  void *(*bzalloc)(void *,int,int);
  void (*bzfree)(void *,void *);
  void *opaque;
} bz_stream;

bz_stream holds all data pertaining to the compression activity. The fields of bz_stream comprise the entirety of the user-visible data. state is a pointer to the private data structures required for compression. Custom memory allocators are supported, via fields bzalloc, bzfree, and opaque. The value opaque is passed to as the first argument to all calls to bzalloc and bzfree, but is otherwise ignored by the library. The call bzalloc ( opaque, n, m ) is expected to return a pointer p to n * m bytes of memory, and bzfree ( opaque, p ) should free that memory. The fields should be set to null otherwise. total_in_lo32, total_in_hi32, total_out_lo32 and total_out_hi32 are used to inform the caller of the total amount of data passed.

int BZ2_bzCompressInit ( bz_stream *strm,
                         int blockSize100k,
                         int verbosity,
                         int workFactor );

Parameter blockSize100k specifies block size (1..9). verbosity range is 0..4 (unless the library is compiled with -DBZ_NO_STDIO), 0 is silent. workFactor default value is 30, range is 0..250 (0 is a special case, equals to 30). Lower values of workFactor reduce the amount of effort the standard algorithm will expend before resorting to the fallback in case the input data is highly repetitive, which is slower but always behaves reasonably, both algorithms generate the same output.

`BZ2_bzCompress` #

int BZ2_bzCompress ( bz_stream *strm, int action );

strm->next_in should point at the data to be compressed, and strm->avail_in should indicate how many bytes the library may read. strm->next_out and strm->avail_out should be defined in a similar way. The action value can be BZ_RUN (start compression), BZ_FINISH (remember current value of next_in, compress from next_in to next_out as much as possible, but do not accept any more input, next state - FINISHING; in FINISHING state, if stream end is reached, the next state is IDLE, otherwise is FINISHING) and BZ_FLUSH (in RUNNING state, same as BZ_FINISH, next state - FLUSHING; in FLUSHING state if all i/o was used, next state is RUNNING, otherwise next step is FLUSHING, other actions in FLUSHING state are illegal).

If the data you want to compress fits into your input buffer all at once, you can skip the calls of BZ2_bzCompress ( ..., BZ_RUN ) and just do the BZ2_bzCompress ( ..., BZ_FINISH ) calls.

`BZ2_bzCompressEnd` #

int BZ2_bzCompressEnd ( bz_stream *strm );

Releases memory.

`BZ2_bzDecompressInit` #

int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );

If small is nonzero, the library will use an alternative decompression algorithm which uses less memory but at the cost of decompressing more slowly.

`BZ2_bzDecompress` #

int BZ2_bzDecompress ( bz_stream *strm );

You should provide input and remove output as described above, and repeatedly call BZ2_bzDecompress until BZ_STREAM_END is returned.

`BZ2_bzDecompressEnd` #

int BZ2_bzDecompressEnd ( bz_stream *strm );

Releases memory.

High-level interface #

All of the functions take an int* first argument, bzerror. After each call, bzerror should be consulted first, and only then the return value can be consulted.
If bzerror indicates error, immediately call BZ2_bzReadClose/BZ2_bzWriteClose to free up all resources.

`BZ2_bzReadOpen` #

typedef void BZFILE;

BZFILE *BZ2_bzReadOpen( int *bzerror, FILE *f,
                        int verbosity, int small,
                        void *unused, int nUnused );

For reasons explained below, BZ2_bzRead will decompress the nUnused bytes starting at unused, before starting to read from the file f. At most BZ_MAX_UNUSED bytes may be supplied like this. If this facility is not required, you should pass NULL and 0 for unused and nUnused respectively.

`BZ2_bzRead` #

int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );

Reads up to len (uncompressed) bytes from the compressed file b into the buffer buf.

Internally, BZ2_bzRead copies data from the compressed file in chunks of size BZ_MAX_UNUSED bytes before decompressing it. If the file contains more bytes than strictly needed to reach the logical end-of-stream, BZ2_bzRead will almost certainly read some of the trailing data before signalling BZ_SEQUENCE_END. To collect the read but unused data once BZ_SEQUENCE_END has appeared, call BZ2_bzReadGetUnused immediately before BZ2_bzReadClose.

`BZ2_bzReadGetUnused` #

void BZ2_bzReadGetUnused( int* bzerror, BZFILE *b,
                          void** unused, int* nUnused );

`BZ2_bzReadClose` #

void BZ2_bzReadClose ( int *bzerror, BZFILE *b );

Doesn't close the b handle, only frees all the allocated resources.

`BZ2_bzWriteOpen` #

BZFILE *BZ2_bzWriteOpen( int *bzerror, FILE *f,
                         int blockSize100k, int verbosity,
                         int workFactor );

`BZ2_bzWrite` #

void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );

`BZ2_bzWriteClose` #

void BZ2_bzWriteClose( int *bzerror, BZFILE* f,
                       int abandon,
                       unsigned int* nbytes_in,
                       unsigned int* nbytes_out );

void BZ2_bzWriteClose64( int *bzerror, BZFILE* f,
                         int abandon,
                         unsigned int* nbytes_in_lo32,
                         unsigned int* nbytes_in_hi32,
                         unsigned int* nbytes_out_lo32,
                         unsigned int* nbytes_out_hi32 );

Compresses and flushes to the compressed file all data so far supplied by BZ2_bzWrite. Subsequent calls to BZ2_bzWrite are illegal. All memory associated with the compressed file b is released. fflush is called on the compressed file, but it is not fclose'd. If BZ2_bzWriteClose is called to clean up after an error, the only action is to release the memory. This behavior can be forced to happen by passing a nonzero value to abandon.

Utility functions #

`BZ2_bzBuffToBuffCompress` #

int BZ2_bzBuffToBuffCompress( char*         dest,
                              unsigned int* destLen,
                              char*         source,
                              unsigned int  sourceLen,
                              int           blockSize100k,
                              int           verbosity,
                              int           workFactor );

Attempts to compress the data from source to dest.

Allocate an output buffer of size 1% larger than the uncompressed data, plus six hundred extra bytes.

`BZ2_bzBuffToBuffDecompress` #

int BZ2_bzBuffToBuffDecompress( char*         dest,
                                unsigned int* destLen,
                                char*         source,
                                unsigned int  sourceLen,
                                int           small,
                                int           verbosity );

zlib compatibility functions #

typedef void BZFILE;

const char * BZ2_bzlibVersion ( void );
BZFILE * BZ2_bzopen  ( const char *path, const char *mode );
BZFILE * BZ2_bzdopen ( int        fd,    const char *mode );
int BZ2_bzread  ( BZFILE* b, void* buf, int len );
int BZ2_bzwrite ( BZFILE* b, void* buf, int len );
int  BZ2_bzflush ( BZFILE* b );
void BZ2_bzclose ( BZFILE* b );
const char * BZ2_bzerror ( BZFILE *b, int *errnum );

Using the library in a stdio-free environment #

Compile with the preprocessor BZ_NO_STDIO symbol defined.

Critical error handling #

In a stdio-free environment, assertion fail results in a call to the following function:

extern void bz_internal_error ( int errcode );

Otherwise, stdio is used to indicate the error. The program cannot be recovered after a critical error.

Windows #

For making a DLL: using Visual C++ 5.0, open the project file libbz2.dsp, and build. That's it. Then test it with the sample program dlltest.c, which also has a project file: dlltest.dsp. If you don't use VC++, you might need to define _WIN32.

For a makefile for Visual C, take a look at makefile.msc. If you compile bzip2 on Windows, set BZ_UNIX to 0 and BZ_LCCWIN32 to 1 in the file bzip2.c.

Miscellanea #

Compressed file format #

Didn't change since bzip2-0.1 for compatibility.

Unnecessary complexities of the file format:

Run-length encoder is irrelevant, originally its purpose was to protect the sorting algorithm from the very worst case input: a string of repeated symbols. But the algorithm steps Q6a and Q6b of Burrows-Wheeler original report show how repeats can be handled without difficulty in block sorting.
The randomisation mechanism doesn't really need to be there. Udi Manber and Gene Myers published a suffix array construction algorithm a few years back, which can be employed to sort any block, no matter how repetitive, in O(N log N) time. There is also a derivative O(N(log N)^2) algorithm by Kunihiko Sadakane, but it didn't prove more effective on practice than Manber-Myers, so bzip2 falls back to this algorithm only if the original one has difficulties.
The compressed file format was never designed to be handled by a library, and I have had to jump though some hoops to produce an efficient implementation of decompression. Much of this trouble could be avoided if the size of each compressed block would be recorded in the datastream.
Adler-32 checksum would be faster than CRC32.
It would be nice to be able to do random access into files. This would require a careful redesign of the format.

Portability issues #

There are a few __inline__ in the code, which should be visible only by gcc. If you use another compiler and they are still visible, compile bzip2 with the flag -D__inline__=
If you still have issues, build with BZ_STRICT_ANSI defined. This is dangerous though, because various fs checks are disabled.
If you create a public binary, consider linking it statically (gcc -static).

Reporting bugs #

Before reporting a bug,

Recompile the program with no optimisation or use another compiler. Or reduce high optimisation to -O2 and enable -fomit-frame-pointer, -fno-strength-reduce. Don't use -funroll-loops.
If bzip2 crashes randomly, and the crashes are not repeatable, you may have a flaky memory subsystem. Try a different machine of the same type.
This isn't really a bug, but ... If bzip2 tells you your file is corrupted on decompression, and you obtained the file via FTP, there is a possibility that you forgot to tell FTP to do a binary mode transfer. That absolutely will cause the file to be non-decompressible. You'll have to transfer it again.
If you've incorporated libbzip2 into your own program and are getting problems, please, please, please, check that the parameters you are passing in calls to the library, are correct.