Start day: 09/13/2022
A dependency of: app-admin/keepassxc-2.7.1-r1
Has no dependencies
ebuild and meta-info #
License: bzip2
Homepage: https://sourceware.org/bzip2/
Source code: https://sourceware.org/pub/${PN}/${P}.tar.gz
Patches (reviewed in the end of the article):
PATCHES=(
"${FILESDIR}"/${PN}-1.0.4-makefile-CFLAGS.patch
"${FILESDIR}"/${PN}-1.0.8-saneso.patch
"${FILESDIR}"/${PN}-1.0.4-man-links.patch #172986
"${FILESDIR}"/${PN}-1.0.6-progress.patch
"${FILESDIR}"/${PN}-1.0.3-no-test.patch
"${FILESDIR}"/${PN}-1.0.8-mingw.patch #393573
"${FILESDIR}"/${PN}-1.0.8-out-of-tree-build.patch
)
Flags: +split-usr -static -static-libs -verify-sig
Lines of code: ~7336 (including only *.c
files)
Requisites #
- (x) gentoo wiki bzip2 page
- ( ) wikipedia
- ( ) Burrows-Wheeler algorithm
- (x) bzip2 man page - https://linux.die.net/man/1/bzip2
- (x) bzip2 manual
- (x) What is bzip2? - https://sourceware.org/bzip2/
Already familiar with #
- Huffman coding
Description #
bzip2 compresses files using the Burrows-Wheeler block sorting text compression algorithm, and Huffman coding. Compression is generally considerably better than that achieved by more conventional LZ77/LZ78-based compressors, and approaches the performance of the PPM family of statistical compressors. It was developed by Julian Seward, and maintained by Mark Wielaard and Micah Snyder.
bzip2 compresses only single files and is not a file archiver.
History #
Seward made the first public release of bzip2, version 0.15, in July 1996. The compressor's stability and popularity grew over the next several years, and Seward released version 1.0 in late 2000. Following a nine-year hiatus of updates for the project since 2010, on 4 June 2019 Federico Mena accepted maintainership of the bzip2 project. Since June 2021, the maintainer is Micah Snyder.
18 July 1996 (version 0.15)
25 August 1996 (version 0.21)
7 August 1997 (bzip2, version 0.1)
29 August 1997 (bzip2, version 0.1pl2)
23 August 1998 (bzip2, version 0.9.0)
8 June 1999 (bzip2, version 0.9.5)
4 Sept 1999 (bzip2, version 0.9.5d)
5 May 2000 (bzip2, version 1.0pre8)
30 December 2001 (bzip2, version 1.0.2pre1)
15 February 2005 (bzip2, version 1.0.3)
20 December 2006 (bzip2, version 1.0.4)
10 December 2007 (bzip2, version 1.0.5)
6 Sept 2010 (bzip2, version 1.0.6)
27 June 2019 (bzip2, version 1.0.7)
13 July 2019 (bzip2, version 1.0.8)
Contact: bzip2-devel@sourceware.org.
Implementation #
Bzip2 uses several layers of compression techniques stacked on top of each other, which occur in the following order during compression and the reverse order during decompression:
- Run-length encoding (RLE) of initial data.
- Burrows–Wheeler transform (BWT), or block sorting.
- Move-to-front (MTF) transform.
- Run-length encoding (RLE) of MTF result.
- Huffman coding.
- Selection between multiple Huffman tables.
- Unary base-1 encoding of Huffman table selection.
- Delta encoding (Δ) of Huffman-code bit lengths.
- Sparse bit array showing which symbols are used.
License #
License: bzip2.
[Break: read bzip2].
Julian Seward, jseward@acm.org
bzip2/libbzip2 version 1.0.8 of 13 July 2019
Makefile #
Objects:
OBJS= blocksort.o \
huffman.o \
crctable.o \
randtable.o \
compress.o \
decompress.o \
bzlib.o
Options:
all: libbz2.a bzip2 bzip2recover test
# ...
check: test
install:
clean:
# an option for each obj
distclean:
dist:
# rebuilding manual from source
manual:
manual.ps:
manual.pdf:
manual.html:
Default installation paths: /usr/local/bin
, /usr/local/lib
, /usr/local/man
and
/usr/local/include
.
Makefile-libbz2_so #
This Makefile builds a shared version of the library, libbz2.so.1.0.8, with soname libbz2.so.1.0, at least on x86-Linux (RedHat 7.2), with gcc-2.96 20000731 (Red Hat Linux 7.1 2.96-98).
all: $(OBJS)
$(CC) -shared -Wl,-soname -Wl,libbz2.so.1.0 -o libbz2.so.1.0.8 $(OBJS)
$(CC) $(CFLAGS) -o bzip2-shared bzip2.c libbz2.so.1.0.8
rm -f libbz2.so.1.0
ln -s libbz2.so.1.0.8 libbz2.so.1.0
Build advice #
I suggest you also build using the normal Makefile, since that conducts a self-test. A second reason to prefer the version statically linked to the library is that, on x86 platforms, building shared objects makes a valuable register (%ebx) unavailable to gcc, resulting in a slowdown of 10%-20%, at least for bzip2.
For Large File (2,147,483,648 (2^31) bytes or above) support, the
-D_FILE_OFFSET_BITS=64
option was added to the Makefile. Many older OSs can't handle files above this size, but many newer ones can. if you get compilation errors which you think are related to large file support, try removing the above define from the Makefile.
Xml stuff #
The script xmlproc.sh takes an xml file as input, and processes it to create .pdf, .html or .ps (PostScript) output. It uses format.pl, a perl script to format <pre>
blocks nicely, and add CDATA
tags so writers do not have to use eg. <
.
Usage:
# Validate
./xmlproc.sh -v manual.xml
./xmlproc.sh -html manual.xml
./xmlproc.sh -pdf manual.xml
./xmlproc.sh -ps manual.xml
Manual #
The manual is very detailed, thus here I'll list only brief notes regarding it.
How to use bzip2 #
Name #
bzip2
,bunzip2
- a block-sorting file compressor, v1.0.8bzcat
- decompresses files to stdoutbzip2recover
- recovers data from damaged bzip2 files
Synopsis #
-
bzip2 [ -cdfkqstvzVL123456789 ] [ filenames ... ]
-
bunzip2 [ -fkvsVL ] [ filenames ... ]
-
bzcat [ -s ] [ filenames ... ]
-
bzip2recover filename
-
-f
- overwrite existing files -
-d
- decompress, an alternative ofbunzip2
; recognised endings:.bz2
,.bz
,.tbz2
or.tbz
-
-t
- integrity testing of concatenated compressed files -
-c
- compress/decompress from/to stdin/stdout -
bzcat
=bzip2 -dc
-
bzip2recover
- try to recover data from damaged files. -
-z
- compress regardless of name -
-k
- keep input files -
-s
- modified algorithm for reduction of memory usage (use this if m < ~8 mb) -
-q
- quiet -
-v
- verbose -
-L
,-V
- licence, version -
-1
(--fast
) ...-9
(--best
) - sets blocks sizes to 100..900 kb. -
--
- treats all subsequent arguments as file names. -
-vvvv
- monitor the process with a great detail. -
Arguments supply priority:
BZIP2
,BZIP
environment variables, command line -
Files with size s < ~100b tend to get larger output because of constant overhead, random data is coded at about 8.05 bits per byte, giving an expansion of around 0.5%.
-
bzip2 uses 32-bit CRCs to make sure that the decompressed version of a file is identical to the original.
-
Return values - 0:success, 1:environmental problems, 2:corrupt compressed file, 3:internal consistency error (i.e. bug)
Memory Management #
Most of the compression comes from the first two or three hundred k of block size, a fact worth bearing in mind when using bzip2 on small machines. The default block size is 900 kb.
Recovering Data from Damaged Files #
The compressed representation of each block is delimited by a 48-bit pattern, which makes it possible to find the block boundaries with reasonable certainty. Each block also carries its own 32-bit CRC, so damaged blocks can be distinguished from undamaged ones.
bzip2recover is a simple program whose purpose is to search for blocks in .bz2 files, and write each block out into its own .bz2 file. You can then use bzip2 -t to test the integrity of the resulting files, and decompress those which are undamaged.
Performance Notes #
The sorting phase of compression gathers together similar strings in the file. Because of this, files containing very long runs of repeated symbols, like "aabaabaabaab ..." (repeated several hundred times) may compress more slowly than normal (average-worst case ratio: ~1:10 after version 0.9.5).
Programming with libbzip2 #
All externally visible symbols have names beginning BZ2_
. To use any part of the library, you need to #include <bzlib.h>
into your sources.
Low-level usage summary #
Compressing/decompressing data in memory. Thread safe and doesn't require stdio, which is convenient for embedded applications. Six routines make up the low level interface: BZ2_bzCompressInit
, BZ2_bzCompress
, and BZ2_bzCompressEnd
for compression, and a corresponding trio BZ2_bzDecompressInit
, BZ2_bzDecompress
and BZ2_bzDecompressEnd
for decompression. The *Init
functions allocate memory for compression/decompression and do other initialisations, whilst the *End
functions close down operations and release memory. The real work is done by BZ2_bzCompress
and BZ2_bzDecompress
. These compress and decompress data from a user-supplied input buffer to a user-supplied output buffer. These buffers can be any size.
High-level usage summary #
For reading files, BZ2_bzReadOpen
, BZ2_bzRead
, BZ2_bzReadClose
and BZ2_bzReadGetUnused
are supplied. For writing files, BZ2_bzWriteOpen
, BZ2_bzWrite
and BZ2_bzWriteFinish
are available. The high-level library is also thread-safe since no global variables are in use. Unless errno
is required, in which case an external thread-safe library will have to be supplied.
Utility functions summary #
For very simple needs, BZ2_bzBuffToBuffCompress
and BZ2_bzBuffToBuffDecompress
are provided. These compress data in memory from one buffer to another buffer in a single function call.
Yoshioka Tsuneo (tsuneo@rr.iij4u.or.jp) has contributed some functions to give better zlib compatibility. These functions are BZ2_bzopen
, BZ2_bzread
, BZ2_bzwrite
, BZ2_bzflush
, BZ2_bzclose
, BZ2_bzerror
and BZ2_bzlibVersion
. They are not documented yet, though.
Error handling #
bzlib.h
includes all the necessary definitions. The following definitions indicate success: BZ_OK
, BZ_RUN_OK
, BZ_FLUSH_OK
, BZ_FINISH_OK
, BZ_STREAM_END
. The following definitions are used to indicate failure: BZ_CONFIG_ERROR
(sizeof char,short, int are not 1, 2, 4 respectively), BZ_SEQUENCE_ERROR
(functions called in an incorrect sequence), BZ_PARAM_ERROR
, BZ_MEM_ERROR
, BZ_DATA_ERROR
(data integrity error), BZ_DATA_ERROR_MAGIC
(incorrect magic bytes), BZ_IO_ERROR
, BZ_UNEXPECTED_EOF
(returned by BZ2_bzRead
when the compressed file finishes before the logical end of stream is detected), BZ_OUTBUFF_FULL
.
Low-level interface #
BZ2_bzCompressInit
#
typedef struct {
char *next_in;
unsigned int avail_in;
unsigned int total_in_lo32;
unsigned int total_in_hi32;
char *next_out;
unsigned int avail_out;
unsigned int total_out_lo32;
unsigned int total_out_hi32;
void *state;
void *(*bzalloc)(void *,int,int);
void (*bzfree)(void *,void *);
void *opaque;
} bz_stream;
bz_stream
holds all data pertaining to the compression activity. The fields of bz_stream
comprise the entirety of the user-visible data. state
is a pointer to the private data structures required for compression. Custom memory allocators are supported, via fields bzalloc
, bzfree
, and opaque
. The value opaque
is passed to as the first argument to all calls to bzalloc
and bzfree
, but is otherwise ignored by the library. The call bzalloc ( opaque, n, m )
is expected to return a pointer p
to n * m
bytes of memory, and bzfree ( opaque, p )
should free that memory. The fields should be set to null otherwise. total_in_lo32
, total_in_hi32
, total_out_lo32
and total_out_hi32
are used to inform the caller of the total amount of data passed.
int BZ2_bzCompressInit ( bz_stream *strm,
int blockSize100k,
int verbosity,
int workFactor );
Parameter blockSize100k
specifies block size (1..9). verbosity
range is 0..4 (unless the library is compiled with -DBZ_NO_STDIO
), 0 is silent. workFactor
default value is 30, range is 0..250 (0 is a special case, equals to 30). Lower values of workFactor
reduce the amount of effort the standard algorithm will expend before resorting to the fallback in case the input data is highly repetitive, which is slower but always behaves reasonably, both algorithms generate the same output.
BZ2_bzCompress
#
int BZ2_bzCompress ( bz_stream *strm, int action );
strm->next_in
should point at the data to be compressed, and strm->avail_in
should indicate how many bytes the library may read. strm->next_out
and strm->avail_out
should be defined in a similar way. The action
value can be BZ_RUN
(start compression), BZ_FINISH
(remember current value of next_in
, compress from next_in
to next_out
as much as possible, but do not accept any more input, next state - FINISHING
; in FINISHING
state, if stream end is reached, the next state is IDLE
, otherwise is FINISHING
) and BZ_FLUSH
(in RUNNING
state, same as BZ_FINISH
, next state - FLUSHING
; in FLUSHING
state if all i/o was used, next state is RUNNING
, otherwise next step is FLUSHING
, other actions in FLUSHING
state are illegal).
If the data you want to compress fits into your input buffer all at once, you can skip the calls of BZ2_bzCompress ( ..., BZ_RUN )
and just do the BZ2_bzCompress ( ..., BZ_FINISH )
calls.
BZ2_bzCompressEnd
#
int BZ2_bzCompressEnd ( bz_stream *strm );
Releases memory.
BZ2_bzDecompressInit
#
int BZ2_bzDecompressInit ( bz_stream *strm, int verbosity, int small );
If small
is nonzero, the library will use an alternative decompression algorithm which uses less memory but at the cost of decompressing more slowly.
BZ2_bzDecompress
#
int BZ2_bzDecompress ( bz_stream *strm );
You should provide input and remove output as described above, and repeatedly call BZ2_bzDecompress
until BZ_STREAM_END
is returned.
BZ2_bzDecompressEnd
#
int BZ2_bzDecompressEnd ( bz_stream *strm );
Releases memory.
High-level interface #
- All of the functions take an
int*
first argument,bzerror
. After each call,bzerror
should be consulted first, and only then the return value can be consulted. - If
bzerror
indicates error, immediately callBZ2_bzReadClose
/BZ2_bzWriteClose
to free up all resources.
BZ2_bzReadOpen
#
typedef void BZFILE;
BZFILE *BZ2_bzReadOpen( int *bzerror, FILE *f,
int verbosity, int small,
void *unused, int nUnused );
For reasons explained below, BZ2_bzRead
will decompress the nUnused
bytes starting at unused
, before starting to read from the file f
. At most BZ_MAX_UNUSED
bytes may be supplied like this. If this facility is not required, you should pass NULL
and 0 for unused
and nUnused
respectively.
BZ2_bzRead
#
int BZ2_bzRead ( int *bzerror, BZFILE *b, void *buf, int len );
Reads up to len
(uncompressed) bytes from the compressed file b
into the buffer buf
.
Internally, BZ2_bzRead
copies data from the compressed file in chunks of size BZ_MAX_UNUSED
bytes before decompressing it. If the file contains more bytes than strictly needed to reach the logical end-of-stream, BZ2_bzRead
will almost certainly read some of the trailing data before signalling BZ_SEQUENCE_END
. To collect the read but unused data once BZ_SEQUENCE_END
has appeared, call BZ2_bzReadGetUnused
immediately before BZ2_bzReadClose
.
BZ2_bzReadGetUnused
#
void BZ2_bzReadGetUnused( int* bzerror, BZFILE *b,
void** unused, int* nUnused );
BZ2_bzReadClose
#
void BZ2_bzReadClose ( int *bzerror, BZFILE *b );
Doesn't close the b
handle, only frees all the allocated resources.
BZ2_bzWriteOpen
#
BZFILE *BZ2_bzWriteOpen( int *bzerror, FILE *f,
int blockSize100k, int verbosity,
int workFactor );
BZ2_bzWrite
#
void BZ2_bzWrite ( int *bzerror, BZFILE *b, void *buf, int len );
BZ2_bzWriteClose
#
void BZ2_bzWriteClose( int *bzerror, BZFILE* f,
int abandon,
unsigned int* nbytes_in,
unsigned int* nbytes_out );
void BZ2_bzWriteClose64( int *bzerror, BZFILE* f,
int abandon,
unsigned int* nbytes_in_lo32,
unsigned int* nbytes_in_hi32,
unsigned int* nbytes_out_lo32,
unsigned int* nbytes_out_hi32 );
Compresses and flushes to the compressed file all data so far supplied by BZ2_bzWrite
. Subsequent calls to BZ2_bzWrite
are illegal. All memory associated with the compressed file b
is released. fflush
is called on the compressed file, but it is not fclose
'd. If BZ2_bzWriteClose
is called to clean up after an error, the only action is to release the memory. This behavior can be forced to happen by passing a nonzero value to abandon
.
Utility functions #
BZ2_bzBuffToBuffCompress
#
int BZ2_bzBuffToBuffCompress( char* dest,
unsigned int* destLen,
char* source,
unsigned int sourceLen,
int blockSize100k,
int verbosity,
int workFactor );
Attempts to compress the data from source
to dest
.
Allocate an output buffer of size 1% larger than the uncompressed data, plus six hundred extra bytes.
BZ2_bzBuffToBuffDecompress
#
int BZ2_bzBuffToBuffDecompress( char* dest,
unsigned int* destLen,
char* source,
unsigned int sourceLen,
int small,
int verbosity );
zlib compatibility functions #
typedef void BZFILE;
const char * BZ2_bzlibVersion ( void );
BZFILE * BZ2_bzopen ( const char *path, const char *mode );
BZFILE * BZ2_bzdopen ( int fd, const char *mode );
int BZ2_bzread ( BZFILE* b, void* buf, int len );
int BZ2_bzwrite ( BZFILE* b, void* buf, int len );
int BZ2_bzflush ( BZFILE* b );
void BZ2_bzclose ( BZFILE* b );
const char * BZ2_bzerror ( BZFILE *b, int *errnum );
Using the library in a stdio-free environment #
Compile with the preprocessor BZ_NO_STDIO
symbol defined.
Critical error handling #
In a stdio-free environment, assertion fail results in a call to the following function:
extern void bz_internal_error ( int errcode );
Otherwise, stdio is used to indicate the error. The program cannot be recovered after a critical error.
Windows #
For making a DLL: using Visual C++ 5.0, open the project file libbz2.dsp
, and build. That's it. Then test it with the sample program dlltest.c
, which also has a project file: dlltest.dsp
. If you don't use VC++, you might need to define _WIN32
.
For a makefile for Visual C, take a look at makefile.msc
. If you compile bzip2 on Windows, set BZ_UNIX
to 0 and BZ_LCCWIN32
to 1 in the file bzip2.c
.
Miscellanea #
Compressed file format #
Didn't change since bzip2-0.1 for compatibility.
Unnecessary complexities of the file format:
- Run-length encoder is irrelevant, originally its purpose was to protect the sorting algorithm from the very worst case input: a string of repeated symbols. But the algorithm steps Q6a and Q6b of Burrows-Wheeler original report show how repeats can be handled without difficulty in block sorting.
- The randomisation mechanism doesn't really need to be there. Udi Manber and Gene Myers published a suffix array construction algorithm a few years back, which can be employed to sort any block, no matter how repetitive, in O(N log N) time. There is also a derivative O(N(log N)^2) algorithm by Kunihiko Sadakane, but it didn't prove more effective on practice than Manber-Myers, so bzip2 falls back to this algorithm only if the original one has difficulties.
- The compressed file format was never designed to be handled by a library, and I have had to jump though some hoops to produce an efficient implementation of decompression. Much of this trouble could be avoided if the size of each compressed block would be recorded in the datastream.
- Adler-32 checksum would be faster than CRC32.
- It would be nice to be able to do random access into files. This would require a careful redesign of the format.
Portability issues #
- There are a few
__inline__
in the code, which should be visible only by gcc. If you use another compiler and they are still visible, compile bzip2 with the flag-D__inline__=
- If you still have issues, build with
BZ_STRICT_ANSI
defined. This is dangerous though, because various fs checks are disabled. - If you create a public binary, consider linking it statically (
gcc -static
).
Reporting bugs #
Before reporting a bug,
- Recompile the program with no optimisation or use another compiler. Or reduce high optimisation to
-O2
and enable-fomit-frame-pointer
,-fno-strength-reduce
. Don't use-funroll-loops
. - If bzip2 crashes randomly, and the crashes are not repeatable, you may have a flaky memory subsystem. Try a different machine of the same type.
- This isn't really a bug, but ... If bzip2 tells you your file is corrupted on decompression, and you obtained the file via FTP, there is a possibility that you forgot to tell FTP to do a binary mode transfer. That absolutely will cause the file to be non-decompressible. You'll have to transfer it again.
- If you've incorporated libbzip2 into your own program and are getting problems, please, please, please, check that the parameters you are passing in calls to the library, are correct.
Further reading #
- Michael Burrows and D. J. Wheeler: "A block-sorting lossless data compression algorithm" 10th May 1994. ftp://ftp.digital.com/pub/DEC/SRC/research-reports/SRC-124.ps.gz or at http://www.nzdl.org.
- Daniel S. Hirschberg and Debra A. LeLewer "Efficient Decoding of Prefix Codes"
- David J. Wheeler: Program bred3.c and accompanying document bred3.ps. This contains the idea behind the multi-table Huffman coding scheme. ftp://ftp.cl.cam.ac.uk/users/djw3/
- Jon L. Bentley and Robert Sedgewick "Fast Algorithms for Sorting and Searching Strings". www.cs.princeton.edu/~rs
The following paper gives valuable additional insights into the algorithm, but is not immediately the basis of any code used in bzip2.
- Peter Fenwick: Block Sorting Text Compression. ftp://ftp.cs.auckland.ac.nz/pub/peter-f/ACSC96paper.ps
Other related documents:
- Kunihiko Sadakane's sorting algorithm: http://naomi.is.s.u-tokyo.ac.jp/~sada/papers/Sada98b.ps.gz
- The Manber-Myers suffix array construction algorithm: http://www.cs.arizona.edu/people/gene/PAPERS/suffix.ps
Performance-related papers:
- Julian Seward: On the Performance of BWT Sorting Algorithms. 28-30 March 2000
- Julian Seward: Space-time Tradeoffs in the Inverse B-W Transform. 27-29 March 2001