Byron's Blog: cmix

Monday, January 06, 2014

cmix

I spent most of my christmas break working on a data compression project. It is a lossless compression program called "cmix" written in C++. It is aimed entirely at optimizing compression ratio (at the cost of high CPU/memory usage). I have the very ambitious goal of eventually surpassing the compression performance of PAQ8. PAQ8 is currently state of the art in most compression benchmarks. My master's thesis was related to PAQ8. Realistically, it is unlikely that I will be able to surpass PAQ8 (given the amount of work/expertise that has gone into its development).

As I develop the program, I have been tracking my progress on some standard compression benchmarks. You can see my progress at https://sites.google.com/site/cmixbenchmark/. This tracks two compression benchmarks: enwik8 (Wikipedia) and calgary corpus (a mixture of files). The tables show the compression size of some other compression programs for reference. The time line shows the performance of cmix over time, as I add new features.

cmix currently compresses the calgary corpus to 667,558 bytes. The record is 580,170 bytes by a variant of PAQ8 called PAQAR. There have been hundreds of versions of PAQ, but here are some major milestones in its version history:

Version	Calgary	Date
P5	992,902	2000
PAQ1	716,704	2002
PAQ2	702,382	May 2003
PAQ3	696,616	Sep 2003
PAQ4	672,134	Nov 2003
PAQ5	661,811	Dec 2003
PAQ6	648,892	Jan 2004
PAQ7	611,684	Dec 2005
PAQ8L	595,586	Mar 2007

So cmix currently does slightly better than PAQ4. cmix shares a similar architecture to PAQ. It is a streaming algorithm - it compresses data by predicting the probability of each bit given the previous bits in the input. It combines the predictions of a large number of independent models into a single prediction for each bit (i.e. ensemble learning). The main developer of PAQ wrote an online book about data compression: Data Compression Explained. It is an excellent reference - I have been following it closely in the development of cmix.

Monday, January 06, 2014

cmix

No comments:

Links

Blog Archive