Monday, January 06, 2014

cmix

I spent most of my christmas break working on a data compression project. It is a lossless compression program called "cmix" written in C++. It is aimed entirely at optimizing compression ratio (at the cost of high CPU/memory usage). I have the very ambitious goal of eventually surpassing the compression performance of PAQ8. PAQ8 is currently state of the art in most compression benchmarks. My master's thesis was related to PAQ8. Realistically, it is unlikely that I will be able to surpass PAQ8 (given the amount of work/expertise that has gone into its development).

As I develop the program, I have been tracking my progress on some standard compression benchmarks. You can see my progress at https://sites.google.com/site/cmixbenchmark/. This tracks two compression benchmarks: enwik8 (Wikipedia) and calgary corpus (a mixture of files). The tables show the compression size of some other compression programs for reference. The time line shows the performance of cmix over time, as I add new features.

cmix currently compresses the calgary corpus to 667,558 bytes. The record is 580,170 bytes by a variant of PAQ8 called PAQAR. There have been hundreds of versions of PAQ, but here are some major milestones in its version history:

Version Calgary Date
P5 992,902 2000
PAQ1 716,704 2002
PAQ2 702,382 May 2003
PAQ3 696,616 Sep 2003
PAQ4 672,134 Nov 2003
PAQ5 661,811 Dec 2003
PAQ6 648,892 Jan 2004
PAQ7 611,684 Dec 2005
PAQ8L 595,586 Mar 2007

So cmix currently does slightly better than PAQ4. cmix shares a similar architecture to PAQ. It is a streaming algorithm - it compresses data by predicting the probability of each bit given the previous bits in the input. It combines the predictions of a large number of independent models into a single prediction for each bit (i.e. ensemble learning). The main developer of PAQ wrote an online book about data compression: Data Compression Explained. It is an excellent reference - I have been following it closely in the development of cmix.

No comments: