As I develop the program, I have been tracking my progress on some standard compression benchmarks. You can see my progress at https://sites.google.com/site/cmixbenchmark/. This tracks two compression benchmarks: enwik8 (Wikipedia) and calgary corpus (a mixture of files). The tables show the compression size of some other compression programs for reference. The time line shows the performance of cmix over time, as I add new features.
cmix currently compresses the calgary corpus to 667,558 bytes. The record is 580,170 bytes by a variant of PAQ8 called PAQAR. There have been hundreds of versions of PAQ, but here are some major milestones in its version history:
Version | Calgary | Date |
---|---|---|
P5 | 992,902 | 2000 |
PAQ1 | 716,704 | 2002 |
PAQ2 | 702,382 | May 2003 |
PAQ3 | 696,616 | Sep 2003 |
PAQ4 | 672,134 | Nov 2003 |
PAQ5 | 661,811 | Dec 2003 |
PAQ6 | 648,892 | Jan 2004 |
PAQ7 | 611,684 | Dec 2005 |
PAQ8L | 595,586 | Mar 2007 |
So cmix currently does slightly better than PAQ4. cmix shares a similar architecture to PAQ. It is a streaming algorithm - it compresses data by predicting the probability of each bit given the previous bits in the input. It combines the predictions of a large number of independent models into a single prediction for each bit (i.e. ensemble learning). The main developer of PAQ wrote an online book about data compression: Data Compression Explained. It is an excellent reference - I have been following it closely in the development of cmix.
No comments:
Post a Comment