Wednesday, March 07, 2018

cmix updates

There was recently a new winning entry (called phda) for the Hutter Prize. The Hutter Prize is a contest for compressing the first 100MB of Wikipedia (called enwik8). Here is how all of the previous versions of cmix performed on enwik8. phda is incredible - using far less CPU and memory than cmix while getting a similar compression rate. Unfortunately it is closed source, so many of its implementation details are hidden. phda has pushed cmix v14 into second place on the Large Text Compression Benchmark. cmix currently compresses enwik8 to 15113248 bytes, so I am guessing the next cmix release should make it back to first place. Unfortunately there is a bug which is preventing cmix from compressing enwik9 properly, so I need to fix that before the v15 release. This bug is difficult to fix because I have only seen it happen when compressing enwik9... which takes over one week of CPU time for each debugging pass.

Last year I got funding for cmix through the AI Grant. This helped speed up progress by giving me access to more computational resources. I bought a new desktop and can use many virtual machines simultaneously on Google Compute Engine. The new resources let me run cmix on the Lossless Photo Compression Benchmark, getting first place using over 6 months of CPU time!