Saturday, February 22, 2014

cmix update

cmix has now surpassed paq8l on both the calgary corpus and enwik8. cmix is probably now one of the most powerful general purpose compression programs. cmix is about the same speed as paq8l, but uses much more memory (which gives it an advantage on large files such as enwik8).

cmix currently uses an ensemble of 112 independent predictors (paq8l uses 552). cmix uses a more extensive neural network to mix the models than paq8l. cmix uses a four layer network with 528,496 neurons (paq8l uses a three layer network with 3,633 neurons).

For both calgary corpus and enwik8, there is still a significant gap until cmix reaches state of the art. PAQAR and decomp8 are both specialized versions of paq8, which use dictionary preprocessing specific to the dataset (i.e. they are not very good at general purpose compression). I think my first official benchmark submission will probably be on the large text compression benchmark. cmix would currently be ranked as 4th out of 183 submissions. However, I still have a number of ideas which should make cmix significantly better, so I'll hold off on submitting until I reach first place!

No comments: