Sunday, May 16, 2010

Arithmetic Coding

This week I wrote a program to perform arithmetic coding. Arithmetic coding is useful for data compression. This summer I plan to release an open source lossless data compression algorithm which uses the coder.

The program to do the arithmetic coding is only around 150 lines of Java code. For a while I had a really annoying bug related to precision errors with floating point numbers. The vast majority of the time I spent developing the program was trying to fix this bug. Today I finally figured out a great solution to the precision problem which has almost no computational overhead and is theoretically guaranteed to always be correct.

Wikipedia claims "every programmatic implementation of arithmetic encoding has a different compression ratio and performance." I find this statement surprising because I think I can construct a proof showing that my program encodes the input with exactly the minimal number of bits. If the number of bits is not divisible by 8, my program uses a few extra bits in order to store the resulting bytes to disk. As far as I know it is impossible for another arithmetic coder to achieve a better compression ratio. I tried testing the program on several files and it achieved a compression ratio extremely close to the theoretical cross entropy limit.

In terms of performance my algorithm also has nice properties. The time and space complexity are both linear in terms of the number of input characters (which is optimal). The code only performs basic arithmetic operations on primitive data types, so it should also be very efficient. The code works with any input alphabet size. I plan to convert the program to optimized C code and compare its performance to other arithmetic coders. Since I am already confident that its compression ratio can not be beaten, if it turns out to be faster than other arithmetic coders as well, then my algorithm could be an important contribution =)

No comments: