Monday, February 16, 2009

Data Compression and Machine Learning

There is an interesting relationship between data compression and machine learning. Any data compression algorithm can be used to perform pattern recognition. This can be done by simply comparing the file size of two independently compressed files to the file size of compressing them together. In other words, compare the size of (compression(A)+compression(B)) to compression(A+B). If the two files contain common patterns, then compression(A+B) will be smaller than (compression(A)+compression(B)). By comparing the relative file size difference using solid compression, it is possible to rank the similarity of different data sets.

In my machine learning course last semester we had an optional project to do binary classification on a dataset with thousands of features. We were given data in a training set which were classified into one of two categories (positive/negative). We needed to write an algorithm to classify test data. To compare the quality of our solution, we were given an implementation of AdaBoost with the following error rates:

false negative rate = 0.1429 (over 350 positive data)
false positive rate = 0.1429 (over 350 negative data)
sum: 0.2858

I used a neural network to achieve a lower error rate:

false negative rate = 0.14 (over 350 positive data)
false positive rate = 0.12285714285714286 (over 350 negative data)
sum: 0.2628571428571429

This morning I tried comparing some different data compression algorithms to see how well they are at binary classification. Using 7z I got the following error rates:

false negative rate = 0.31142857142857144 (over 350 positive data)
false positive rate = 0.2542857142857143 (over 350 negative data)

I also tried using gzip and paq8p. gzip was fast but had a terrible error rate. I think paq8p would have done much better than 7z, but it is extremely slow so I couldn't finish running it on the test set. I assume that in general better (slower) compression algorithms will have a lower error rate.

1 comment:

Anonymous said...

You might be interested in this paper (they have the same idea):
“Clustering by Compression” by Rudi Cilibrasi and Paul M.B. Vitanyi, IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 51, NO 4, APRIL 2005, 1523–1545 1