Sunday, November 14, 2010

State of the Art

For my master's thesis I have been looking into compression-based techniques for classification, clustering, and anomaly detection. I have implemented a classification algorithm using PAQ and evaluated it on three datasets. So far the results look very promising and it seems to get state of the art results on all three datasets. The first dataset (called 20news) involves categorizing newsgroup articles into one of twenty categories. The other two datasets are spam-filtering (ling-spam and PU1). For the spam-filtering datasets a tradeoff can be made between spam/ham misclassification rates. I therefore evaluated my algorithm using ROC curves. Although my algorithm wasn't the best spam-classifier on all portions of the ROC curves, on both datasets there was still a significant portion in which it got state of the art results.

No comments: