Wednesday, October 10, 2012

The Top-k Element Problem

I have been working on the problem of efficiently finding the most common items in a large stream of data. This problem shows up in many applications - I am working on it for a data compression project. One example application: given a stream of user queries to a search engine, find the top k queries which occur most often. Some Google employees published a fascinating algorithm for this problem called count sketch.

After reading through papers on the problem, I decided to use an algorithm called space-saving. I implemented it in C++ and have been testing it on the following task:

Problem: find the k most common substrings of length N in the Wikipedia database.

I haven't fully optimized my program yet, so I estimate it would currently take 4-5 days to process the 38GiB Wikipedia file I downloaded. I have some preliminary results on a subset of Wikipedia though - here is the output for k=1000 and N=1...30: Each row has two columns (tab separated). The first column is the substring and the second column is the count.

Since the time complexity of this algorithm is completely independent of k, I actually chose k=1 million and then took the top 1000 from the million buckets. The heuristic I used for choosing k was to use the largest value possible such that the memory used by the program did not exceed the available RAM on my computer.

No comments: