Peter Norvig recently posted an interesting article about the frequency of different ngrams in a massive data set. A few months ago I worked on a similar problem.
Inspired by Peter Norvig's post, I decided to release my code for the space-saving algorithm as an open source project: http://code.google.com/p/space-saving/
I optimized the speed of the program and tried testing it on Wikipedia. The file I used was a XML dump of Wikipedia which is 42,544,528,073 bytes (39.6 GiB). I used k=100,000,000 (100 million internal buckets). Here are the results:
Unlike last time, the counts reported here are exact. After an initial pass with the space-saving algorithm my program makes a second pass to compute the exact counts for the top buckets.
I ended up implementing my own hash table from scratch for this project. I figure that since this problem has some unique properties (such as a constant number of elements in the hash table), it would be easier to optimize my own hash table rather than using something more generic. I used linear probing to resolve hash collisions.
I found a way to improve upon one aspect of the standard linear probing approach. Typically deleted entries are marked with a special label, so a large number of <delete,insert> swaps would eventually fill the table up with useless deletion entries (requiring an expensive resize). My implementation doesn't use deletion labels and never needs to be resized. Instead, when an entry is removed I propagate the empty spot down until it reaches a free slot. Performing this propagation operation requires computing the desired position of subsequent elements (to know whether the element can be moved) to maintain consistency. This can be done efficiently in my hash table since the elements are integers - computing the desired position just requires a mod operation.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment