*k*queries which occur most often. Some Google employees published a fascinating algorithm for this problem called count sketch.

After reading through papers on the problem, I decided to use an algorithm called space-saving. I implemented it in C++ and have been testing it on the following task:

**Problem**: find the

*k*most common substrings of length

*N*in the Wikipedia database.

I haven't fully optimized my program yet, so I estimate it would currently take 4-5 days to process the 38GiB Wikipedia file I downloaded. I have some preliminary results on a subset of Wikipedia though - here is the output for

*k*=1000 and

*N*=1...30:

- N=1
- N=2
- N=3
- N=4
- N=5
- N=6
- N=7
- N=8
- N=9
- N=10
- N=11
- N=12
- N=13
- N=14
- N=15
- N=16
- N=17
- N=18
- N=19
- N=20
- N=21
- N=22
- N=23
- N=24
- N=25
- N=26
- N=27
- N=28
- N=29
- N=30

Since the time complexity of this algorithm is completely independent of

*k*, I actually chose

*k*=1 million and then took the top 1000 from the million buckets. The heuristic I used for choosing

*k*was to use the largest value possible such that the memory used by the program did not exceed the available RAM on my computer.

## No comments:

Post a Comment