After reading through papers on the problem, I decided to use an algorithm called space-saving. I implemented it in C++ and have been testing it on the following task:
Problem: find the k most common substrings of length N in the Wikipedia database.
I haven't fully optimized my program yet, so I estimate it would currently take 4-5 days to process the 38GiB Wikipedia file I downloaded. I have some preliminary results on a subset of Wikipedia though - here is the output for k=1000 and N=1...30:
- N=1
- N=2
- N=3
- N=4
- N=5
- N=6
- N=7
- N=8
- N=9
- N=10
- N=11
- N=12
- N=13
- N=14
- N=15
- N=16
- N=17
- N=18
- N=19
- N=20
- N=21
- N=22
- N=23
- N=24
- N=25
- N=26
- N=27
- N=28
- N=29
- N=30
Since the time complexity of this algorithm is completely independent of k, I actually chose k=1 million and then took the top 1000 from the million buckets. The heuristic I used for choosing k was to use the largest value possible such that the memory used by the program did not exceed the available RAM on my computer.
No comments:
Post a Comment