String matching is an important problem in several fields of computer science such as bioinformatics, data compression, and search engines. I find it amazing that search engines can find exact string matches in massive datasets in essentially constant time. Having the ability to do this would be very useful for when I create data compression algorithms!
Last night I downloaded the entire English Wikipedia, which turned out to be a 31.6GiB XML file. I spent the day trying to solve the following problem: how to quickly find strings in this massive file. The hardest constraint in solving this problem is the fact that my computer only has about 3GiB of memory. Somehow I needed to preprocess the 32GiB file into 3GiB of memory and then *only* use the 3GiB of memory to find the location of arbitrary strings in constant time. Solving this problem exactly is impossible, so the goal of the project is to maximize the likelihood that the match I find is correct.
I tried doing some research online and was surprised that I couldn't find many helpful resources. This is surprising since this seems like such an important problem. I will describe the algorithm I ended up implementing:
I stored all of the data in two arrays. The first array stored positions in the XML file, type = long, dimension = 20000000x4. The second array stored counts, type = int, dimension = 20000000. The first step was to go through the 32GiB XML file and extract every substring of length 25. These substrings are keys that I used to store in the arrays. I computed a hash for each key to use as an array index. The second array was used to keep track of the number of keys put in each array position, so it is simply incremented for each key. The first array is used to store the position of up to four substrings. I stored the position of a particular key with a probability that depends on the number of colliding keys (the count stored in the second array). More collisions = less likely to store the position.
OK, so the algorithm so far preprocessed the 32GiB file into about 700MiB of memory. Now comes the fun part of using the data to do string matching. Given a string to search for, I first extract every substring of length 25. I hash each substring and collect the locations that are stored at those positions in the first array. Only a small fraction of these matches are likely to actually correspond to the string we are searching for. The essential piece of information that helps find the true match is the fact that the offset of the substrings in our search query should match the corresponding offset of matches returned from the data. It is possible to find the best match in O(n) time (where n is the length of the search query).
To sum it up: I preprocess the data in O(m) time (where m is the length of the data). I can then do string matching in O(n) time (where n is the length of the search query).
So how well did this work on the dataset? It took something like 1 hour to preprocess the XML file (basically the amount of time required to read the entire file from disk). The search queries run remarkably fast - no noticeable delay even for queries that are hundreds of thousands of characters long. The longer the search query, the more likely the algorithm is to return the correct match location. For queries of length 50, it seems to match correctly about half of the time. For queries over length 100, it is almost always correct. Success!
Interesting fact: my algorithm is somewhat similar to how Shazam works. Shazam is a program that can look up the name of a song given a short recorded sample of it. The longer the recording, the more likely the correct match is returned.