Saturday, October 20, 2012

laservex

I have started working on a new project: www.laservex.com. This will be a remake of a Java game I made a few years ago: http://byronknoll.com/lasers.html. This time I will implement it in HTML5 and support user-generated levels.

Wednesday, October 10, 2012

The Top-k Element Problem

I have been working on the problem of efficiently finding the most common items in a large stream of data. This problem shows up in many applications - I am working on it for a data compression project. One example application: given a stream of user queries to a search engine, find the top k queries which occur most often. Some Google employees published a fascinating algorithm for this problem called count sketch.

After reading through papers on the problem, I decided to use an algorithm called space-saving. I implemented it in C++ and have been testing it on the following task:

Problem: find the k most common substrings of length N in the Wikipedia database.

I haven't fully optimized my program yet, so I estimate it would currently take 4-5 days to process the 38GiB Wikipedia file I downloaded. I have some preliminary results on a subset of Wikipedia though - here is the output for k=1000 and N=1...30: Each row has two columns (tab separated). The first column is the substring and the second column is the count.

Since the time complexity of this algorithm is completely independent of k, I actually chose k=1 million and then took the top 1000 from the million buckets. The heuristic I used for choosing k was to use the largest value possible such that the memory used by the program did not exceed the available RAM on my computer.

Tuesday, October 02, 2012

Five Player Go First Dice

In a previous post I proposed a challenge to come up with a set of go first dice for five players. I thought my prize would be safe since this seems like a difficult unsolved problem. Paul Vanetti has successfully won the challenge with the following solution:
  • 0, 1, 2, 3, 4, 5, 6, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 183, 184, 185, 186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203, 204, 205, 206, 207, 208, 209, 210, 211, 212, 213, 214, 259, 260, 261, 262, 263, 264, 265
  • 7, 20, 21, 22, 35, 36, 37, 50, 83, 96, 97, 98, 111, 112, 113, 126, 139, 152, 153, 154, 167, 168, 169, 182, 215, 228, 229, 230, 243, 244, 245, 258
  • 8, 12, 13, 14, 15, 19, 23, 27, 28, 29, 30, 34, 38, 42, 43, 44, 45, 49, 84, 88, 89, 90, 91, 95, 99, 103, 104, 105, 106, 110, 114, 118, 119, 120, 121, 125, 140, 144, 145, 146, 147, 151, 155, 159, 160, 161, 162, 166, 170, 174, 175, 176, 177, 181, 216, 220, 221, 222, 223, 227, 231, 235, 236, 237, 238, 242, 246, 250, 251, 252, 253, 257
  • 9, 11, 16, 18, 24, 26, 31, 33, 39, 41, 46, 48, 85, 87, 92, 94, 100, 102, 107, 109, 115, 117, 122, 124, 141, 143, 148, 150, 156, 158, 163, 165, 171, 173, 178, 180, 217, 219, 224, 226, 232, 234, 239, 241, 247, 249, 254, 256
  • 10, 17, 25, 32, 40, 47, 86, 93, 101, 108, 116, 123, 142, 149, 157, 164, 172, 179, 218, 225, 233, 240, 248, 255
These dice have the following number of faces:
  • 90
  • 32
  • 72
  • 48
  • 24
When I described the problem in my original blog post I did not specify any constraint specifying that the dice need to have the same number of faces. This constraint is unnecessary because:

Conjecture: Given any valid set of N go first dice with number of faces x1, x2, ..., xN, it is always possible to generate a new set of go first dice with exactly LCM(x1, x2, ..., xN) faces per die.

This can be done using the following algorithm:

Suppose you have go first dice with the following labels:
  • a1, a2, ..., aQ
  • b1, b2, ..., bR
  • ...
  • y1, y2, ..., yZ
Let m = LCM(Q, R, ..., Z). The following should be a set of go first dice with m faces:
  • m*a1, m*a1 + 1, ..., m*a1 + (m/Q) - 1, m*a2, m*a2 + 1, ..., m*a2 + (m/Q) - 1, m*aQ, m*aQ + 1, ..., m*aQ + (m/Q) - 1
  • m*b1, m*b1 + 1, ..., m*b1 + (m/R) - 1, m*b2, m*b2 + 1, ..., m*b2 + (m/R) - 1, m*bR, m*bR + 1, ..., m*bR + (m/R) - 1
  • ...
  • m*y1, m*y1 + 1, ..., m*y1 + (m/Z) - 1, m*y2, m*y2 + 1, ..., m*y2 + (m/Z) - 1, m*yZ, m*yZ + 1, ..., m*yZ + (m/Z) - 1
Example of how to apply the algorithm to a simple set of dice (which are not go first dice):
  • 1,2,4
  • 3,5
LCM(3,2) = 6, so the merged dice are:
  • 6*1, 6*1 + 1, 6*2, 6*2 + 1, 6*4, 6*4 + 1
  • 6*3, 6*3 + 1, 6*3 + 2, 6*5, 6*5 + 1, 6*5 + 2
Which becomes:
  • 6, 7, 12, 13, 24, 25
  • 18, 19, 20, 30, 31, 32
Empty regions can be removed to form:
  • 1, 2, 3, 4, 8, 9
  • 5, 6, 7, 10, 11, 12
Since LCM(90, 32, 72, 48, 24) = 1440, Paul's solution can be used to generate a set of 5 go first dice with 1440 faces. Paul says he found a technique to use a solution for N-1 to generate N go first dice (I do not know how this is done). His technique can be used to quickly generate solutions for any N. The solutions generated by his algorithm are not optimal in terms of minimizing the number of faces on the dice.