Sunday, March 25, 2012

A New Kind of CAPTCHA

My goal for today was to create my own CAPTCHA system. I find CAPTCHAs fascinating because of their goal of distinguishing human intelligence from machines. Surprisingly, most major CAPTCHA systems (currently used by companies such as Google, Microsoft, Facebook, etc.) are routinely solved by computers with over 15% success rates (at least according to the brief research I did on the topic today).

Most CAPTCHA systems provide an image of a sequence of characters and ask the user to type in the text. Software that attempts to solve these typically first segment the image into characters and then classify each character individually. Character classification is a solved problem - so the only security behind current CAPTCHAs relies on the fact that character segmentation is difficult. In my attempt at creating a CAPTCHA system I will try to make both the segmentation and classification tasks as difficult as possible.

One of the reasons English letter classification is so easy is because there is a massive amount of training data available for each letter. In my CAPTCHA system, I will create my own symbols and provide only *one* training example for each symbol class. My hypothesis is that humans are better than computers at extrapolating from tiny training sets. Another benefit of creating new symbols from scratch is that it makes the system internationalizable (i.e. not dependent on the character set of a particular language).

Here is an example output from my CAPTCHA program:

The answer in this case is "1428736". First, I create 10 symbol classes and present one example from each class to the user. I then generate 7 symbols from random classes (no repetition) and arrange them from left to right (allowing overlaps). The user has to enter the symbol classes in the right order. There is a 1 in 604,800 chance of guessing correct randomly, or a 1 in 10,000,000 chance if symbol repetition is allowed. The symbols are displayed in random orientation.

The symbol from a particular class will look different between the training example and the main CAPTCHA. Each class is defined by a small set random parameters. These parameters determine features such as branching factor, curviness, length, etc. Once I have generated the random class parameters, each symbol is also created via a stochastic process - but its general appearance will remain visually similar to the other symbols in its class due to the class parameters.

Well, I suspect that this CAPTCHA system will not work well in practice because it is too tedious for humans to solve. On first impression I think this task seems more difficult to solve with a computer than typical CAPTCHA systems. It would be easy to evaluate the time/precision of humans solving these CAPTCHAs with a user-study. However, evaluating the performance of software solvers is more difficult. One idea I had for evaluating software solvers is to host a contest (say on TopCoder or Kaggle) and offer a big cash prize as incentive to make competitive AIs (I won't actually do this :P).

Here are a few more random CAPTCHAs from the program. I have posted the answers at the end of this post - how many can you solve without looking at the answers?

The answers are: 0718936, 9467251, 0197283, and 7634290 respectively.

2 comments:

Sandy said...

I spent a couple minutes but I still made 3 mistakes.

I'd prefer these to the usual method of making text almost too crappy to read

Jordan said...

Some of them are difficult, but I'd agree that some of the current CAPTCHAs can be tough too.

It also lends some insight into what types of patterns humans are good at segmenting and classifying. For examples, different people have widely varying ability to identify patterns that have been rotated, mirrored, etc.