Down the Rabbit Hole....

Down the Rabbit Hole....

by Nancy Sackheim

Fiction writers are an endlessly curious lot.  Many of us are also chronic procrastinators.  Combine the two, and down the search engine rabbit hole we go, hours later coming up blinking and mumbling huh, really interesting stuff.   Useful "research" for that book we're supposed to be writing?   Maybe. Nah, but wow, really interesting stuff.  Which leads me to my latest rabbit hole, The reCAPTCHA Project.

As noted in Carnegie Mellon University's CyLab, back in 2000, CMU computer science graduate student Luis von Ahn, along with his advisor Manuel Blum, created a new cyber security tool called CAPTCHA.  A CAPTCHA (short for Completely Automated Public Turing Test To Tell Computers and Humans Apart) is a program designed to secure web sites from automated bot attacks and spammers by generating a test that humans can pass but computers cannot.

The original iteration of CAPTCHA generated an image comprised of several randomly selected and distorted characters.  In order to gain access to a protected site, users had to prove that they were human and not a computer by correctly deciphering and retyping the characters.  Because computers cannot process distorted images and text as well as humans can, CAPTCHAs immediately proved effective at thwarting most automated attacks.  The technology quickly caught on and today is used to secure tens of thousands of web sites, although persistent hackers and spammers have rendered it increasingly less effective over time.

Enter Dr. von Ahn's 2006 iteration of CAPTCHA technology, the CyLab-funded reCAPTCHA Project.  The reCAPTCHA Project lies at the intersection of cyber security, technology, and the ongoing effort by several commercial and non-profit organizations to digitize printed information.  In the process of scanning and digitizing books and other print materials, approximately 20% of words are found to be inherently distorted and unreadable by Optical Character Recognition (OCR) programs.  reCAPTCHA takes advantage of this glitch in the digitization process. Each reCAPTCHA image is constructed using one of these inherently distorted words as a starting point.  The image is then further distorted by adding random warping and lines.  Finally, reCAPTCHA pairs this image with another distorted word image also taken from printed material.  The end result is a CAPTCHA that, while still readable and decipherable by humans, is more complex and more difficult for computers to read.

Beyond its obvious use for foiling bot attacks and would-be spammers, the reCAPTCHA Project has another, more altruistic purpose.  Several years after introducing the world to CAPTCHA technology, von Ahn realized that, despite taking just a few seconds to type a CAPTCHA, humans were spending hundreds of thousands of hours each day typing in more than 100 million CAPTCHAs.  reCAPTCHA technology was developed not merely with an eye toward improving cyber security, but also as a way to harness and reuse the collective human time and mental energy spent solving and typing CAPTCHAs—a concept von Ahn has dubbed “human computation.”  By constructing CAPTCHAs using words tagged as unreadable in the digitizing of books and other printed material, millions and millions of cyber users play a part every day in the digitization and preservation of human knowledge by transcribing words.  Tests have shown that reCAPTCHA textual images are deciphered and transcribed with 99.1% accuracy, a rate comparable to the best human professional transcription services.  In just the first year after launching reCAPTCHA, humans correctly deciphered and transcribed more than 440 million words, roughly the equivalent of 17,600 books. In September 2009 Google acquired reCAPTCHA.  By 2011 reCAPTCHA completed digitizing the archives of The New York Times.  The archive can be searched from the New York Times Article Archive, where more than 13 million articles in total have been archived, dating from 1851 to the present.  Today, reCAPTCHA continues to digitize books that are too illegible to be scanned by computers, as well as translate books to different languages.  And you are a part of the project every time you transcribe the correct letters into that little CAPTCHA box on Facebook, TicketMaster, Twitter, StumbleUpon, or Craigslist.

What's your latest rabbit hole?

Summer Vacation with Old Friends: Rereading Your Favorite Childhood Books

Summer Vacation with Old Friends: Rereading Your Favorite Childhood Books

Something From Nothing: Conflict Edition

Something From Nothing: Conflict Edition