Turing Digitalization

Some 60 million CAPTCHAs are solved daily according to Luis von Ahn (on Wired Science on PBS). His technology project reCAPTHCA will use unknown words in these challenges for solving the unknown words in OCR digitalizing books to solve these words in an a quasi-automated sort of way.

I wonder though. Even if reCAPTCHA a) becomes the default at major sites like Yahoo or Google and b) is solved 100% right ever time, then how many books would be completed per day? Certainly no one really comments on this blog, so its almost why bother. (hint, hint)

tag: ,

UPDATE: Trying to clarify. reCAPTCHA integrates two technologies.

Optical Character Recognition always has questionable results. The worse the quality of the text (age or damage), the less capable the software. It takes a human on average about 10 seconds to recognize and provide the correct spelling of a piece of unknown text.

CAPTCHAs are the little pictures used to verify you are a human and not a spammer at various web sites. The problem is coming up with good digital letters OCR software cannot easily recognize.

Luis’ reCAPTCHA idea is if OCR software has trouble with a piece of text from these scanned books, then they have would make excellent candidates for objects to confuse the spammer bots trying to defeat CAPTCHAs. At the same time, humans validate the correctness of the unknown words where the OCR was confused.



Steel, originally uploaded by Ezra F.

…aka John Henry Irons

My Christmas present from Brian

If you ever get a chance to pick up The Death of Superman (there is a graphic novel / aggregation of the comic issues involved), its a pretty good story arc. It happens I was collecting at the time of its original issue, though I guess those are at my dad’s house.