Turing Digitalization

Some 60 million CAPTCHAs are solved daily according to Luis von Ahn (on Wired Science on PBS). His technology project reCAPTHCA will use unknown words in these challenges for solving the unknown words in OCR digitalizing books to solve these words in an a quasi-automated sort of way.

I wonder though. Even if reCAPTCHA a) becomes the default at major sites like Yahoo or Google and b) is solved 100% right ever time, then how many books would be completed per day? Certainly no one really comments on this blog, so its almost why bother. (hint, hint)

tag: ,

UPDATE: Trying to clarify. reCAPTCHA integrates two technologies.

Optical Character Recognition always has questionable results. The worse the quality of the text (age or damage), the less capable the software. It takes a human on average about 10 seconds to recognize and provide the correct spelling of a piece of unknown text.

CAPTCHAs are the little pictures used to verify you are a human and not a spammer at various web sites. The problem is coming up with good digital letters OCR software cannot easily recognize.

Luis’ reCAPTCHA idea is if OCR software has trouble with a piece of text from these scanned books, then they have would make excellent candidates for objects to confuse the spammer bots trying to defeat CAPTCHAs. At the same time, humans validate the correctness of the unknown words where the OCR was confused.



2 responses to “Turing Digitalization”

  1. LauraG Avatar

    In the case of this post…it may be that no one really understands what you’re trying to say. ; D
    Unknown words to solve unknown words? I thought optical character recognition was trying to recognize real words and display them as such… I’m missing something in a big way.

Leave a Reply

%d bloggers like this: