- Samuel's Blog - https://samuelgordonstewart.com -

Counter-intuitive

Google are very pleased with themselves at the moment as they recently purchased reCAPTCHA [1], one of the many organisations behind those squiggly sets of letter and number which attempt to make you prove that you are human and not an evil spamming robot.

CAPTCHAs, in order to work as intended, rely on the fact that computers have a hard time reading the squiggly text, but Google and reCAPTURE seem to want to make it easier for computers to recognise the squiggly text.

Since computers have trouble reading squiggly words like these, CAPTCHAs are designed to allow humans in but prevent malicious programs from scalping tickets or obtain millions of email accounts for spamming. But there’s a twist — the words in many of the CAPTCHAs provided by reCAPTCHA come from scanned archival newspapers and old books. Computers find it hard to recognize these words because the ink and paper have degraded over time, but by typing them in as a CAPTCHA, crowds teach computers to read the scanned text.

In this way, reCAPTCHA’s unique technology improves the process that converts scanned images into plain text, known as Optical Character Recognition (OCR). This technology also powers large scale text scanning projects like Google Books and Google News Archive Search.

In this way, reCAPTCHA’s unique technology also improves computers’ ability to read CAPTCHAs, therefore defeating the whole process…although, it could already be defeated. If the letters are coming from ancient scanned newspapers, and reCAPTCHA is relying on you, the human, to teach it what the letters are, does that not therefore mean that reCAPTCHA has no idea what the letters are in the first place, and will let you in regardless of what input you provide?

Presumably reCAPTCHA is providing a combination of characters that it does and does not know in each capture, requiring you to enter the known characters correctly and hopefully the others correctly as well…but this still defeats the purpose of the CAPTCHA, as by teaching the computer how to read the squiggly text, CAPTCHAs would have to grow in complexity over time in order to stay ahead of the reading ability of computers.

And surely it is just a matter of time, if it hasn’t already happened, until malware starts taking note of what you enter for a given CAPTCHA, so that the bad guys have a CAPTCHA based Optical Character Recognition database of their own.

Samuel