DHH20080807

InfoInfo
Search:    

Go to the [WWW]original post
Go back to the DHH Archive

Thursday, August 07, 2008
Arms Races

(Cross-posted to Cliopatria and Digital History Hacks)

Like many people who blog at Blogger, I was recently notified by e-mail that my blog had been identified by their automated classifiers "as a potential spam blog." In order to prove that this was not the case, I had to log in to one of their servers and request that my blog be reviewed by a human being. The e-mail went on to say "Automatic spam detection is inherently fuzzy, and occasionally a blog like yours is flagged incorrectly. We sincerely apologize for this error." The author of the e-mail knew, of course, that if my blog were sending spam then his or her e-mail would fall on deaf ears (as it were)... you don't have to worry about bots' feelings. The politeness was intended for me, a hapless human caught in the crossfire in [WWW]the war of intelligent machines.

That same week, a lot of my e-mails were also getting bounced. Since I have my blog address in my .sig file, I'm guessing that may have something to do with it. Alternately, my e-mail address may have been temporarily blocked as the result of a surge in spam being sent from GMail servers. This to-and-fro, attack against counter-attack, [WWW]Spy vs. Spy kind of thing can be irritating for the collaterally damaged but it is good news for digital historians, as paradoxical as that may seem.

One of the side effects of the war on spam has been a lot of sophisticated research on automated classifiers that use Bayesian or other techniques to categorize natural language documents. Historians can use these algorithms to make their own online archival research much more productive, as I argued in a [WWW]series of posts this summer.

In fact, a closely related arms race is being fought at another level, one that also has important implications for the digital humanities. The optical character recognition (OCR) software that is used to digitize paper books and documents is also being used by spammers to try and circumvent software intended to block them. This, in turn, is having a positive effect on the development of OCR algorithms, and leading to higher quality digital repositories as a collateral benefit. Here's how.

Tags: machine learning | optical character recognition (OCR) | Turing test
Posted by William J. Turkel at 10:15 AM

This is a Wiki Spot wiki. Wiki Spot is a non-profit organization that helps communities collaborate via wikis.