History 513 2007-08 17. Machine Learning and Data Mining

InfoInfo
Search:    

6 Feb 2008

Witten and Frank define data mining as “the extraction of implicit, previously unknown, and potentially useful information from data,” as the process of “finding and describing patterns in data.” Machine learning, a subdiscipline of computer science, goes one step further by attempting to use these patterns to classify previously unseen data. Historians are now beginning to use both kinds of techniques in the research process.

Readings for Discussion

Cohen, Daniel J. “[WWW]Mapping What Americans Did on September 11,” dancohen.org (8 Aug 2006).

Cohen, Daniel J. “[WWW]Intelligence Analysts and Humanities Scholars,” dancohen.org (13 Nov 2006).

Cohen, Daniel J. “[WWW]History and the Second Decade of the Web,” Rethinking History (Jun 2004).

Cohen, Daniel J. “[WWW]From Babel to Knowledge: Data Mining Large Digital Collections,” D-Lib Magazine 12, no. 3 (Mar 2006).

Cohen, Daniel J. and Roy Rosenzweig. “[WWW]Web of Lies? Historical Knowledge on the Internet,” First Monday 10, no. 12 (15 Nov 2005).

Cohen, Daniel J. and Roy Rosenzweig. “[WWW]No Computer Left Behind,” Chronicle of Higher Education (24 Feb 2006).

Garrett, Jeffrey. “[WWW]KWIC and Dirty? Human Cognition and the Claims of Full-Text Searching,” Journal of Electronic Publishing 9, no. 1 (Winter 2006).

Kelly, T. Mills. “[WWW]Analyzing Traffic,” edwired (29 Sep 2006).

Graham-Rowe, Duncan. “[WWW]Google’s Search for Meaning,” New Scientist (29 Jan 2005).

Turkel, W. J. “[WWW]Searching for History,” Digital History Hacks (12 Oct 2006).

UC Irvine, “[WWW]UCI Researchers ‘Text Mine’ the New York Times, Demonstrating Evolution of Potent New Technology,” (26 Jul 2006).

Unsworth, John. “[WWW]The Scholar in the Digital Library,” (6 Apr 2000).

Individual Exercises

Easy. The privacy of search records. In August 2006, AOL released search data for more than half a million of their users, each represented by a random ID number. Within days, the company realized that this was a mistake, withdrew the data and made a public apology. (There are articles about the story and some background information here. See also u500k’s AOL data analyzed.) Read through the articles and answer the following questions: What should historians make of this case now? How may future historians respond to the availabi ity of similar datasets?

Easy. Find relevancy peaks. [WWW]Find Forward lets you determine which years are most closely associated with a particular search term. The way that it does this is by doing your search with the additional search term ‘1950,’ then ‘1951,’ and so on. It keeps track of the number of hits for each search and plots them on a bar chart. To get started, you might try ‘depression’, ‘Elvis’ or ‘U2′ for the period 1950-2000. Now try some terms from your own research. Are the results what you expect?

Further Reading

Bradley, John. “[WWW]Text Tools,” in A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, 2004.

Chakrabarti, Soumen. Mining the Web: Discovering Knowledge from Hypertext Data. San Francisco: Morgan Kaufmann, 2003.

Garfinkel, Simson. Database Nation: The Death of Privacy in the 21st Century. Sebastopol, CA: O’Reilly, 2001.

[WWW]H-Bot.

Ide, Nancy. “[WWW]Preparation and Analysis of Linguistic Corpora,” in A Companion to Digital Humanities, ed. Susan Schreibman, Ray Siemens, John Unsworth. Oxford: Blackwell, 2004.

Mitchell, Tom M. Machine Learning. Boston: McGraw-Hill, 1997.

O’Harrow, Robert. No Place to Hide. Free Press, 2006.

Weiss, Sholom M., Nitin Indurkhya, Tong Zhang and Fred J. Damerau. Text Mining: Predictive Methods for Analyzing Unstructured Information. New York: Springer, 2005.

Witten, Ian H. and Eibe Frank. Data Mining: Practical Machine Learning Tools and Techniques, 2nd ed. San Francisco: Morgan Kaufmann, 2005.

This is a Wiki Spot wiki. Wiki Spot is a non-profit organization that helps communities collaborate via wikis.