Return to UWO History 9808A Digital History Fall 2009
Machine Learning and Data Mining (25 Nov 2009)
In Data Mining, Witten and Frank define the subject as “the extraction of implicit, previously unknown, and potentially useful information from data,” as the process of “finding and describing patterns in data.” Machine learning, a sub-discipline of computer science, goes one step further by attempting to use these patterns to classify previously unseen data. Historians are now beginning to use both kinds of techniques in the research process.
Readings for Discussion
-
Cohen, "
Mapping What Americans Did on September 11," dancohen.org (8 Aug 2006).
-
Cohen, "
Intelligence Analysts and Humanities Scholars," dancohen.org (13 Nov 2006).
-
Cohen, "
History and the Second Decade of the Web," Rethinking History (Jun 2004).
-
Cohen and Rosenzweig, "
Web of Lies? Historical Knowledge on the Internet," First Monday 10, no. 12 (15 Nov 2005).
-
Garrett, "
KWIC and Dirty? Human Cognition and the Claims of Full-Text Searching," Journal of Electronic Publishing 9, no. 1 (Winter 2006).
-
Gralla, "Ch 33 How Agents Work (Ch 32 in 7th ed.)", "Ch 48 How Internet Sites Can Invade Your Privacy (Ch 46 in 7th ed.)", "Ch 49 The Dangers of Spyware and Phishing (Not in 7th ed.)"
-
Kelly, "
Analyzing Traffic," edwired (29 Sep 2006).
-
Shirky, "
A Speculative Post on the Idea of Algorithmic Authority," shirky.com (15 November 2009).
-
Turkel, "
Searching for History," Digital History Hacks (12 Oct 2006).
-
Unsworth, "
The Scholar in the Digital Library," (6 Apr 2000).
Background Readings
The following set of posts describe how to implement one complete machine learning / data mining project, using trial records from the
Old Bailey Online. The links to the source code in my blog are broken, but copies of all of the Python programs can be found
here. If you just want to get an idea of what I did, read posts 1 and 14. The naive bayesian learner is described in post 7.
-
Turkel, "A Naive Bayesian in the Old Bailey, parts
1,
2,
3,
4,
5,
6,
7,
8,
9,
10,
11,
12,
13, and
14," Digital History Hacks (24 May - 3 July 2008).
Assignment
Do some simple text mining. This week you learned about some sophisticated tools that can be used by humanists to process large amounts of text and facilitate exploration. Some of these techniques require programming skills, but many do not. The Canadian
TAPoR project is a wonderful collection of resources that bring text processing and analysis within the reach of any scholar. Starting at the
TAPoR recipes page try choosing a historical text from
Gutenberg and
generating a concordance. What kinds of things can you learn about a work this way? Feel free to blog about the assignment if you find something interesting.

