The Nora Project aims at putting together a big pool of digital texts in the humanities in order to develop and test data mining techniques specific to this domain. Various collaborations with other institutions have provided them already a testbed of about 10,000 literary texts in English, from the 19th century, or about 5 GB of marked-up text.
Started by the University of Illinois' Graduate School of Library and Information Science, it relies on several years of software development work that has been done at the University of Illinois' National Center for Supercomputing Applications (NCSA), developing the D2K (Data to Knowledge) software, in Michael Welge's Automated Learning Group. As they explain:
[...] the goal of data-mining (including text-mining) is to produce new knowledge by exposing unanticipated similarities or differences, clustering or dispersal, co-occurrence and trends. Over the last decade, many millions of dollars have been invested in creating digital library collections: at this point, terabytes of full-text humanities resources are publicly available on the web. The goal of the nora project is to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries.
I tried the online Nora Vis demo, on the letters of Emily Dickinson, through the Java Web start.
Didn't take too long to launch (well.. tx to KMi's super fast connection), and it automatically loads a text and some metadata we don't see initially. From the online guide I gathered that I have to browse the letters, and rank them according to how much they represent a specific content .
The tags associated with the documents are not very 'semantic', are they? Maybe I'm missing something... nonetheless, the idea is to provide a training corpus, through the ranking on the right-bottom part of the page.
The training corpus can then be used as a "pattern" to match other documents. At first sight, the result is a sort of linguistic similarity search on other documents.
Definitely an interesting concept that is worth more exploration!
Cite this blog post:
Annual Conference of the Canadian Society for Digital Humanities / Société pour l'étude des médias interactifs (SDH-SEMI 2010), Montreal, Canada, Jun 2010.
blog The Nora Project