Professor Sukanta Chaudhuri yesterday gave a very interesting talk on the scope, methods and aims of 'Bichitra' (literally, 'the various'), the ongoing project for an online variorum edition of the complete works of Rabindranath Tagore in English and Bengali. The talk (part of this year's DDH research seminar) highlighted a number of issues I personally wasn't much familiar with, so in this post I'm summarising them a bit and then highlighting a couple of possible suggestions.
Sukanta Chaudhuri is Professor Emeritus at Jadavpur University, Kolkata (Calcutta), where he was formerly Professor of English and Director of the School of Cultural Texts and Records. His core specializations are in Renaissance literature and in textual studies: he published The Metaphysics of Text from Cambridge University Press in 2010. He has also translated widely from Bengali into English, and is General Editor of the Oxford Tagore Translations.
Rabindranath Tagore (1861 – 1941), the first nobel laureate of Asia, was arguably the most important icon of modern Indian Renaissance. This recent project on the electronic collation of Tagore texts, called 'the Bichitra project', is being developed as part of the national commemoration of the 150th birth anniversary of the poet (here's the official page). This is how the School of Cultural Texts and Records summarizes the project's scope:
The School is carrying out pioneer work in computer collation of Tagore texts and creation of electronic hypertexts incorporating all variant readings [...] we have now undertaken a two-year project entitled "Bichitra" for a complete electronic variorum edition of all Tagores works in English and Bengali. The project is funded by the Ministry of Culture, Government of India, and is being conducted in collaboration with Rabindra-Bhavana, Santiniketan.
The target is to create a website which will contain (a) images of all significant variant versions, in manuscript and print, of all Tagores works; (b) text files of the same; and (c) collation of all versions applying the "Pathantar" software. To this end, the software itself is being radically redesigned. Simultaneously, manuscript and print material is being obtained and processed from Rabindra-Bhavana, downloaded from various online databases, and acquired from other sources. Work on the project commenced in March 2011 and is expected to end in March 2013, by which time the entire output will be uploaded onto a freely accessible website.
On the visualization of the results of a collation. Some inspiration could be found in the type of visualizations normally used in version control software systems, where multiple and alternative versions of the same file must be tracked and shown to users. For example, we could think of the visualizations available on GitHub (a popular code-sharing site), which are described on this blog post and demonstrated via an interactive tool on this webpage. Here's a screenshot:
The situation is striking similar - or not? Would it be feasible to reuse one of these approaches with textual sources? Another relevant visualization is the one used by popular file-comparison softwares (eg File Merge on a Mac) for showing differences between two files:
On using language technologies with Bengali. I did a quick tour of what's available online, and (quite unsurprisingly, considering the reputation Indian computer scientists have) found several research papers which seem highly relevant. Here's a few of them:
- Asian language processing: current state-of-the-art [text] - Research report on Bengali NLP engine for TTS [text] - The Emile corpus, containing fourteen monolingual corpora, including both written and (for some languages) spoken data for fourteen South Asian languages [homepage] - A complete OCR system for continuous Bengali characters [text] - Parsing Bengali for Database Interface [text] - Unsupervised Morphological Parsing of Bengali [text]
On open-source softwares that appear to be usable with Bengali text. Not a lot of stuff, but more than enough to get started (the second project in particular seems pretty serious):
- Open Bangla OCR - A BDOSDN (Bangladesh Open Source Development Network) project to develop a Bangla OCR - Bangla OCR project, mainly focused on the research and development of an Optical Character Recognizer for Bangla / Bengali script
Cite this blog post:
Lecture slides from the Course on digital history, part of the master in Digital Humanities at King's College, London., Oct 2011.