xml – Parerga und Paralipomena

Notes from the Force11 annual conference

mikele — Sat, 17 Jan 2015 18:04:41 +0000

I attended the https://www.force11.org/ conference in Oxford the last couple of days (the conference was previously called ‘Beyond the PDF’).

Force11 is a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing. Individually and collectively, we aim to bring about a change in modern scholarly communications through the effective use of information technology. [About Force 11]

Rather than the presentations, I would say that the most valuable aspect of this event are the many conversations you can have with people from different backgrounds: techies, publishers, policy makers, academics etc..

Nonetheless, here’s a (very short and biased) list of things that seemed to stand out.

A talk titled Who’s Sharing with Who? Acknowledgements-driven identification of resources by David Eichmann, University of Iowa. He is working on a (seemingly very effective) method for extracting contributors roles from scientific articles

This presentation describes my recent work in semantic analysis of the acknowledgement section of biomedical research articles, specifically the sharing of resources (instruments, reagents, model organisms, etc.) between the author articles and other non-author investigators. The resulting semantic graph complements the knowledge currently captured by research profiling systems, which primarily focus on investigators, publications and grants. My approach results in much finer-grained information, at the individual author contribution level, and the specific resources shared by external parties. The long-term goal for this work is unification with the VIVO-ISF-based CTSAsearch federated search engine, which currently contains research profiles from 60 institutions worldwide.

A talk titled Why are we so attached to attachments? Let’s ditch them and improve publishing by Kaveh Bazargan, head of River Valley Technologies. He demoed a prototype manuscript tracking system that allows editors, authors and reviewers to create new versions of the same document via an online google-doc-like system which has JATS XML in the background

I argue that it is precisely the ubiquitous use of attachments that has held up progress in publishing. We have the technology right now to allow the author to write online and have the file saved automatically as XML. All subsequent work on the “manuscript” (e.g. copy editing, QC, etc) can also be done online. At the end of the process the XML is automatically “rendered” to PDF, Epub, etc, and delivered to the end user, on demand. This system is quicker as there are no emails or attachments to hold it up, cheaper as there is no admin involved, and more accurate as there is only one definitive file (the XML) which is the “format of record”.

Rebecca Lawrence from F1000 presented and gave me a walk through of a new suite of tools they’re working on. That was quite impressing I must say, especially due to the variety of features they offer: tools to organize and store references, annotate and discuss articles and web pages, import them into word documents etc.. All packed within a nicely looking and user friendly application. This is due to go public beta some time in March, but you can try to get access to it sooner by signing up here.

The best poster award went to 101 Innovations in Scholarly Communication – the Changing Research Workflow. This is a project aiming to chart innovation in scholarly information and communication flows. Very inspiring and definitely worth a look.

Finally, I’m proud to say that the best demo award went to my own resquotes.com, a personal quotations-manager online tool which I’ve just launched a couple of weeks ago. Needless to say, it was great to get vote of confidence from this community!

If you want more, it’s worth taking a look directly at the conference agenda and in particular the demo/poster session agenda. And hopefully see you next year in Portland, Oregon :-)

Using Impromptu to visualize RSS feeds

mikele — Wed, 21 Dec 2011 13:22:30 +0000

Some time ago I’ve been experimenting with the processing and display of RSS feeds within Impromptu, and as a result I built a small app that retrieves the news feed from The Guardian online and displays on a canvas. I’ve had a bit of free time these days, so last night I thought it was time to polish it a little and make it available on this blog (who knows maybe someone else will use it as starting point for another project).

There’re a thousand improvements that could be done to it still, but the core of the application is there: I packaged it as a standalone app that you can download here. (use the ‘show package contents’ Finder command to see the source code).

The application relies on a bunch of XML processing functions that I found within Impromptu ‘examples’ folder (specifically, it’s the example named 35_objc_xml_lib). I pruned that a bit so to fit my purposes and renamed it xml_lib.scm.

By using that, I created a function that extracts title and url info from the guardian feed:

(load "xml_lib.scm")
(define feedurl "http://feeds.guardian.co.uk/theguardian/world/rss")

;;
;; loads the feed and extracts title and url
;;

(define get-articles-online
     (lambda ()
        (let* ((out '())
               (feed (xml:load-url feedurl))
               (titles (objc:nsarray->list (xml:xpath (xml:get-root-node feed)
                                                "channel/item/title/text()")))
               (urls (objc:nsarray->list (xml:xpath (xml:get-root-node feed)
                                                "channel/item/link/text()"))))                                                 
           (for-each (lambda (x y)
                        (let ((xx (objc:nsstring->string x))
                              (yy (objc:nsstring->string y)))
                           (set! out (append out (list (list xx yy))))))
                titles urls)
           out)))

Some feed titles are a bit longish, so I added a utility function formattext that wraps the titles’ text if they exceed a predefined length.

(define formattext 
   (lambda (maxlength txt posx posy)
      (let ((l (string-length txt)))      
         (if (> l maxlength)
             (let loop ((i 0)
                        (j maxlength) ;; comparison value: it decreases at each recursion (except the first one) 
                        (topvalue maxlength)) ;; komodo value : must be equal to j at the beginning
                (if (equal? (- topvalue i) j) ;; the first time
                    (loop (+ i 1) j topvalue)
                    (begin   ;(print (substring txt (- topvalue i) j))
                             (if (string=? (substring txt (- topvalue i) j) " ")
                                 (string-append (substring txt 0 (- topvalue i)) 
                                                "n" 
                                                (substring txt (- topvalue i) (string-length txt)))
                                 (if (< i topvalue) ;;avoid negative indexes in substring
                                     (loop (+ i 1) (- j 1) topvalue))))))
             txt))))

And here’s the main loop: it goes through all the feed items at a predefined speed, and displays it on the canvas using a cosine oscillator to vary the colours a bit. At the end of it I’m also updating 3 global variables that are used for the mouse-click-capturing routine.

(define displayloop
   (lambda (beat feeds) 
      (let* ((dur 5)
             (posx  (random 0 (- *canvas_max_x* 350)))
             (posy  (random 10 (- *canvas_max_y* 150)))
             (txt (formattext 40 (car (car feeds)) posx posy))
             (dim ;(+ (length feeds) 10))                  
                  (if (= (length feeds) 29)
                      60  ;; if it's the first element of the feed list make it bigger
                      (random 25 50)))
             (fill (if (= (length feeds) 29)
                         '(1 0 (random) 1)  ;; if it's the first element of the feed list make it reddish
                         (list (random) 1 (random) 1)))
             (style (gfx:make-text-style "Arial" dim fill)))
         (gfx:clear-canvas (*metro* beat) *canvas* (list (cosr .5 .6 .001) 0 (cosr .5 .6 .001) .5 ))
         (gfx:draw-text (*metro* beat) *canvas* txt style (list posx posy))
         (set! *pos_x* posx)
         (set! *pos_y* posy)
         (set! *current_url* (cadr (car feeds)))
     (callback (*metro* (+ beat (* 1/2 dur))) 'displayloop (+ beat dur)
               (if-cdr-notnull feeds 
                               (get-articles-online))))))

In order to capture the clicks on the feed titles I simply create a rectangle path based on the x,y coordinates randomly assigned when displaying the title on the canvas. These coordinates are stored in global variables so that they can be updated constantly.

(io:register-mouse-events *canvas*)
(define io:mouse-down
   (lambda (x y)
      (print x y)
      (when (gfx:point-in-path? (gfx:make-rectangle *pos_x* *pos_y* 200 200) x y )
            (util:open-url *current_url*))))

Finally, the util:open-url opens up a url string in your browser (I’ve already talked about it here).

You can see all of this code in action by downloading the app and taking a look its contents (all the files are under Contents/Resources/app).

If I had the time…

Some other things it’d be nice to do:

Creating a routine that makes the transitions among feed items less abrupt, maybe by using canvas layers.

Refining the clicking events creation: now you can click only on the most recent title; moreover the clicking event handler is updated too quickly, thus unless you click on the titles as soon as it appears you won’t be able to trigger the open-url action.

Refining the xml-tree parsing function, which now is very very minimal. We could extract news entries description and other stuff that can make the app more informative.

Adding some background music to it.

Any other ideas?

Event: THATcamp Kansas and Digital Humanities Forum

mikele — Wed, 28 Sep 2011 16:56:55 +0000

The THATcamp Kansas and Digital Humanities Forum happened last week at the Institute for Digital Research in the Humanities, which is part of the University of Kansas in beautiful Lawrence. I had the opportunity to be there and give a talk about some recent stuff I’ve been working on regarding digital prosopography and computer ontologies, so in this blog post I’m summing up a bit the things that caught my attention while at the conference.

The event happened on September 22-24 and consistend of three separate things:

Bootcamp Workshops: a set of in-depth workshops on digital tools and other DH topics http://kansas2011.thatcamp.org/bootcamps/.

THATCamp: an “unconference” for technologists and humanists http://kansas2011.thatcamp.org/.

Representing Knowledge in the DH conference: a one-day program of panels and poster sessions (schedule | abstracts )

The workshop and THATcamp were both packed with interesting stuff, so I strongly suggest you take a look at the online documentation, which is very comprehensive. In what follows I’ll instead highlight some of the contributed papers which a) I liked and b) I was able to attend (needless to say, this list matches only my individual preference and interests). Hope you’ll find something of interest there too!

A (quite subjective) list of interesting papers

The Graphic Visualization of XML Documents, by David Birnbaum ( abstract ): a quite inspiring example of how to employ visualizations in order to support philological research in the humanities. Mostly focused on Russian texts and XML-oriented technologies, but its principles easily generalizable to other contexts and technologies.

Exploring Issues at the Intersection of Humanities and Computing with LADL, by Gregory Aist ( abstract ): the talk presented LADL, the Learning Activity Description Language, a fascinating software environment provides a way to “describe both the information structure and the interaction structure of an interactive experience”, to the purpose of “constructing a single interactive Web page that allows for viewing and comparing of multiple source documents together with online tools”.

Making the most of free, unrestricted texts–a first look at the promise of the Text Creation Partnership, by Rebecca Welzenbach ( abstract ): an interesting report on the pros and cons of making available a large repository of SGML/XML encoded texts from the Eighteenth Century Collections Online (ECCO) corpus.

The hermeneutics of data representation, by Michael Sperberg-McQueen ( abstract ): a speculative and challenging investigation of the assumptions at the root of any machine-readable representation of knowledge – and their cultural implications.

Breaking the Historian’s Code: Finding Patterns of Historical Representation, by Ryan Shaw ( abstract ): an investigation on the usage of natural language processing techniques to the purpose of ‘breaking down’ the ‘code’ of historical narrative. In particular, the sets of documents used are related to the civil rights movement, and the specific NLP techniques being employed are named entity recognition, event extraction, and event chain mining.

Employing Geospatial Genealogy to Reveal Residential and Kinship Patterns in a Pre-Holocaust Ukrainian Village, by Stephen Egbert.( abstract ): this paper showed how it is possible to visualize residential and kinship patterns in the mixed-ethnic settlements of pre-Holocaust Eastern Europe by using geographic information systems (GIS), and how these results can provide useful materials for humanists to base their work on.

Prosopography and Computer Ontologies: towards a formal representation of the ‘factoid’ model by means of CIDOC-CRM, by me and John Bradley ( abstract ): this is the paper I presented (shameless self plug, I know). It’s about the evolution of structured prosopography (= the ‘study of people’ in history) from a mostly single-application and database-oriented scenario towards a more interoperable and linked-data one. In particular, I talked about the recent efforts for representing the notion of ‘factoids’ (a conceptual model normally used in our prosopographies) using the ontological language provided by CIDOC-CRM (a computational ontology commonly used in the museum community).

Python links (and more) 7/2/11

mikele — Thu, 03 Feb 2011 15:23:21 +0000

This post contains just a collection of various interesting things I ran into in the last couple of weeks… they’re organized into three categories: pythonic links, events and conferences, and new online tools. Hope you’ll find something of interest!

Pythonic stuff:

Epidoc
Epydoc is a handy tool for generating API documentation for Python modules, based on their docstrings. For an example of epydoc’s output, see the API documentation for epydoc itself (html, pdf).

PyEnchant
PyEnchant is a spellchecking library for Python, based on the excellent Enchant library.

Dexml
The dexml module takes the mapping between XML tags and Python objects and lets you capture that as cleanly as possible. Loosely inspired by Django’s ORM, you write simple class definitions to define the expected structure of your XML document.

SpecGen
SpecGen v5, ontology specification generator tool. It’s written in Python using Redland RDF library and licensed under the MIT license.

PyCloud
Leverage the power of the cloud with only 3 lines of python code. Run long processes on the cloud directly from your shell!

commandlinefu.com
This is not really pythonic – but nonetheless useful to pythonists: a community-based repository of useful unix shell scripts!

Events and Conferences:

Digital Resources in the Humanities and Arts Conference 2011
University of Nottingham Ningbo, China. The DRHA 2011 conference theme this year is “Connected Communities: global or local2local?”

Narrative and Hypertext Workshop at the ACM Hypertext 2011 conference in Eindhoven.

Culture Hack Day, London, January 2011
This event aimed at bringing cultural organisations together with software developers and creative technologists to make interesting new things.

History Hack Day, London, January 2011
A bunch of hackers with a passion for history getting together and doing experimental stuff

Conference.archimuse.com
The ‘online space for cultural informatics‘: lots of useful info here, about publications, jobs, people etc.

Agora project: Scholarly Open Access Research in European Philosophy
Project looking at building an infrastructure for the semantic interlinking of European philosophy datasets

Online tools:

FactForge
A web application aiming at showcasing a ‘practical approach for reasoning with the web of linked data’.

Semantic Overflow
A clone of Stack Overflow (collaboratively edited question and answer site for programmers) for questions ‘about semantic web techniques and technologies’.

Google Refine
A tool for “working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases”.

Google Scribe
A text editor with embedded autocomplete suggestions as you type

Books Ngram Viewer
Tool that displays statistical information regarding the use of user-selected sentences in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over the selected years

…

Getting back to the ontological work..

mikele — Thu, 04 Feb 2010 14:00:58 +0000

I’ll be working in Osaka for three months on ontologizing a couple of datasets with the help of Riichiro Mizoguchi. This means that I’ll have enough time to revise various notions about ontology engineering during this period. Here’s a first and fundamental one, regarding the difference between ontologies and data models:

The difference between ontologies and data models does not lie in the language being used: you can deﬁne an ontology in a basic ER language (although you will be hampered in what you can say); similarly, you can write a data model with OWL. Writing something in OWL does not make it an ontology! The key difference is not the language the intended use. A data model is a model of the information in some restricted well-delimited application domain, whereas an ontology is intended to provide a set of shared concepts for multiple users and applications. To put it simply: data models live in a relatively small closed world; ontologies are meant for an open, distributed world (hence their importance for the Web).

Schreiber. Knowledge Engineering. Handbook of Knowledge Representation (2007) pp. 929-946