ontology – Parerga und Paralipomena

OntoSpy v.1.7.4

mikele — Mon, 27 Feb 2017 07:59:52 +0000

A new version of OntoSpy (1.7.4) is available online. OntoSpy is a lightweight Python library and command line tool for inspecting and visualising vocabularies encoded in the RDF family of languages.

This version includes a hugely improved API for creating nice-looking HTML or Markdown documentation for an ontology, which takes advantage of frameworks like Bootstrap and Bootswatch.

You can take a look at the examples page to see what I’m taking about.

To find out more about Ontospy:

CheeseShop: https://pypi.python.org/pypi/ontospy

Github: https://github.com/lambdamusic/ontospy

 
Here’s a short video showing a typical sessions with the OntoSpy repl:

Coming up next

More advanced ontology visualisations using d3 or similar javascript libraries;

A better separation between the core Python library in OntoSpy and the other components. This is partly addressing the fact that the OntoSpy package has grown a bit too much, in particular form the point of view of people who are only interested in using it in order to create their own applications, as opposed (for example) to reusing the built-in visualisations.

Of course, any comments or suggestions are welcome as usual – either using the form below or via GitHub. Cheers!

Ontospy v. 1.6.7

mikele — Sun, 12 Jun 2016 18:05:51 +0000

A new and improved version of OntoSpy (1.6.7) is available online. OntoSpy is a lightweight Python library and command line tool for inspecting and visualizing vocabularies encoded in the RDF family of languages.

This update includes support for Python 3, plus various other improvements that make it easier to query semantic web vocabularies using OntoSpy’s interactive shell module. To find out more about Ontospy:

Docs: http://ontospy.readthedocs.org

CheeseShop: https://pypi.python.org/pypi/ontospy

Github: https://github.com/lambdamusic/ontospy

 
Here’s a short video showing a typical sessions with the OntoSpy repl:

What’s new in this release

The main new features of version 1.6.7:
 

added support for Python 3.0 (thanks to a pull request from https://github.com/T-002)

the import [file | uri | repo | starter-pack] command that makes it easier to load models into the local repository. You can import a local RDF file or a web resource via its URI. The repo option allows to select an ontology by listing the one available in a couple of online public repositories; finally the starter-pack option can be used to download automatically a few widely used vocabularies (e.g. FOAF,DC etc..) into the local repository – mostly useful after a fresh installation in order to get started

the info [toplayer | parents | children | ancestors | descendants] command allows to print more detailed info about entities

added an incremental search mode based on text patterns e.g. to reduce the options returned by the ls command

calling the serialize command at ontology level now serializes the whole graph

made the caching functionality version-dependent

added json serialization option (via rdflib-jsonld)

Install/update simply by typing pip install ontospy -U in your terminal window (see this page for more info).

Coming up next

I’d really like to add more output visualisations e.g. VivaGraphJS or one of the JavaScript InfoVis Toolkit.

Probably even more interesting, I’d like to refactor the code generating visualisations so that it allows people to develop their own via a standard API and then publishing them on GitHub.

Lastly, more support for instance management: querying and creating instances from any loaded ontology.

Of course, any comments or suggestions are welcome as usual – either using the form below or via GitHub. Cheers!

Towards an ontology for philosophy

mikele — Wed, 17 Jun 2015 12:27:24 +0000

I enjoyed watching a recent presentation by Barry Smith about ontology engineering and in particular its application in the field of philosophy itself. The presentation was hosted by the InPho team at Indian University, whose ongoing work based on creating an ontological backbone for Stanford Encyclopedia of Philosophy has drawn the attention of many.

Barry Smith is a prominent contributor to both theoretical and applied research in ontology. He is the author of many publications on ontology and related topics. In particular, the Basic Formal Ontology is a widely used top level model in the scientific community.

I’m a bit surprised that there was no mention whatsoever of the work I did a while back in the context of the PhiloSURFIcal project. Built as part my PhD, the PhiloSURFIcal software tool allowed to navigate a philosophical text, taking advantage of a map of the concepts relevant to the text. The map, in this case, relied on a rather generic ontology for philosophy, which I instantiated using concepts from Wittgenstein’s Tractatus Logico-Philosophicus.

At the time I could not find evidence for any other ontology modelling the philosophical domain, and still to this day I haven’t seen any that provides the same level of detail in modeling out the various nuances of philosophical ideas.

Admittedly, the OWL formalization wasn’t very good (in fact I originally implemented the ontology using a KR language called OCML). Maybe though I should take this as an incentive to revive this work and publish it again using a more modern Linked Data approach!

A detailed summary of the modeling approach can be found here:

Michele Pasin, Enrico Motta. Ontological Requirements for Annotation and Navigation of Philosophical Resources – Synthese, Volume 182, Number 2, Springer September 2011 .

In any case, here’s a few interesting links and slides from Barry Smith’s presentation:

http://philosophyfamilytree.wikispaces.com/

http://ontology.buffalo.edu/philosophome/

http://kieranhealy.org/blog/archives/2013/06/18/a-co-citation-network-for-philosophy/

http://philosophyideas.com/

Nature.com ontologies portal available online

mikele — Thu, 30 Apr 2015 21:46:42 +0000

The Nature ontologies portal is new section of the nature.com site that describes our involvement with semantic technologies and also makes available to the wider public several models and datasets as RDF linked data.

We launched the portal nearly a month ago, to the purpose of sharing our experiences with semantic technologies and more generally to contribute to the wider linked data community with our data models and datasets.

This April 2015 release doubles the number and size of our published data models. This now spans more completely the various things that our world contains, from publication things – articles, figures, etc. – to classification things – article-types, subjects, etc. – and additional things used to manage our content publishing operation – assets, events, etc. Also included is a release page for the latest data release and a separate page for archival data releases.

Background

Is this the first time you’ve heard about semantic web and ontologies?

Then you should know that even though internally at Macmillan Science and Education XML remains the main technology used to represent and store the things we publish, the metadata about these documents (e.g. publication details, subject categories etc..) are normally encoded also using a more abstract, graph-oriented information model.

This is called RDF and has two key characteristics:
– it encodes all information in the form of triples e.g.
– it was built with the web in mind: broadly speaking, each of the items in a triple can be accessed via the internet i.e. it is a URIs (a generalised notion of a URL).

So why using RDF?

The RDF model makes it easier to maintain a shared yet scalable schema (aka an ‘ontology’) of the data types in use within our organization . A bit like a common language which is spoken by increasingly more data stores and thus allows to join things up more easily whenever needed.

At the same time – since the RDF model is native to the web – it facilitates the ‘semantic’ integration of our data with the increasing number of other organisations that publish their data using compatible models.

For example the BBC, Elsevier or more recently Springer are among the many organisations that contribute to the Linked Data Cloud.

What’s next

We’ll continue improving these ontologies and releasing new ones as they are created. But probably most interestingly for many people, we’re working a new release of the whole NPG articles dataset (~1M articles).

So stay tuned for more!

Nature.com subject pages available online!

mikele — Mon, 23 Jun 2014 14:45:00 +0000

Subject pages are pages that aggregate content from across nature.com based on the tagging of that content by NPG subject ontology terms. After six months of work on this project we’ve finally launched the first release of the site, which is reachable online at http://www.nature.com/subjects. Hooray!

This has been a particularly challenging experience cause I’ve essentially been wearing two hats for the past six months: product owner, leading the team in the day to day activities and prioritization of tasks, and information architect, dealing with the way content is organized and presented to users (my usual role).

In a nutshell, the goal of the project was to help our readers discover content more easily by using an internally-developed subject ontology to publish a page per term. The ontology is actually a poly-hierarchical taxonomy of scientific topics, which has been used in the last couple of years to tag all articles published on nature.com.

Besides helping users browse the site more easily, subject pages also contribute to making NPG content more discoverable via Google and other external search engines. All of this powered by a new backend platform which combines the expressiveness of linked data technologies (RDF) with the scalability of more traditional XML data stores (MarkLogic).

The main features are:
– one page per subject term which collates all content tagged with that term across nature.com
– RSS and ATOM feeds for each of the subject terms (~2500)
– dedicated pages that collate content from different journals based on their article types (e.g. news, research etc..)
– a visual tool to navigate subjects based on the ontology relations
– subject email alerts (to be released in the coming weeks)

It’s been a lot of work to bring all of this content together within a single application (keep in mind that the content comes from more than 80 different journals!) but this is just the beginning.

In the next months we’re looking at extending this work by making this content available in other formats (e.g. RDF), providing more ways to navigate through the data (facets, visualizations) and to integrate it with other datasets available online.. so stay tuned for more!

Towards a conceptual model for the domain of sculpture

mikele — Sat, 19 Nov 2011 14:44:09 +0000

For the next two years I’ll be collaborating with the Art of Making project. The project investigates the processes involved in the carving of stone during the Roman period, in particular it aims at analysing them using the insights and understanding Peter Rockwell (son of Norman Rockwell) developed during his lifelong experience as a sculptor. Eventually we will present these results by means of a freely accessible online digital resource that guides users through examples of stone carving. In this post I just wanted to report on the very first discussions I had with the sculpture and art scholars I’m working with, to the purpose of creating a shared model for this domain.

The project started this July, it is based at King’s College London and is funded by the Leverhulme Trust. I’m more involved with the digital aspects of the project, and as usual one of the first steps involved in the building of a digital resource (in particular, a database-backed digital resource) is the construction of a conceptual model that can represent the main types of things being dealt with.

In other words, it is fundamental to identify which are the things our database and web-application should ‘talk about’; later on, this model can be refined and extended so to become an abstract template of the data-manipulation tasks the software application must be capable of supporting (e.g. entering data into the system, searching and visualising them).

Here’s a nice example of the sculptures (a sarcophagus from Aphrodisias) that constitute our ‘source’ materials:

What are the key entities in the sculpture domain?

To this purpose, a few weeks ago we had a very productive brainstorming session aimed at fleshing out the main items of interest in the world of sculpture. This is a very first step towards the construction of a formal model for this domain; nonetheless, I think that we have already managed to pin down the key elements we’re going to be dealing with in the next two years.

Here’s a list of the main objects we identified:

– People, such as craftsman’s etc..
– Sculptures (of various kinds)
– Materials
– Tools
– Generic processes that are part of a sculpting project, such as quarrying and transport.
– More specific methods being used within a particular process, e.g. carving styles, or approaches to quarrying.
– Traditions, conceptualisations of the ‘way of doing things’ that, in turn, can inspire the way methods and processes are carried out nowadays.

We encoded the results of our discussions in a mind map for better readability, and also in order to use a technology that would make it easier to share our findings later on. I added it below.. (in case the interactive image doesn’t work, you can find it here too).

Fleshing out the model a bit more

After a few weeks of work we did a reiteration of the conceptual map above. The good news was that it soon became evident to us that we got it quite right on the first round; that is, we didn’t really feel like adding or removing anything from the map.

On the other hand, we thought we should try to add some relations (= links, arcs) among the concepts (=bubbles) previously identified, so to characterize their semantics a bit more. I had a go at adding some relations first, and here’s the result:

I should specify that I have no knowledge whatsoever of the domain of sculpture, so the stuff I added to the map came out entirely from the (little) research I’ve been doing on the subject (on and off) during the last weeks.

At the same time, also Will and Ben (the art historians I’m collaborating with) worked independently at the task of fleshing out the mind map with more relations. Needeless to say, what they came up with is way more dense and intricate than what I could have ever imagined! This is probably not surprising, as one would expect to see a significant difference between a non-expert’s representation of a subject domain and another one which is instead created by experts. Still, it was interesting to see it happening with my own eyes!
Here it is:

The next step will be trying to reduce the (natural) complexity of the portion of the world we are representing to a more manageable size… then, formalize it, and start building our database based on that.. stay tuned for more!

Event: THATcamp Kansas and Digital Humanities Forum

mikele — Wed, 28 Sep 2011 16:56:55 +0000

The THATcamp Kansas and Digital Humanities Forum happened last week at the Institute for Digital Research in the Humanities, which is part of the University of Kansas in beautiful Lawrence. I had the opportunity to be there and give a talk about some recent stuff I’ve been working on regarding digital prosopography and computer ontologies, so in this blog post I’m summing up a bit the things that caught my attention while at the conference.

The event happened on September 22-24 and consistend of three separate things:

Bootcamp Workshops: a set of in-depth workshops on digital tools and other DH topics http://kansas2011.thatcamp.org/bootcamps/.

THATCamp: an “unconference” for technologists and humanists http://kansas2011.thatcamp.org/.

Representing Knowledge in the DH conference: a one-day program of panels and poster sessions (schedule | abstracts )

The workshop and THATcamp were both packed with interesting stuff, so I strongly suggest you take a look at the online documentation, which is very comprehensive. In what follows I’ll instead highlight some of the contributed papers which a) I liked and b) I was able to attend (needless to say, this list matches only my individual preference and interests). Hope you’ll find something of interest there too!

A (quite subjective) list of interesting papers

The Graphic Visualization of XML Documents, by David Birnbaum ( abstract ): a quite inspiring example of how to employ visualizations in order to support philological research in the humanities. Mostly focused on Russian texts and XML-oriented technologies, but its principles easily generalizable to other contexts and technologies.

Exploring Issues at the Intersection of Humanities and Computing with LADL, by Gregory Aist ( abstract ): the talk presented LADL, the Learning Activity Description Language, a fascinating software environment provides a way to “describe both the information structure and the interaction structure of an interactive experience”, to the purpose of “constructing a single interactive Web page that allows for viewing and comparing of multiple source documents together with online tools”.

Making the most of free, unrestricted texts–a first look at the promise of the Text Creation Partnership, by Rebecca Welzenbach ( abstract ): an interesting report on the pros and cons of making available a large repository of SGML/XML encoded texts from the Eighteenth Century Collections Online (ECCO) corpus.

The hermeneutics of data representation, by Michael Sperberg-McQueen ( abstract ): a speculative and challenging investigation of the assumptions at the root of any machine-readable representation of knowledge – and their cultural implications.

Breaking the Historian’s Code: Finding Patterns of Historical Representation, by Ryan Shaw ( abstract ): an investigation on the usage of natural language processing techniques to the purpose of ‘breaking down’ the ‘code’ of historical narrative. In particular, the sets of documents used are related to the civil rights movement, and the specific NLP techniques being employed are named entity recognition, event extraction, and event chain mining.

Employing Geospatial Genealogy to Reveal Residential and Kinship Patterns in a Pre-Holocaust Ukrainian Village, by Stephen Egbert.( abstract ): this paper showed how it is possible to visualize residential and kinship patterns in the mixed-ethnic settlements of pre-Holocaust Eastern Europe by using geographic information systems (GIS), and how these results can provide useful materials for humanists to base their work on.

Prosopography and Computer Ontologies: towards a formal representation of the ‘factoid’ model by means of CIDOC-CRM, by me and John Bradley ( abstract ): this is the paper I presented (shameless self plug, I know). It’s about the evolution of structured prosopography (= the ‘study of people’ in history) from a mostly single-application and database-oriented scenario towards a more interoperable and linked-data one. In particular, I talked about the recent efforts for representing the notion of ‘factoids’ (a conceptual model normally used in our prosopographies) using the ontological language provided by CIDOC-CRM (a computational ontology commonly used in the museum community).

Inspecting an ontology with RDFLib

mikele — Mon, 18 Jul 2011 14:59:05 +0000

RDFLib (homepage) is a pretty solid and comprehensive rdf-programming kit for Python. In a previous post I already discussed what pythonic options are currently available out there for doing semantic web programming; after some more in depth testing I realized that Rdflib is the most accessible and complete of them all (in fact many of the available libraries are based on Rdflib’s APIs). So.. here we go: in this post I’m giving an overview of some of the things you can do with this library.

Update 2014-10-04: the latest version of the Python library described in this post is available on GitHub

The Linked Data world is primarily made up of RDF, many would say, so the most important thing is being able to parse and extract information from this simple but versatile language. A quite well known mantra in this community is the ‘a little semantics goes a long way‘, which expresses succinctly the idea that there’s no need to fixate on the construction of large-scale CYC-like knowledge-based systems in order to get something going in an open-world scenario such as the web (of data).

In other words, this idea suggests that (for now) it’s enough to make your application spit out structured data using a standard data model (RDF, that is), and possibly connect your RDF dataset to other datasets in the ‘cloud‘ by creating rdf-links. Once you’ve done that, you can take it easy and stop worrying about the data integration problems your RDF might generate, or the ‘big picture’. Others will figure out how to use your data; it’s an incremental approach, there will be some sort of snowball effect at some stage, semantic web enthusiasts seem to suggest. This and other arguments are a bit make-believe, I have to say; but at the same time they also do make some sense: unless we have some real stuff to play with out there on the data-web, not much will ever happen!

Hullo, RDFlib

After quickly ascertaining that it’s not a total waste of time to work with RDF, it’s now time to get practical and experiment a bit with RDFlib. This is a great python library for it lets you process RDF data very very easily. Example:

	
# open a graph
>>> import rdflib
>>> graph = rdflib.Graph()


# load some data
>>> graph.parse('http://dbpedia.org/resource/Semantic_Web')
)>
>>> len(graph)
98


# query the data
>>> list(graph)[:10]
[(rdflib.term.URIRef('http://dbpedia.org/resource/SUPER'), rdflib.term.URIRef('http://dbpedia.org/property/keywords'),
rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web')), (rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_internet'),
rdflib.term.URIRef('http://dbpedia.org/ontology/wikiPageRedirects'), rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web')),
(rdflib.term.URIRef('http://dbpedia.org/resource/SW'), rdflib.term.URIRef('http://dbpedia.org/ontology/wikiPageDisambiguates'),
rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web')), (rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_integrity'),
rdflib.term.URIRef('http://dbpedia.org/ontology/wikiPageRedirects'), rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web')),
(rdflib.term.URIRef('http://dbpedia.org/resource/Ontotext'), rdflib.term.URIRef('http://dbpedia.org/ontology/industry'),
rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web')), (rdflib.term.URIRef('http://mpii.de/yago/resource/Semantic_Web'),
rdflib.term.URIRef('http://www.w3.org/2002/07/owl#sameAs'), rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web')),
(rdflib.term.URIRef('http://dbpedia.org/resource/Deborah_McGuinness'), rdflib.term.URIRef('http://dbpedia.org/ontology/knownFor'),
rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web')), (rdflib.term.URIRef('http://dbpedia.org/resource/The_semantic_web'),
rdflib.term.URIRef('http://dbpedia.org/ontology/wikiPageRedirects'), rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web')),
(rdflib.term.URIRef('http://dbpedia.org/resource/Access-eGov'), rdflib.term.URIRef('http://dbpedia.org/property/keywords'),
rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web')), (rdflib.term.URIRef('http://dbpedia.org/resource/SOA4All'),
rdflib.term.URIRef('http://dbpedia.org/property/keywords'), rdflib.term.URIRef('http://dbpedia.org/resource/Semantic_Web'))]


# print out some triples
>>> for s, p, o in graph:
...     print s, "n--- ", p, "n------ ", o
...

http://dbpedia.org/resource/SUPER
---  http://dbpedia.org/property/keywords 
------  http://dbpedia.org/resource/Semantic_Web
http://dbpedia.org/resource/Semantic_internet
---  http://dbpedia.org/ontology/wikiPageRedirects 
------  http://dbpedia.org/resource/Semantic_Web
http://dbpedia.org/resource/SW
---  http://dbpedia.org/ontology/wikiPageDisambiguates 
------  http://dbpedia.org/resource/Semantic_Web
http://dbpedia.org/resource/Semantic_integrity
---  http://dbpedia.org/ontology/wikiPageRedirects 
------  http://dbpedia.org/resource/Semantic_Web
http://dbpedia.org/resource/Semantic_Web 
---  http://dbpedia.org/ontology/abstract 
------  Con il termine web semantico, termine coniato dal suo ideatore, Tim Berners-Lee, si intende la trasformazione del World Wide Web in un ambiente dove i documenti pubblicati (pagine HTML, file, immagini, e così via) siano associati ad informazioni e dati che ne specifichino il contesto semantico in un formato adatto all'interrogazione, all'interpretazione e, più in generale, all'elaborazione automatica. Con l'interpretazione del contenuto dei documenti che il Web Semantico propugna, saranno possibili ricerche molto più evolute delle attuali, basate sulla presenza nel documento di parole chiave, ed altre operazioni specialistiche come la costruzione di reti di relazioni e connessioni tra documenti secondo logiche più elaborate del semplice link ipertestuale.
http://dbpedia.org/resource/Semantic_Web 
--- etc. etc etc................

Pretty straightforward uh? In a nutshell, what we’ve just done is:

a) loading the RDF description of the ‘Semantic Web’ page on DBPedia (http://dbpedia.org/resource/Semantic_Web);
b) showing the first 10 triples in that RDF graph;
c) iterating through all the triples in the graph and printing them out in a format that reflects the subject-predicate-object structure of RDF.

However we still don’t know much about those data. Meaning: what is the abstract structure used to define them? Do they conform to some sound and thorough data-model or is it just some automatically-generated messy agglomerate of database records? In other words, what I want to know is, what’s the ontology behind these data? How can I see it? Shall I reuse it (and thus endorse it) within my own application, or does my application require something slightly different?

I’m probably biased here, cause I personally get much more satisfaction from creating and thinking about ontologies rather than fiddling with large quantities of rdf-xml triples. Still, I think that being able to evaluate the ontology a bunch of rdf refers to is of vital importance, in order to judge whether that RDF is what you’re looking for or not, and how to best integrate it in your application.

Long story short, I couldn’t find anything in RdfLib that would let me print out the hierarchy tree of an ontology and other related information. So I thought, here’s a good candidate-task for me to learn how to use the library better.

Inspecting an ontology using RDFLib

I created a small class called ‘OntoInspector‘ that you can instantiate with an RDFS/OWL ontology and then query to find out basic information about that ontology. I know – all of this could have been done using one of the many (and constantly increasing) ontology editing tools – but hey this is all about learning isn’t it?
You can find all the source code on ~~BitBucket~~ GitHub. Feel free to get it and modify as needed. Also, I integrated this python toolkit within a django application that let you browse ontologies online (beware – it’s just a hack really). This is called (surprise) OntoView, and it’s accessible here.

The first thing to do in our class definition is (obviously) loading up the RDFLib library. I’ve developed this using RDFlib 2.4, but recently tested it with 3.0 (the latest release available) and it all still works fine. By loading up the RDF and RDFS modules we’ll have access to all the constants needed to query for classes and subclasses. Note that I added an OWL module as that is not part of RDFLib. You can find it in the source code, it’s just a list of all predicates in the OWL vocabulary.

	from rdflib import ConjunctiveGraph, Namespace, exceptions

from rdflib import URIRef, RDFS, RDF, BNode

import OWL

Now let’s set up the basic structure of the OntoInspector class. In principle, an OntoInspector object should contain all the information necessary to query an ontology. An ontology is referred to using its URI, so that’s all is needed for creating an instance of OntoInspector too:

class OntoInspector(object):

    """Class that includes methods for querying an RDFS/OWL ontology"""        

    def __init__(self, uri, language=""):
        super(OntoInspector, self).__init__()

        self.rdfGraph = ConjunctiveGraph()
        try:
            self.rdfGraph.parse(uri, format="xml")
        except:
            try:
                self.rdfGraph.parse(uri, format="n3")
            except:
                raise exceptions.Error("Could not parse the file! Is it a valid RDF/OWL ontology?")

        finally:
            # let's cache some useful info for faster access
            self.baseURI = self.get_OntologyURI() or uri            
            self.allclasses = self.__getAllClasses(classPredicate)
            self.toplayer = self.__getTopclasses()
            self.tree = self.__getTree()


    def get_OntologyURI(self, ....):
        # todo
        pass

    def __getAllClasses(self, ....):
        # todo
        pass
       

    def __getTopclasses(self, ....):
        pass


    def __getTree(self, ....):
        # todo
        pass

As you can see the __init__ method tries to load the ontology file (which can be expressed in either rdf/xml or n3 format) and then sets up 4 class attributes. These attributes will contain some key information about the ontology: its URI, a list of all the classes available, the classes in the top layer and the main taxonomical tree of the ontology.
We’re now going to implement the methods needed to fill out these 4 attributes.

Getting the ontology URI

If we’re dealing with an OWL ontology, it may be the case that the URI we have just used to retrieve the ontology file is not the ‘official’ URI of the ontology. In fact OWL provides a construct that can be used to ‘state’ which is the base URI of an ontology (essentially, this is equivalent to stating that an RDF resource has rdf:type http://www.w3.org/2002/07/owl#Ontology).
So in the following method first we check if an URI of rdf:type OWL:Ontology exists, and return that if available (when we return None, the URI value defaults to the URI originally provided when creating the OntoInspector object – see the constructor code above):

def get_OntologyURI(self, return_as_string=True):
    """ 
    In [15]: [x for x in o.rdfGraph.triples((None, RDF.type, OWL.Ontology))]
    Out[15]: 
    [(rdflib.URIRef('http://purl.com/net/sails'),
      rdflib.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'),
      rdflib.URIRef('http://www.w3.org/2002/07/owl#Ontology'))]

    Mind that this will work only for OWL ontologies.
    In other cases we just return None, and use the URI passed at loading time
    """

    test = [x for x, y, z in self.rdfGraph.triples((None, RDF.type, OWL.Ontology))]

    if test:
        if return_as_string:
            return str(test[0])
        else:
            return test[0]
    else:
        return None

Extracting all the classes

Essentially, there are only two ways to define a class: you can either specify that an entity has a property RDF:type with value rdfs:Class, or that it has a property RDF:type with value owl:Class. Note that the owl:Class predicate is defined as a subclass of rdfs:Class. The rationale for having a separate OWL class construct lies in the restrictions on OWL DL (and thus also on OWL Lite), which imply that not all RDFS classes are legal OWL DL classes. In OWL Full these restrictions do not exist and therefore owl:Class and rdfs:Class are equivalent in OWL Full (more info here: http://www.w3.org/TR/owl-ref/, section 3.1).

Thus, In order to retrieve all the classes defined in an ontology we can just query the RDF graph for triples that have this form:

someURI – rdf:type – rdf:Class OR owl:Class .

This approach will work in the majority of cases. However, things are complicated by the fact that people are sometimes sloppy when they define ontologies, or because they use different tools that automatically generate different styles of RDF code. For example, often an entity is defined as being an rdfs:subclassOf another entity, without explicitly declaring that both of them are (rdfs, or owl) classes; another common case is that one of classes mentioned in the domain/range values of properties (via the rdfs.domain and rdfs.range properties) but not declared explicitly.

Since we want to be as comprehensive as possible when looking for *all* the classes present in an ontology, I added a couple of methods that deal with these borderline cases. If you don’t want to include all of this stuff, you can still bypass these extra checks by using the classPredicate argument.

def __getAllClasses(self, classPredicate = "", removeBlankNodes = True):
    """  
    Extracts all the classes from a model
    We use the RDFS and OWL predicate by default; also, we extract non explicitly declared classes
    """

    rdfGraph = self.rdfGraph
    exit = []       

    if not classPredicate:          
        for s, v, o in rdfGraph.triples((None, RDF.type , OWL.Class)): 
            exit.append(s)
        for s, v, o in rdfGraph.triples((None, RDF.type , RDFS.Class)):
            exit.append(s)

        # this extra routine makes sure we include classes not declared explicitly
        # eg when importing another onto and subclassing one of its classes...
        for s, v, o in rdfGraph.triples((None, RDFS.subClassOf , None)):
            if s not in exit:
                exit.append(s)
            if o not in exit:
                exit.append(o)

        # this extra routine includes classes found only in rdfs:domain and rdfs:range definitions
        for s, v, o in rdfGraph.triples((None, RDFS.domain , None)):
            if o not in exit:
                exit.append(o)
        for s, v, o in rdfGraph.triples((None, RDFS.range , None)):
            if o not in exit:
                exit.append(o)

    else:
        if classPredicate == "rdfs" or classPredicate == "rdf":
            for s, v, o in rdfGraph.triples((None, RDF.type , RDFS.Class)):
                exit.append(s)
        elif classPredicate == "owl":
            for s, v, o in rdfGraph.triples((None, RDF.type , OWL.Class)): 
                exit.append(s)
        else:
            raise exceptions.Error("ClassPredicate must be either rdf, rdfs or owl")

    exit = remove_duplicates(exit)

    if removeBlankNodes:
        exit = [x for x in exit if not self.__isBlankNode(x)]

    return sort_uri_list_by_name(exit)

You probably noticed that there are a couple of other methods mentioned in the snippet above: they are used for checking if a URI is a BlankNode (which we’re normally not interested in, when dealing with ontologies) and for other utility functions, such as sorting and removing duplicates from our list of classes. You’ll find all the details about this stuff in the source code obviously..

Next, we want to be able to move around the ontology hierarchy. So we need methods to get super and sub classes from a given class. This is easily done by querying the graph for triples containing the rdfs.subClassOf predicate:

# methods for getting ancestores and descendants of classes: by default, we do not include blank nodes

def get_classDirectSupers(self, aClass, excludeBnodes = True):
    returnlist = []
    for s, v, o in self.rdfGraph.triples((aClass, RDFS.subClassOf , None)):
        if excludeBnodes:
            if not self.__isBlankNode(o):
                returnlist.append(o)
        else:
            returnlist.append(o)

    return sort_uri_list_by_name(remove_duplicates(returnlist)) 


def get_classDirectSubs(self, aClass, excludeBnodes = True):
    returnlist = []
    for s, v, o in self.rdfGraph.triples((None, RDFS.subClassOf , aClass)):
        if excludeBnodes:
            if not self.__isBlankNode(s):
                returnlist.append(s)

        else:
            returnlist.append(s)

    return sort_uri_list_by_name(remove_duplicates(returnlist))


def get_classAllSubs(self, aClass, returnlist = [], excludeBnodes = True):
    for sub in self.get_classDirectSubs(aClass, excludeBnodes):
        returnlist.append(sub)
        self.get_classAllSubs(sub, returnlist, excludeBnodes)
    return sort_uri_list_by_name(remove_duplicates(returnlist))



def get_classAllSupers(self, aClass, returnlist = [], excludeBnodes = True ):
    for ssuper in self.get_classDirectSupers(aClass, excludeBnodes):
        returnlist.append(ssuper)
        self.get_classAllSupers(ssuper, returnlist, excludeBnodes)
    return sort_uri_list_by_name(remove_duplicates(returnlist))



def get_classSiblings(self, aClass, excludeBnodes = True):
    returnlist = []
    for father in self.get_classDirectSupers(aClass, excludeBnodes):
        for child in self.get_classDirectSubs(father, excludeBnodes):
            if child != aClass:
                returnlist.append(child)

    return sort_uri_list_by_name(remove_duplicates(returnlist))

Getting the top layer

We’re now all set for retrieving the classes at the top of the taxonomic hierarchy of our ontology, that is, its ‘top-layer’. This can be done by reusing the get_classDirectSupers method previously defined, so to search for all classes that have no superclasses:

def __getTopclasses(self, classPredicate = ''):

    """ Finds the topclass in an ontology (works also when we have more than on superclass)
    """

    returnlist = []

    # gets all the classes
    for eachclass in self.__getAllClasses(classPredicate):
        x = self.get_classDirectSupers(eachclass)
        if not x:
            returnlist.append(eachclass)

    return sort_uri_list_by_name(returnlist)

Reconstructing the ontology tree

Now that we know which are the top classes in our taxonomy, we can parse the tree recursively using the get_classDirectSubs method defined above, and reconstruct the whole taxonomical structure of the ontology.

def __getTree(self, father=None, out=None):

    """ Reconstructs the taxonomical tree of an ontology, from the 'topClasses' (= classes with no supers, see below)
        Returns a dictionary in which each class is a key, and its direct subs are the values.
        The top classes have key = 0

        Eg.
        {'0' : [class1, class2], class1: [class1-2, class1-3], class2: [class2-1, class2-2]}
    """

    if not father:
        out = {}
        topclasses = self.toplayer
        out[0] = topclasses

        for top in topclasses:
            children = self.get_classDirectSubs(top)
            out[top] = children
            for potentialfather in children:
                self.__getTree(potentialfather, out)

        return out

    else:
        children = self.get_classDirectSubs(father)
        out[father] = children
        for ch in children:
            self.__getTree(ch, out)

That’s it really. Given this abstract tree representation, it can be printed out differently depending on the context (html, command line) but the core will remain intact.

Wrapping up

The source code on GitHub contains also other utilities I added, eg for handling class comments, namespaces, for nice-printing of classes’ names, and for outputting the ontology tree as an image, using the Graphviz library (which needs to be installed separately).

Here’s an example of how OntoInspector can be used in the python interactive shell for inspecting the Friend Of A Friend lightweight ontology:

In [1]: from onto_inspector import *

In [2]: onto = OntoInspector("http://xmlns.com/foaf/spec/20100809.rdf")         

In [3]: onto.toplayer

Out[3]: 
[rdflib.URIRef('http://xmlns.com/foaf/0.1/Agent'),
 rdflib.URIRef('http://www.w3.org/2000/01/rdf-schema#Class'),
 rdflib.URIRef('http://www.w3.org/2004/02/skos/core#Concept'),
 rdflib.URIRef('http://xmlns.com/foaf/0.1/Document'),
 rdflib.URIRef('http://xmlns.com/foaf/0.1/LabelProperty'),
 rdflib.URIRef('http://www.w3.org/2000/01/rdf-schema#Literal'),
 rdflib.URIRef('http://www.w3.org/2000/10/swap/pim/contact#Person'),
 rdflib.URIRef('http://xmlns.com/foaf/0.1/Project'),
 rdflib.URIRef('http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing'),
 rdflib.URIRef('http://www.w3.org/2002/07/owl#Thing')]

In [4]: onto.printTree()
foaf:Agent
----foaf:Group
----foaf:Organization
----foaf:Person
rdfs:Class
http://www.w3.org/2004/02/skos/core#Concept
foaf:Document
----foaf:Image
----foaf:PersonalProfileDocument
foaf:LabelProperty
rdfs:Literal
http://www.w3.org/2000/10/swap/pim/contact#Person
----foaf:Person
foaf:Project
http://www.w3.org/2003/01/geo/wgs84_pos#SpatialThing
----foaf:Person
owl:Thing
----foaf:OnlineAccount
--------foaf:OnlineChatAccount
--------foaf:OnlineEcommerceAccount
--------foaf:OnlineGamingAccount

In [5]: document = onto.find_class_byname("document")

In [6]: document
Out[6]: 
[rdflib.URIRef('http://xmlns.com/foaf/0.1/Document'),
 rdflib.URIRef('http://xmlns.com/foaf/0.1/PersonalProfileDocument')]

In [7]: document = document[0]

In [8]: document
Out[8]: rdflib.URIRef('http://xmlns.com/foaf/0.1/Document')

In [9]: onto.get_classAllSubs(document)
Out[9]: 
[rdflib.URIRef('http://xmlns.com/foaf/0.1/Image'),
 rdflib.URIRef('http://xmlns.com/foaf/0.1/PersonalProfileDocument')]

In [10]: onto.get_classAllSupers(document)
Out[10]: []

In [11]: onto.get_classComment(document)
Out[11]: rdflib.Literal('A document.', language=None, datatype=None)

Any comments? As I said I’m still learning/improving this… so any feedback is welcome!

Python links (and more) 7/2/11

mikele — Thu, 03 Feb 2011 15:23:21 +0000

This post contains just a collection of various interesting things I ran into in the last couple of weeks… they’re organized into three categories: pythonic links, events and conferences, and new online tools. Hope you’ll find something of interest!

Pythonic stuff:

Epidoc
Epydoc is a handy tool for generating API documentation for Python modules, based on their docstrings. For an example of epydoc’s output, see the API documentation for epydoc itself (html, pdf).

PyEnchant
PyEnchant is a spellchecking library for Python, based on the excellent Enchant library.

Dexml
The dexml module takes the mapping between XML tags and Python objects and lets you capture that as cleanly as possible. Loosely inspired by Django’s ORM, you write simple class definitions to define the expected structure of your XML document.

SpecGen
SpecGen v5, ontology specification generator tool. It’s written in Python using Redland RDF library and licensed under the MIT license.

PyCloud
Leverage the power of the cloud with only 3 lines of python code. Run long processes on the cloud directly from your shell!

commandlinefu.com
This is not really pythonic – but nonetheless useful to pythonists: a community-based repository of useful unix shell scripts!

Events and Conferences:

Digital Resources in the Humanities and Arts Conference 2011
University of Nottingham Ningbo, China. The DRHA 2011 conference theme this year is “Connected Communities: global or local2local?”

Narrative and Hypertext Workshop at the ACM Hypertext 2011 conference in Eindhoven.

Culture Hack Day, London, January 2011
This event aimed at bringing cultural organisations together with software developers and creative technologists to make interesting new things.

History Hack Day, London, January 2011
A bunch of hackers with a passion for history getting together and doing experimental stuff

Conference.archimuse.com
The ‘online space for cultural informatics‘: lots of useful info here, about publications, jobs, people etc.

Agora project: Scholarly Open Access Research in European Philosophy
Project looking at building an infrastructure for the semantic interlinking of European philosophy datasets

Online tools:

FactForge
A web application aiming at showcasing a ‘practical approach for reasoning with the web of linked data’.

Semantic Overflow
A clone of Stack Overflow (collaboratively edited question and answer site for programmers) for questions ‘about semantic web techniques and technologies’.

Google Refine
A tool for “working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases”.

Google Scribe
A text editor with embedded autocomplete suggestions as you type

Books Ngram Viewer
Tool that displays statistical information regarding the use of user-selected sentences in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over the selected years

…

KR workshop #2: introducing CIDOC-CRM and FRBR-OO

mikele — Mon, 30 Aug 2010 23:02:42 +0000

This is the second appointment with the knowledge representation seminar we’re having at CCH (Kings College, London). If you are in the area and are interested in taking part in this, please drop me an email. We’re looking at these topics from the specific perspective of the digital humanities, but even if your take on things is different, we’d love to hear from you!

Last meeting we discussed quite generally about ontologies and other KR technologies, so this time we decided to start looking more meticulously into the details of a widely known ontology for the cultural heritage domain, the CIDOC-CMR conceptual model (an ISO standard since 2006).

Doerr. The CIDOC conceptual reference module: an ontological approach to semantic interoperability of metadata. AI Magazine archive (2003) vol. 24 (3) pp. 75-92

The CIDOC-CRM is the de facto standard for data integration in the museum community. Its authors claim that it “is intended to be a common language for domain experts and implementers to formulate requirements for information systems and to serve as a guide for good practice of conceptual modelling. In this way, it can provide the “semantic glue” needed to mediate between different sources of cultural heritage information, such as that published by museums, libraries and archives”.

CIDOC contains a wealth of inspiring ideas and useful approaches; in our first meeting about it we highlighted only a few of them, including:

the need for interoperability at the semantic level

the difference between data capturing and interpreting data

the read-only integration approach

the notion of a property-centric ontology

the top-level classes in CIDOC

the nature of cultural historical knowledge

general principles and methodology in CIDOC-CMR

Here are some slides including key passages from the CIDOC paper mentioned above:

Introducing CIDOC-CRM (Cch KR workshop #2.1)

Information objects

One thing that CIDOC doesn’t cover much is the domain of information objects, which is a fancy term ontologists use to refer to anything that can carry information, such as a book, a stone inscription, or a piece of music. Modeling this type of entities using a clear (and possibly formal) language may seem straightforward at first: a book is a simple physical object, isn’t it?

However this is not always the case: things easily get muddled as soon as you start thinking about the fact that a book can have multiple copies, that it can be translated into many languages, or it can be included into a different edition. In all these cases, what is the ‘book’ entity we’re talking about? Not the physical object, apparently.

How to model information objects is a topic that librarians (among others) have discussed quite extensively in the years. As a result of these discussion librarians developed a conceptual model called FRBR, which has become a standard to follow when dealing with this type of problems.

FRBR (represented schematically above) summarizes quite well various important features of information objects, especially when considered under the perspective many library scientists have. However it is also true that FRBR doesn’t present its results using the rigourous and unambiguous language of formal ontologies. As a result, people end up interpreting the meaning of its concepts in slightly different ways. For example, if you are a librarian with little knowledge of computer science, you might end up using FRBR in a totally different way than that of a computer scientist who’d designing a software system for librarians.

To address this limitation, and also in order to open up the CIDOC-CRM model to the librarian community, the CIDOC committee has started ‘ontologizing’ FRBR. The second part of our seminar focused on this (ongoing) enterprise, which is outlined in this article:

Bekiari, C., Doerr, M., & Boeuf, P. L. (2009). FRBR: Object-Oriented Definition and Mapping to FRBR-ER (version 1.0). International Working Group on FRBR and CIDOC CRM Harmonisation.

Here are the slides I used in the seminar, which contain some of the most salient visual representation of the ontology:

Introducing FRBR-OO (CCH KR workshop 2.2)

That’s all for now – in the future we’ll be looking at specific situations where using CIDOC and FRBR presents challenges to the digital humanist: stay tuned!