linkeddata – Parerga und Paralipomena http://www.michelepasin.org/blog At the core of all well-founded belief lies belief that is unfounded - Wittgenstein Mon, 08 Apr 2019 11:31:02 +0000 en-US hourly 1 https://wordpress.org/?v=5.2.11 13825966 SN SciGraph Latest Release: Patents, Clinical Trials and many new features http://www.michelepasin.org/blog/2019/03/22/sn-scigraph-latest-release-patents-clinical-trials-and-many-new-features/ Fri, 22 Mar 2019 12:49:22 +0000 http://www.michelepasin.org/blog/?p=3307 We are pleased to announce the third release of SN SciGraph Linked Open Data. SN SciGraph is Springer Nature’s Linked Data platform that collates information from across the research landscape, i.e. the things, documents, people, places and relations of importance to the science and scholarly domain.

This release includes a complete refactoring of the SN SciGraph data model. Following up on users feedback, we have simplified it using Schema.org and JSON-LD, so to make it easier to understand and consume the data also for non-linked data specialists.  

This release includes two brand new datasets – Patents and Clinical Trials linked to Springer Nature publications – which have been made available by our partner Digital Science, and in particular the Dimensions team.

Highlights:

  • New Datasets. Data about clinical trials and patents connected to Springer Nature publications have been added. This data is sourced from Dimensions.ai.
  • New Ontology. Schema.org is now the main model used to represent SN SciGraph data.
  • References data. Publications data now include references as well (= outgoing citations).
  • Simpler Identifiers. URIs for SciGraph objects have been dramatically simplified, reusing common identifiers whenever possible. In particular all articles and chapters use the URI format prefix (‘pub.’) + DOI (eg pub.10.1007/s11199-007-9209-1).
  • JSON-LD. JSON-LD is now the primary serialization format used by SN SciGraph.
  • Downloads. Data dumps are now managed externally on FigShare and are referenceable via DOIs.
  • Continuous updates. New publications data is released on a daily basis. All the other datasets are refreshed on a monthly basis.

 

Note: crossposted on https://researchdata.springernature.com

 

Screenshot 2019-03-22 at 12.33.41.png

 

Screenshot 2019-03-22 at 12.33.54.png

]]>
3307
Exploring scholarly publications using DBPedia concepts: an experiment http://www.michelepasin.org/blog/2018/11/23/exploring-scholarly-publications-via-dbpedia/ Fri, 23 Nov 2018 17:18:48 +0000 http://www.michelepasin.org/blog/?p=3254 This post is about a recent prototype I developed, which allows to explore a sample collection of Springer Nature publications using subject tags automatically extracted from DBPedia.

DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects. This structured information resembles an open knowledge graph (OKG) which is available for everyone on the Web.

Datasets

The dataset I used is the result of a collaboration with Beyza Yaman, a researcher working with the DBpedia team in Leipzig, who used the SciGraph datasets as input to the DBPedia-Spotlight entity-mining tool.

By using DBPedia-Spotlight we automatically associated DBpedia subjects terms to a  subset of abstracts available in the SciGraph dataset (around 90k abstract from 2017 publications).

The prototype allows to search the Springer Nature publications using these subject terms.

Also, DBpedia subjects include definitions and semantic relationships (which we are currently not using, but one can imagine how they could be raw material for generating more thematic ‘pathways’).

Results: serendipitous discovery of scientific publications

The results are pretty encouraging: despite the fact that the concepts extracted sometimes are only marginally relevant (or not relevant at all), the breadth and depth of the DBpedia classification makes the interactive exploration quite interesting and serendipitous.

You can judge for yourself: the tool is available herehttp://hacks2019.michelepasin.org/dbpedialinks

The purpose of this prototype is to evaluate the quality of the tagging and generate ideas for future applications. So any kind of feedback or ideas is very welcome!

We are working with Beyza to write up the results of this investigation as a research paper. The data and software is already freely available on github.

A couple of screenshots:

Eg see the topic ‘artificial intelligence

Screen Shot 2018-11-23 at 17.15.07.png

One can add more subjects to a search in order to ‘zoom in’ into a results set, eg by adding ‘China’ to the search:

Screen Shot 2018-11-23 at 17.16.38

Implementation details

 

 

]]>
3254
SN SciGraph: latest website release make it easier to discover related content http://www.michelepasin.org/blog/2018/08/01/sn-scigraph-latest-website-release-make-it-easier-to-discover-related-content/ Wed, 01 Aug 2018 08:26:19 +0000 http://www.michelepasin.org/blog/?p=3239 The latest release of SN SciGraph Explorer website includes a number of new features that make it easier to navigate the scholarly knowledge graph and discover items of interest.

Graphs are essentially composed by two kinds of objects: nodes and edges. Nodes are like the stations in a train map, while edges are the links that connect the different stations.

Of course one wants to be able to move from station to station in any direction! Similarly in a graph one wants to be able to jump back and forth from node to node using any of the links provided. That’s the beauty of it!

Although the underlying data allowed for this, the SN SciGraph Explorer website wasn’t fully supporting this kind of navigation. So we’ve now started to add a number of ‘related objects’ sections that reveal these pathways more clearly.

For example, now it’s much easier to get to the organizations and grants an article relates to:

Screen Shot 2018-07-31 at 18.23.47.png

Or, for a book edition, to see its chapters and related organizations:

bookEdition-links.png

And much more..  Take a look at the site yourself to find out.

Finally, we improved the linked data visualization included in every page by adding distinctive icons to each object type – so to make it easier to understand the immediate network of an object at a glance. E.g. see this grant:

grant-diagram.png

SN SciGraph is primarily about opening up new opportunities for open data and metadata enthusiasts who want to do more things with our content, so we hope that these additions will make discovering data items easier and more fun.

Any comments? We’d love to hear from you. Otherwise, thanks for reading and stay tuned for more updates.

PS: this blog was posted on the SN Research Data space too.

 

]]>
3239
SN SciGraph is part of the Linked Open Data Cloud 2018 http://www.michelepasin.org/blog/2018/05/23/sn-scigraph-is-part-of-the-linked-open-data-cloud-2018/ Wed, 23 May 2018 11:28:59 +0000 http://www.michelepasin.org/blog/?p=3311 The latest Linked Open Data (LOD) Cloud has been recently made available by the Insights Centre for Data Analytics. The LOD cloud is a visual representation of the datasets (and the links among them) that have been published according to the Linked Data principles – a web-friendly methodology for data sharing that encourages open schemas and data reuse.

Screen+Shot+2018-05-23+at+10.10.09.png

 

We’ve very glad to say that Sn SciGraph is now part of it! (ps this is its JSON record) If you look at the picture above, the two red lines departing from our ‘bubble’ indicate that the two main datasets we are linking to are CrossRef and DBpedia.

Note this visualisation unfortunately doesn’t do justice to the fact that SN SciGraph is one of the largest datasets out there (1 billion + triples and counting). In previous versions, the bubble’s size would reflect how large a dataset is.. but hopefully that’ll change in the future!

The cloud currently contains 1,184 datasets with 15,993 links (as of April 2018) and it’s divided into 9 sub-clouds based on their domain.

Screen+Shot+2018-05-22+at+16.14.11.png

SciGraph is part of the ‘Publications’ sub-cloud (depicted above) alongside other important linked data publishers such as the British Library, the German National Library, the Open LibraryOCLC and many others.

It’s impressive to see the growing number of datasets being released using this approach! We’ve been told that later this year more discovery tools will be made available that allow searching for data publishers, so to make it easier for people and projects to collaborate.

Useful links:

 

 

]]>
3311
Exploring SciGraph data using JSON-LD, Elastic Search and Kibana http://www.michelepasin.org/blog/2017/04/06/exploring-scigraph-data-using-elastic-search-and-kibana/ http://www.michelepasin.org/blog/2017/04/06/exploring-scigraph-data-using-elastic-search-and-kibana/#comments Thu, 06 Apr 2017 14:12:05 +0000 http://www.michelepasin.org/blog/?p=2844 Hello there data lovers! In this post you can find some information on how to download and make some sense of the scholarly dataset recently made available by the Springer Nature SciGraph project, by using the freely available Elasticsearch suite of software.

A few weeks ago the SciGraph dataset was released (full disclosure: I’m part of the team who did that!). This is a high quality dataset containing metadata and abstracts about scientific articles published by Springer Nature, research grants related to them plus other classifications of this content.

scigraph.png

This release of the dataset includes the last 5 years of content – that’s already an impressive 32 gigs of data you can get your hands on. So in this post I’m going to show how to do that, in particular by transforming the data from the RDF graph format they come with, into a JSON format which is more suited for application development and analytics.

We will be using two free-to-download products, GraphDB and Elasticsearch, so you’ll have to install them if you haven’t got them already. But no worries, that’s pretty straighforward, as you’ll see below.

1. Hello SciGraph Linked Data

First things first, we want to get hold of the SciGraph RDF datasets of course. That’s pretty easy, just head over to the SciGraph downloads page and get the following datasets:

  • Ontologies: the main schema behind SciGraph.
  • Articles – 2016: all the core articles metadata for one year.
  • Grants: grants metadata related to those articles.
  • Journals: full list of Springer Nature journal catalogue.
  • Subjects: classification of research areas developed by Springer Nature.

That’s pretty much everything, only thing we’re getting only one year worth of articles as that’s enough for the purpose of this exercise (~300k articles from 2016).

Next up, we want to get a couple of other datasets SciGraph depends on:

That’s it! Time for a cup of coffee.

2. Python to the help

We will be doing a bit of data manipulation  in the next sections and Python is a great language for that sort of thing. Here’s what we need to get going:

  1. Python. Make sure you have Python installed and also Pip, the Python package manager (any Python version above 2.7 should be ok).
  2. GitHub project. I’ve created a few scripts for this tutorial, so head over to the hello-scigraph project on GitHub and download it to your computer. Note: the project contains all the Python scripts needed to complete this tutorial, but of course you should feel free to modify them or write from scratch if you fancy it!
  3. Libraries. Install all the dependencies for the hello-scigraph project to run. You can do that by cd-ing into the project folder and running pip install -r requirements.txt (ideally within a virtual environment, but that’s up to you).

3. Loading the data into GraphDB

So, you should have by now 8 different files containing data (after step 1 above). Make sure they’re all in the same folder and that all of them have been unzipped (if needed), then head over to the GraphDB website and download the free version of the triplestore (you may have to sign up first).

The online documentation for GraphDB is pretty good, so it should be easy to get it up and running. In essence, you have to do the following steps:

  1. Launch the application: for me, on a mac, I just had to double click the GraphDB icon – nice!
  2. Create a new repository: this is the equivalent of a database within the triplestore. Call this repo scigraph-2016” so that we’re all synced for the following steps.

Next thing, we want a script to load our RDF files into this empty repository. So cd into the directory containg the GitHub project (from step 2) and run the following command:

python -m hello-scigraph.loadGraphDB ~/scigraph-downloads/

The “loadGraphDB” script goes through all RDF files in the “scigraph-downloads” directory and loads them into the scigraph-2016 repository (note: you must replace “scigraph-downloads” with the actual path to the folder you downloaded content in step 1 above).

So, to recap: this script is now loading more than 35 million triples into your local graph database. Don’t be surprised if it’ll take some time (in particular the ‘articles-2016’ dataset, by far the biggest) so it’s time to take a break or do something else.

Once the process it’s finished, you should be able to explore your data via the GraphDB workbench.  It’ll look something like this:

GraphDB-class-hierarchy

4. Creating an Elasticsearch index

We’re almost there. Let’s head over to the Elasticsearch website and download it. Elasticsearch is a powerful, distributed, JSON-based search and analytics engine so we’ll be using it to build an analytics dashboard for the SciGraph data.

Make sure Elastic is running (run bin/elasticsearch (or bin\elasticsearch.bat on Windows), then cd into the hello-scigraph Python project (from step 2) in order to run the following script:

python -m hello-scigraph.loadElastic

If you take a look at the source code, you’ll see that the script does the following:

  1. Articles loading: extracts articles references from GraphDB in batches of 200.
  2. Articles metadata extraction: for each article, we pull out all relevant metadata (e.g. title, DOI, authors) plus related information (e.g. author GRID organizations, geo locations, funding info etc..).
  3. Articles metadata simplification:  some intermediate nodes coming from the orginal RDF graph are dropped and replaced with a flatter structure which uses a a temporary dummy schema (prefix es: <http://elastic-index.scigraph.com/> It doesn’t matter what we call that schema, but what’s important is to that we want to simplify the data we put into the Elastic search index. That’s because while the Graph layer is supposed to facilitate data integration and hence it benefits from a rich semantic representation of information, the search layer is more geared towards performance and retrieval hence a leaner information structure can dramatically speed things up there.
  4. JSON-LD transformation: the simplified RDF data structure is serialized as JSON-LD – one of the many serializations available for RDF. JSON-LD is of course valid JSON, meaning that we can put that into Elastic right away. This is a bit of a shortcut actually, in fact for a more fine-grained control of how the JSON looks like,  it’s probably better to transform the data into JSON using some ad-hoc mechanism. But for the purpose of this tutorial it’s more than enough.
  5. Elastic index creation. Finally, we can load the data into an Elastic index called – guess what – “hello-scigraph”.

Two more things to point out:

  • Long queries. The Python script enforces a 60 seconds time-out on the GraphDB queries, so in case things go wrong with some articles data the script should keep running.
  • Memory issues. The script stops for 10 seconds after each batch of 200 articles (time.sleep(10)). Had to do this to prevent GraphDB on my laptop from running out of memory. Time to catch some breath!

That’s it! Time for another break  now. A pretty long one actually – loading all the data took around 10 hours on my (rather averaged spec’ed) laptop so you may want to do that overnight or get hold of a faster machine/server.

Eventually, once the loading script is finished, you can issue this command from the command line to see how much data you’ve loaded into the Elastic index  “hello-scigraph”. Bravo!

curl -XGET 'localhost:9200/_cat/indices/'

5. Analyzing the data with Kibana

Loading the data in Elastic already opens up a number of possibilites – check out the search APIs for some ideas – however there’s an even quicker way to analyze the data: KibanaKibana is another free product in the Elastic Search suite, which provides an extensible user interface for configuring and managing all aspects of the Elastic Stack.

So let’s get started with Kibana: download it and set it up using the online instructions, then point your browser at http://localhost:5601 .

You’ll get to the Kibana dashboard which shows the index we just created. Here you can perform any kind of searches and see the raw data as JSON.

What’s even more interesting is the visualization tab. Results of searches can be rendered as line chart, pie charts etc.. and more dimensions can be added via ‘buckets’. See below for some quick examples, but really, the possibilities are endless!

Conclusion

This post should have given you enough to realise that:

  1. The SciGraph dataset contain an impressive amount of high-quality scholarly publications metadata which can be used for things like literature search, research statistics etc..
  2. Even though you’re not familiar with Linked Data and the RDF family of languages, it’s not hard to get going with a triplestore and then transform the data into a more widely used format like JSON.
  3. Finally, Elasticsearch and especially Kibana are fantastic tools for data analysis and exploration! Needless to say, in this post I’ve just scratched the surface of what could be done with it.

Hope this was fun, any questions or comments, you know the drill :-)

]]>
http://www.michelepasin.org/blog/2017/04/06/exploring-scigraph-data-using-elastic-search-and-kibana/feed/ 4 2844
Recent projects from CrossRef.org http://www.michelepasin.org/blog/2015/06/14/recent-projects-from-crossref-org/ Sun, 14 Jun 2015 22:21:55 +0000 http://www.michelepasin.org/blog/?p=2638 We spent the day with the CrossRef team in Oxford last week, talking about our recent work in the linked data space (see the nature ontologies portal) and their recent initiatives in the scholarly publishing area.

So here’s a couple of interesting follow ups from the meeting.
ps. If you want to know more about CrossRef, make sure you take a look at their website and in particular the labs section: http://labs.crossref.org/.

Opening up article level metrics

http://det.labs.crossref.org/

CrossRef is using the open source Lagotto application (developed by PLOS https://github.com/articlemetrics/lagotto) to retrieve article metrics data from a variety of sources (e.g. wikipedia, twitter etc. see the full list here).

The model used for storing this data follows an agreed ontology containing for example a classification of ‘mentions’ actions (viewed/saved/discussed/recommended/cited – see this paper for more details).

In a nutshell, CrossRef is planning to collect and make the metrics (raw) data for all the DOIs they track in the form of ‘DOI events

An interesting demo application shows the stream of DOIs citations coming from Wikipedia (one of the top referrers of DOIs, unsurprisingly). More discussions on this blog post.

Screen Shot 2015 05 20 at 16 30 00 1024x760

Linking dataset DOIs and publications DOIs

http://www.crosscite.org/

CrossRef has been working with Datacite to the goal of harmonising their databases. Datacite is the second major register of DOIs (after CrossRef) and it has been focusing on assigning persistent identifiers to datasets.

This work is now gaining more momentum as Datacite is enlarging its team. So in theory it won’t be long before we see a service that allows to interlink publications and datasets, which is great news.

Linking publications and funding sources

http://www.crossref.org/fundref/

FundRef provides a standard way to report funding sources for published scholarly research. This is increasingly becoming a fundamental requirement for all publicly funded research, so several publishers have agreed to help extracting funding information and sending it to CrossRef.

A recent platform built on top of Fundref is Chorus http://www.chorusaccess.org/, which enables users to discover articles reporting on funded research. Furthermore it provides dashboards which can b used by funders, institutions, researchers, publishers, and the public for monitoring and tracking public-access compliance for articles reporting on funded research.

For example see http://dashboard.chorusaccess.org/ahrq#/breakdown

Screen Shot 2015 06 11 at 12 57 39

Miscellaneous news & links

JSON-LD (an RDF version of JSON) is being considered as a candidate data format for the next generation of the CrossRef REST API.

– The prototype http://www.yamz.net/ came up in discussion; a quite interesting stack-overflow meets ontology-engineering kind of tool. Def worth a look, I’d say.

Wikidata (a queryable structured data version of wikipedia) seems to be gaining a lot of momentum after it’s taken over Freebase from Google. Will it eventually replace its main rival DBpedia?

Screen Shot 2015 06 11 at 12 58 20

 

]]>
2638
Nature.com ontologies portal available online http://www.michelepasin.org/blog/2015/04/30/nature-com-ontologies-portal-available-online/ Thu, 30 Apr 2015 21:46:42 +0000 http://www.michelepasin.org/blog/?p=2618 The Nature ontologies portal is new section of the nature.com site that describes our involvement with semantic technologies and also makes available to the wider public several models and datasets as RDF linked data.

We launched the portal nearly a month ago, to the purpose of sharing our experiences with semantic technologies and more generally to contribute to the wider linked data community with our data models and datasets.

Screen Shot 2015 04 30 at 17 35 39

This April 2015 release doubles the number and size of our published data models. This now spans more completely the various things that our world contains, from publication things – articles, figures, etc. – to classification things – article-types, subjects, etc. – and additional things used to manage our content publishing operation – assets, events, etc. Also included is a release page for the latest data release and a separate page for archival data releases.

Npg models hierarchy v2 alt

Background

Is this the first time you’ve heard about semantic web and ontologies?
 
Then you should know that even though internally at Macmillan Science and Education XML remains the main technology used to represent and store the things we publish, the metadata about these documents (e.g. publication details, subject categories etc..) are normally encoded also using a more abstract, graph-oriented information model.
 
This is called RDF and has two key characteristics:
– it encodes all information in the form of triples e.g. <subject><predicate><object>
– it was built with the web in mind: broadly speaking, each of the items in a triple can be accessed via the internet i.e. it is a URIs (a generalised notion of a URL).
 
So why using RDF?

The RDF model makes it easier to maintain a shared yet scalable schema (aka an ‘ontology’) of the data types in use within our organization . A bit like a common language which is spoken by increasingly more data stores and thus allows to join things up more easily whenever needed.
 
At the same time – since the RDF model is native to the web – it facilitates the ‘semantic’ integration of our data with the increasing number of other organisations that publish their data using compatible models.
 
For example the BBC, Elsevier or more recently Springer  are among the many organisations that contribute to the Linked Data Cloud.

What’s next

We’ll continue improving these ontologies and releasing new ones as they are created. But probably most interestingly for many people, we’re working a new release of the whole NPG articles dataset (~1M articles).

So stay tuned for more!

 

]]>
2618
A few useful Linked Data resources http://www.michelepasin.org/blog/2011/03/17/a-few-useful-linked-data-resources/ Thu, 17 Mar 2011 11:32:00 +0000 http://www.michelepasin.org/blog/?p=1135 Done a bit of semantic web work in the last couple of weeks, which gave me a chance to explore better the current web-scenario around this topic. I’m working on some example applications myself, but in the meanwhile I thought I’d share here a couple of quite useful links I ran into.

Development Tools:

  • Quick and Dirty RDF browser. It does just what is says: you pass it an rdf file and it helps you making sense of it. For example, check out the rdf graph describing the city of Southampton on DbPedia: http://dbpedia.org/resource/Southampton. Minimal, fast and useful!
  • Namespace lookup service for RDF developers. The intention of this service is to simplify a common task in the work of RDF developers: remembering and looking up URI prefixes.You can look up prefixes from the search box on the homepage, or directly by typing URLs into your browser bar, such as http://prefix.cc/foaf or http://prefix.cc/foaf,dc,owl.ttl.
  • Knoodl Knoodl is an online tool for creating, managing, and analyzing RDF/OWL descriptions. It has several features that support collaboration in all stages of these activities (eg it lets you create quite easily discussion forums around ontological modeling decisions). It’s hosted in the Amazon EC2 cloud and can be used for free.
  • Rdf Goole chrome extensions. Just a list of extensions for Google Chrome that make working with rdf much simpler, for example by detecting rdf annotations embedded in HTML.
  • Get the data. Ask and answer questions about getting, using and sharing data! A StackOverflow clone that crowd-sources the task of finding out whether the data you need are available, and where.
  •  

    Articles / Tutorials

  • Linked Data Guide for Newbies. It’s primarily aimed at “people who’re tasked with creating RDF and don’t have time to faff around.” It’s a brief and practical introduction to some of the concepts and technical issues behind Linked Data – simple and effective, although it obviously hides all the most difficult aspects.
  • What you need to know about RDF+XML. Again, another gentle and practical intro.
  • Linked Data: design issues. One of the original articles by Berners Lee. It goes a little deeper into the theoretical issues involved with the Linked Data approach.
  • Linked Data: Evolving the Web into a Global Data Space. Large and thorough resource: this book is freely available online and contains all that you need to become a Linked Data expert – whatever that means!
  • Linked Data/RDF/SPARQL Documentation Challenge. A recent initiative aimed at pushing people to document the ‘path to rdf’ with as many languages and environments as possible. The idea is to move away from some kind of academic-circles-only culture and create something “closer to the Django introduction tutorial or the MongoDB quick start guide than an academic white paper“. This blog post is definitely worth checking out imho, especially because of the wealth of responses it has elicited!
  • Introducing SPARQL: Querying the Semantic Web. An in-depth article at XML.com that introduces SPARQL – the query language and data access protocol for the Semantic Web.
  • A beginner’s guide to SPARQLing linked data. A more hands-on description of what SPARQL can do for you.
  • Linked Data: how to get your dataset in the diagram. So you’ve noticed the Linked Data bubbles growing bigger and bigger. Next step is – how to contribute and get in there? This article gives you all the info you need to know.
  • Semantic Overflow Answers.semanticweb.com. If you run out of ideas, this is the place where to ask for help!
  •  

    ]]>
    1135
    Survey of Pythonic tools for RDF and Linked Data programming http://www.michelepasin.org/blog/2011/02/24/survey-of-pythonic-tools-for-rdf-and-linked-data-programming/ http://www.michelepasin.org/blog/2011/02/24/survey-of-pythonic-tools-for-rdf-and-linked-data-programming/#comments Thu, 24 Feb 2011 15:21:27 +0000 http://www.michelepasin.org/blog/?p=1110 In this post I’m reporting on a recent survey I made in the context of a Linked Data project I’m working on, SAILS. The Resource Description Framework (RDF) is a data model and language which is quickly gaining momentum in the open-data and data-integration worlds. In SAILS we’re developing a prototype for rdf-data manipulation and querying, but since the final application (of which the rdf-components is part of) will be written in Python and Django, in what follows I tried to gather information about all the existing libraries and frameworks for doing rdf-programming using python.

    1. Python libraries for working with Rdf

    RdfLib http://www.rdflib.net/

    RdfLib (download) is a pretty solid and extensive rdf-programming kit for python. It contains parsers and serializers for RDF/XML, N3, NTriples, Turtle, TriX and RDFa. The library presents a Graph interface which can be backed by any one of a number of store implementations, including, memory, MySQL, Redland, SQLite, Sleepycat, ZODB and SQLObject.

    The latest release is RdfLib 3.0, although I have the feeling that many are still using the previous release, 2.4. One big difference between the two is that in 3.0 some libraries have been separated into another package (called rdfextras); among these libraries there’s also the one you need for processing sparql queries (the rdf query language), so it’s likely that you want to install that too.
    A short overview of the difference between these two recent releases of RdfLib can be found here. The APIs documentation for RdfLib 2.4 is available here, while the one for RdfLib 3.0 can be found here. Finally, there are also some other (a bit older, but possibly useful) docs on the wiki.

    Next thing, you might want to check out these tutorials:

  • Getting data from the Semantic Web: a nice example of how to use RdfLib and python in order to get data from DBPedia, the Semantic Web version of Wikipedia.
  • How can I use the Ordnance Survey Linked Data: shows how to install RdfLib and query the linked data offered by Ordnance Survey.
  • A quick and dirty guide to YOUR first time with RDF: another example of querying Uk government data found on data.gov.uk using RdfLib and Berkely/Sleepycat DB.
  • RdfAlchemy http://www.openvest.com/trac/wiki/RDFAlchemy

    The goal of RDFAlchemy (install | apidocs | usergroup) is to allow anyone who uses python to have a object type API access to an RDF Triplestore. In a nutshell, the same way that SQLAlchemy is an ORM (Object Relational Mapper) for relational database users, RDFAlchemy is an ORM (Object RDF Mapper) for semantic web users.

    RdfAlchemy can also work in conjunction with other datastores, including rdflib, Sesame, and Jena. Support for SPARQL is present, although it seems less stable than the rest of the library.

    Fuxi http://code.google.com/p/fuxi/

    FuXi is a Python-based, bi-directional logical reasoning system for the semantic web. It requires rdflib 2.4.1 or 2.4.2 and it is not compatible with rdflib 3. FuXi aims to be the ‘engine for contemporary expert systems based on the Semantic Web technologies’. The documentation can be found here; it might be useful also to look at the user-manual and the discussion group.

    In general, it looks as if Fuxi can offer a complete solution for knowledge representation and reasoning over the semantic web; it is quite sophisticated and well documented (partly via several academic articles). The downside is that to the end of hacking together a linked data application.. well Fuxi is probably just too complex and difficult to learn.

  • About Inferencing: a very short introduction to what Fuxi inferencing capabilities can do in the context of an rdf application.
  • ORDF ordf.org

    ORDF (download | docs) is the Open Knowledge Foundation‘s library of support infrastructure for RDF. It is based on RDFLib and contains an object-description mapper, support for multiple back-end indices, message passing, revision history and provenance, a namespace library and a variety of helper functions and modules to ease integration with the Pylons framework.

    The current version of this library is 0.35. You can have a peek at some of its key functionalities by checking out the ‘Object Description Mapper‘ – an equivalent to what an Object-Relational Mapper would give you in the context of a relational database. The library seems to be pretty solid; for an example of a system built on top of ORDF you can see Bibliographica, an online open catalogue of cultural works.

  • Why using RDF? The Design Considerations section in the ORDF documentation discusses the reasons that led to the development of this library in a clear and practical fashion.
  • Django-rdf http://code.google.com/p/django-rdf/

    Django-RDF (download | faq | discussiongroup) is an RDF engine implemented in a generic, reusable Django app, providing complete RDF support to Django projects without requiring any modifications to existing framework or app source code. The philosophy is simple: do your web development using Django just like you’re used to, then turn the knob and – with no additional effort – expose your project on the semantic web.

    Django-RDF can expose models from any other app as RDF data. This makes it easy to write new views that return RDF/XML data, and/or query existing models in terms of RDFS or OWL classes and properties using (a variant of) the SPARQL query language. SPARQL in, RDF/XML out – two basic semantic web necessities. Django-RDF also implements an RDF store using its internal models such as Concept, Predicate, Resource, Statement, Literal, Ontology, Namespace, etc. The SPARQL query engine returns query sets that can freely mix data in the RDF store with data from existing Django models.

    The major downside of this library is that it doesn’t seem to be maintained anymore; the last release is from 2008, and there seem to be various conflicts with recent versions of Django. A real shame!

    Djubby http://code.google.com/p/djubby/

    Djubby (download | docs) is a Linked Data frontend for SPARQL endpoints for the Django Web framework, adding a Linked Data interface to any existing SPARQL-capable triple stores.

    Djubby is quite inspired by Richard Cyganiak’s Pubby (written in Java): it provides a Linked Data interface to local or remote SPARQL protocol servers, it provides dereferenceable URIs by rewriting URIs found in the SPARQL-exposed dataset into the djubby server’s namespace, and it provides a simple HTML interface showing the data available about each resource, taking care of handling 303 redirects and content negotiation.

    Redland http://librdf.org/

    Redland (download | docs | discussiongroup) is an RDF library written in C and including several high-level language APIs providing RDF manipulation and storage. Redland makes available also a Python interface (intro | apidocs) that can be used to manipulate RDF triples.

    This library seems to be quite complete and is actively maintained; only potential downside is the installation process. In order to use the python bindings you need to install the C library too (which in turns depends on other C libraries), so (depending on your programming experience and operating system used) just getting up and running might become a challenge.

    SuRF http://packages.python.org/SuRF/

    SuRF (install | docs) is an Object – RDF Mapper based on the RDFLIB python library. It exposes the RDF triple sets as sets of resources and seamlessly integrates them into the Object Oriented paradigm of python in a similar manner as ActiveRDF does for ruby.

    Other smaller (but possibly useful) python libraries for rdf:

  • Sparql Interface to python: a minimalistic solution for querying sparql endpoints using python (download | apidocs). UPDATE: Ivan Herman pointed out that this library has been discontinued and merged with the ‘SPARQL Endpoint interface to Python’ below.
  • SPARQL Endpoint interface to Python another little utility for talking to a SPARQL endpoint, including having select-results mapped to rdflib terms or returned in JSON format (download)
  • PySparql: again, a minimal library that does SELECT and ASK queries on an endpoint which implements the HTTP (GET or POST) bindings of the SPARQL Protocol (code page)
  • Sparta: Sparts is a simple, resource-centric API for RDF graphs, built on top of RDFLIB.
  • Oort: another Python toolkit for accessing RDF graphs as plain objects, based on RDFLIB. The project homepage hasn’t been updated for a while, although there is trace of recent activity on its google project page.
  •  

    2. RDF Triplestores that are python-friendly

    An important component of a linked-data application is the triplestore (that is, an RDF database): many commercial and non-commercial triplestores are available, but only a few offer out-of-the-box python interfaces. Here’s a list of them:

    Allegro Graph http://www.franz.com/agraph/allegrograph/

    AllegroGraph RDFStore is a high-performance, persistent RDF graph database. AllegroGraph uses disk-based storage, enabling it to scale to billions of triples while maintaining superior performance. Unfortunately, the official version of AllegroGraph is not free, but it is possible to get a free version of it (it limits the DB to 50 million triples, so although useful for testing or development it doesn’t seem a good solution for a production environment).

    The Allegro Graph Python API (download | docs | reference) offers convenient and efficient access to an AllegroGraph server from a Python-based application. This API provides methods for creating, querying and maintaining RDF data, and for managing the stored triples.

  • A hands-on overview of what’s like to work with AllegroGraph and python can be found here: Getting started with AllegroGraph.
  • Open Link Virtuoso http://virtuoso.openlinksw.com/

    Virtuoso Universal Server is a middleware and database engine hybrid that combines the functionality of a traditional RDBMS, ORDBMS, virtual database, RDF, XML, free-text, web application server and file server functionality in a single system. Rather than have dedicated servers for each of the aforementioned functionality realms, Virtuoso is a “universal server”; it enables a single multithreaded server process that implements multiple protocols. The open source edition of Virtuoso Universal Server is also known as OpenLink Virtuoso.

    Virtuoso from Python is intended to be a collection of modules for interacting with OpenLink Virtuoso from python. The goal is to provide drivers for `SQLAlchemy` and `RDFLib`. The package is installable from the Python Package Index and source code for development is available in a mercurial repository on BitBucket.

  • A possibly useful example of using Virtuoso from python: SPARQL Guide for Python Developer.
  • Sesame http://www.openrdf.org/

    Sesame is an open-source framework for querying and analyzing RDF data (download | documentation). Sesame supports two query languages: SeRQL and Sparql. Sesame’s API differs from comparable solutions in that it offers a (stackable) interface through wich functionality can be added, and the storage engine is abstracted from the query interface (many other Triplestores can in fact be used through the Sesame API).

    It looks as if the best way to interact with Sesame is by using Java; however there is also a pythonic API called pySesame. This is essentially a python wrapper for Sesame’s REST HTTP API, so the range of operations supported (Log in, Log out, Request a list of available repositories, Evaluate a SeRQL-select, RQL or RDQL query, Extract/upload/remove RDF from a repository) are somehow limited (for example, there does not seem to be any native SPARQL support).

  • A nice introduction to using Sesame with Python (without pySesame though) can be found in this article: Getting Started with RDF and SPARQL Using Sesame and Python.
  • Talis platform http://www.talis.com/platform/

    The Talis Platform (faq | docs)is an environment for building next generation applications and services based on Semantic Web technologies. It is a hosted system which provides an efficient, robust storage infrastructure. Both arbitrary documents and RDF-based semantic content are supported, with sophisticated query, indexing and search features. Data uploaded on the Talis platform are organized into stores: a store is a grouping of related data and metadata. For convenience each store is assigned one or more owners who are the people who have rights to configure the access controls over that data and metadata. Each store provides a uniform REST interface to the data and metadata it manages.

    Stores don’t come free of charge, but through the Talis Connected Commons scheme it is possible have quite large amounts of store space for free. The scheme is intended to support a wide range of different forms of data publishing. For example scientific researchers seeking to share their research data; dissemination of public domain data from a variety of different charitable, public sector or volunteer organizations; open data enthusiasts compiling data sets to be shared with the web community.

    Good news for pythonistas too: pynappl is a simple client library for the Talis Platform. It relies on rdflib 3.0 and draws inspiration from other similar client libraries. Currently it is focussed mainly on managing data loading and manipulation of Talis Platform stores (this blog post says more about it).

  • Before trying out the Talis platform you might find useful this blog post: Publishing Linked Data on the Talis Platform.
  • 4store http://4store.org/

    4store (download | features | docs) is a database storage and query engine that holds RDF data. It has been used by Garlik as their primary RDF platform for three years, and has proved itself to be robust and secure.
    4store’s main strengths are its performance, scalability and stability. It does not provide many features over and above RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist.

    4store offers a number of client libraries, among them there are two for python: first, HTTP4Store is a client for the 4Store httpd service – allowing for easy handling of sparql results, and adding, appending and deleting graphs. Second, py4s, although this seems to be a much more experimental library (geared towards multi process queries).
    Furthemore, there is also an application for the Django web framework called django-4store that makes it easier to query and load rdf data into 4store when running Django. The application offers some support for constructing sparql-based Django views.

  • This blog post shows how to install 4store: Getting Started with RDF and SPARQL Using 4store and RDF.rb .
  •  

    End of the survey.. have I missed out on something? Please let me know if I did – I’ll try to keep adding stuff to this list as I move on with my project work!

     

    ]]>
    http://www.michelepasin.org/blog/2011/02/24/survey-of-pythonic-tools-for-rdf-and-linked-data-programming/feed/ 24 1110
    Roman Port Networks project http://www.michelepasin.org/blog/2009/07/21/roman-port-networks-project/ http://www.michelepasin.org/blog/2009/07/21/roman-port-networks-project/#comments Tue, 21 Jul 2009 13:07:50 +0000 http://magicrebirth.wordpress.com/?p=225 The Roman Port Networks Project is a collaboration between 30 European partners, examining the connections between Roman ports across the Mediterranean. The project has received financial support from the British Academy (BASIS) and the University of Southampton (School of Humanities, Department of Archaeology and School of Electronics and Computing Science).

    Picture 1

    From the website (the bold font is mine):

    The project will use an innovative new approach to data management in order to bring together the many separate sources of information that we have about ports in the Roman Mediterranean. The Semantic Web is a way of linking data by storing it as statements rather than in tables. Because the statements are composed of the same URIs that you use in the address bar of an internet browser, they can be accessed by other computers so different datasets can be connected together more easily. It also means that we can see all the information related to a given concept, whether it’s a thing, a property or a class of objects. [some interesting papers about this approach can be found here]

    We hope that by using this methodology we might soon be able to ask questions such as ‘where are all the known finds of Dressel 20 amphorae on the Mediterranean coast?’, or ‘which other towns have used the same types of marble as those employed in Tarragona?’ It is with this kind of knowledge that we can start building theoretical networks of trade and mobility.

     

    ]]>
    http://www.michelepasin.org/blog/2009/07/21/roman-port-networks-project/feed/ 1 225