nature – Parerga und Paralipomena

Exploring SciGraph data using JSON-LD, Elastic Search and Kibana

mikele — Thu, 06 Apr 2017 14:12:05 +0000

Hello there data lovers! In this post you can find some information on how to download and make some sense of the scholarly dataset recently made available by the Springer Nature SciGraph project, by using the freely available Elasticsearch suite of software.

A few weeks ago the SciGraph dataset was released (full disclosure: I’m part of the team who did that!). This is a high quality dataset containing metadata and abstracts about scientific articles published by Springer Nature, research grants related to them plus other classifications of this content.

This release of the dataset includes the last 5 years of content – that’s already an impressive 32 gigs of data you can get your hands on. So in this post I’m going to show how to do that, in particular by transforming the data from the RDF graph format they come with, into a JSON format which is more suited for application development and analytics.

We will be using two free-to-download products, GraphDB and Elasticsearch, so you’ll have to install them if you haven’t got them already. But no worries, that’s pretty straighforward, as you’ll see below.

1. Hello SciGraph Linked Data

First things first, we want to get hold of the SciGraph RDF datasets of course. That’s pretty easy, just head over to the SciGraph downloads page and get the following datasets:

Ontologies: the main schema behind SciGraph.
Articles – 2016: all the core articles metadata for one year.
Grants: grants metadata related to those articles.
Journals: full list of Springer Nature journal catalogue.
Subjects: classification of research areas developed by Springer Nature.

That’s pretty much everything, only thing we’re getting only one year worth of articles as that’s enough for the purpose of this exercise (~300k articles from 2016).

Next up, we want to get a couple of other datasets SciGraph depends on:

GRID: a catalogue of the world’s research organisations. Make sure you get both the ontology and one of the latest releases, within which you can find an RDF implementation too.
Field Of Research codes: another classification scheme used in SciGraph, developed by the Australian and New Zealand Standard Research Classification organization.

That’s it! Time for a cup of coffee.

2. Python to the help

We will be doing a bit of data manipulation in the next sections and Python is a great language for that sort of thing. Here’s what we need to get going:

Python. Make sure you have Python installed and also Pip, the Python package manager (any Python version above 2.7 should be ok).
GitHub project. I’ve created a few scripts for this tutorial, so head over to the hello-scigraph project on GitHub and download it to your computer. Note: the project contains all the Python scripts needed to complete this tutorial, but of course you should feel free to modify them or write from scratch if you fancy it!
Libraries. Install all the dependencies for the hello-scigraph project to run. You can do that by cd-ing into the project folder and running pip install -r requirements.txt (ideally within a virtual environment, but that’s up to you).

3. Loading the data into GraphDB

So, you should have by now 8 different files containing data (after step 1 above). Make sure they’re all in the same folder and that all of them have been unzipped (if needed), then head over to the GraphDB website and download the free version of the triplestore (you may have to sign up first).

The online documentation for GraphDB is pretty good, so it should be easy to get it up and running. In essence, you have to do the following steps:

Launch the application: for me, on a mac, I just had to double click the GraphDB icon – nice!
Create a new repository: this is the equivalent of a database within the triplestore. Call this repo “scigraph-2016” so that we’re all synced for the following steps.

Next thing, we want a script to load our RDF files into this empty repository. So cd into the directory containg the GitHub project (from step 2) and run the following command:

python -m hello-scigraph.loadGraphDB ~/scigraph-downloads/

The “loadGraphDB” script goes through all RDF files in the “scigraph-downloads” directory and loads them into the scigraph-2016 repository (note: you must replace “scigraph-downloads” with the actual path to the folder you downloaded content in step 1 above).

So, to recap: this script is now loading more than 35 million triples into your local graph database. Don’t be surprised if it’ll take some time (in particular the ‘articles-2016’ dataset, by far the biggest) so it’s time to take a break or do something else.

Once the process it’s finished, you should be able to explore your data via the GraphDB workbench. It’ll look something like this:

4. Creating an Elasticsearch index

We’re almost there. Let’s head over to the Elasticsearch website and download it. Elasticsearch is a powerful, distributed, JSON-based search and analytics engine so we’ll be using it to build an analytics dashboard for the SciGraph data.

Make sure Elastic is running (run bin/elasticsearch (or bin\elasticsearch.bat on Windows), then cd into the hello-scigraph Python project (from step 2) in order to run the following script:

python -m hello-scigraph.loadElastic

If you take a look at the source code, you’ll see that the script does the following:

Articles loading: extracts articles references from GraphDB in batches of 200.
Articles metadata extraction: for each article, we pull out all relevant metadata (e.g. title, DOI, authors) plus related information (e.g. author GRID organizations, geo locations, funding info etc..).
Articles metadata simplification: some intermediate nodes coming from the orginal RDF graph are dropped and replaced with a flatter structure which uses a a temporary dummy schema (prefix es: It doesn’t matter what we call that schema, but what’s important is to that we want to simplify the data we put into the Elastic search index. That’s because while the Graph layer is supposed to facilitate data integration and hence it benefits from a rich semantic representation of information, the search layer is more geared towards performance and retrieval hence a leaner information structure can dramatically speed things up there.
JSON-LD transformation: the simplified RDF data structure is serialized as JSON-LD – one of the many serializations available for RDF. JSON-LD is of course valid JSON, meaning that we can put that into Elastic right away. This is a bit of a shortcut actually, in fact for a more fine-grained control of how the JSON looks like, it’s probably better to transform the data into JSON using some ad-hoc mechanism. But for the purpose of this tutorial it’s more than enough.
Elastic index creation. Finally, we can load the data into an Elastic index called – guess what – “hello-scigraph”.

Two more things to point out:

Long queries. The Python script enforces a 60 seconds time-out on the GraphDB queries, so in case things go wrong with some articles data the script should keep running.
Memory issues. The script stops for 10 seconds after each batch of 200 articles (time.sleep(10)). Had to do this to prevent GraphDB on my laptop from running out of memory. Time to catch some breath!

That’s it! Time for another break now. A pretty long one actually – loading all the data took around 10 hours on my (rather averaged spec’ed) laptop so you may want to do that overnight or get hold of a faster machine/server.

Eventually, once the loading script is finished, you can issue this command from the command line to see how much data you’ve loaded into the Elastic index “hello-scigraph”. Bravo!

curl -XGET 'localhost:9200/_cat/indices/'

5. Analyzing the data with Kibana

Loading the data in Elastic already opens up a number of possibilites – check out the search APIs for some ideas – however there’s an even quicker way to analyze the data: Kibana. Kibana is another free product in the Elastic Search suite, which provides an extensible user interface for configuring and managing all aspects of the Elastic Stack.

So let’s get started with Kibana: download it and set it up using the online instructions, then point your browser at http://localhost:5601 .

You’ll get to the Kibana dashboard which shows the index we just created. Here you can perform any kind of searches and see the raw data as JSON.

What’s even more interesting is the visualization tab. Results of searches can be rendered as line chart, pie charts etc.. and more dimensions can be added via ‘buckets’. See below for some quick examples, but really, the possibilities are endless!

Conclusion

This post should have given you enough to realise that:

The SciGraph dataset contain an impressive amount of high-quality scholarly publications metadata which can be used for things like literature search, research statistics etc..
Even though you’re not familiar with Linked Data and the RDF family of languages, it’s not hard to get going with a triplestore and then transform the data into a more widely used format like JSON.
Finally, Elasticsearch and especially Kibana are fantastic tools for data analysis and exploration! Needless to say, in this post I’ve just scratched the surface of what could be done with it.

Hope this was fun, any questions or comments, you know the drill :-)

Nature.com Subjects Stream Graph

mikele — Sun, 03 Jan 2016 00:28:08 +0000

The nature.com subjects stream graph displays the distribution of content across the subject areas covered by the nature.com portal.

This is an experimental interactive visualisation based on a freely available dataset from the nature.com linked data platform, which I’ve been working on in the last few months.

The main visualization provides an overview of selected content within the level 2 disciplines of the NPG Subjects Ontology. By clicking on these, it is then possible to explore more specific subdisciplines and their related articles.

For those of you who are not familiar with the Subjects Ontology: this is a categorization of scholarly subject areas which are used for the indexing of content on nature.com. It includes subject terms of varying levels of specificity such as Biological sciences (top level), Cancer (level 2), or B-2 cells (level 7). In total there are more than 2500 subject terms, organized into a polyhierarchical tree.

Starting in 2010, the various journals published on nature.com have adopted the subject ontology to tag their articles (note: different journals have started doing this at different times, hence some variations in the graph starting dates).

The visualization makes use of various d3.js modules, plus some simple customizations here and there. The hardest part of the work was putting the different page components together, to the effect of a more fluent ‘narrative’ achieved by gradually zooming into the data.

The back end is a Django web application with a relational database. The original dataset is published as RDF, so in order to use the Django APIs I’ve recreated it as a relational model. That let me also add a few extra data fields containing search indexes (e.g. article counts per month), so to make the stream graph load faster.

Comments or suggestions, as always very welcome.

Is wikipedia a valid source of scientific knowledge?

mikele — Wed, 02 Sep 2015 13:15:25 +0000

Is wikipedia a valid source of scientific knowledge? Many would say yes. Others are still quite skeptical, or maybe just cautious about it. What seems to be the case though – and this is what this post is about – is that wikipedians are increasingly including references to scientific literature, and when they do it they do it right.

Based on data we’ve recently extracted from Wikipedia, it looks like that the vast majority of citations to nature.com content have been done according to the established scientific practice (i.e. using DOIs). Which makes you think that whoever added those citations is either a scientist or has some familiarity with science.

In the context of the nature.com ontologies portal we’ve done some work aimed at surfacing links between our articles and other datasets. Wikipedia and DBpedia (an RDF database version of wikipedia) have come to our attention quite soon: how much do wikipedia articles cite scientific content published on nature.com? Also, how well do they cite it?

So here’s an interactive visualization that lets you see all incoming references from Wikipedia to the nature.com archive. The actual dataset is encoded in RDF and can be downloaded here (look for the npg-articles-dbpedia-linkset.2015-08-24.nq.tar.gz file).

About the data

In a nutshell, what we’ve done was simply extracting all mentions of either NPG DOIs or nature.com links using the wikipedia APIs (for example, see all references to the DOI “10.1038/ng1285”).

These links have then been validated against the nature.com articles database and encoded in RDF in two ways: a cito:isCitedBy relationship links the article URI to the citing Wikipedia page, and a foaf:topic relationship links the same article URI to the corresponding DBpedia page.

In total there are 51309 links over 145 years.

Quite interestingly, the vast majority of these links are explicit DOI references (only ~900 were links to nature.com without a DOI). So, it seems that people do recognize the importance of DOIs even within a loosely controlled context like wikipedia.

Using the dataset

Considering that for many wikipedia is become the de facto largest and most cited encyclopedia out there (see the articles below), this may be an interesting dataset to analyze e.g. to highlight citation patters of influential articles.

Also, this could become quite useful as a data source for content enrichment: the wikipedia links could be used to drive subject tagging, or they could even be presented to readers on article pages e.g. as contextual information.

We haven’t really had time to explore any follow up on this work, but hopefully we’ll do that soon.

All of this data is open source and freely available on nature.com/ontologies. So if you’re reading this and have more ideas about potential uses or just want to collaborate, please do get in touch!

Caveats

This dataset is obviously just a snaphot of wikipedia links at a specific moment in time.

If one were to use these data within a real-world application he’d probably want to come up with some strategy to keep it up to date (e.g. monitoring the Wikipedia IRC recent changes channel).

Good news is, work is already happening in this space:

CrossRef is looking at collecting citation events from Wikipedia in real time and release these data freely as part of their service e.g. see http://crosstech.crossref.org/2015/05/coming-to-you-live-from-wikipedia.html

Altmetric scans wikipedia for references too e.g. see http://nature.altmetric.com/details/961190/wikipedia and http://www.altmetric.com/blog/new-source-alert-wikipedia/, however the source data is not freely available.

Readings

Finally, here are a couple of interesting background readings I’ve found in the nature.com archive:

Wikipedia rival calls in the experts (2006) http://www.nature.com/nature/journal/v443/n7111/full/443493a.html

Publish in Wikipedia or perish (2008) http://www.nature.com/news/2008/081216/full/news.2008.1312.html

Time to underpin Wikipedia wisdom (2010) http://www.nature.com/nature/journal/v468/n7325/full/468765c.html

Enjoy!

A sneak peek at Nature.com articles’ archive

mikele — Mon, 08 Jun 2015 21:26:58 +0000

We’re getting closer to releasing the full set of metadata covering over one million articles published by Nature Publishing Group since 1845. So here’s a sneak peek at this dataset, in the form of a simple d3.js visual summary of what soon will be available to download and reuse.

In the last months I’ve been working with my colleagues at Macmillan Science and Education on an open data portal that makes available to the public many of the taxonomies and ontologies we use internally for organising the content we publish.

This is part of our ongoing involvement with linked data and semantic technologies, aimed both at leveraging these tools to the end of transforming the publishing workflow into a more dynamic platform, and at contributing to the evolving web of open data with a rich dataset of scientific articles metadata.

The articles dataset includes metadata about all articles published by the Nature journal, of course. But not only: the Scientific American, Nature Medicine, Nature Genetics and many other titles are also part of it (note: the full list can be downloaded as raw data here).

The first diagram shows how many articles have been published each year since 1845 (the start year of Scientific American). Nature began only a few years later in 1869; the curve getting steeper in the 90s instead corresponds to the exponential increase in publications due to the progressive specialisation of scientific journals (e.g. all the nature-branded titles).

The second diagram instead shows the increase in publication volumes on an incremental scale. We’ve now reached the 1M articles and counting!

In order to create the charts I played around with a nifty example from Mike Bostock (http://bl.ocks.org/mbostock/3902569) and added a couple of extra things to it.

The full source code is on Github.

Finally, worth mentioning that this metadata had already been made available a few of years ago under the CC0 license: you can still access it here. This upcoming release though makes it available in the context of a much more precise and stable set of ontologies. Meaning that the semantics of the dataset is more clearly laid out and consistent.

So stay tuned for more! ..and if you plan/would like to reuse these datasets please do get in touch, either here of by emailing developers@nature.com.

Nature.com subject pages available online!

mikele — Mon, 23 Jun 2014 14:45:00 +0000

Subject pages are pages that aggregate content from across nature.com based on the tagging of that content by NPG subject ontology terms. After six months of work on this project we’ve finally launched the first release of the site, which is reachable online at http://www.nature.com/subjects. Hooray!

This has been a particularly challenging experience cause I’ve essentially been wearing two hats for the past six months: product owner, leading the team in the day to day activities and prioritization of tasks, and information architect, dealing with the way content is organized and presented to users (my usual role).

In a nutshell, the goal of the project was to help our readers discover content more easily by using an internally-developed subject ontology to publish a page per term. The ontology is actually a poly-hierarchical taxonomy of scientific topics, which has been used in the last couple of years to tag all articles published on nature.com.

Besides helping users browse the site more easily, subject pages also contribute to making NPG content more discoverable via Google and other external search engines. All of this powered by a new backend platform which combines the expressiveness of linked data technologies (RDF) with the scalability of more traditional XML data stores (MarkLogic).

The main features are:
– one page per subject term which collates all content tagged with that term across nature.com
– RSS and ATOM feeds for each of the subject terms (~2500)
– dedicated pages that collate content from different journals based on their article types (e.g. news, research etc..)
– a visual tool to navigate subjects based on the ontology relations
– subject email alerts (to be released in the coming weeks)

It’s been a lot of work to bring all of this content together within a single application (keep in mind that the content comes from more than 80 different journals!) but this is just the beginning.

In the next months we’re looking at extending this work by making this content available in other formats (e.g. RDF), providing more ways to navigate through the data (facets, visualizations) and to integrate it with other datasets available online.. so stay tuned for more!

Listen to the RainForest at Kew Gardens

mikele — Mon, 09 Aug 2010 17:25:17 +0000

If you happen to be going to London’s Kew Gardens, make sure you don’t miss this nice sound installation by Chris Watson. The installation is on till September 5th and it’s called Whispering in the Leaves:

Whispering in the Leaves features two sound pieces – Dawn and Dusk – composed by Chris from memory using his extensive archive of on-location recordings in Central and South American rainforests.

Designed specifically for the Palm House, Whispering in the Leaves is the audio equivalent of 3D cinema. Visitors will be immersed in a dynamic, spatial soundscape of primate calls and birdsong, backed with a shimmering wall of insect sounds. Some of the species heard are currently unknown to humans. Visitors will experience the heard but never seen.

Diffused through 80 speakers, the two compositions will be transmitted at hourly intervals throughout the day – Dawn in the morning and Dusk in the afternoon. Each lasts for 15-20 minute durations – the approximate time it takes for the transitions between darkness and daylight in the dense tropical vegetation.

The website makes available some of the nature recordings too.

…