d3 – Parerga und Paralipomena

Nature.com Subjects Stream Graph

mikele — Sun, 03 Jan 2016 00:28:08 +0000

The nature.com subjects stream graph displays the distribution of content across the subject areas covered by the nature.com portal.

This is an experimental interactive visualisation based on a freely available dataset from the nature.com linked data platform, which I’ve been working on in the last few months.

The main visualization provides an overview of selected content within the level 2 disciplines of the NPG Subjects Ontology. By clicking on these, it is then possible to explore more specific subdisciplines and their related articles.

For those of you who are not familiar with the Subjects Ontology: this is a categorization of scholarly subject areas which are used for the indexing of content on nature.com. It includes subject terms of varying levels of specificity such as Biological sciences (top level), Cancer (level 2), or B-2 cells (level 7). In total there are more than 2500 subject terms, organized into a polyhierarchical tree.

Starting in 2010, the various journals published on nature.com have adopted the subject ontology to tag their articles (note: different journals have started doing this at different times, hence some variations in the graph starting dates).

The visualization makes use of various d3.js modules, plus some simple customizations here and there. The hardest part of the work was putting the different page components together, to the effect of a more fluent ‘narrative’ achieved by gradually zooming into the data.

The back end is a Django web application with a relational database. The original dataset is published as RDF, so in order to use the Django APIs I’ve recreated it as a relational model. That let me also add a few extra data fields containing search indexes (e.g. article counts per month), so to make the stream graph load faster.

Comments or suggestions, as always very welcome.

Is wikipedia a valid source of scientific knowledge?

mikele — Wed, 02 Sep 2015 13:15:25 +0000

Is wikipedia a valid source of scientific knowledge? Many would say yes. Others are still quite skeptical, or maybe just cautious about it. What seems to be the case though – and this is what this post is about – is that wikipedians are increasingly including references to scientific literature, and when they do it they do it right.

Based on data we’ve recently extracted from Wikipedia, it looks like that the vast majority of citations to nature.com content have been done according to the established scientific practice (i.e. using DOIs). Which makes you think that whoever added those citations is either a scientist or has some familiarity with science.

In the context of the nature.com ontologies portal we’ve done some work aimed at surfacing links between our articles and other datasets. Wikipedia and DBpedia (an RDF database version of wikipedia) have come to our attention quite soon: how much do wikipedia articles cite scientific content published on nature.com? Also, how well do they cite it?

So here’s an interactive visualization that lets you see all incoming references from Wikipedia to the nature.com archive. The actual dataset is encoded in RDF and can be downloaded here (look for the npg-articles-dbpedia-linkset.2015-08-24.nq.tar.gz file).

About the data

In a nutshell, what we’ve done was simply extracting all mentions of either NPG DOIs or nature.com links using the wikipedia APIs (for example, see all references to the DOI “10.1038/ng1285”).

These links have then been validated against the nature.com articles database and encoded in RDF in two ways: a cito:isCitedBy relationship links the article URI to the citing Wikipedia page, and a foaf:topic relationship links the same article URI to the corresponding DBpedia page.

In total there are 51309 links over 145 years.

Quite interestingly, the vast majority of these links are explicit DOI references (only ~900 were links to nature.com without a DOI). So, it seems that people do recognize the importance of DOIs even within a loosely controlled context like wikipedia.

Using the dataset

Considering that for many wikipedia is become the de facto largest and most cited encyclopedia out there (see the articles below), this may be an interesting dataset to analyze e.g. to highlight citation patters of influential articles.

Also, this could become quite useful as a data source for content enrichment: the wikipedia links could be used to drive subject tagging, or they could even be presented to readers on article pages e.g. as contextual information.

We haven’t really had time to explore any follow up on this work, but hopefully we’ll do that soon.

All of this data is open source and freely available on nature.com/ontologies. So if you’re reading this and have more ideas about potential uses or just want to collaborate, please do get in touch!

Caveats

This dataset is obviously just a snaphot of wikipedia links at a specific moment in time.

If one were to use these data within a real-world application he’d probably want to come up with some strategy to keep it up to date (e.g. monitoring the Wikipedia IRC recent changes channel).

Good news is, work is already happening in this space:

CrossRef is looking at collecting citation events from Wikipedia in real time and release these data freely as part of their service e.g. see http://crosstech.crossref.org/2015/05/coming-to-you-live-from-wikipedia.html

Altmetric scans wikipedia for references too e.g. see http://nature.altmetric.com/details/961190/wikipedia and http://www.altmetric.com/blog/new-source-alert-wikipedia/, however the source data is not freely available.

Readings

Finally, here are a couple of interesting background readings I’ve found in the nature.com archive:

Wikipedia rival calls in the experts (2006) http://www.nature.com/nature/journal/v443/n7111/full/443493a.html

Publish in Wikipedia or perish (2008) http://www.nature.com/news/2008/081216/full/news.2008.1312.html

Time to underpin Wikipedia wisdom (2010) http://www.nature.com/nature/journal/v468/n7325/full/468765c.html

Enjoy!

A sneak peek at Nature.com articles’ archive

mikele — Mon, 08 Jun 2015 21:26:58 +0000

We’re getting closer to releasing the full set of metadata covering over one million articles published by Nature Publishing Group since 1845. So here’s a sneak peek at this dataset, in the form of a simple d3.js visual summary of what soon will be available to download and reuse.

In the last months I’ve been working with my colleagues at Macmillan Science and Education on an open data portal that makes available to the public many of the taxonomies and ontologies we use internally for organising the content we publish.

This is part of our ongoing involvement with linked data and semantic technologies, aimed both at leveraging these tools to the end of transforming the publishing workflow into a more dynamic platform, and at contributing to the evolving web of open data with a rich dataset of scientific articles metadata.

The articles dataset includes metadata about all articles published by the Nature journal, of course. But not only: the Scientific American, Nature Medicine, Nature Genetics and many other titles are also part of it (note: the full list can be downloaded as raw data here).

The first diagram shows how many articles have been published each year since 1845 (the start year of Scientific American). Nature began only a few years later in 1869; the curve getting steeper in the 90s instead corresponds to the exponential increase in publications due to the progressive specialisation of scientific journals (e.g. all the nature-branded titles).

The second diagram instead shows the increase in publication volumes on an incremental scale. We’ve now reached the 1M articles and counting!

In order to create the charts I played around with a nifty example from Mike Bostock (http://bl.ocks.org/mbostock/3902569) and added a couple of extra things to it.

The full source code is on Github.

Finally, worth mentioning that this metadata had already been made available a few of years ago under the CC0 license: you can still access it here. This upcoming release though makes it available in the context of a much more precise and stable set of ontologies. Meaning that the semantics of the dataset is more clearly laid out and consistent.

So stay tuned for more! ..and if you plan/would like to reuse these datasets please do get in touch, either here of by emailing developers@nature.com.

Messing around wih D3.js and hierarchical data

mikele — Fri, 21 Jun 2013 13:23:59 +0000

These days there are a lot of browser-oriented visualization toolkits, such d3.js or jit.js. They’re great and easy to use, but how much do they scale when used with medium-large or very large datasets?

The subject ontology is a quite large (~2500 entities) taxonomical classification developed at Nature Publishing Group in order to classify scientific publications. The taxonomy is publicly available on data.nature.com, and is being encoded using the SKOS RDF vocabulary.

In order to evaluate the scalability of various javascript tree visualizations I extracted a JSON version of the subject taxonomy and tried to render it on a webpage, using out-of-the-box some of the viz approaches made available; here are the results (ps: I added the option of selecting how many levels of the tree can be visualized, just to get an idea of when a viz breaks).

Some conclusions:

The subject taxonomy actually is a poly-hierarchy (=one term can have more than one parent, so really it’s more like a directed graph). None of the libraries could handle that properly, but maybe that’s not really a limitation cause they are meant to support the visualization of trees (maybe I should play around more with force-directed graphs layout and the like..)

The only viz that could handle all of the terms in the taxonomy is D3’s collapsible tree. Still, you don’t want to keep all the branches open at the same time! Click on the image to see it with your eyes.

An approach to deal with large quantities of data is obviously to show them a little bit at a time. The Bar Hierarchy seems a pretty good way to do that, it’s informative and responsive. However it’d be nice to integrate with other controls/visual cues that would tell one what level of depth they’re currently looking at, which siblings are available etc.. etc..

Partition tables also looks pretty good in providing a visual summary of the categories available; however they tend to fail quickly when there are too many nodes, and the text is often not readable at all.. in the example below I had to include only the first 3 levels of the taxonomy for it to be loaded properly!

Rotating tree. Essentially a Tree plotted on a circle, very useful to provide a graphical overview of the data but it tends to become non responsive quickly.

Hierarchical pie chart. A pie chart that allows zooming in so to reveal hierarchical relationships (often also called Zoomable Sunburst). Quite nice and responsive, also with a large amount of data. The absence of labels could be a limiting feature though; you get a nice overview of the datascape but can’t really understand the meaning of each element unless you mouse over it.

Other stuff out there that could do a better job?