d3 – Parerga und Paralipomena http://www.michelepasin.org/blog At the core of all well-founded belief lies belief that is unfounded - Wittgenstein Mon, 04 Jan 2016 18:44:15 +0000 en-US hourly 1 https://wordpress.org/?v=5.2.11 13825966 Nature.com Subjects Stream Graph http://www.michelepasin.org/blog/2016/01/03/nature-com-subjects-stream-graph/ Sun, 03 Jan 2016 00:28:08 +0000 http://www.michelepasin.org/blog/?p=2750 The nature.com subjects stream graph displays the distribution of content across the subject areas covered by the nature.com portal.

This is an experimental interactive visualisation based on a freely available dataset from the nature.com linked data platform, which I’ve been working on in the last few months.

streamgraph

The main visualization provides an overview of selected content within the level 2 disciplines of the NPG Subjects Ontology. By clicking on these, it is then possible to explore more specific subdisciplines and their related articles.

For those of you who are not familiar with the Subjects Ontology: this is a categorization of scholarly subject areas which are used for the indexing of content on nature.com. It includes subject terms of varying levels of specificity such as Biological sciences (top level), Cancer (level 2), or B-2 cells (level 7). In total there are more than 2500 subject terms, organized into a polyhierarchical tree.

Starting in 2010, the various journals published on nature.com have adopted the subject ontology to tag their articles (note: different journals have started doing this at different times, hence some variations in the graph starting dates).

streamgraph2

streamgraph3

The visualization makes use of various d3.js modules, plus some simple customizations here and there. The hardest part of the work was putting the different page components together, to the effect of a more fluent ‘narrative’ achieved by gradually zooming into the data.

The back end is a Django web application with a relational database. The original dataset is published as RDF, so in order to use the Django APIs I’ve recreated it as a relational model. That let me also add a few extra data fields containing search indexes (e.g. article counts per month), so to make the stream graph load faster.

Comments or suggestions, as always very welcome.

 

]]>
2750
Is wikipedia a valid source of scientific knowledge? http://www.michelepasin.org/blog/2015/09/02/is-wikipedia-a-valid-source-of-scientific-knowledge/ http://www.michelepasin.org/blog/2015/09/02/is-wikipedia-a-valid-source-of-scientific-knowledge/#comments Wed, 02 Sep 2015 13:15:25 +0000 http://www.michelepasin.org/blog/?p=2689 Is wikipedia a valid source of scientific knowledge? Many would say yes. Others are still quite skeptical, or maybe just cautious about it. What seems to be the case though – and this is what this post is about – is that wikipedians are increasingly including references to scientific literature, and when they do it they do it right.

Based on data we’ve recently extracted from Wikipedia, it looks like that the vast majority of citations to nature.com content have been done according to the established scientific practice (i.e. using DOIs). Which makes you think that whoever added those citations is either a scientist or has some familiarity with science.

In the context of the nature.com ontologies portal we’ve done some work aimed at surfacing links between our articles and other datasets. Wikipedia and DBpedia (an RDF database version of wikipedia) have come to our attention quite soon: how much do wikipedia articles cite scientific content published on nature.com? Also, how well do they cite it?

So here’s an interactive visualization that lets you see all incoming references from Wikipedia to the nature.com archive. The actual dataset is encoded in RDF and can be downloaded here (look for the npg-articles-dbpedia-linkset.2015-08-24.nq.tar.gz file).

NewImage

 

About the data

In a nutshell, what we’ve done was simply extracting all mentions of either NPG DOIs or nature.com links using the wikipedia APIs (for example, see all references to the DOI “10.1038/ng1285”).

These links have then been validated against the nature.com articles database and encoded in RDF in two ways: a cito:isCitedBy relationship links the article URI to the citing Wikipedia page, and a foaf:topic relationship links the same article URI to the corresponding DBpedia page.

Screen Shot 2015 09 03 at 12 37 33 PM

In total there are 51309 links over 145 years.

Quite interestingly, the vast majority of these links are explicit DOI references (only ~900 were links to nature.com without a DOI). So, it seems that people do recognize the importance of DOIs even within a loosely controlled context like wikipedia.

Using the dataset

Considering that for many wikipedia is become the de facto largest and most cited encyclopedia out there (see the articles below), this may be an interesting dataset to analyze e.g. to highlight citation patters of influential articles.

Also, this could become quite useful as a data source for content enrichment: the wikipedia links could be used to drive subject tagging, or they could even be presented to readers on article pages e.g. as contextual information.

Toparticles

We haven’t really had time to explore any follow up on this work, but hopefully we’ll do that soon.

All of this data is open source and freely available on nature.com/ontologies. So if you’re reading this and have more ideas about potential uses or just want to collaborate, please do get in touch!

Caveats

This dataset is obviously just a snaphot of wikipedia links at a specific moment in time.

If one were to use these data within a real-world application he’d probably want to come up with some strategy to keep it up to date (e.g. monitoring the Wikipedia IRC recent changes channel).

Good news is, work is already happening in this space:

  • CrossRef is looking at collecting citation events from Wikipedia in real time and release these data freely as part of their service e.g. see http://crosstech.crossref.org/2015/05/coming-to-you-live-from-wikipedia.html
  • Altmetric scans wikipedia for references too e.g. see http://nature.altmetric.com/details/961190/wikipedia and http://www.altmetric.com/blog/new-source-alert-wikipedia/, however the source data is not freely available.
  •  

    Readings

    Finally, here are a couple of interesting background readings I’ve found in the nature.com archive:

  • Wikipedia rival calls in the experts (2006) http://www.nature.com/nature/journal/v443/n7111/full/443493a.html
  • Publish in Wikipedia or perish (2008) http://www.nature.com/news/2008/081216/full/news.2008.1312.html
  • Time to underpin Wikipedia wisdom (2010) http://www.nature.com/nature/journal/v468/n7325/full/468765c.html
  • Enjoy!

     

    ]]>
    http://www.michelepasin.org/blog/2015/09/02/is-wikipedia-a-valid-source-of-scientific-knowledge/feed/ 2 2689
    A sneak peek at Nature.com articles’ archive http://www.michelepasin.org/blog/2015/06/08/a-sneak-peek-at-nature-com-articles-archive/ http://www.michelepasin.org/blog/2015/06/08/a-sneak-peek-at-nature-com-articles-archive/#comments Mon, 08 Jun 2015 21:26:58 +0000 http://www.michelepasin.org/blog/?p=2632 We’re getting closer to releasing the full set of metadata covering over one million articles published by Nature Publishing Group since 1845. So here’s a sneak peek at this dataset, in the form of a simple d3.js visual summary of what soon will be available to download and reuse.

    In the last months I’ve been working with my colleagues at Macmillan Science and Education on an open data portal that makes available to the public many of the taxonomies and ontologies we use internally for organising the content we publish.

    This is part of our ongoing involvement with linked data and semantic technologies, aimed both at leveraging these tools to the end of transforming the publishing workflow into a more dynamic platform, and at contributing to the evolving web of open data with a rich dataset of scientific articles metadata.

    The articles dataset includes metadata about all articles published by the Nature journal, of course. But not only: the Scientific American, Nature Medicine, Nature Genetics and many other titles are also part of it (note: the full list can be downloaded as raw data here).

    Screen Shot 2015 06 08 at 22 24 15

    The first diagram shows how many articles have been published each year since 1845 (the start year of Scientific American). Nature began only a few years later in 1869; the curve getting steeper in the 90s instead corresponds to the exponential increase in publications due to the progressive specialisation of scientific journals (e.g. all the nature-branded titles).

    The second diagram instead shows the increase in publication volumes on an incremental scale. We’ve now reached the 1M articles and counting!

    Screen Shot 2015 06 08 at 22 25 09

    In order to create the charts I played around with a nifty example from Mike Bostock (http://bl.ocks.org/mbostock/3902569) and added a couple of extra things to it.

    The full source code is on Github.

    Finally, worth mentioning that this metadata had already been made available a few of years ago under the CC0 license: you can still access it here. This upcoming release though makes it available in the context of a much more precise and stable set of ontologies. Meaning that the semantics of the dataset is more clearly laid out and consistent.

    So stay tuned for more! ..and if you plan/would like to reuse these datasets please do get in touch, either here of by emailing developers@nature.com.

     

    ]]>
    http://www.michelepasin.org/blog/2015/06/08/a-sneak-peek-at-nature-com-articles-archive/feed/ 1 2632
    Messing around wih D3.js and hierarchical data http://www.michelepasin.org/blog/2013/06/21/messing-around-wih-d3-js-and-hierarchical-data/ http://www.michelepasin.org/blog/2013/06/21/messing-around-wih-d3-js-and-hierarchical-data/#comments Fri, 21 Jun 2013 13:23:59 +0000 http://www.michelepasin.org/blog/?p=2379 These days there are a lot of browser-oriented visualization toolkits, such d3.js or jit.js. They’re great and easy to use, but how much do they scale when used with medium-large or very large datasets?

    The subject ontology is a quite large (~2500 entities) taxonomical classification developed at Nature Publishing Group in order to classify scientific publications. The taxonomy is publicly available on data.nature.com, and is being encoded using the SKOS RDF vocabulary.

    In order to evaluate the scalability of various javascript tree visualizations I extracted a JSON version of the subject taxonomy and tried to render it on a webpage, using out-of-the-box some of the viz approaches made available; here are the results (ps: I added the option of selecting how many levels of the tree can be visualized, just to get an idea of when a viz breaks).

    Screen Shot 2014 02 13 at 2 07 50 PM

    Some conclusions:

  • The subject taxonomy actually is a poly-hierarchy (=one term can have more than one parent, so really it’s more like a directed graph). None of the libraries could handle that properly, but maybe that’s not really a limitation cause they are meant to support the visualization of trees (maybe I should play around more with force-directed graphs layout and the like..)
  • The only viz that could handle all of the terms in the taxonomy is D3’s collapsible tree. Still, you don’t want to keep all the branches open at the same time! Click on the image to see it with your eyes.
  • CollapsibleTree

  • An approach to deal with large quantities of data is obviously to show them a little bit at a time. The Bar Hierarchy seems a pretty good way to do that, it’s informative and responsive. However it’d be nice to integrate with other controls/visual cues that would tell one what level of depth they’re currently looking at, which siblings are available etc.. etc..
  • BarHiearchy

  • Partition tables also looks pretty good in providing a visual summary of the categories available; however they tend to fail quickly when there are too many nodes, and the text is often not readable at all.. in the example below I had to include only the first 3 levels of the taxonomy for it to be loaded properly!
  • TreeMapD3

    TreeMap

  • Rotating tree. Essentially a Tree plotted on a circle, very useful to provide a graphical overview of the data but it tends to become non responsive quickly.
  • RotatingTree

  • Hierarchical pie chart. A pie chart that allows zooming in so to reveal hierarchical relationships (often also called Zoomable Sunburst). Quite nice and responsive, also with a large amount of data. The absence of labels could be a limiting feature though; you get a nice overview of the datascape but can’t really understand the meaning of each element unless you mouse over it.
  • PieTree

     

    Other stuff out there that could do a better job?

     

    ]]>
    http://www.michelepasin.org/blog/2013/06/21/messing-around-wih-d3-js-and-hierarchical-data/feed/ 7 2379