Is wikipedia a valid source of scientific knowledge?

Is wikipedia a valid source of scientific knowledge? Many would say yes. Others are still quite skeptical, or maybe just cautious about it. What seems to be the case though – and this is what this post is about – is that wikipedians are increasingly including references to scientific literature, and when they do it they do it right.

Based on data we’ve recently extracted from Wikipedia, it looks like that the vast majority of citations to content have been done according to the established scientific practice (i.e. using DOIs). Which makes you think that whoever added those citations is either a scientist or has some familiarity with science.

In the context of the ontologies portal we’ve done some work aimed at surfacing links between our articles and other datasets. Wikipedia and DBpedia (an RDF database version of wikipedia) have come to our attention quite soon: how much do wikipedia articles cite scientific content published on Also, how well do they cite it?

So here’s an interactive visualization that lets you see all incoming references from Wikipedia to the archive. The actual dataset is encoded in RDF and can be downloaded here (look for the npg-articles-dbpedia-linkset.2015-08-24.nq.tar.gz file).



About the data

In a nutshell, what we’ve done was simply extracting all mentions of either NPG DOIs or links using the wikipedia APIs (for example, see all references to the DOI “10.1038/ng1285”).

These links have then been validated against the articles database and encoded in RDF in two ways: a cito:isCitedBy relationship links the article URI to the citing Wikipedia page, and a foaf:topic relationship links the same article URI to the corresponding DBpedia page.

Screen Shot 2015 09 03 at 12 37 33 PM

In total there are 51309 links over 145 years.

Quite interestingly, the vast majority of these links are explicit DOI references (only ~900 were links to without a DOI). So, it seems that people do recognize the importance of DOIs even within a loosely controlled context like wikipedia.

Using the dataset

Considering that for many wikipedia is become the de facto largest and most cited encyclopedia out there (see the articles below), this may be an interesting dataset to analyze e.g. to highlight citation patters of influential articles.

Also, this could become quite useful as a data source for content enrichment: the wikipedia links could be used to drive subject tagging, or they could even be presented to readers on article pages e.g. as contextual information.


We haven’t really had time to explore any follow up on this work, but hopefully we’ll do that soon.

All of this data is open source and freely available on So if you’re reading this and have more ideas about potential uses or just want to collaborate, please do get in touch!


This dataset is obviously just a snaphot of wikipedia links at a specific moment in time.

If one were to use these data within a real-world application he’d probably want to come up with some strategy to keep it up to date (e.g. monitoring the Wikipedia IRC recent changes channel).

Good news is, work is already happening in this space:

  • CrossRef is looking at collecting citation events from Wikipedia in real time and release these data freely as part of their service e.g. see
  • Altmetric scans wikipedia for references too e.g. see and, however the source data is not freely available.


    Finally, here are a couple of interesting background readings I’ve found in the archive:

  • Wikipedia rival calls in the experts (2006)
  • Publish in Wikipedia or perish (2008)
  • Time to underpin Wikipedia wisdom (2010)
  • Enjoy!