publishing – Parerga und Paralipomena

Recent projects from CrossRef.org

mikele — Sun, 14 Jun 2015 22:21:55 +0000

We spent the day with the CrossRef team in Oxford last week, talking about our recent work in the linked data space (see the nature ontologies portal) and their recent initiatives in the scholarly publishing area.

So here’s a couple of interesting follow ups from the meeting.
ps. If you want to know more about CrossRef, make sure you take a look at their website and in particular the labs section: http://labs.crossref.org/.

Opening up article level metrics

http://det.labs.crossref.org/

CrossRef is using the open source Lagotto application (developed by PLOS https://github.com/articlemetrics/lagotto) to retrieve article metrics data from a variety of sources (e.g. wikipedia, twitter etc. see the full list here).

The model used for storing this data follows an agreed ontology containing for example a classification of ‘mentions’ actions (viewed/saved/discussed/recommended/cited – see this paper for more details).

In a nutshell, CrossRef is planning to collect and make the metrics (raw) data for all the DOIs they track in the form of ‘DOI events‘

An interesting demo application shows the stream of DOIs citations coming from Wikipedia (one of the top referrers of DOIs, unsurprisingly). More discussions on this blog post.

Linking dataset DOIs and publications DOIs

http://www.crosscite.org/

CrossRef has been working with Datacite to the goal of harmonising their databases. Datacite is the second major register of DOIs (after CrossRef) and it has been focusing on assigning persistent identifiers to datasets.

This work is now gaining more momentum as Datacite is enlarging its team. So in theory it won’t be long before we see a service that allows to interlink publications and datasets, which is great news.

Linking publications and funding sources

http://www.crossref.org/fundref/

FundRef provides a standard way to report funding sources for published scholarly research. This is increasingly becoming a fundamental requirement for all publicly funded research, so several publishers have agreed to help extracting funding information and sending it to CrossRef.

A recent platform built on top of Fundref is Chorus http://www.chorusaccess.org/, which enables users to discover articles reporting on funded research. Furthermore it provides dashboards which can b used by funders, institutions, researchers, publishers, and the public for monitoring and tracking public-access compliance for articles reporting on funded research.

For example see http://dashboard.chorusaccess.org/ahrq#/breakdown

Miscellaneous news & links

– JSON-LD (an RDF version of JSON) is being considered as a candidate data format for the next generation of the CrossRef REST API.

– The prototype http://www.yamz.net/ came up in discussion; a quite interesting stack-overflow meets ontology-engineering kind of tool. Def worth a look, I’d say.

– Wikidata (a queryable structured data version of wikipedia) seems to be gaining a lot of momentum after it’s taken over Freebase from Google. Will it eventually replace its main rival DBpedia?

A sneak peek at Nature.com articles’ archive

mikele — Mon, 08 Jun 2015 21:26:58 +0000

We’re getting closer to releasing the full set of metadata covering over one million articles published by Nature Publishing Group since 1845. So here’s a sneak peek at this dataset, in the form of a simple d3.js visual summary of what soon will be available to download and reuse.

In the last months I’ve been working with my colleagues at Macmillan Science and Education on an open data portal that makes available to the public many of the taxonomies and ontologies we use internally for organising the content we publish.

This is part of our ongoing involvement with linked data and semantic technologies, aimed both at leveraging these tools to the end of transforming the publishing workflow into a more dynamic platform, and at contributing to the evolving web of open data with a rich dataset of scientific articles metadata.

The articles dataset includes metadata about all articles published by the Nature journal, of course. But not only: the Scientific American, Nature Medicine, Nature Genetics and many other titles are also part of it (note: the full list can be downloaded as raw data here).

The first diagram shows how many articles have been published each year since 1845 (the start year of Scientific American). Nature began only a few years later in 1869; the curve getting steeper in the 90s instead corresponds to the exponential increase in publications due to the progressive specialisation of scientific journals (e.g. all the nature-branded titles).

The second diagram instead shows the increase in publication volumes on an incremental scale. We’ve now reached the 1M articles and counting!

In order to create the charts I played around with a nifty example from Mike Bostock (http://bl.ocks.org/mbostock/3902569) and added a couple of extra things to it.

The full source code is on Github.

Finally, worth mentioning that this metadata had already been made available a few of years ago under the CC0 license: you can still access it here. This upcoming release though makes it available in the context of a much more precise and stable set of ontologies. Meaning that the semantics of the dataset is more clearly laid out and consistent.

So stay tuned for more! ..and if you plan/would like to reuse these datasets please do get in touch, either here of by emailing developers@nature.com.

Notes from the Force11 annual conference

mikele — Sat, 17 Jan 2015 18:04:41 +0000

I attended the https://www.force11.org/ conference in Oxford the last couple of days (the conference was previously called ‘Beyond the PDF’).

Force11 is a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing. Individually and collectively, we aim to bring about a change in modern scholarly communications through the effective use of information technology. [About Force 11]

Rather than the presentations, I would say that the most valuable aspect of this event are the many conversations you can have with people from different backgrounds: techies, publishers, policy makers, academics etc..

Nonetheless, here’s a (very short and biased) list of things that seemed to stand out.

A talk titled Who’s Sharing with Who? Acknowledgements-driven identification of resources by David Eichmann, University of Iowa. He is working on a (seemingly very effective) method for extracting contributors roles from scientific articles

This presentation describes my recent work in semantic analysis of the acknowledgement section of biomedical research articles, specifically the sharing of resources (instruments, reagents, model organisms, etc.) between the author articles and other non-author investigators. The resulting semantic graph complements the knowledge currently captured by research profiling systems, which primarily focus on investigators, publications and grants. My approach results in much finer-grained information, at the individual author contribution level, and the specific resources shared by external parties. The long-term goal for this work is unification with the VIVO-ISF-based CTSAsearch federated search engine, which currently contains research profiles from 60 institutions worldwide.

A talk titled Why are we so attached to attachments? Let’s ditch them and improve publishing by Kaveh Bazargan, head of River Valley Technologies. He demoed a prototype manuscript tracking system that allows editors, authors and reviewers to create new versions of the same document via an online google-doc-like system which has JATS XML in the background

I argue that it is precisely the ubiquitous use of attachments that has held up progress in publishing. We have the technology right now to allow the author to write online and have the file saved automatically as XML. All subsequent work on the “manuscript” (e.g. copy editing, QC, etc) can also be done online. At the end of the process the XML is automatically “rendered” to PDF, Epub, etc, and delivered to the end user, on demand. This system is quicker as there are no emails or attachments to hold it up, cheaper as there is no admin involved, and more accurate as there is only one definitive file (the XML) which is the “format of record”.

Rebecca Lawrence from F1000 presented and gave me a walk through of a new suite of tools they’re working on. That was quite impressing I must say, especially due to the variety of features they offer: tools to organize and store references, annotate and discuss articles and web pages, import them into word documents etc.. All packed within a nicely looking and user friendly application. This is due to go public beta some time in March, but you can try to get access to it sooner by signing up here.

The best poster award went to 101 Innovations in Scholarly Communication – the Changing Research Workflow. This is a project aiming to chart innovation in scholarly information and communication flows. Very inspiring and definitely worth a look.

Finally, I’m proud to say that the best demo award went to my own resquotes.com, a personal quotations-manager online tool which I’ve just launched a couple of weeks ago. Needless to say, it was great to get vote of confidence from this community!

If you want more, it’s worth taking a look directly at the conference agenda and in particular the demo/poster session agenda. And hopefully see you next year in Portland, Oregon :-)