Information Architecture – Parerga und Paralipomena http://www.michelepasin.org/blog At the core of all well-founded belief lies belief that is unfounded - Wittgenstein Tue, 19 Feb 2019 22:03:38 +0000 en-US hourly 1 https://wordpress.org/?v=5.2.11 13825966 Zero Hunger Hack Day: surfacing research about the Sustainable Development Goals program http://www.michelepasin.org/blog/2019/02/11/zero-hunger-hack-day/ Mon, 11 Feb 2019 12:09:58 +0000 http://www.michelepasin.org/blog/?p=3282 This post is about a little dashboard idea that aims at helping policy makers discover research relevant to the ‘zero hunger‘ topic, one of the themes of the Sustainable Development Goals program.

The 2030 Agenda for Sustainable Development, adopted by all United Nations Member States in 2015, provides a shared blueprint for peace and prosperity for people and the planet, now and into the future. At its heart are the 17 Sustainable Development Goals (SDGs), which are an urgent call for action by all countries – developed and developing – in a global partnership. They recognize that ending poverty and other deprivations must go hand-in-hand with strategies that improve health and education, reduce inequality, and spur economic growth – all while tackling climate change and working to preserve our oceans and forests.

For more background about this project, see also its wikipedia page https://en.wikipedia.org/wiki/Sustainable_Development_Goals

Screenshot 2019-02-19 at 21.18.42.png

Springer Nature is among the many organizations who are taking an active role in developing scenarios and solutions to tackle these global challenges. A couple of months ago Springer Nature organized a hack day which brought together people with different backgrounds and expertise in order to come up with ideas and prototypes that could lead to further research. In particular, the focus of the hack day was on the ‘zero hunger’ theme.

The team I was working with developed a concept around the idea of an easy-to-use dashboard-like tool which could be used by busy policy makers in order to quickly gather infos about researchers or institutions they’d want to consult with.

Screenshot 2019-02-19 at 21.46.02.jpg

In order to make this idea more tangible I ended up building a little prototype, which allows to scan scholarly documents in order to pull out information (potentially) related to the ‘zero hunger’ topic and sub-topics, essentially following the keywords-structure specified in the Sustainable Development Goals document.

The prototype is available here: http://hacks2019.michelepasin.org/zerohunger/

 

Screen Shot 2018-11-02 at 16.22.42

 

Screen Shot 2018-11-02 at 16.22.51

 

This experiment also gave me an opportunity to learn about the Dimensions.ai API, a domain specific language (DSL) which allows to query the Dimensions database, a state-of-the-art scholarly platform containing  millions of linked metadata records about publications, grants, patents, clinical trials and policy documents (for more background about Dimensions, see this blog post and this white paper).

Screenshot 2019-02-19 at 21.50.33.jpg

The API itself is being a paywall, but if you are curious about it, the documentation is available online.

It’s a fantastic resource, intuitive and easy to use yet powerful and features-rich, so I am pretty sure I’ll be writing more about it.

Stay tuned for more!

 

 

]]>
3282
Ontospy 1.9.8 released http://www.michelepasin.org/blog/2019/01/03/ontospy-1-9-8-released/ Thu, 03 Jan 2019 11:55:14 +0000 http://www.michelepasin.org/blog/?p=3279 Ontospy version 1.9.8 has been just released and it contains tons of improvements and new features. Ontospy is a lightweight open-source Python library and command line tool for working with vocabularies encoded in the RDF family of languages.

Over the past month I’ve been working on a new version of Ontospy, which is now available for download on Pypi.

 

What’s new in this version

  • The library to generate ontology documentation (as html or markdown) is now included within the main Ontospy distribution. Previously this library was distributed separately under the name ontodocs.  The main problem with this approach is that keeping the two projects in sync was becoming too time-consuming for me, so I’ve decided to merge them. NOTE one can still choose whether or not to include this extra library when installing.
  • You can print out the raw RDF data being returned via command line argument.
  • One can decided whether or not to include ‘inferred’ schema definitions extracted from an RDF payload. The inferences are pretty basic for now (eg the object of rdf:type statements is taken to be a type) but this allows for example to quickly dereference a DBpedia URI and pull out all types/predicates being used.
  • The online documentation are now hosted on github pages and available within the /docs folder of the project.
  • Improved support for JSON-LD and a new utility for quickly sending JSON-LD data to the online playground tool.
  • Several other bug fixes and improvements, in particular to the interactive ontology exploration mode (shell command), the visualization library (new visualizations are available – albeit still in alpha state).
  •  

    ]]>
    3279
    Exploring scholarly publications using DBPedia concepts: an experiment http://www.michelepasin.org/blog/2018/11/23/exploring-scholarly-publications-via-dbpedia/ Fri, 23 Nov 2018 17:18:48 +0000 http://www.michelepasin.org/blog/?p=3254 This post is about a recent prototype I developed, which allows to explore a sample collection of Springer Nature publications using subject tags automatically extracted from DBPedia.

    DBpedia is a crowd-sourced community effort to extract structured content from the information created in various Wikimedia projects. This structured information resembles an open knowledge graph (OKG) which is available for everyone on the Web.

    Datasets

    The dataset I used is the result of a collaboration with Beyza Yaman, a researcher working with the DBpedia team in Leipzig, who used the SciGraph datasets as input to the DBPedia-Spotlight entity-mining tool.

    By using DBPedia-Spotlight we automatically associated DBpedia subjects terms to a  subset of abstracts available in the SciGraph dataset (around 90k abstract from 2017 publications).

    The prototype allows to search the Springer Nature publications using these subject terms.

    Also, DBpedia subjects include definitions and semantic relationships (which we are currently not using, but one can imagine how they could be raw material for generating more thematic ‘pathways’).

    Results: serendipitous discovery of scientific publications

    The results are pretty encouraging: despite the fact that the concepts extracted sometimes are only marginally relevant (or not relevant at all), the breadth and depth of the DBpedia classification makes the interactive exploration quite interesting and serendipitous.

    You can judge for yourself: the tool is available herehttp://hacks2019.michelepasin.org/dbpedialinks

    The purpose of this prototype is to evaluate the quality of the tagging and generate ideas for future applications. So any kind of feedback or ideas is very welcome!

    We are working with Beyza to write up the results of this investigation as a research paper. The data and software is already freely available on github.

    A couple of screenshots:

    Eg see the topic ‘artificial intelligence

    Screen Shot 2018-11-23 at 17.15.07.png

    One can add more subjects to a search in order to ‘zoom in’ into a results set, eg by adding ‘China’ to the search:

    Screen Shot 2018-11-23 at 17.16.38

    Implementation details

     

     

    ]]>
    3254
    SN SciGraph: latest website release make it easier to discover related content http://www.michelepasin.org/blog/2018/08/01/sn-scigraph-latest-website-release-make-it-easier-to-discover-related-content/ Wed, 01 Aug 2018 08:26:19 +0000 http://www.michelepasin.org/blog/?p=3239 The latest release of SN SciGraph Explorer website includes a number of new features that make it easier to navigate the scholarly knowledge graph and discover items of interest.

    Graphs are essentially composed by two kinds of objects: nodes and edges. Nodes are like the stations in a train map, while edges are the links that connect the different stations.

    Of course one wants to be able to move from station to station in any direction! Similarly in a graph one wants to be able to jump back and forth from node to node using any of the links provided. That’s the beauty of it!

    Although the underlying data allowed for this, the SN SciGraph Explorer website wasn’t fully supporting this kind of navigation. So we’ve now started to add a number of ‘related objects’ sections that reveal these pathways more clearly.

    For example, now it’s much easier to get to the organizations and grants an article relates to:

    Screen Shot 2018-07-31 at 18.23.47.png

    Or, for a book edition, to see its chapters and related organizations:

    bookEdition-links.png

    And much more..  Take a look at the site yourself to find out.

    Finally, we improved the linked data visualization included in every page by adding distinctive icons to each object type – so to make it easier to understand the immediate network of an object at a glance. E.g. see this grant:

    grant-diagram.png

    SN SciGraph is primarily about opening up new opportunities for open data and metadata enthusiasts who want to do more things with our content, so we hope that these additions will make discovering data items easier and more fun.

    Any comments? We’d love to hear from you. Otherwise, thanks for reading and stay tuned for more updates.

    PS: this blog was posted on the SN Research Data space too.

     

    ]]>
    3239
    Exploring SciGraph data using JSON-LD, Elastic Search and Kibana http://www.michelepasin.org/blog/2017/04/06/exploring-scigraph-data-using-elastic-search-and-kibana/ http://www.michelepasin.org/blog/2017/04/06/exploring-scigraph-data-using-elastic-search-and-kibana/#comments Thu, 06 Apr 2017 14:12:05 +0000 http://www.michelepasin.org/blog/?p=2844 Hello there data lovers! In this post you can find some information on how to download and make some sense of the scholarly dataset recently made available by the Springer Nature SciGraph project, by using the freely available Elasticsearch suite of software.

    A few weeks ago the SciGraph dataset was released (full disclosure: I’m part of the team who did that!). This is a high quality dataset containing metadata and abstracts about scientific articles published by Springer Nature, research grants related to them plus other classifications of this content.

    scigraph.png

    This release of the dataset includes the last 5 years of content – that’s already an impressive 32 gigs of data you can get your hands on. So in this post I’m going to show how to do that, in particular by transforming the data from the RDF graph format they come with, into a JSON format which is more suited for application development and analytics.

    We will be using two free-to-download products, GraphDB and Elasticsearch, so you’ll have to install them if you haven’t got them already. But no worries, that’s pretty straighforward, as you’ll see below.

    1. Hello SciGraph Linked Data

    First things first, we want to get hold of the SciGraph RDF datasets of course. That’s pretty easy, just head over to the SciGraph downloads page and get the following datasets:

    • Ontologies: the main schema behind SciGraph.
    • Articles – 2016: all the core articles metadata for one year.
    • Grants: grants metadata related to those articles.
    • Journals: full list of Springer Nature journal catalogue.
    • Subjects: classification of research areas developed by Springer Nature.

    That’s pretty much everything, only thing we’re getting only one year worth of articles as that’s enough for the purpose of this exercise (~300k articles from 2016).

    Next up, we want to get a couple of other datasets SciGraph depends on:

    That’s it! Time for a cup of coffee.

    2. Python to the help

    We will be doing a bit of data manipulation  in the next sections and Python is a great language for that sort of thing. Here’s what we need to get going:

    1. Python. Make sure you have Python installed and also Pip, the Python package manager (any Python version above 2.7 should be ok).
    2. GitHub project. I’ve created a few scripts for this tutorial, so head over to the hello-scigraph project on GitHub and download it to your computer. Note: the project contains all the Python scripts needed to complete this tutorial, but of course you should feel free to modify them or write from scratch if you fancy it!
    3. Libraries. Install all the dependencies for the hello-scigraph project to run. You can do that by cd-ing into the project folder and running pip install -r requirements.txt (ideally within a virtual environment, but that’s up to you).

    3. Loading the data into GraphDB

    So, you should have by now 8 different files containing data (after step 1 above). Make sure they’re all in the same folder and that all of them have been unzipped (if needed), then head over to the GraphDB website and download the free version of the triplestore (you may have to sign up first).

    The online documentation for GraphDB is pretty good, so it should be easy to get it up and running. In essence, you have to do the following steps:

    1. Launch the application: for me, on a mac, I just had to double click the GraphDB icon – nice!
    2. Create a new repository: this is the equivalent of a database within the triplestore. Call this repo scigraph-2016” so that we’re all synced for the following steps.

    Next thing, we want a script to load our RDF files into this empty repository. So cd into the directory containg the GitHub project (from step 2) and run the following command:

    python -m hello-scigraph.loadGraphDB ~/scigraph-downloads/

    The “loadGraphDB” script goes through all RDF files in the “scigraph-downloads” directory and loads them into the scigraph-2016 repository (note: you must replace “scigraph-downloads” with the actual path to the folder you downloaded content in step 1 above).

    So, to recap: this script is now loading more than 35 million triples into your local graph database. Don’t be surprised if it’ll take some time (in particular the ‘articles-2016’ dataset, by far the biggest) so it’s time to take a break or do something else.

    Once the process it’s finished, you should be able to explore your data via the GraphDB workbench.  It’ll look something like this:

    GraphDB-class-hierarchy

    4. Creating an Elasticsearch index

    We’re almost there. Let’s head over to the Elasticsearch website and download it. Elasticsearch is a powerful, distributed, JSON-based search and analytics engine so we’ll be using it to build an analytics dashboard for the SciGraph data.

    Make sure Elastic is running (run bin/elasticsearch (or bin\elasticsearch.bat on Windows), then cd into the hello-scigraph Python project (from step 2) in order to run the following script:

    python -m hello-scigraph.loadElastic

    If you take a look at the source code, you’ll see that the script does the following:

    1. Articles loading: extracts articles references from GraphDB in batches of 200.
    2. Articles metadata extraction: for each article, we pull out all relevant metadata (e.g. title, DOI, authors) plus related information (e.g. author GRID organizations, geo locations, funding info etc..).
    3. Articles metadata simplification:  some intermediate nodes coming from the orginal RDF graph are dropped and replaced with a flatter structure which uses a a temporary dummy schema (prefix es: <http://elastic-index.scigraph.com/> It doesn’t matter what we call that schema, but what’s important is to that we want to simplify the data we put into the Elastic search index. That’s because while the Graph layer is supposed to facilitate data integration and hence it benefits from a rich semantic representation of information, the search layer is more geared towards performance and retrieval hence a leaner information structure can dramatically speed things up there.
    4. JSON-LD transformation: the simplified RDF data structure is serialized as JSON-LD – one of the many serializations available for RDF. JSON-LD is of course valid JSON, meaning that we can put that into Elastic right away. This is a bit of a shortcut actually, in fact for a more fine-grained control of how the JSON looks like,  it’s probably better to transform the data into JSON using some ad-hoc mechanism. But for the purpose of this tutorial it’s more than enough.
    5. Elastic index creation. Finally, we can load the data into an Elastic index called – guess what – “hello-scigraph”.

    Two more things to point out:

    • Long queries. The Python script enforces a 60 seconds time-out on the GraphDB queries, so in case things go wrong with some articles data the script should keep running.
    • Memory issues. The script stops for 10 seconds after each batch of 200 articles (time.sleep(10)). Had to do this to prevent GraphDB on my laptop from running out of memory. Time to catch some breath!

    That’s it! Time for another break  now. A pretty long one actually – loading all the data took around 10 hours on my (rather averaged spec’ed) laptop so you may want to do that overnight or get hold of a faster machine/server.

    Eventually, once the loading script is finished, you can issue this command from the command line to see how much data you’ve loaded into the Elastic index  “hello-scigraph”. Bravo!

    curl -XGET 'localhost:9200/_cat/indices/'

    5. Analyzing the data with Kibana

    Loading the data in Elastic already opens up a number of possibilites – check out the search APIs for some ideas – however there’s an even quicker way to analyze the data: KibanaKibana is another free product in the Elastic Search suite, which provides an extensible user interface for configuring and managing all aspects of the Elastic Stack.

    So let’s get started with Kibana: download it and set it up using the online instructions, then point your browser at http://localhost:5601 .

    You’ll get to the Kibana dashboard which shows the index we just created. Here you can perform any kind of searches and see the raw data as JSON.

    What’s even more interesting is the visualization tab. Results of searches can be rendered as line chart, pie charts etc.. and more dimensions can be added via ‘buckets’. See below for some quick examples, but really, the possibilities are endless!

    Conclusion

    This post should have given you enough to realise that:

    1. The SciGraph dataset contain an impressive amount of high-quality scholarly publications metadata which can be used for things like literature search, research statistics etc..
    2. Even though you’re not familiar with Linked Data and the RDF family of languages, it’s not hard to get going with a triplestore and then transform the data into a more widely used format like JSON.
    3. Finally, Elasticsearch and especially Kibana are fantastic tools for data analysis and exploration! Needless to say, in this post I’ve just scratched the surface of what could be done with it.

    Hope this was fun, any questions or comments, you know the drill :-)

    ]]>
    http://www.michelepasin.org/blog/2017/04/06/exploring-scigraph-data-using-elastic-search-and-kibana/feed/ 4 2844
    Ontospy v. 1.6.7 http://www.michelepasin.org/blog/2016/06/12/ontospy-v-1-6-7/ Sun, 12 Jun 2016 18:05:51 +0000 http://www.michelepasin.org/blog/?p=2783 A new and improved version of OntoSpy (1.6.7) is available online. OntoSpy is a lightweight Python library and command line tool for inspecting and visualizing vocabularies encoded in the RDF family of languages.

    This update includes support for Python 3, plus various other improvements that make it easier to query semantic web vocabularies using OntoSpy’s interactive shell module. To find out more about Ontospy:

  • Docs: http://ontospy.readthedocs.org
  • CheeseShop: https://pypi.python.org/pypi/ontospy
  • Github: https://github.com/lambdamusic/ontospy

  • Here’s a short video showing a typical sessions with the OntoSpy repl:

    What’s new in this release

    The main new features of version 1.6.7:

  • added support for Python 3.0 (thanks to a pull request from https://github.com/T-002)
  • the import [file | uri | repo | starter-pack] command that makes it easier to load models into the local repository. You can import a local RDF file or a web resource via its URI. The repo option allows to select an ontology by listing the one available in a couple of online public repositories; finally the starter-pack option can be used to download automatically a few widely used vocabularies (e.g. FOAF,DC etc..) into the local repository – mostly useful after a fresh installation in order to get started
  • the info [toplayer | parents | children | ancestors | descendants] command allows to print more detailed info about entities
  • added an incremental search mode based on text patterns e.g. to reduce the options returned by the ls command
  • calling the serialize command at ontology level now serializes the whole graph
  • made the caching functionality version-dependent
  • added json serialization option (via rdflib-jsonld)
  • Install/update simply by typing pip install ontospy -U in your terminal window (see this page for more info).

    Coming up next

    I’d really like to add more output visualisations e.g. VivaGraphJS or one of the JavaScript InfoVis Toolkit.

    Probably even more interesting, I’d like to refactor the code generating visualisations so that it allows people to develop their own via a standard API and then publishing them on GitHub.

    Lastly, more support for instance management: querying and creating instances from any loaded ontology.

    Of course, any comments or suggestions are welcome as usual – either using the form below or via GitHub. Cheers!

     

    ]]>
    2783
    Nature.com Subjects Stream Graph http://www.michelepasin.org/blog/2016/01/03/nature-com-subjects-stream-graph/ Sun, 03 Jan 2016 00:28:08 +0000 http://www.michelepasin.org/blog/?p=2750 The nature.com subjects stream graph displays the distribution of content across the subject areas covered by the nature.com portal.

    This is an experimental interactive visualisation based on a freely available dataset from the nature.com linked data platform, which I’ve been working on in the last few months.

    streamgraph

    The main visualization provides an overview of selected content within the level 2 disciplines of the NPG Subjects Ontology. By clicking on these, it is then possible to explore more specific subdisciplines and their related articles.

    For those of you who are not familiar with the Subjects Ontology: this is a categorization of scholarly subject areas which are used for the indexing of content on nature.com. It includes subject terms of varying levels of specificity such as Biological sciences (top level), Cancer (level 2), or B-2 cells (level 7). In total there are more than 2500 subject terms, organized into a polyhierarchical tree.

    Starting in 2010, the various journals published on nature.com have adopted the subject ontology to tag their articles (note: different journals have started doing this at different times, hence some variations in the graph starting dates).

    streamgraph2

    streamgraph3

    The visualization makes use of various d3.js modules, plus some simple customizations here and there. The hardest part of the work was putting the different page components together, to the effect of a more fluent ‘narrative’ achieved by gradually zooming into the data.

    The back end is a Django web application with a relational database. The original dataset is published as RDF, so in order to use the Django APIs I’ve recreated it as a relational model. That let me also add a few extra data fields containing search indexes (e.g. article counts per month), so to make the stream graph load faster.

    Comments or suggestions, as always very welcome.

     

    ]]>
    2750
    Another experiment with Wittgenstein’s Tractatus http://www.michelepasin.org/blog/2015/09/21/another-experiment-with-wittgensteins-tractatus/ http://www.michelepasin.org/blog/2015/09/21/another-experiment-with-wittgensteins-tractatus/#comments Mon, 21 Sep 2015 18:58:46 +0000 http://www.michelepasin.org/blog/?p=2717 Spent some time hacking over the weekend. And here’s the result: a minimalist interactive version of Wittgenstein’s Tractatus.

    Screen Shot 2015 09 21 at 19 52 31

    The Tractatus Logico-Philosophicus is a text I’ve worked with already in the past.

    This time I was intrigued by the simple yet super cool typed.js javascript library, which simulates animated typing.

    Screen Shot 2015 09 21 at 19 54 12

    After testing it out a bit I realised that this approach allows to focus on the text with more attention that having it all displayed at once.

    Since the words appear one at a time, it feels more like a verbal dialogue than reading. As a consequence, also the way the meaning of the text gets perceived changes.

    Slower, deeper. Almost like meditating. Try it out here.

    Credits

  • the typed.js javascript library.
  • the Tractatus Logico-Philosophicus by Wittgenstein
  •  

    ]]>
    http://www.michelepasin.org/blog/2015/09/21/another-experiment-with-wittgensteins-tractatus/feed/ 1 2717
    Is wikipedia a valid source of scientific knowledge? http://www.michelepasin.org/blog/2015/09/02/is-wikipedia-a-valid-source-of-scientific-knowledge/ http://www.michelepasin.org/blog/2015/09/02/is-wikipedia-a-valid-source-of-scientific-knowledge/#comments Wed, 02 Sep 2015 13:15:25 +0000 http://www.michelepasin.org/blog/?p=2689 Is wikipedia a valid source of scientific knowledge? Many would say yes. Others are still quite skeptical, or maybe just cautious about it. What seems to be the case though – and this is what this post is about – is that wikipedians are increasingly including references to scientific literature, and when they do it they do it right.

    Based on data we’ve recently extracted from Wikipedia, it looks like that the vast majority of citations to nature.com content have been done according to the established scientific practice (i.e. using DOIs). Which makes you think that whoever added those citations is either a scientist or has some familiarity with science.

    In the context of the nature.com ontologies portal we’ve done some work aimed at surfacing links between our articles and other datasets. Wikipedia and DBpedia (an RDF database version of wikipedia) have come to our attention quite soon: how much do wikipedia articles cite scientific content published on nature.com? Also, how well do they cite it?

    So here’s an interactive visualization that lets you see all incoming references from Wikipedia to the nature.com archive. The actual dataset is encoded in RDF and can be downloaded here (look for the npg-articles-dbpedia-linkset.2015-08-24.nq.tar.gz file).

    NewImage

     

    About the data

    In a nutshell, what we’ve done was simply extracting all mentions of either NPG DOIs or nature.com links using the wikipedia APIs (for example, see all references to the DOI “10.1038/ng1285”).

    These links have then been validated against the nature.com articles database and encoded in RDF in two ways: a cito:isCitedBy relationship links the article URI to the citing Wikipedia page, and a foaf:topic relationship links the same article URI to the corresponding DBpedia page.

    Screen Shot 2015 09 03 at 12 37 33 PM

    In total there are 51309 links over 145 years.

    Quite interestingly, the vast majority of these links are explicit DOI references (only ~900 were links to nature.com without a DOI). So, it seems that people do recognize the importance of DOIs even within a loosely controlled context like wikipedia.

    Using the dataset

    Considering that for many wikipedia is become the de facto largest and most cited encyclopedia out there (see the articles below), this may be an interesting dataset to analyze e.g. to highlight citation patters of influential articles.

    Also, this could become quite useful as a data source for content enrichment: the wikipedia links could be used to drive subject tagging, or they could even be presented to readers on article pages e.g. as contextual information.

    Toparticles

    We haven’t really had time to explore any follow up on this work, but hopefully we’ll do that soon.

    All of this data is open source and freely available on nature.com/ontologies. So if you’re reading this and have more ideas about potential uses or just want to collaborate, please do get in touch!

    Caveats

    This dataset is obviously just a snaphot of wikipedia links at a specific moment in time.

    If one were to use these data within a real-world application he’d probably want to come up with some strategy to keep it up to date (e.g. monitoring the Wikipedia IRC recent changes channel).

    Good news is, work is already happening in this space:

  • CrossRef is looking at collecting citation events from Wikipedia in real time and release these data freely as part of their service e.g. see http://crosstech.crossref.org/2015/05/coming-to-you-live-from-wikipedia.html
  • Altmetric scans wikipedia for references too e.g. see http://nature.altmetric.com/details/961190/wikipedia and http://www.altmetric.com/blog/new-source-alert-wikipedia/, however the source data is not freely available.
  •  

    Readings

    Finally, here are a couple of interesting background readings I’ve found in the nature.com archive:

  • Wikipedia rival calls in the experts (2006) http://www.nature.com/nature/journal/v443/n7111/full/443493a.html
  • Publish in Wikipedia or perish (2008) http://www.nature.com/news/2008/081216/full/news.2008.1312.html
  • Time to underpin Wikipedia wisdom (2010) http://www.nature.com/nature/journal/v468/n7325/full/468765c.html
  • Enjoy!

     

    ]]>
    http://www.michelepasin.org/blog/2015/09/02/is-wikipedia-a-valid-source-of-scientific-knowledge/feed/ 2 2689
    Italian public spending data: a review http://www.michelepasin.org/blog/2014/12/22/italian-public-spending-data/ Mon, 22 Dec 2014 21:45:13 +0000 http://michelepasin.org/blog/?p=2561 The Italian government recently announced a new portal containing data on public spending: http://soldipubblici.gov.it. This is obviously great news; the website is still in beta though so in what follows I’d like to put forward a few (hopefully constructive) comments and desires for how it could/should be developed further.

    Incidentally, I recently ran into Ian Makgill from spendnetwork.com/, a London startup funded by the Open Data Institute which looks at using open public data to create the first comprehensive and publicly available repository for government transaction data.

    We ended up chatting about the situation with open government data in Italy. To be honest I’m no expert on the matter but a couple of names quickly came to mind.

    First, the excellent OpenPolis association. Their mission is to enable free access to public information on political candidates, elected representatives, and legislative activity thus promoting transparency and the democratic participation of Italian citizens.

    One of their most successful projects is Open Parliament (similar in scope to theyworkforyou.com in the UK). More recently the Open Bilanci platform was created so to allow citizens to search&compare the budgets and expenses of municipalities (local boroughs) in Italy.

    Screen Shot 2014 12 22 at 22 44 34

    Screen Shot 2014 12 23 at 10 52 45

    Second, the ongoing work done by the Open Knowledge Foundation, which also has an Italian charter. For example one of its long-standing projects, openspending.org, contains references to several datasets about Italy’s public spending.

    Another useful resource is the Italia open data census, a community driven initiative to compare the progress made by different cities and local areas in releasing Open Data.

    Screen Shot 2014 12 22 at 22 42 47

     

    Soldipubblici.gov.it: a first look

    It should be clear by now that the people behind soldipubblici.gov.it are not the only ones looking at increasing transparency and democracy by releasing open data.

    What’s not clear at all though, is whether these different groups are talking to each other – which would seem the most obvious thing to do before embarking on a new enterprise like this. Especially since soldipubblici.gov.it is strikingly similar (in scope) to the aforementioned Open Bilanci portal. I’m sure that both the folks at the OKFN and openpolis.org would be interested in getting their hands on these data so to integrate them with their existing services.

    Nonetheless, it’s great to hear that more is happening in this space. Even more so because it’s the Italian government who’s taking responsibility for it this (as it should be).

    Screen Shot 2014 12 23 at 11 14 52

    That’s the good news. The bad news is that, from a data perspective, there isn’t much you can do with the soldipubblici.gov.it in this beta release. If one wants to put these data to use there are various key elements missing I believe. Here’s a few ideas:

  • There is no way to browse/review the data. Search is good, but if you have no idea what to search for (e.g. simply because you don’t know what’s it called), then you’re fundamentally stuck. The system actually features a more advanced ‘semantic’ search, which essentially augments the scope of the keywords you put in via synonyms and related terms. That’s nice, but that’s also no substitute for a good old days yellow-pages-like categories browser. You know, just to get the hang of what’s in the box before opening it.
  • You can’t download the data. To be fair, the FAQs clearly state that this feature is still being worked on. Fine – I guess they’re talking about some nifty mechanism to select-collect-&-download specific datasets one is interested in. However I do wonder why one cannot download the entire dataset already. At the end of the day, that data is A) already made available via the current user interface; B) public and (in theory) already available on a different website called SIOPE (available is a big word though – I should probably say buried).
  • The visualisation app is nice but very limited. Data doesn’t become information unless you give it some meaningful context. This tool is a great idea but it’d be enormously more useful if you could decide yourself what to plot on the graph (e.g. which years, of which data sets) depending on your research questions i.e. your context. Moreover, you want to be able to make comparisons between different datasets etc etc.. All things that Open Bilanci does already pretty well.
  • There’s no data about the beneficiaries of the public expenses. Not sure what the challenges are here, or whether this is feasible at all. But it’d be great to have this extra piece of information, for transparency’s sake. For example, on spendnetwork.com you can easily see which are the list of suppliers for the London Borough Council of Ealing expenses.
  •  

    Conclusion: how serious do you want to be about open data?

    This is an inspiring start and I can’t wait to see it being developed further. Especially if it gets developed with real end-users in mind!
    To that end, it’s useful to bring up what the OKFN has declared to be the key features of data openness:

  • Availability and access: the data must be available as a whole and at no more than a reasonable reproduction cost, preferably by downloading over the internet. The data must also be available in a convenient and modifiable form.
  • Reuse and redistribution: the data must be provided under terms that permit reuse and redistribution including the intermixing with other datasets. The data must be machine-readable.
  • Universal participation: everyone must be able to use, reuse and redistribute — there should be no discrimination against fields of endeavour or against persons or groups. For example, ‘non-commercial’ restrictions that would prevent ‘commercial’ use, or restrictions of use for certain purposes (e.g. only in education), are not allowed.
  •  

    Personally, I can’t stress strongly enough how useful it’d be being able to access the raw data. Excel, CSV, a REST API or even better a Linked Data API.
    A data-level access point would turn this nice-looking but essentially siloed website into an open resource which thousands of data journalists or data scientists (of any kind) could build upon.

    Are you looking forward to see this happen? I do!

     

    ]]>
    2561