graph – Parerga und Paralipomena

More Jupyter notebooks: pyvis and networkx

mikele — Thu, 06 Aug 2020 09:55:12 +0000

Lately I’ve been spending more time creating Jupyter notebooks that demonstrate how to use the Dimensions API for research analytics. In this post I’ll talk a little bit about two cool Python technologies I’ve discovered for working with graph data: pyvis and networkx.

pyvis and networkx

The networkx and pyvis libraries are used for generating and visualizing network data, respectively.

Pyvis is fundamentally a python wrapper around the popular Javascript visJS library. Networkx instead of is a pretty sophisticated package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

>>> from pyvis.network import Network
>>> import networkx as nx
# generate generic network graph instance
>>> nx_graph = nx.Graph()
# add some nodes and edges
>>> nx_graph.nodes[1]['title'] = 'Number 1'
>>> nx_graph.nodes[1]['group'] = 1
>>> nx_graph.nodes[3]['title'] = 'I belong to a different group!'
>>> nx_graph.nodes[3]['group'] = 10
>>> nx_graph.add_node(20, size=20, title='couple', group=2)
>>> nx_graph.add_node(21, size=15, title='couple', group=2)
>>> nx_graph.add_edge(20, 21, weight=5)
>>> nx_graph.add_node(25, size=25, label='lonely', title='lonely node', group=3)
# instantiatet pyvis network
>>> nt = Network("500px", "500px")
# populates pyvis network from networkx instance
>>> nt.from_nx(nx_graph)
>>> nt.show("nx.html")

It took me a little to familiarise with the libraries’ concepts and to generate some basic graphs. So, the tutorials linked below are meant to provide some reusable code building blocks for working with these tools.

Once you get the hang of it though, the fun part begins. What are the best data variables to represent in the graph? What color coding strategy is making it easier to explore the data? How many nodes/edges to display? Can we add some interactivity to the visualizations? Check out the resulting visualizations below for more ideas.

Dataviz: concepts co-occurence network

The Building a concepts co-occurence network notebook shows how to turn document keywords extracted from ‘semantic web’ publications into a simple topic map – by virtue of their co-occurrence within the same documents.

See also the standalone html version of the interactive visualization: concepts_network_2020-08-05.html

Dataviz: Organizations Collaboration Network

The Building an Organizations Collaboration Network Diagram notebook shows how to use publications’ authors and GRID data to generate a network of collaborating research organizations.

See also the standalone html version of the interactive visualization: network_2_levels_grid.412125.1.html

Exploring SciGraph data using JSON-LD, Elastic Search and Kibana

mikele — Thu, 06 Apr 2017 14:12:05 +0000

Hello there data lovers! In this post you can find some information on how to download and make some sense of the scholarly dataset recently made available by the Springer Nature SciGraph project, by using the freely available Elasticsearch suite of software.

A few weeks ago the SciGraph dataset was released (full disclosure: I’m part of the team who did that!). This is a high quality dataset containing metadata and abstracts about scientific articles published by Springer Nature, research grants related to them plus other classifications of this content.

This release of the dataset includes the last 5 years of content – that’s already an impressive 32 gigs of data you can get your hands on. So in this post I’m going to show how to do that, in particular by transforming the data from the RDF graph format they come with, into a JSON format which is more suited for application development and analytics.

We will be using two free-to-download products, GraphDB and Elasticsearch, so you’ll have to install them if you haven’t got them already. But no worries, that’s pretty straighforward, as you’ll see below.

1. Hello SciGraph Linked Data

First things first, we want to get hold of the SciGraph RDF datasets of course. That’s pretty easy, just head over to the SciGraph downloads page and get the following datasets:

Ontologies: the main schema behind SciGraph.
Articles – 2016: all the core articles metadata for one year.
Grants: grants metadata related to those articles.
Journals: full list of Springer Nature journal catalogue.
Subjects: classification of research areas developed by Springer Nature.

That’s pretty much everything, only thing we’re getting only one year worth of articles as that’s enough for the purpose of this exercise (~300k articles from 2016).

Next up, we want to get a couple of other datasets SciGraph depends on:

GRID: a catalogue of the world’s research organisations. Make sure you get both the ontology and one of the latest releases, within which you can find an RDF implementation too.
Field Of Research codes: another classification scheme used in SciGraph, developed by the Australian and New Zealand Standard Research Classification organization.

That’s it! Time for a cup of coffee.

2. Python to the help

We will be doing a bit of data manipulation in the next sections and Python is a great language for that sort of thing. Here’s what we need to get going:

Python. Make sure you have Python installed and also Pip, the Python package manager (any Python version above 2.7 should be ok).
GitHub project. I’ve created a few scripts for this tutorial, so head over to the hello-scigraph project on GitHub and download it to your computer. Note: the project contains all the Python scripts needed to complete this tutorial, but of course you should feel free to modify them or write from scratch if you fancy it!
Libraries. Install all the dependencies for the hello-scigraph project to run. You can do that by cd-ing into the project folder and running pip install -r requirements.txt (ideally within a virtual environment, but that’s up to you).

3. Loading the data into GraphDB

So, you should have by now 8 different files containing data (after step 1 above). Make sure they’re all in the same folder and that all of them have been unzipped (if needed), then head over to the GraphDB website and download the free version of the triplestore (you may have to sign up first).

The online documentation for GraphDB is pretty good, so it should be easy to get it up and running. In essence, you have to do the following steps:

Launch the application: for me, on a mac, I just had to double click the GraphDB icon – nice!
Create a new repository: this is the equivalent of a database within the triplestore. Call this repo “scigraph-2016” so that we’re all synced for the following steps.

Next thing, we want a script to load our RDF files into this empty repository. So cd into the directory containg the GitHub project (from step 2) and run the following command:

python -m hello-scigraph.loadGraphDB ~/scigraph-downloads/

The “loadGraphDB” script goes through all RDF files in the “scigraph-downloads” directory and loads them into the scigraph-2016 repository (note: you must replace “scigraph-downloads” with the actual path to the folder you downloaded content in step 1 above).

So, to recap: this script is now loading more than 35 million triples into your local graph database. Don’t be surprised if it’ll take some time (in particular the ‘articles-2016’ dataset, by far the biggest) so it’s time to take a break or do something else.

Once the process it’s finished, you should be able to explore your data via the GraphDB workbench. It’ll look something like this:

4. Creating an Elasticsearch index

We’re almost there. Let’s head over to the Elasticsearch website and download it. Elasticsearch is a powerful, distributed, JSON-based search and analytics engine so we’ll be using it to build an analytics dashboard for the SciGraph data.

Make sure Elastic is running (run bin/elasticsearch (or bin\elasticsearch.bat on Windows), then cd into the hello-scigraph Python project (from step 2) in order to run the following script:

python -m hello-scigraph.loadElastic

If you take a look at the source code, you’ll see that the script does the following:

Articles loading: extracts articles references from GraphDB in batches of 200.
Articles metadata extraction: for each article, we pull out all relevant metadata (e.g. title, DOI, authors) plus related information (e.g. author GRID organizations, geo locations, funding info etc..).
Articles metadata simplification: some intermediate nodes coming from the orginal RDF graph are dropped and replaced with a flatter structure which uses a a temporary dummy schema (prefix es: It doesn’t matter what we call that schema, but what’s important is to that we want to simplify the data we put into the Elastic search index. That’s because while the Graph layer is supposed to facilitate data integration and hence it benefits from a rich semantic representation of information, the search layer is more geared towards performance and retrieval hence a leaner information structure can dramatically speed things up there.
JSON-LD transformation: the simplified RDF data structure is serialized as JSON-LD – one of the many serializations available for RDF. JSON-LD is of course valid JSON, meaning that we can put that into Elastic right away. This is a bit of a shortcut actually, in fact for a more fine-grained control of how the JSON looks like, it’s probably better to transform the data into JSON using some ad-hoc mechanism. But for the purpose of this tutorial it’s more than enough.
Elastic index creation. Finally, we can load the data into an Elastic index called – guess what – “hello-scigraph”.

Two more things to point out:

Long queries. The Python script enforces a 60 seconds time-out on the GraphDB queries, so in case things go wrong with some articles data the script should keep running.
Memory issues. The script stops for 10 seconds after each batch of 200 articles (time.sleep(10)). Had to do this to prevent GraphDB on my laptop from running out of memory. Time to catch some breath!

That’s it! Time for another break now. A pretty long one actually – loading all the data took around 10 hours on my (rather averaged spec’ed) laptop so you may want to do that overnight or get hold of a faster machine/server.

Eventually, once the loading script is finished, you can issue this command from the command line to see how much data you’ve loaded into the Elastic index “hello-scigraph”. Bravo!

curl -XGET 'localhost:9200/_cat/indices/'

5. Analyzing the data with Kibana

Loading the data in Elastic already opens up a number of possibilites – check out the search APIs for some ideas – however there’s an even quicker way to analyze the data: Kibana. Kibana is another free product in the Elastic Search suite, which provides an extensible user interface for configuring and managing all aspects of the Elastic Stack.

So let’s get started with Kibana: download it and set it up using the online instructions, then point your browser at http://localhost:5601 .

You’ll get to the Kibana dashboard which shows the index we just created. Here you can perform any kind of searches and see the raw data as JSON.

What’s even more interesting is the visualization tab. Results of searches can be rendered as line chart, pie charts etc.. and more dimensions can be added via ‘buckets’. See below for some quick examples, but really, the possibilities are endless!

Conclusion

This post should have given you enough to realise that:

The SciGraph dataset contain an impressive amount of high-quality scholarly publications metadata which can be used for things like literature search, research statistics etc..
Even though you’re not familiar with Linked Data and the RDF family of languages, it’s not hard to get going with a triplestore and then transform the data into a more widely used format like JSON.
Finally, Elasticsearch and especially Kibana are fantastic tools for data analysis and exploration! Needless to say, in this post I’ve just scratched the surface of what could be done with it.

Hope this was fun, any questions or comments, you know the drill :-)

Installing ClioPatria triplestore on mac os

mikele — Mon, 27 Oct 2014 11:53:48 +0000

ClioPatria is a “SWI-Prolog application that integrates SWI-Prolog’s the SWI-Prolog libraries for RDF and HTTP services into a ready to use (semantic) web server”. It is actively developed by the folks at the VU University of Amsterdam and is freely available online.

While at a conference last week I saw a pretty cool demo (DIVE) which, I later learned, is powered by the ClioPatria triplestore. So I thought I’d give it a try and by doing so write a follow up on my recent post on installing OWLIM on Mac OS.

1. Requirements

OSX: Mavericks 10.9.5
XCode: latest version available from Apple
HOMEBREW: ruby -e “$(curl -fsSkL raw.github.com/mxcl/homebrew/go)”
Prolog: build it from source using brew: brew install swi-prolog
ClioPatria: git clone https://github.com/ClioPatria/ClioPatria.git

2. Setting up

After you have downloaded and unpacked the archive, all you need to do is start a new project using the ClioPatria script. In short, this is done by creating a new directory and telling ClioPatria to configure it as a project:

[michele.pasin]:~/Documents/ClioPatriaProjects/firstproject> ../path/to/ClioPatria/configure

A bunch of files are created, including a script run.pl which you can use later to run the server.

3. Running ClioPatria

I tried running the run.pl as per documentation but that didn’t work:

[michele.pasin]@Tartaruga:~/Documents/ClioPatriaProjects/firstproject>./run.pl 
./run.pl: line 3: :-: command not found
./run.pl: line 5: /Applications: is a directory
./run.pl: line 6: This: command not found
./run.pl: line 8: syntax error near unexpected token `('
./run.pl: line 8: `    % ./configure			(Unix)'

According to a thread on stack overflow, the Prolog shebang line isn’t interpreted correctly by OSx, meaning that Mac OS doesn’t recognise that script as a Prolog program.

That can be easily solved by calling the Prolog interpreter (swipl) explicitly:

[michele.pasin]@Tartaruga:~/Documents/ClioPatriaProjects/firstproject>swipl run.pl 
ERROR: /Applications/-Other-Apps/8-Languages-IDEs/ClioPatria/rdfql/sparql_runtime.pl:1246:14: Syntax error: Operator expected
% run.pl compiled 1.64 sec, 25,789 clauses
% Started ClioPatria server at port 3020
% You may access the server at http://tartaruga.local:3020/
% Loaded 0 graphs (0 triples) in 0.00 sec. (0% CPU = 0.00 sec.)
Welcome to SWI-Prolog (Multi-threaded, 64 bits, Version 6.6.6)
Copyright (c) 1990-2013 University of Amsterdam, VU Amsterdam
SWI-Prolog comes with ABSOLUTELY NO WARRANTY. This is free software,
and you are welcome to redistribute it under certain conditions.
Please visit http://www.swi-prolog.org for details.

You should be able to access the server with your browser on port 3020 (ps: the previous command caused a syntax error too, but luckily that isn’t a game stopper).

First impression:

Super-easy to install, clean and intuitive user interface. I subsequently added a couple of RDF datasets and it all went very very smoothy.

One cool feature is the fact that ClioPatria has a built-in package management system, which allows you to easily install extensions to the application. For example what follows allows one to quickly extend the UI with a couple of ‘intelligent’ SPARQL query interfaces (Yasque and Flint):

[michele.pasin]@Tartaruga:/Applications/ClioPatria>sudo git submodule update --init web/yasqe web/yasr
Password:


[michele.pasin]@Tartaruga:/Applications/ClioPatria>sudo git submodule update --init web/FlintSparqlEditor

4. Loading a big dataset

As in my previous post, I’ve tried loading the NPG Articles dataset available at nature.com’s legacy linked data site data.nature.com. The dataset contains around 40M triples describing (at the metadata level) all that’s been published by NPG and Scientific American from 1845 till nowadays. The file size is ~6 gigs so it’s not a huge dataset. Still, something big enough to pose a challenge to my macbook pro (8gigs RAM).

I used the web UI (‘load local file’) to load the dataset but I quickly ran into a ‘not enough memory’ error. Tried fiddling with the settings accessible via the web interface (Stack limit, Time limit), but that didn’t seem to do much.
So I increased the memory allocated to the Prolog process (more info here) however this wasn’t enough since after around 20mins the whole thing crashed again due to an out of memory error.

[michele.pasin]@Tartaruga:~/Documents/ClioPatriaProjects/firstproject>swipl -G6g run.pl

In the end I got in touch with the ClioPatria creators via the mailing list: in their (incredibly fast) reply they suggested to load the dataset manually using the server Prolog console. You’d do that simply by using the rdf_load command after starting the ClioPatria server (as shown above):

?- rdf_load('/Users/michele.pasin/Downloads/NPGcitationsGraph/articles.2012-07-16/articles.nq')
|    .
% Parsed "articles.nq" in 1149.71 sec; 0 triples

That worked: the dataset was loaded in around 20 mins. Job done!

However when I tried to run some queries the application became very slow and ultimately not responding (especially with queries like trying to retrieve all named classes from the graph). I tried restarting the triplestore, and realised that once you do that, ClioPatria begins by re-loading all repositories previously created – which, in the case of my 40M triples repo, would take around 10-15 minutes.

After restarting the server, queries were a bit faster but in many cases still pretty slowish on my 8G ram laptop.

Conclusion:

I am sure there are many more things which could be optimised, however I’m no Prolog expert nor I could figure out where to start just based on the online documentation. So I kind of gave up on using it to work on large datasets on my macbook for now.

On the other hand, I really liked ClioPatria’s intuitive and simple UI, its ease of installation and the fact you can perform operations transparently and interactively via a Prolog-console (assuming you know how to do that).

All in all, ClioPatria seems to me a really good option if you want to get up and running quickly e.g. in order to prototype linked data applications or explore small to medium-sized RDF datasets (10M triples or so I guess). For bigger datasets, you better equip your mac with a few gigs of extra RAM!

5. Useful resources

> Whitepaper with technical analysis

http://cliopatria.swi-prolog.org/help/whitepaper.html

> Mailing list

http://mailman.few.vu.nl/mailman/listinfo/cliopatria-list

Installing GraphDB (aka OWLIM) triplestore on mac os

mikele — Thu, 16 Oct 2014 19:05:38 +0000

GraphDB (formerly called OWLIM) is an RDF triplestore which is used – among others – by large organisations like the BBC or the British Museum. I’ve recently installed the LITE release of this graph database on my mac, so what follows is a simple write up of the steps that worked for me.

Haven’t played much with the database yet, but all in all, the installation was much simpler than expected (ps: this old recipe on google code was very helpful in steering me in the right direction with the whole Tomcat/Java setup).

1. Requirements

OSX: Mavericks 10.9.5
XCode: latest version available from Apple
HOMEBREW: ruby -e “$(curl -fsSkL raw.github.com/mxcl/homebrew/go)”
Tomcat7: brew install tomcat
JAVA: available from Apple

Finally – we obviously want to get a copy of OWLIM-Lite too: http://www.ontotext.com/owlim/downloads

2. Setting up

After you have downloaded and unpacked the archive, you must simply copy these two files:

owlim-lite/sesame_owlim/openrdf-sesame.war
owlim-lite/sesame_owlim/openrdf-workbench.war

..to the Tomcat webapps folder:

/usr/local/Cellar/tomcat/7.0.29/libexec/webapps/

Essentially that’s because OWLIM-Lite is packaged as a storage and inference layer for the Sesame RDF framework, which runs here as a component within the Tomcat server (note: there are other ways to run OWLIM, but this one seemed the quickest).

3. Starting Tomcat

First I created a symbolic link in my ~/Library folder, so to better manage new versions (as suggested here).

sudo ln -s /usr/local/Cellar/tomcat/7.0.39 ~/Library/Tomcat

Then in order to start/stop Tomcat it’s enough to use the catalina command:

[michele.pasin]@here:~/Library/Tomcat/bin>./catalina start
Using CATALINA_BASE:   /usr/local/Cellar/tomcat/7.0.39/libexec
Using CATALINA_HOME:   /usr/local/Cellar/tomcat/7.0.39/libexec
Using CATALINA_TMPDIR: /usr/local/Cellar/tomcat/7.0.39/libexec/temp
Using JRE_HOME:        /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Using CLASSPATH:       /usr/local/Cellar/tomcat/7.0.39/libexec/bin/bootstrap.jar:/usr/local/Cellar/tomcat/7.0.39/libexec/bin/tomcat-juli.jar

[michele.pasin]@here:~/Library/Tomcat/bin>./catalina stop
Using CATALINA_BASE:   /usr/local/Cellar/tomcat/7.0.39/libexec
Using CATALINA_HOME:   /usr/local/Cellar/tomcat/7.0.39/libexec
Using CATALINA_TMPDIR: /usr/local/Cellar/tomcat/7.0.39/libexec/temp
Using JRE_HOME:        /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Using CLASSPATH:       /usr/local/Cellar/tomcat/7.0.39/libexec/bin/bootstrap.jar:/usr/local/Cellar/tomcat/7.0.39/libexec/bin/tomcat-juli.jar

Tip: Tomcat runs by default on port 8080. That can be changed pretty easily by modifying a parameter in server.xml in {Tomcat installation folder}/libexec/conf/ more details here.

4. Testing the Graph database

Start a browser and go to the Workbench Web application using a URL of this form: http://localhost:8080/openrdf-workbench/ (substituting localhost and the 8080 port number as appropriate). You should see something like this:

After selecting a server, click ‘New repository’.

Select ‘OWLIM-Lite’ from the drop-down and enter the repository ID and description. Then click ‘next’.

Fill in the fields as required and click ‘create’.

That’s it! A message should be displayed that gives details of the new repository and this should also appear in the repository list (click ‘repositories’ to see this).

5. Loading a big dataset

I’ve set out to load the NPG Articles dataset available at nature.com’s legacy linked data site data.nature.com.

The dataset contains around 40M triples describing (at the metadata level) all that’s been published by NPG and Scientific American from 1845 till nowadays. The file size is ~6 gigs so it’s not a huge dataset. Still, something big enough to pose a challenge to my macbook pro (8gigs RAM).

First, I increased the memory allocated to the Tomcat application to 5G. It was enough to create a setenv.sh file in the ${tomcat-folder}\bin\ folder. The file contains this line:

CATALINA_OPTS=”$CATALINA_OPTS -server -Xms5g -Xmx5g”

More details on Tomcat’s and Java memory issues are available here.

Then I used OWLIM’s web interface to create a new graph repository and upload the dataset file into it (I previously downloaded a copy of the dataset to my computer so to work with local files only).

It took around 10 minutes for the application to upload the file into the triplestore, and 2-3 minutes for OWLIM to process it. Much much faster than what I expected. Only minor issue, the lack of notifications (in the UI) of what was going on. Not a big deal in my case, but with larger dataset uploads it might be a potential downer.

Note: I used the web form to upload the dataset, but there are also ways to do that from the command line (which will probably result in even faster uploads).

6. Useful information

> Sparql endpoints

All of your repositories come also with a handy SPARQL endpoint, which is available at this url: http://localhost:8080/openrdf-sesame/repositories/test1 (just change the last bit so that it matches your repository name).

> Official documentation

https://confluence.ontotext.com/display/GraphDB6

> Ontotext’s Q&A forum

http://answers.ontotext.com

An introduction to Neo4j

mikele — Wed, 10 Apr 2013 11:04:48 +0000

Neo4j is a recent graph-database that is rapidly accumulating success stories, especially in areas such as “social applications, recommendation engines, fraud detection, resource authorization, network & data center management and much more“. Here’s an interesting introductory lecture about by Ian Robinson at JavaZone 2013.

Tip: Databasetube offers various other interesting articles about neo4j

A few notes from the presentation:

Premises: 
	- Data today is more connected than ever before
	- Complexity = f(size, semi-structure, connectedness)
	- Graphs are the best abstractions we have to model connectedness

The data model in neo4j: "property graph model"
	- nodes have properties (eg key-value pairs)
	- relationships have a direction, and can have properties too (eg weighted associations)

Neo4j server has a built in UI (web-based)

When to consider using a graph database:
	- lots of join tables [connectedness]
	- lots of sparse tables [semi-structure]

Neo4j fully supports ACID transactions
	- durable, consistent data
	- uses a try/success syntax

Performance
	- millions of 'joins' per second [connections are pre-calculated at insert time!]
	- consistent query times as dataset grows

Cypher query language
	- syntax mirrors the graphic representation of a graph 
	- one dimensional, left-to-right

For a comparison of various graph databases (including Neo4j) check out this tutorial from the ESWC’13 conference