python – Parerga und Paralipomena

More Jupyter notebooks: pyvis and networkx

mikele — Thu, 06 Aug 2020 09:55:12 +0000

Lately I’ve been spending more time creating Jupyter notebooks that demonstrate how to use the Dimensions API for research analytics. In this post I’ll talk a little bit about two cool Python technologies I’ve discovered for working with graph data: pyvis and networkx.

pyvis and networkx

The networkx and pyvis libraries are used for generating and visualizing network data, respectively.

Pyvis is fundamentally a python wrapper around the popular Javascript visJS library. Networkx instead of is a pretty sophisticated package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

>>> from pyvis.network import Network
>>> import networkx as nx
# generate generic network graph instance
>>> nx_graph = nx.Graph()
# add some nodes and edges
>>> nx_graph.nodes[1]['title'] = 'Number 1'
>>> nx_graph.nodes[1]['group'] = 1
>>> nx_graph.nodes[3]['title'] = 'I belong to a different group!'
>>> nx_graph.nodes[3]['group'] = 10
>>> nx_graph.add_node(20, size=20, title='couple', group=2)
>>> nx_graph.add_node(21, size=15, title='couple', group=2)
>>> nx_graph.add_edge(20, 21, weight=5)
>>> nx_graph.add_node(25, size=25, label='lonely', title='lonely node', group=3)
# instantiatet pyvis network
>>> nt = Network("500px", "500px")
# populates pyvis network from networkx instance
>>> nt.from_nx(nx_graph)
>>> nt.show("nx.html")

It took me a little to familiarise with the libraries’ concepts and to generate some basic graphs. So, the tutorials linked below are meant to provide some reusable code building blocks for working with these tools.

Once you get the hang of it though, the fun part begins. What are the best data variables to represent in the graph? What color coding strategy is making it easier to explore the data? How many nodes/edges to display? Can we add some interactivity to the visualizations? Check out the resulting visualizations below for more ideas.

Dataviz: concepts co-occurence network

The Building a concepts co-occurence network notebook shows how to turn document keywords extracted from ‘semantic web’ publications into a simple topic map – by virtue of their co-occurrence within the same documents.

See also the standalone html version of the interactive visualization: concepts_network_2020-08-05.html

Dataviz: Organizations Collaboration Network

The Building an Organizations Collaboration Network Diagram notebook shows how to use publications’ authors and GRID data to generate a network of collaborating research organizations.

See also the standalone html version of the interactive visualization: network_2_levels_grid.412125.1.html

Pypapers: a bare-bones, command line, PDF manager

mikele — Sun, 30 Jun 2019 22:48:40 +0000

Ever felt like softwares like Mendeley or Papers are great, but somehow slow you down? Ever felt like none of the many reference manager softwares out there will ever cut it for you, cause you need something R E A L L Y SIMPLE? I did. Many times. So I’ve finally crossed the line and tried out building a simple commmand-line PDF manager. PyPapers, is called.

Yes – that’s right – command line. So not for everyone. Also: this is bare bones and pre-alpha. So don’t expect wonders. It basically provides a simple interface for searching a folder full of PDFs. That’s all for now!

Key features (or lack of)

Mac only, I’m afraid. I’m sitting on the shoulders of a giant. That is, mdfind.

No fuss search in file names only or full text

Shows all results and relies on Preview for reading

Highlighting on Preview works pretty damn fine and it’s the ultimate compatibility solution (any other software kinds of locks you in eventually, imho)

Open source. If you can code Python you can customise it to your needs. If you can’t, open an issue in github and I may end up doing it.

It recognises sub-folders, so that can be leveraged to become a simple, filesystem level, categorization structure for your PDFs (eg I have different folders for articles, books, news etc..)

Your PDFs live in the Mac filesystem ultimately. So you can always search them using Finder in case you get bored of the command line.

First impressions

Pretty good. Was concerned I was gonna miss things like collections or tags. But I found a workaround: first, identify the papers I am interested in. Then, create a folder in the same directory and symlink them in there (= create an alias).

It’s not quite like uncarved wood, but it definitely feels simple enough.

Introducing DimCli: a Python CLI for the Dimensions API

mikele — Fri, 24 May 2019 11:10:15 +0000

For the last couple of months I’ve been working on a new open source Python project. This is called DimCli and it’s a library aimed at making it simpler to work with the Dimensions Analytics API.

The project is available on Github. In a nutshell, DimCli helps people becoming productive with the powerful scholarly analytics API from Dimensions. See the video below for a quick taster of the functionalities available.

Background

I recenlty joined the Dimensions team, so needed a way to get to grips with their feature-rich API (official docs). So, building DimCli has been a fun way for me to dig into the logic of the Dimensions Search Language (DSL).

Plus, this project gave me a chance to learn more about two awesome Python technologies: JupyterLab and its magic commands, as well as the Python Prompt Toolkit library.

Features

In a nutshell, this is what DimCli has to offer:

It’s an interactive query console for the Dimensions Analytics API (ps: Dimensions is a world-class research-data platform including information about millions of documents like publications, patents, grants, clinical trials and policy documents.

It helps learning the Dimensions Search Language (DSL) thanks to a built-in autocomplete and documentation search mechanism.

It handles authentication transparently either via a global user-specific credentials file, or by passing credentials manually (e.g. when used within shared environments).

It allows to export results to CSV, JSON and pandas dataframes, hence making it easier to integrate with other data analysis tools.

It is compatible with Jupyter, e.g. it includes various magic commands that make it super simple to interrogate Dimensions (various examples here).

Feedback

DimCli lives on Github, so for any feedback or bug reports, feel free to open an issue there.

Ontospy 1.9.8 released

mikele — Thu, 03 Jan 2019 11:55:14 +0000

Ontospy version 1.9.8 has been just released and it contains tons of improvements and new features. Ontospy is a lightweight open-source Python library and command line tool for working with vocabularies encoded in the RDF family of languages.

Over the past month I’ve been working on a new version of Ontospy, which is now available for download on Pypi.

What’s new in this version

The library to generate ontology documentation (as html or markdown) is now included within the main Ontospy distribution. Previously this library was distributed separately under the name ontodocs. The main problem with this approach is that keeping the two projects in sync was becoming too time-consuming for me, so I’ve decided to merge them. NOTE one can still choose whether or not to include this extra library when installing.

You can print out the raw RDF data being returned via command line argument.

One can decided whether or not to include ‘inferred’ schema definitions extracted from an RDF payload. The inferences are pretty basic for now (eg the object of rdf:type statements is taken to be a type) but this allows for example to quickly dereference a DBpedia URI and pull out all types/predicates being used.

The online documentation are now hosted on github pages and available within the /docs folder of the project.

Improved support for JSON-LD and a new utility for quickly sending JSON-LD data to the online playground tool.

Several other bug fixes and improvements, in particular to the interactive ontology exploration mode (shell command), the visualization library (new visualizations are available – albeit still in alpha state).

Ontospy v. 1.6.7

mikele — Sun, 12 Jun 2016 18:05:51 +0000

A new and improved version of OntoSpy (1.6.7) is available online. OntoSpy is a lightweight Python library and command line tool for inspecting and visualizing vocabularies encoded in the RDF family of languages.

This update includes support for Python 3, plus various other improvements that make it easier to query semantic web vocabularies using OntoSpy’s interactive shell module. To find out more about Ontospy:

Docs: http://ontospy.readthedocs.org

CheeseShop: https://pypi.python.org/pypi/ontospy

Github: https://github.com/lambdamusic/ontospy

 
Here’s a short video showing a typical sessions with the OntoSpy repl:

What’s new in this release

The main new features of version 1.6.7:
 

added support for Python 3.0 (thanks to a pull request from https://github.com/T-002)

the import [file | uri | repo | starter-pack] command that makes it easier to load models into the local repository. You can import a local RDF file or a web resource via its URI. The repo option allows to select an ontology by listing the one available in a couple of online public repositories; finally the starter-pack option can be used to download automatically a few widely used vocabularies (e.g. FOAF,DC etc..) into the local repository – mostly useful after a fresh installation in order to get started

the info [toplayer | parents | children | ancestors | descendants] command allows to print more detailed info about entities

added an incremental search mode based on text patterns e.g. to reduce the options returned by the ls command

calling the serialize command at ontology level now serializes the whole graph

made the caching functionality version-dependent

added json serialization option (via rdflib-jsonld)

Install/update simply by typing pip install ontospy -U in your terminal window (see this page for more info).

Coming up next

I’d really like to add more output visualisations e.g. VivaGraphJS or one of the JavaScript InfoVis Toolkit.

Probably even more interesting, I’d like to refactor the code generating visualisations so that it allows people to develop their own via a standard API and then publishing them on GitHub.

Lastly, more support for instance management: querying and creating instances from any loaded ontology.

Of course, any comments or suggestions are welcome as usual – either using the form below or via GitHub. Cheers!

Accessing OS X dictionary with Python

mikele — Sat, 28 Nov 2015 15:57:06 +0000

A little script that allows to access the OS X Dictionary app using Python.

Tip: make the script executable and add an alias for it in order to be able to call it from the command line easily.

Dereference a DOI using python

mikele — Wed, 03 Dec 2014 16:49:44 +0000

A little python script that allows to pass an article DOI in order to obtain all the metadata related to that article.

The script relies on the handy crosscite.org API, which is one of the wonderful services provided by CrossRef.

Teaching programming concepts visually with the Online Python Tutor

mikele — Fri, 12 Oct 2012 08:52:49 +0000

The Online Python Tutor is a Web-based program visualization for CS education, developed in collaboration with Google. It provides an easy-to-use online environment for writing code and testing it interactively. A great resource for teaching computer science concepts!

As part of his CS education work at Google, Philip Guo has been developing an open-source educational tool called Online Python Tutor. This tool enables teachers and students to write Python programs directly in the web browser and then single-step forwards and backwards to visualize what the computer is doing as it executes those programs. The tool has already been used by over 100,000 people but has a lot of potential for advancement. Philip is actively seeking partnerships with educators at all grade levels to deploy and improve the Online Python Tutor tool. Visit the URL for more information on using the tool and how to get involved.

Create, Test, Share

Once you’ve created a program, you can also share it online via a url, or get a snippet of code that will let you embed it in your site. Which is pretty neat! For example:

Getting hold of your Flickr collections with Python

mikele — Mon, 07 May 2012 12:11:52 +0000

Recently I’ve been a little disappointed with Flickr, the popular online photo-sharing service. Photos gone missing, entire albums disappeared. Not really what you’d like to happen to your photo collection, especially when it’s very large and therefore it’s difficult to be always on top of what’s there and what’s not.. Time to change strategy: use flickr for sharing and my local HD for backup!

I emailed the customer service people at Flickr, they promptly replied that it wasn’t their fault but most likely a bug with other apps I had previously authorised to edit my Flickr collection (e.g. iPhoto or Aperture). Bad news: apparently whatever happened now what’s lost is lost forever. Not much to my consolation, the same happened to other people, for example check this post or this post to see alternative versions of the problem from 2010 and 2007.

So I’ve suddenly realised the cloud isn’t that secure a place, as yet. It’s time to change strategy: use flickr for sharing and my local HD for backup!

The good news is that if you know a little programming you can download your entire Flickr collection without having to pay a cent, for example by using Python. There are a few free libraries out there for accessing the Flickr APIs, such as flickrpy and FlickrAPI. They both require you to fiddle a little with the code (at the very least, get a personalised passkey from Flickr and add it to the python program) in order to get what you want.

The one I’ve gone for instead is a little package called flickrtouchr, which is even easier to use. After downloading you just have to run it from the command line and it’ll begin browsing your whole Flickr collection and download pictures at the highest resolution available. I have more than 8000 photos, and it worked like a charm – beware though – it took more than 10 hours on my TalkTalk connection.

Thanks Dan@hivelogic.com for writing this code – couldn’t be asking for more!

[mac]@mike:~/Dropbox/code/python/_libs/dan-hivelogic-flickrtouchr-9ba645b>python flickrtouchr.py ~/Desktop/FlickrBackupFolder

In order to allow FlickrTouchr to read your photos and favourites
you need to allow the application. Please press return when you've
granted access at the following url (which should have opened
automatically).

http://api.flickr.com/services/auth/?api_key=e2245325378b5675b4af4e8cdb0564716fa9bd&perms=read&frob=8856734hhgbbhsksd19443-caa77e89367asbbhfa2ba-600258&api_sig=a4aasdbbnb345c7fb46bdd33cfa65ec17bb32a

Waiting for you to press return

Egypt 1 ... in set ... Sharm el Sheik, Dec 2011
Egypt 2 ... in set ... Sharm el Sheik, Dec 2011
Egypt 3 ... in set ... Sharm el Sheik, Dec 2011
Egypt 4 ... in set ... Sharm el Sheik, Dec 2011

..... etc….

Survey of Pythonic tools for RDF and Linked Data programming

mikele — Thu, 24 Feb 2011 15:21:27 +0000

In this post I’m reporting on a recent survey I made in the context of a Linked Data project I’m working on, SAILS. The Resource Description Framework (RDF) is a data model and language which is quickly gaining momentum in the open-data and data-integration worlds. In SAILS we’re developing a prototype for rdf-data manipulation and querying, but since the final application (of which the rdf-components is part of) will be written in Python and Django, in what follows I tried to gather information about all the existing libraries and frameworks for doing rdf-programming using python.

1. Python libraries for working with Rdf

RdfLib http://www.rdflib.net/

RdfLib (download) is a pretty solid and extensive rdf-programming kit for python. It contains parsers and serializers for RDF/XML, N3, NTriples, Turtle, TriX and RDFa. The library presents a Graph interface which can be backed by any one of a number of store implementations, including, memory, MySQL, Redland, SQLite, Sleepycat, ZODB and SQLObject.

The latest release is RdfLib 3.0, although I have the feeling that many are still using the previous release, 2.4. One big difference between the two is that in 3.0 some libraries have been separated into another package (called rdfextras); among these libraries there’s also the one you need for processing sparql queries (the rdf query language), so it’s likely that you want to install that too.
A short overview of the difference between these two recent releases of RdfLib can be found here. The APIs documentation for RdfLib 2.4 is available here, while the one for RdfLib 3.0 can be found here. Finally, there are also some other (a bit older, but possibly useful) docs on the wiki.

Next thing, you might want to check out these tutorials:

Getting data from the Semantic Web: a nice example of how to use RdfLib and python in order to get data from DBPedia, the Semantic Web version of Wikipedia.

How can I use the Ordnance Survey Linked Data: shows how to install RdfLib and query the linked data offered by Ordnance Survey.

A quick and dirty guide to YOUR first time with RDF: another example of querying Uk government data found on data.gov.uk using RdfLib and Berkely/Sleepycat DB.

RdfAlchemy http://www.openvest.com/trac/wiki/RDFAlchemy

The goal of RDFAlchemy (install | apidocs | usergroup) is to allow anyone who uses python to have a object type API access to an RDF Triplestore. In a nutshell, the same way that SQLAlchemy is an ORM (Object Relational Mapper) for relational database users, RDFAlchemy is an ORM (Object RDF Mapper) for semantic web users.

RdfAlchemy can also work in conjunction with other datastores, including rdflib, Sesame, and Jena. Support for SPARQL is present, although it seems less stable than the rest of the library.

Fuxi http://code.google.com/p/fuxi/

FuXi is a Python-based, bi-directional logical reasoning system for the semantic web. It requires rdflib 2.4.1 or 2.4.2 and it is not compatible with rdflib 3. FuXi aims to be the ‘engine for contemporary expert systems based on the Semantic Web technologies’. The documentation can be found here; it might be useful also to look at the user-manual and the discussion group.

In general, it looks as if Fuxi can offer a complete solution for knowledge representation and reasoning over the semantic web; it is quite sophisticated and well documented (partly via several academic articles). The downside is that to the end of hacking together a linked data application.. well Fuxi is probably just too complex and difficult to learn.

About Inferencing: a very short introduction to what Fuxi inferencing capabilities can do in the context of an rdf application.

ORDF ordf.org

ORDF (download | docs) is the Open Knowledge Foundation‘s library of support infrastructure for RDF. It is based on RDFLib and contains an object-description mapper, support for multiple back-end indices, message passing, revision history and provenance, a namespace library and a variety of helper functions and modules to ease integration with the Pylons framework.

The current version of this library is 0.35. You can have a peek at some of its key functionalities by checking out the ‘Object Description Mapper‘ – an equivalent to what an Object-Relational Mapper would give you in the context of a relational database. The library seems to be pretty solid; for an example of a system built on top of ORDF you can see Bibliographica, an online open catalogue of cultural works.

Why using RDF? The Design Considerations section in the ORDF documentation discusses the reasons that led to the development of this library in a clear and practical fashion.

Django-rdf http://code.google.com/p/django-rdf/

Django-RDF (download | faq | discussiongroup) is an RDF engine implemented in a generic, reusable Django app, providing complete RDF support to Django projects without requiring any modifications to existing framework or app source code. The philosophy is simple: do your web development using Django just like you’re used to, then turn the knob and – with no additional effort – expose your project on the semantic web.

Django-RDF can expose models from any other app as RDF data. This makes it easy to write new views that return RDF/XML data, and/or query existing models in terms of RDFS or OWL classes and properties using (a variant of) the SPARQL query language. SPARQL in, RDF/XML out – two basic semantic web necessities. Django-RDF also implements an RDF store using its internal models such as Concept, Predicate, Resource, Statement, Literal, Ontology, Namespace, etc. The SPARQL query engine returns query sets that can freely mix data in the RDF store with data from existing Django models.

The major downside of this library is that it doesn’t seem to be maintained anymore; the last release is from 2008, and there seem to be various conflicts with recent versions of Django. A real shame!

Djubby http://code.google.com/p/djubby/

Djubby (download | docs) is a Linked Data frontend for SPARQL endpoints for the Django Web framework, adding a Linked Data interface to any existing SPARQL-capable triple stores.

Djubby is quite inspired by Richard Cyganiak’s Pubby (written in Java): it provides a Linked Data interface to local or remote SPARQL protocol servers, it provides dereferenceable URIs by rewriting URIs found in the SPARQL-exposed dataset into the djubby server’s namespace, and it provides a simple HTML interface showing the data available about each resource, taking care of handling 303 redirects and content negotiation.

Redland http://librdf.org/

Redland (download | docs | discussiongroup) is an RDF library written in C and including several high-level language APIs providing RDF manipulation and storage. Redland makes available also a Python interface (intro | apidocs) that can be used to manipulate RDF triples.

This library seems to be quite complete and is actively maintained; only potential downside is the installation process. In order to use the python bindings you need to install the C library too (which in turns depends on other C libraries), so (depending on your programming experience and operating system used) just getting up and running might become a challenge.

SuRF http://packages.python.org/SuRF/

SuRF (install | docs) is an Object – RDF Mapper based on the RDFLIB python library. It exposes the RDF triple sets as sets of resources and seamlessly integrates them into the Object Oriented paradigm of python in a similar manner as ActiveRDF does for ruby.

Other smaller (but possibly useful) python libraries for rdf:

Sparql Interface to python: a minimalistic solution for querying sparql endpoints using python (download | apidocs). UPDATE: Ivan Herman pointed out that this library has been discontinued and merged with the ‘SPARQL Endpoint interface to Python’ below.

SPARQL Endpoint interface to Python another little utility for talking to a SPARQL endpoint, including having select-results mapped to rdflib terms or returned in JSON format (download)

PySparql: again, a minimal library that does SELECT and ASK queries on an endpoint which implements the HTTP (GET or POST) bindings of the SPARQL Protocol (code page)

Sparta: Sparts is a simple, resource-centric API for RDF graphs, built on top of RDFLIB.

Oort: another Python toolkit for accessing RDF graphs as plain objects, based on RDFLIB. The project homepage hasn’t been updated for a while, although there is trace of recent activity on its google project page.

2. RDF Triplestores that are python-friendly

An important component of a linked-data application is the triplestore (that is, an RDF database): many commercial and non-commercial triplestores are available, but only a few offer out-of-the-box python interfaces. Here’s a list of them:

Allegro Graph http://www.franz.com/agraph/allegrograph/

AllegroGraph RDFStore is a high-performance, persistent RDF graph database. AllegroGraph uses disk-based storage, enabling it to scale to billions of triples while maintaining superior performance. Unfortunately, the official version of AllegroGraph is not free, but it is possible to get a free version of it (it limits the DB to 50 million triples, so although useful for testing or development it doesn’t seem a good solution for a production environment).

The Allegro Graph Python API (download | docs | reference) offers convenient and efficient access to an AllegroGraph server from a Python-based application. This API provides methods for creating, querying and maintaining RDF data, and for managing the stored triples.

A hands-on overview of what’s like to work with AllegroGraph and python can be found here: Getting started with AllegroGraph.

Open Link Virtuoso http://virtuoso.openlinksw.com/

Virtuoso Universal Server is a middleware and database engine hybrid that combines the functionality of a traditional RDBMS, ORDBMS, virtual database, RDF, XML, free-text, web application server and file server functionality in a single system. Rather than have dedicated servers for each of the aforementioned functionality realms, Virtuoso is a “universal server”; it enables a single multithreaded server process that implements multiple protocols. The open source edition of Virtuoso Universal Server is also known as OpenLink Virtuoso.

Virtuoso from Python is intended to be a collection of modules for interacting with OpenLink Virtuoso from python. The goal is to provide drivers for `SQLAlchemy` and `RDFLib`. The package is installable from the Python Package Index and source code for development is available in a mercurial repository on BitBucket.

A possibly useful example of using Virtuoso from python: SPARQL Guide for Python Developer.

Sesame http://www.openrdf.org/

Sesame is an open-source framework for querying and analyzing RDF data (download | documentation). Sesame supports two query languages: SeRQL and Sparql. Sesame’s API differs from comparable solutions in that it offers a (stackable) interface through wich functionality can be added, and the storage engine is abstracted from the query interface (many other Triplestores can in fact be used through the Sesame API).

It looks as if the best way to interact with Sesame is by using Java; however there is also a pythonic API called pySesame. This is essentially a python wrapper for Sesame’s REST HTTP API, so the range of operations supported (Log in, Log out, Request a list of available repositories, Evaluate a SeRQL-select, RQL or RDQL query, Extract/upload/remove RDF from a repository) are somehow limited (for example, there does not seem to be any native SPARQL support).

A nice introduction to using Sesame with Python (without pySesame though) can be found in this article: Getting Started with RDF and SPARQL Using Sesame and Python.

Talis platform http://www.talis.com/platform/

The Talis Platform (faq | docs)is an environment for building next generation applications and services based on Semantic Web technologies. It is a hosted system which provides an efficient, robust storage infrastructure. Both arbitrary documents and RDF-based semantic content are supported, with sophisticated query, indexing and search features. Data uploaded on the Talis platform are organized into stores: a store is a grouping of related data and metadata. For convenience each store is assigned one or more owners who are the people who have rights to configure the access controls over that data and metadata. Each store provides a uniform REST interface to the data and metadata it manages.

Stores don’t come free of charge, but through the Talis Connected Commons scheme it is possible have quite large amounts of store space for free. The scheme is intended to support a wide range of different forms of data publishing. For example scientific researchers seeking to share their research data; dissemination of public domain data from a variety of different charitable, public sector or volunteer organizations; open data enthusiasts compiling data sets to be shared with the web community.

Good news for pythonistas too: pynappl is a simple client library for the Talis Platform. It relies on rdflib 3.0 and draws inspiration from other similar client libraries. Currently it is focussed mainly on managing data loading and manipulation of Talis Platform stores (this blog post says more about it).

Before trying out the Talis platform you might find useful this blog post: Publishing Linked Data on the Talis Platform.

4store http://4store.org/

4store (download | features | docs) is a database storage and query engine that holds RDF data. It has been used by Garlik as their primary RDF platform for three years, and has proved itself to be robust and secure.
4store’s main strengths are its performance, scalability and stability. It does not provide many features over and above RDF storage and SPARQL queries, but if your are looking for a scalable, secure, fast and efficient RDF store, then 4store should be on your shortlist.

4store offers a number of client libraries, among them there are two for python: first, HTTP4Store is a client for the 4Store httpd service – allowing for easy handling of sparql results, and adding, appending and deleting graphs. Second, py4s, although this seems to be a much more experimental library (geared towards multi process queries).
Furthemore, there is also an application for the Django web framework called django-4store that makes it easier to query and load rdf data into 4store when running Django. The application offers some support for constructing sparql-based Django views.

This blog post shows how to install 4store: Getting Started with RDF and SPARQL Using 4store and RDF.rb .

End of the survey.. have I missed out on something? Please let me know if I did – I’ll try to keep adding stuff to this list as I move on with my project work!