Data science – Parerga und Paralipomena

More Jupyter notebooks: pyvis and networkx

mikele — Thu, 06 Aug 2020 09:55:12 +0000

Lately I’ve been spending more time creating Jupyter notebooks that demonstrate how to use the Dimensions API for research analytics. In this post I’ll talk a little bit about two cool Python technologies I’ve discovered for working with graph data: pyvis and networkx.

pyvis and networkx

The networkx and pyvis libraries are used for generating and visualizing network data, respectively.

Pyvis is fundamentally a python wrapper around the popular Javascript visJS library. Networkx instead of is a pretty sophisticated package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

>>> from pyvis.network import Network
>>> import networkx as nx
# generate generic network graph instance
>>> nx_graph = nx.Graph()
# add some nodes and edges
>>> nx_graph.nodes[1]['title'] = 'Number 1'
>>> nx_graph.nodes[1]['group'] = 1
>>> nx_graph.nodes[3]['title'] = 'I belong to a different group!'
>>> nx_graph.nodes[3]['group'] = 10
>>> nx_graph.add_node(20, size=20, title='couple', group=2)
>>> nx_graph.add_node(21, size=15, title='couple', group=2)
>>> nx_graph.add_edge(20, 21, weight=5)
>>> nx_graph.add_node(25, size=25, label='lonely', title='lonely node', group=3)
# instantiatet pyvis network
>>> nt = Network("500px", "500px")
# populates pyvis network from networkx instance
>>> nt.from_nx(nx_graph)
>>> nt.show("nx.html")

It took me a little to familiarise with the libraries’ concepts and to generate some basic graphs. So, the tutorials linked below are meant to provide some reusable code building blocks for working with these tools.

Once you get the hang of it though, the fun part begins. What are the best data variables to represent in the graph? What color coding strategy is making it easier to explore the data? How many nodes/edges to display? Can we add some interactivity to the visualizations? Check out the resulting visualizations below for more ideas.

Dataviz: concepts co-occurence network

The Building a concepts co-occurence network notebook shows how to turn document keywords extracted from ‘semantic web’ publications into a simple topic map – by virtue of their co-occurrence within the same documents.

See also the standalone html version of the interactive visualization: concepts_network_2020-08-05.html

Dataviz: Organizations Collaboration Network

The Building an Organizations Collaboration Network Diagram notebook shows how to use publications’ authors and GRID data to generate a network of collaborating research organizations.

See also the standalone html version of the interactive visualization: network_2_levels_grid.412125.1.html

Getting to grips with Google Colab

mikele — Thu, 30 Jan 2020 13:28:27 +0000

I’ve been using Google Colab on a regular basis during the last few months, as I was curious to see whether I could make the switch to it (from a more traditional Jupyter/Jupyterlab environment). As it turns out, Colab is pretty amazing in many respects but there are still situations where a local Jupyter notebook is my first choice. Keep reading to discover why!

Google Colab VS Jupyter

Google Colaboratory (also known as Colab, see the faqs) is a free Jupyter notebook environment that runs in the cloud and stores its notebooks on Google Drive.

Colab has become extremely popular with data scientists and in particular people doing some kind of machine learning tasks. Party, I guess, that’s because Colab has deep integration with Google’s ML tools (eg Tensorflow) and in fact Colab actually permits to switch to a Tensor Processing Unit (TSU) when running your notebook. For FREE. . Which, by itself, is pretty remarkable already.

There are tons of videos on Youtube and tutorials on Medium, so I’m not gonna describe it any further, because there is definitely no shortage of learning materials if you want to find out more about it.

How I’m using Colab

I normally turn to notebooks because I need to demonstrate real-world applications of APIs to a (sometimes not-so-technical) audience. A lot of the work I’ve been doing lately has crystallized into the ‘Dimensions API Labs‘ portal. This is essentially a collection of notebooks aimed at making it easier for people to extract, process and turn into actionable insights the many kinds of data my company’s APIs can deliver.

My usual workflow:

Getting some data by calling APIs, sometimes using custom-built Python packages;
Processing the data using pandas or built-in Python libraries;
Building visualizations and summaries e.g. using Plotly.

My target audience:

Data scientist and developers who want to become proficient with our APIs.
Analysts and domain experts who are less technically advanced, but have the capacity to turn interesting research questions into queries and API-based workflows.

Read on to find out how Colab ticked a lot of the boxes for this kind of work.

Pros of Colab

In general, Jupyter notebooks are an ideal tool for showcasing API functionalities and data features. The ability to pack together code, images and text within a single runnable file make the end result intuitive yet powerful.

Google Colab brings a number of extra benefits to the table:

No install set up. That was a massive selling point for me. If I have to share an API recipe with just anyone, Colab allows to do that very very quickly, even with non technical users. They just have to open up a webpage, hit ‘play’ and run the notebook. Moreover, Colab includes by default many popular Python libraries and, if you need to, you can pip-install your own favorite ones too. Neat!
It scales well. I ran a couple of workshop recently with 30+ users, without any performance issue. E.g. compared to setting up a Jupyterhub server, it’s much easier, and cheaper too, of course. Plus, people can go home and re-run the same notebooks virtually withing the same exact environment. No need to fiddle with Python, Docker or Jupyter packages.
Sharing and commenting. The collaborative features of Colab need no introduction. Just think of how easy it is to share a Google Doc with your colleagues, only in this case you’d do it with a notebook!
Playground mode. Colab introduced the notion of playground mode, which essentially allows you to open a notebook in read-only mode (trying to save throws the error “This notebook is in playground mode. Changes will not be saved unless you make a copy of the notebook.”). I find this feature extremely handy for demos, or in situations where one needs to mess about with a notebook without the risk of overwriting its ‘stable’ state.
Snippets. Colab includes a sidebar with many useful code snippets by default. You can extend that easily by creating your own ‘snippets’ notebook, going to Tools > Preferences, paste the snippets notebook URL in Custom snippet notebook URL and save. Simple and effective. And the new snippets can be shared with team mates too!
Extra UI components. The Colab folks developed a syntax for generating Forms components using markdown. This is very cool because it lets you generate simple input boxes, which can be used for example by non technical people to enter data into a notebook. Also worth pointing out that forms are created using comments-like code (eg

#@param {type:”string”}) so they don’t interfere with the notebook if you open it within a traditional Jupyter environment.
The Google ecosystem. The integration with the rest of the G-Suite is unsurprisingly amazing so pulling/putting data in and out of Drive, Sheets or BigQuery is quick, easy and well-documented.

Cons of Colab

Performance limitations. Of course the performance will never be as good as running things locally (having said that – you can even use GPUs for free, but haven’t tried that yet). So for bigger projects e.g. involving complex algorithms or very large datasets, other data science platforms are probably better eg Gigantum
Interface learning. You have to get used to the Colab interface. It somehow still feels a bit more fiddly than JupyterLab, to me. Keyword shortcuts can be a problem too: you can customize them in Colab, but I couldn’t replicate all of my (rather heavily customized) JupyterLab ones, due to conflicts with other default ones in Colab. So some muscle-memory pain there.
Exporting to HTML is not that good. Being able to turn Jupyter notebooks into a simple HTML file is pretty handy, but Colab can’t do that. You can of course download the .ipynb file and then export locally (via nbconvert), but that doesn’t always produce the results you’d expect either. For example, Plotly visualizations (like this one) are not rendering properly unless I run the whole notebook locally in Jupyterlab before exporting.
Some Python libraries won’t work out of the box. For example I have a Python library called dimcli that builds on the latest prompt-toolkit. Turns out that Colab, by default, runs ipython 5.5.0 (latest version is 7), which is incompatible with promt-toolkit. You can of course uprade everything on Colab (eg pip install –upgrade –force-reinstall library-name) – which is great – however that may lead to further dependencies errors.. and so on.
Project versioning. Colab includes a built-in revision history tool, and it can integrate with Github too. Yet, I often end up creating multiple copies/versions of a notebook, instead of relying on the revisions system. I wish there was a better way to do this..
The Google ecosystem. As much as this can be a massive plus for some people (see above), it can also be a massive problem for others. Some customers I work with don’t have access to G-Suite, full stop. That’s not so uncommon, especially with large enterprises that are concerned about data privacy.

Conclusions

Google Colab is simply great for small/medium data projects. Hands down to the developers who built it. Some features are totally neat, and especially when I intend to share whatever I’m doing with more than one person, I hit right away my New Colab Document shortcut.

Nonetheless, I still use JupyterLab a lot, for a variety of projects. E.g. for quick personal data investigations. Or for projects that I know will be shared only with other data scientists (who need no guidance in order to run them). Or in projects with long-running processes and memory consumption.

So the two things need to coexist. Which opens up a new problem: how to seamlessly move from one environment to the other? That’s still an open question for me, but you’ll find what I learned so far in the following section..

Appendix: Colab and JupyterLab happily co-existing

Is this too much to ask? This is what I worked out so far:

I try to put all of my notebooks in Google Drive, so that they are accessible by Colab.
I sync my Google Drive to my laptop, so I’ve got everything locally as well (ps: I sync Drive to one computer only so to avoid double/out-of-sync issues).
I have several folders containing notebooks. Some of these folders are actually Github repositories too. They seem to sync over Drive without issues (so far!)
This setup means that I can either work on Colab (thanks to Drive) or local Jupyter (via the local sync) depending on my needs. I can even start working on something locally and then complete it on Colab. The .ipynb files are 100% compatible (almost – see above the exception about rendered visualizations)
Any Colab-specific code does not break Jupyter. There is some redundancy on occasions (eg pip installing libraries on Colab which I already have on my laptop) but that’s fine. It’s also possible to use expressions like this `if not ‘google.colab’ in sys.modules` to run code selectively based on the platform (eg see here).

Comments?

As usual I’d love to hear them :-)

Calculating Industry Collaborations via GRID

mikele — Wed, 08 Jan 2020 15:18:34 +0000

A new tutorial demostrating how to extract and visualize data about industry collaborations, by combining the Dimensions data with GRID.

Dimensions uses GRID (the Global Research Identifiers Database) to unambiguously identify research organizations. GRID includes a wealth of data, for example whether an organization has type ‘Education’ or ‘Industry’. So it’s pretty easy to take advantage of these metadata in order to highlight collaboration patterns between a selected university and other organizations from the industry sector.

The open source Jupyter notebook can be adapted so to focus on any research organization: many of us are linked to some university, hence it’s quite interesting to explore what are the non-academic organizations related to it.

For example, see above a Plotly visualization of the industry collaborators for University of Trento, Italy (you can also open it in new tab):

The Dimensions API can be accessed for free for non-commercial research projects.

Introducing DimCli: a Python CLI for the Dimensions API

mikele — Fri, 24 May 2019 11:10:15 +0000

For the last couple of months I’ve been working on a new open source Python project. This is called DimCli and it’s a library aimed at making it simpler to work with the Dimensions Analytics API.

The project is available on Github. In a nutshell, DimCli helps people becoming productive with the powerful scholarly analytics API from Dimensions. See the video below for a quick taster of the functionalities available.

Background

I recenlty joined the Dimensions team, so needed a way to get to grips with their feature-rich API (official docs). So, building DimCli has been a fun way for me to dig into the logic of the Dimensions Search Language (DSL).

Plus, this project gave me a chance to learn more about two awesome Python technologies: JupyterLab and its magic commands, as well as the Python Prompt Toolkit library.

Features

In a nutshell, this is what DimCli has to offer:

It’s an interactive query console for the Dimensions Analytics API (ps: Dimensions is a world-class research-data platform including information about millions of documents like publications, patents, grants, clinical trials and policy documents.

It helps learning the Dimensions Search Language (DSL) thanks to a built-in autocomplete and documentation search mechanism.

It handles authentication transparently either via a global user-specific credentials file, or by passing credentials manually (e.g. when used within shared environments).

It allows to export results to CSV, JSON and pandas dataframes, hence making it easier to integrate with other data analysis tools.

It is compatible with Jupyter, e.g. it includes various magic commands that make it super simple to interrogate Dimensions (various examples here).

Feedback

DimCli lives on Github, so for any feedback or bug reports, feel free to open an issue there.

Running interactive Jupyter demos with mybinder.org

mikele — Fri, 03 May 2019 15:25:24 +0000

The online tool mybinder.org allows to turn a Git repo into a collection of interactive notebooks with one click.

I played with it a little today and was pretty impressed! A very useful tool e.g. if you have a repository of Jupyter notebooks and want to showcase them to someone with no access to a Jupyter environment.

See the official docs for more info.

I was able to run many of the Dimensions API notebooks with little or no changes (follow this link to try them out yourself). Dependencies can be loaded on the fly, and new files (eg local settings) create just as if you are working within a normal Jupyter notebook.

Worth keeping in mind the limitation (from the official faq):

How much memory am I given when using Binder?

If you or another Binder user clicks on a Binder link, the mybinder.org deployment will run the linked repository. While running, users are guaranteed at least 1GB of RAM, with a maximum of 2GB. This means you will always have 1GB, you may occasionally have between 1 and 2GB, and if you go over 2GB your kernel will be restarted.

How long will my Binder session last?

Binder is meant for interactive and ephemeral interactive coding, meaning that it is ideally suited for relatively short sessions. Binder will automatically shut down user sessions that have more than 10 minutes of inactivity (if you leave your window open, this will be counted as “activity”).

Binder aims to provide at least 12 hours of session time per user session. Beyond that, we cannot guarantee that the session will remain running.

Can I use mybinder.org for a live demo or workshop?

For sure! We hope the demo gods are with you. Please do make sure you have a backup plan in case there is a problem with mybinder.org during your workshop or demo. Occasionally, service on mybinder.org can be degraded, usually because the server is getting a lot of attention somewhere on the internet, because we are deploying new versions of software, or the team can’t quickly respond to an outage.

Absolutely a big thank you to the mybinder community!