TechLife – Parerga und Paralipomena

Getting to grips with Google Colab

mikele — Thu, 30 Jan 2020 13:28:27 +0000

I’ve been using Google Colab on a regular basis during the last few months, as I was curious to see whether I could make the switch to it (from a more traditional Jupyter/Jupyterlab environment). As it turns out, Colab is pretty amazing in many respects but there are still situations where a local Jupyter notebook is my first choice. Keep reading to discover why!

Google Colab VS Jupyter

Google Colaboratory (also known as Colab, see the faqs) is a free Jupyter notebook environment that runs in the cloud and stores its notebooks on Google Drive.

Colab has become extremely popular with data scientists and in particular people doing some kind of machine learning tasks. Party, I guess, that’s because Colab has deep integration with Google’s ML tools (eg Tensorflow) and in fact Colab actually permits to switch to a Tensor Processing Unit (TSU) when running your notebook. For FREE. . Which, by itself, is pretty remarkable already.

There are tons of videos on Youtube and tutorials on Medium, so I’m not gonna describe it any further, because there is definitely no shortage of learning materials if you want to find out more about it.

How I’m using Colab

I normally turn to notebooks because I need to demonstrate real-world applications of APIs to a (sometimes not-so-technical) audience. A lot of the work I’ve been doing lately has crystallized into the ‘Dimensions API Labs‘ portal. This is essentially a collection of notebooks aimed at making it easier for people to extract, process and turn into actionable insights the many kinds of data my company’s APIs can deliver.

My usual workflow:

Getting some data by calling APIs, sometimes using custom-built Python packages;
Processing the data using pandas or built-in Python libraries;
Building visualizations and summaries e.g. using Plotly.

My target audience:

Data scientist and developers who want to become proficient with our APIs.
Analysts and domain experts who are less technically advanced, but have the capacity to turn interesting research questions into queries and API-based workflows.

Read on to find out how Colab ticked a lot of the boxes for this kind of work.

Pros of Colab

In general, Jupyter notebooks are an ideal tool for showcasing API functionalities and data features. The ability to pack together code, images and text within a single runnable file make the end result intuitive yet powerful.

Google Colab brings a number of extra benefits to the table:

No install set up. That was a massive selling point for me. If I have to share an API recipe with just anyone, Colab allows to do that very very quickly, even with non technical users. They just have to open up a webpage, hit ‘play’ and run the notebook. Moreover, Colab includes by default many popular Python libraries and, if you need to, you can pip-install your own favorite ones too. Neat!
It scales well. I ran a couple of workshop recently with 30+ users, without any performance issue. E.g. compared to setting up a Jupyterhub server, it’s much easier, and cheaper too, of course. Plus, people can go home and re-run the same notebooks virtually withing the same exact environment. No need to fiddle with Python, Docker or Jupyter packages.
Sharing and commenting. The collaborative features of Colab need no introduction. Just think of how easy it is to share a Google Doc with your colleagues, only in this case you’d do it with a notebook!
Playground mode. Colab introduced the notion of playground mode, which essentially allows you to open a notebook in read-only mode (trying to save throws the error “This notebook is in playground mode. Changes will not be saved unless you make a copy of the notebook.”). I find this feature extremely handy for demos, or in situations where one needs to mess about with a notebook without the risk of overwriting its ‘stable’ state.
Snippets. Colab includes a sidebar with many useful code snippets by default. You can extend that easily by creating your own ‘snippets’ notebook, going to Tools > Preferences, paste the snippets notebook URL in Custom snippet notebook URL and save. Simple and effective. And the new snippets can be shared with team mates too!
Extra UI components. The Colab folks developed a syntax for generating Forms components using markdown. This is very cool because it lets you generate simple input boxes, which can be used for example by non technical people to enter data into a notebook. Also worth pointing out that forms are created using comments-like code (eg

#@param {type:”string”}) so they don’t interfere with the notebook if you open it within a traditional Jupyter environment.
The Google ecosystem. The integration with the rest of the G-Suite is unsurprisingly amazing so pulling/putting data in and out of Drive, Sheets or BigQuery is quick, easy and well-documented.

Cons of Colab

Performance limitations. Of course the performance will never be as good as running things locally (having said that – you can even use GPUs for free, but haven’t tried that yet). So for bigger projects e.g. involving complex algorithms or very large datasets, other data science platforms are probably better eg Gigantum
Interface learning. You have to get used to the Colab interface. It somehow still feels a bit more fiddly than JupyterLab, to me. Keyword shortcuts can be a problem too: you can customize them in Colab, but I couldn’t replicate all of my (rather heavily customized) JupyterLab ones, due to conflicts with other default ones in Colab. So some muscle-memory pain there.
Exporting to HTML is not that good. Being able to turn Jupyter notebooks into a simple HTML file is pretty handy, but Colab can’t do that. You can of course download the .ipynb file and then export locally (via nbconvert), but that doesn’t always produce the results you’d expect either. For example, Plotly visualizations (like this one) are not rendering properly unless I run the whole notebook locally in Jupyterlab before exporting.
Some Python libraries won’t work out of the box. For example I have a Python library called dimcli that builds on the latest prompt-toolkit. Turns out that Colab, by default, runs ipython 5.5.0 (latest version is 7), which is incompatible with promt-toolkit. You can of course uprade everything on Colab (eg pip install –upgrade –force-reinstall library-name) – which is great – however that may lead to further dependencies errors.. and so on.
Project versioning. Colab includes a built-in revision history tool, and it can integrate with Github too. Yet, I often end up creating multiple copies/versions of a notebook, instead of relying on the revisions system. I wish there was a better way to do this..
The Google ecosystem. As much as this can be a massive plus for some people (see above), it can also be a massive problem for others. Some customers I work with don’t have access to G-Suite, full stop. That’s not so uncommon, especially with large enterprises that are concerned about data privacy.

Conclusions

Google Colab is simply great for small/medium data projects. Hands down to the developers who built it. Some features are totally neat, and especially when I intend to share whatever I’m doing with more than one person, I hit right away my New Colab Document shortcut.

Nonetheless, I still use JupyterLab a lot, for a variety of projects. E.g. for quick personal data investigations. Or for projects that I know will be shared only with other data scientists (who need no guidance in order to run them). Or in projects with long-running processes and memory consumption.

So the two things need to coexist. Which opens up a new problem: how to seamlessly move from one environment to the other? That’s still an open question for me, but you’ll find what I learned so far in the following section..

Appendix: Colab and JupyterLab happily co-existing

Is this too much to ask? This is what I worked out so far:

I try to put all of my notebooks in Google Drive, so that they are accessible by Colab.
I sync my Google Drive to my laptop, so I’ve got everything locally as well (ps: I sync Drive to one computer only so to avoid double/out-of-sync issues).
I have several folders containing notebooks. Some of these folders are actually Github repositories too. They seem to sync over Drive without issues (so far!)
This setup means that I can either work on Colab (thanks to Drive) or local Jupyter (via the local sync) depending on my needs. I can even start working on something locally and then complete it on Colab. The .ipynb files are 100% compatible (almost – see above the exception about rendered visualizations)
Any Colab-specific code does not break Jupyter. There is some redundancy on occasions (eg pip installing libraries on Colab which I already have on my laptop) but that’s fine. It’s also possible to use expressions like this `if not ‘google.colab’ in sys.modules` to run code selectively based on the platform (eg see here).

Comments?

As usual I’d love to hear them :-)

Pypapers: a bare-bones, command line, PDF manager

mikele — Sun, 30 Jun 2019 22:48:40 +0000

Ever felt like softwares like Mendeley or Papers are great, but somehow slow you down? Ever felt like none of the many reference manager softwares out there will ever cut it for you, cause you need something R E A L L Y SIMPLE? I did. Many times. So I’ve finally crossed the line and tried out building a simple commmand-line PDF manager. PyPapers, is called.

Yes – that’s right – command line. So not for everyone. Also: this is bare bones and pre-alpha. So don’t expect wonders. It basically provides a simple interface for searching a folder full of PDFs. That’s all for now!

Key features (or lack of)

Mac only, I’m afraid. I’m sitting on the shoulders of a giant. That is, mdfind.

No fuss search in file names only or full text

Shows all results and relies on Preview for reading

Highlighting on Preview works pretty damn fine and it’s the ultimate compatibility solution (any other software kinds of locks you in eventually, imho)

Open source. If you can code Python you can customise it to your needs. If you can’t, open an issue in github and I may end up doing it.

It recognises sub-folders, so that can be leveraged to become a simple, filesystem level, categorization structure for your PDFs (eg I have different folders for articles, books, news etc..)

Your PDFs live in the Mac filesystem ultimately. So you can always search them using Finder in case you get bored of the command line.

First impressions

Pretty good. Was concerned I was gonna miss things like collections or tags. But I found a workaround: first, identify the papers I am interested in. Then, create a folder in the same directory and symlink them in there (= create an alias).

It’s not quite like uncarved wood, but it definitely feels simple enough.

Introducing DimCli: a Python CLI for the Dimensions API

mikele — Fri, 24 May 2019 11:10:15 +0000

For the last couple of months I’ve been working on a new open source Python project. This is called DimCli and it’s a library aimed at making it simpler to work with the Dimensions Analytics API.

The project is available on Github. In a nutshell, DimCli helps people becoming productive with the powerful scholarly analytics API from Dimensions. See the video below for a quick taster of the functionalities available.

Background

I recenlty joined the Dimensions team, so needed a way to get to grips with their feature-rich API (official docs). So, building DimCli has been a fun way for me to dig into the logic of the Dimensions Search Language (DSL).

Plus, this project gave me a chance to learn more about two awesome Python technologies: JupyterLab and its magic commands, as well as the Python Prompt Toolkit library.

Features

In a nutshell, this is what DimCli has to offer:

It’s an interactive query console for the Dimensions Analytics API (ps: Dimensions is a world-class research-data platform including information about millions of documents like publications, patents, grants, clinical trials and policy documents.

It helps learning the Dimensions Search Language (DSL) thanks to a built-in autocomplete and documentation search mechanism.

It handles authentication transparently either via a global user-specific credentials file, or by passing credentials manually (e.g. when used within shared environments).

It allows to export results to CSV, JSON and pandas dataframes, hence making it easier to integrate with other data analysis tools.

It is compatible with Jupyter, e.g. it includes various magic commands that make it super simple to interrogate Dimensions (various examples here).

Feedback

DimCli lives on Github, so for any feedback or bug reports, feel free to open an issue there.

OntoSpy v.1.7.4

mikele — Mon, 27 Feb 2017 07:59:52 +0000

A new version of OntoSpy (1.7.4) is available online. OntoSpy is a lightweight Python library and command line tool for inspecting and visualising vocabularies encoded in the RDF family of languages.

This version includes a hugely improved API for creating nice-looking HTML or Markdown documentation for an ontology, which takes advantage of frameworks like Bootstrap and Bootswatch.

You can take a look at the examples page to see what I’m taking about.

To find out more about Ontospy:

CheeseShop: https://pypi.python.org/pypi/ontospy

Github: https://github.com/lambdamusic/ontospy

 
Here’s a short video showing a typical sessions with the OntoSpy repl:

Coming up next

More advanced ontology visualisations using d3 or similar javascript libraries;

A better separation between the core Python library in OntoSpy and the other components. This is partly addressing the fact that the OntoSpy package has grown a bit too much, in particular form the point of view of people who are only interested in using it in order to create their own applications, as opposed (for example) to reusing the built-in visualisations.

Of course, any comments or suggestions are welcome as usual – either using the form below or via GitHub. Cheers!

How to copy snippets from Github Gists to Dash

mikele — Sat, 24 Dec 2016 18:15:37 +0000

If you’re a Dash for MacOS user, here’s a little script to copy existing code snippets saved as Github Gists into the Dash snippets database.

Dash for MacOS is an application that allows to keep a local library of a multitude of programming frameworks and libraries, so that you can search this library quickly using an offline intuitive interface.

Dash has a feature for creating and managing code snippets – there are many other alternatives out there for this – but probably the fact you can store snippets alongside other documentation could be a winner in this case.

Anyhow, since I’ve been collecting lots of snippets as Github Gists, I thought I’d be nice to load them up into Dash so to test it out a bit more!

Note:

– the solution below does not extract tags information at the moment

– always a good idea to make a backup copy of the Dash database before messing with it. Then just update the value of the DASH_DATABASE parameter in the script and start the extraction.

– inspired by another gist: https://raw.github.com/gist/5466075/gist-backup.py

So here we go:

Apple Keynote: extracting presenter notes to Markdown

mikele — Thu, 03 Nov 2016 08:05:52 +0000

Here’s a simple AppleScript that makes it easier to extract presenter notes from a Keynote presentation and save them to a nice Markdown document.

If you’re using Apple Keynote, you may have noticed that there isn’t an easy way to extract the presenter notes attached to your slides (of course you could do it manually, one slide at a time, but that’s pretty tedious!).

The following AppleScript code allows you to automate this action by pulling out all presenter notes from an open presentation and save them to a Markdown file.

Inspired by:

http://apple.stackexchange.com/questions/136118/how-to-print-full-presenter-notes-without-slides-in-keynote

https://gist.github.com/benwaldie/9955151

SpotiSci: finding science concepts on Spotify

mikele — Fri, 29 Apr 2016 22:43:12 +0000

Ever wondered how many musical albums focus on topics like the moon landing, artificial intelligence or DNA replication? Probably not for everyone’s taste, but if you give it a shot you’ll be surprised at the results.

When I ran into the excellent Spotipy library (a small yet nifty Python client for the Spotify Web API) I couldn’t wait to try it out with some fun project.

So that’s how the SpotiSci experiment came about; essentially a search tool that allows to query Nature.com‘s one million articles archives while at the same time browsing the vast selection of music available on Spotify.

Have a good listen. You may find the right soundtrack for your science.

Accessing OS X dictionary with Python

mikele — Sat, 28 Nov 2015 15:57:06 +0000

A little script that allows to access the OS X Dictionary app using Python.

Tip: make the script executable and add an alias for it in order to be able to call it from the command line easily.

Dereference a DOI using python

mikele — Wed, 03 Dec 2014 16:49:44 +0000

A little python script that allows to pass an article DOI in order to obtain all the metadata related to that article.

The script relies on the handy crosscite.org API, which is one of the wonderful services provided by CrossRef.

Installing Stardog triplestore on mac os

mikele — Thu, 06 Nov 2014 16:02:49 +0000

Stardog is an enterprise-level triplestore developed by clarkparsia.com. It combines tools to store and query RDF data with more advanced features for inference and data analytics – in particular via the built-in Pellet Java reasoner. All of this, combineded with a user experience which is arguably the best you can currently find in the market.

1. Requirements

OSX: Mavericks 10.9.5 (that’s what I used, but it’ll work on older versions too).
JAVA: available from Apple.
Stardog: grab the free community edition at http://www.stardog.com/ (you can also get the ‘developer’ version for a 30-days trial, which is actually what I did).

2. Setting up

Good news, it can’t get any simpler than this. Just unpack the Stardog installer, and you’re pretty much done (see the online docs for more info).

Stardog needs to know where to store its databases, so you do that by adding a couple of lines to your .bash_profile file:


export STARDOG_HOME="/Users/michele.pasin/Data/Stardog"  # databases will be stored here
export PATH="/Applications/stardog-2.2.2/bin:$PATH"  # add stardog commands to the path
alias cdstardog="cd /Applications/stardog-2.2.2"  # just a handy shortcut

Finally, copy the license key file (which should have come together with the installer) into the data folder:

$ cp stardog-license-key.bin $STARDOG_HOME

3. Running Stardog

The stardog-admin server start command is used to start and stop the server. Then you can use the stardog-admin db create command to create a DB and load some data. For example:


[michele.pasin]@Tartaruga:~>cdstardog 

[michele.pasin]@l5611:/Applications/stardog-2.2.2>stardog-admin server start

************************************************************
This copy of Stardog is licensed to MIk (michele.pasin@gmail.com), michelepasin.org
This is a Community license
This license does not expire.
************************************************************

                                                             :;   
                                      ;;                   `;`:   
  `'+',    ::                        `++                    `;:`  
 +###++,  ,#+                        `++                    .     
 ##+.,',  '#+                         ++                     +    
,##      ####++  ####+:   ##,++` .###+++   .####+    ####++++#    
`##+     ####+'  ##+#++   ###++``###'+++  `###'+++  ###`,++,:     
 ####+    ##+        ++.  ##:   ###  `++  ###  `++` ##`  ++:      
  ###++,  ##+        ++,  ##`   ##;  `++  ##:   ++; ##,  ++:      
    ;+++  ##+    ####++,  ##`   ##:  `++  ##:   ++' ;##'#++       
     ;++  ##+   ###  ++,  ##`   ##'  `++  ##;   ++:  ####+        
,.   +++  ##+   ##:  ++,  ##`   ###  `++  ###  .++  '#;           
,####++'  +##++ ###+#+++` ##`   :####+++  `####++'  ;####++`      
`####+;    ##++  ###+,++` ##`    ;###:++   `###+;   `###++++      
                                                    ##   `++      
                                                   .##   ;++      
                                                    #####++`      
                                                     `;;;.        

************************************************************
Stardog server 2.2.2 started on Thu Nov 06 16:41:23 GMT 2014.

Stardog server is listening on all network interfaces.
SNARL server available at snarl://localhost:5820.
HTTP server available at http://localhost:5820.

STARDOG_HOME=/Users/michele.pasin/Data/Stardog 

LOG_FILE=/Users/michele.pasin/Data/Stardog/stardog.log


[michele.pasin]@l5611:/Applications/stardog-2.2.2>stardog-admin db create -n myDB examples/data/University0_0.owl
Bulk loading data to new database.
Parsing triples: 100% complete in 00:00:00 (8.6K triples - 13.2K triples/sec)
Parsing triples finished in 00:00:00.646
Creating index: 100% complete in 00:00:00 (93.0K triples/sec)
Creating index finished in 00:00:00.092
Computing statistics: 100% complete in 00:00:00 (60.9K triples/sec)
Computing statistics finished in 00:00:00.140
Loading complete.
Inserted 8,521 unique triples from 8,555 read triples in 00:00:01.050 at 8.1K triples/sec
Bulk load complete.  Loaded 8,521 triples from 1 file(s) in 00:00:01 @ 8.4K triples/sec.

Successfully created database 'myDB'.

[michele.pasin]@Tartaruga:/Applications/stardog-2.2.2>stardog query myDB "SELECT DISTINCT ?s WHERE { ?s ?p ?o } LIMIT 10"
+--------------------------------------------------------+
|                           s                            |
+--------------------------------------------------------+
| tag:stardog:api:                                       |
| http://www.University0.edu                             |
| http://www.Department0.University0.edu                 |
| http://www.Department0.University0.edu/FullProfessor0  |
| http://www.Department0.University0.edu/Course0         |
| http://www.Department0.University0.edu/GraduateCourse0 |
| http://www.Department0.University0.edu/GraduateCourse1 |
| http://www.University84.edu                            |
| http://www.University875.edu                           |
| http://www.University241.edu                           |
+--------------------------------------------------------+

Query returned 10 results in 00:00:00.061

In the snippet above, I’ve just loaded the test dataset that comes with Stardog into the myDB database, then queried it using the stardog query command.

There’s a fancy user interface too, which can be accessed by going to http://localhost:5820 (note: by default, you can log in with usr/psw = admin).

4. Loading a big dataset

As in my previous post, I’ve tried loading the NPG Articles dataset available at nature.com’s legacy linked data site data.nature.com. The dataset contains around 40M triples describing (at the metadata level) all that’s been published by NPG and Scientific American from 1845 till nowadays. The file size is ~6 gigs so it’s not a huge dataset. Still, something big enough to pose a challenge to my macbook pro (8gigs RAM).

First off, I tried loading the dataset via the command line by passing an extra argument when creating a new database:

[michele.pasin]@Tartaruga:~/Downloads/NPGcitationsGraph/articles.2012-07-16>stardog-admin db create -n npgArticles articles.nq 
Bulk loading data to new database.
Parsing triples: 100% complete in 00:01:48 (10.1M triples - 93.3K triples/sec)
Parsing triples finished in 00:01:48.678
Creating index: 100% complete in 00:00:19 (525.1K triples/sec)
Creating index finished in 00:00:19.311
Computing statistics: 100% complete in 00:00:05 (1748.1K triples/sec)
Computing statistics finished in 00:00:05.782
Loading complete.
Inserted 10,107,653 unique triples from 10,140,000 read triples in 00:02:16.178 at 74.5K triples/sec
Bulk load complete.  Loaded 10,107,653 triples from 1 file(s) in 00:02:16 @ 74.3K triples/sec.

Errors were encountered during loading:
File: /Users/michele.pasin/Downloads/NPGcitationsGraph/articles.2012-07-16/articles.nq Message: '2000-13-01' is not a valid value for datatype http://www.w3.org/2001/XMLSchema#date [line 10144786]
Successfully created database 'npgArticles'.

As you can see, that didn’t work as expected: only 10M out of the 40M triples were loaded, because of an XML parsing error the installer encountered.

After some googling and pinging the mailing list, I discover that Stardog is actually right: the parsing error derives from the fact that valid values for XMLSchema#date are ISO8601 Dates. My data contained an XML date 2000-13-01 which is wrong – that should be 2000-01-13 instead.

What’s interesting is that I’ve previously managed to load the same dataset with other triple stores without any problems. How was that possible?

The online documentation provides the answer:

RDF parsing in Stardog is strict: it requires typed RDF literals to match their explicit datatypes, URIs to be well-formed, etc. In some cases, strict parsing isn’t ideal, so it may be disabled using the –strict-parsing=FALSE to disable it.

Also, from the mailing list:

By default, if you say “1.5”^^xsd:int or “twelve point four”^^xsd:float, Stardog is going to complain. While it’s perfectly legal to have that in RDF, you can run into trouble later on, particularly when doing query evaluation with filters that would handle those literal values where you will hit the dark corners of the SPARQL spec.

So, the way to load a (partially or potentially broken) dataset without having to worry about it too much is to use the strict.parsing=false flag:

>stardog-admin db create -o strict.parsing=false -n articlesNPG2 articles.nq
Bulk loading data to new database.
Parsing triples: 100% complete in 00:05:55 (39.4M triples - 110.7K triples/sec)
Parsing triples finished in 00:05:55.643
Creating index: 100% complete in 00:01:17 (510.7K triples/sec)
Creating index finished in 00:01:17.122
Computing statistics: 100% complete in 00:00:21 (1789.2K triples/sec)
Computing statistics finished in 00:00:21.944
Loading complete.
Inserted 39,262,620 unique triples from 39,384,548 read triples in 00:07:51.402 at 83.5K triples/sec
Bulk load complete.  Loaded 39,262,620 triples from 1 file(s) in 00:07:51 @ 83.3K triples/sec.

Successfully created database 'articlesNPG2'.

Job done in around 7 minutes!

Conclusion:

Extremely easy to install, efficient and packed with advanced features (inferencing and data-checking among the most useful ones imho). Also, as far as the UX and web interface goes, I doubt you can get any better than this with triplestores.

It’s a commercial product, of course, so you may not expect anything less than that. However the community edition (which is free) allows for 10 databases & 25M triples per db – which may be just fine for many projects.

If we had more tools as accessible as this one, I do think rdf triplestores would have a much higher uptake by now!

5. Useful resources

> Documentation

http://docs.stardog.com/

> Mailing list

https://groups.google.com/a/clarkparsia.com/forum/#!forum/stardog

> Python API

If you’re a pythonista, this small library can be useful: https://github.com/knorex/pystardog/wiki