visualization – Parerga und Paralipomena

More Jupyter notebooks: pyvis and networkx

mikele — Thu, 06 Aug 2020 09:55:12 +0000

Lately I’ve been spending more time creating Jupyter notebooks that demonstrate how to use the Dimensions API for research analytics. In this post I’ll talk a little bit about two cool Python technologies I’ve discovered for working with graph data: pyvis and networkx.

pyvis and networkx

The networkx and pyvis libraries are used for generating and visualizing network data, respectively.

Pyvis is fundamentally a python wrapper around the popular Javascript visJS library. Networkx instead of is a pretty sophisticated package for the creation, manipulation, and study of the structure, dynamics, and functions of complex networks.

>>> from pyvis.network import Network
>>> import networkx as nx
# generate generic network graph instance
>>> nx_graph = nx.Graph()
# add some nodes and edges
>>> nx_graph.nodes[1]['title'] = 'Number 1'
>>> nx_graph.nodes[1]['group'] = 1
>>> nx_graph.nodes[3]['title'] = 'I belong to a different group!'
>>> nx_graph.nodes[3]['group'] = 10
>>> nx_graph.add_node(20, size=20, title='couple', group=2)
>>> nx_graph.add_node(21, size=15, title='couple', group=2)
>>> nx_graph.add_edge(20, 21, weight=5)
>>> nx_graph.add_node(25, size=25, label='lonely', title='lonely node', group=3)
# instantiatet pyvis network
>>> nt = Network("500px", "500px")
# populates pyvis network from networkx instance
>>> nt.from_nx(nx_graph)
>>> nt.show("nx.html")

It took me a little to familiarise with the libraries’ concepts and to generate some basic graphs. So, the tutorials linked below are meant to provide some reusable code building blocks for working with these tools.

Once you get the hang of it though, the fun part begins. What are the best data variables to represent in the graph? What color coding strategy is making it easier to explore the data? How many nodes/edges to display? Can we add some interactivity to the visualizations? Check out the resulting visualizations below for more ideas.

Dataviz: concepts co-occurence network

The Building a concepts co-occurence network notebook shows how to turn document keywords extracted from ‘semantic web’ publications into a simple topic map – by virtue of their co-occurrence within the same documents.

See also the standalone html version of the interactive visualization: concepts_network_2020-08-05.html

Dataviz: Organizations Collaboration Network

The Building an Organizations Collaboration Network Diagram notebook shows how to use publications’ authors and GRID data to generate a network of collaborating research organizations.

See also the standalone html version of the interactive visualization: network_2_levels_grid.412125.1.html

Exploring SciGraph data using JSON-LD, Elastic Search and Kibana

mikele — Thu, 06 Apr 2017 14:12:05 +0000

Hello there data lovers! In this post you can find some information on how to download and make some sense of the scholarly dataset recently made available by the Springer Nature SciGraph project, by using the freely available Elasticsearch suite of software.

A few weeks ago the SciGraph dataset was released (full disclosure: I’m part of the team who did that!). This is a high quality dataset containing metadata and abstracts about scientific articles published by Springer Nature, research grants related to them plus other classifications of this content.

This release of the dataset includes the last 5 years of content – that’s already an impressive 32 gigs of data you can get your hands on. So in this post I’m going to show how to do that, in particular by transforming the data from the RDF graph format they come with, into a JSON format which is more suited for application development and analytics.

We will be using two free-to-download products, GraphDB and Elasticsearch, so you’ll have to install them if you haven’t got them already. But no worries, that’s pretty straighforward, as you’ll see below.

1. Hello SciGraph Linked Data

First things first, we want to get hold of the SciGraph RDF datasets of course. That’s pretty easy, just head over to the SciGraph downloads page and get the following datasets:

Ontologies: the main schema behind SciGraph.
Articles – 2016: all the core articles metadata for one year.
Grants: grants metadata related to those articles.
Journals: full list of Springer Nature journal catalogue.
Subjects: classification of research areas developed by Springer Nature.

That’s pretty much everything, only thing we’re getting only one year worth of articles as that’s enough for the purpose of this exercise (~300k articles from 2016).

Next up, we want to get a couple of other datasets SciGraph depends on:

GRID: a catalogue of the world’s research organisations. Make sure you get both the ontology and one of the latest releases, within which you can find an RDF implementation too.
Field Of Research codes: another classification scheme used in SciGraph, developed by the Australian and New Zealand Standard Research Classification organization.

That’s it! Time for a cup of coffee.

2. Python to the help

We will be doing a bit of data manipulation in the next sections and Python is a great language for that sort of thing. Here’s what we need to get going:

Python. Make sure you have Python installed and also Pip, the Python package manager (any Python version above 2.7 should be ok).
GitHub project. I’ve created a few scripts for this tutorial, so head over to the hello-scigraph project on GitHub and download it to your computer. Note: the project contains all the Python scripts needed to complete this tutorial, but of course you should feel free to modify them or write from scratch if you fancy it!
Libraries. Install all the dependencies for the hello-scigraph project to run. You can do that by cd-ing into the project folder and running pip install -r requirements.txt (ideally within a virtual environment, but that’s up to you).

3. Loading the data into GraphDB

So, you should have by now 8 different files containing data (after step 1 above). Make sure they’re all in the same folder and that all of them have been unzipped (if needed), then head over to the GraphDB website and download the free version of the triplestore (you may have to sign up first).

The online documentation for GraphDB is pretty good, so it should be easy to get it up and running. In essence, you have to do the following steps:

Launch the application: for me, on a mac, I just had to double click the GraphDB icon – nice!
Create a new repository: this is the equivalent of a database within the triplestore. Call this repo “scigraph-2016” so that we’re all synced for the following steps.

Next thing, we want a script to load our RDF files into this empty repository. So cd into the directory containg the GitHub project (from step 2) and run the following command:

python -m hello-scigraph.loadGraphDB ~/scigraph-downloads/

The “loadGraphDB” script goes through all RDF files in the “scigraph-downloads” directory and loads them into the scigraph-2016 repository (note: you must replace “scigraph-downloads” with the actual path to the folder you downloaded content in step 1 above).

So, to recap: this script is now loading more than 35 million triples into your local graph database. Don’t be surprised if it’ll take some time (in particular the ‘articles-2016’ dataset, by far the biggest) so it’s time to take a break or do something else.

Once the process it’s finished, you should be able to explore your data via the GraphDB workbench. It’ll look something like this:

4. Creating an Elasticsearch index

We’re almost there. Let’s head over to the Elasticsearch website and download it. Elasticsearch is a powerful, distributed, JSON-based search and analytics engine so we’ll be using it to build an analytics dashboard for the SciGraph data.

Make sure Elastic is running (run bin/elasticsearch (or bin\elasticsearch.bat on Windows), then cd into the hello-scigraph Python project (from step 2) in order to run the following script:

python -m hello-scigraph.loadElastic

If you take a look at the source code, you’ll see that the script does the following:

Articles loading: extracts articles references from GraphDB in batches of 200.
Articles metadata extraction: for each article, we pull out all relevant metadata (e.g. title, DOI, authors) plus related information (e.g. author GRID organizations, geo locations, funding info etc..).
Articles metadata simplification: some intermediate nodes coming from the orginal RDF graph are dropped and replaced with a flatter structure which uses a a temporary dummy schema (prefix es: It doesn’t matter what we call that schema, but what’s important is to that we want to simplify the data we put into the Elastic search index. That’s because while the Graph layer is supposed to facilitate data integration and hence it benefits from a rich semantic representation of information, the search layer is more geared towards performance and retrieval hence a leaner information structure can dramatically speed things up there.
JSON-LD transformation: the simplified RDF data structure is serialized as JSON-LD – one of the many serializations available for RDF. JSON-LD is of course valid JSON, meaning that we can put that into Elastic right away. This is a bit of a shortcut actually, in fact for a more fine-grained control of how the JSON looks like, it’s probably better to transform the data into JSON using some ad-hoc mechanism. But for the purpose of this tutorial it’s more than enough.
Elastic index creation. Finally, we can load the data into an Elastic index called – guess what – “hello-scigraph”.

Two more things to point out:

Long queries. The Python script enforces a 60 seconds time-out on the GraphDB queries, so in case things go wrong with some articles data the script should keep running.
Memory issues. The script stops for 10 seconds after each batch of 200 articles (time.sleep(10)). Had to do this to prevent GraphDB on my laptop from running out of memory. Time to catch some breath!

That’s it! Time for another break now. A pretty long one actually – loading all the data took around 10 hours on my (rather averaged spec’ed) laptop so you may want to do that overnight or get hold of a faster machine/server.

Eventually, once the loading script is finished, you can issue this command from the command line to see how much data you’ve loaded into the Elastic index “hello-scigraph”. Bravo!

curl -XGET 'localhost:9200/_cat/indices/'

5. Analyzing the data with Kibana

Loading the data in Elastic already opens up a number of possibilites – check out the search APIs for some ideas – however there’s an even quicker way to analyze the data: Kibana. Kibana is another free product in the Elastic Search suite, which provides an extensible user interface for configuring and managing all aspects of the Elastic Stack.

So let’s get started with Kibana: download it and set it up using the online instructions, then point your browser at http://localhost:5601 .

You’ll get to the Kibana dashboard which shows the index we just created. Here you can perform any kind of searches and see the raw data as JSON.

What’s even more interesting is the visualization tab. Results of searches can be rendered as line chart, pie charts etc.. and more dimensions can be added via ‘buckets’. See below for some quick examples, but really, the possibilities are endless!

Conclusion

This post should have given you enough to realise that:

The SciGraph dataset contain an impressive amount of high-quality scholarly publications metadata which can be used for things like literature search, research statistics etc..
Even though you’re not familiar with Linked Data and the RDF family of languages, it’s not hard to get going with a triplestore and then transform the data into a more widely used format like JSON.
Finally, Elasticsearch and especially Kibana are fantastic tools for data analysis and exploration! Needless to say, in this post I’ve just scratched the surface of what could be done with it.

Hope this was fun, any questions or comments, you know the drill :-)

Nature.com Subjects Stream Graph

mikele — Sun, 03 Jan 2016 00:28:08 +0000

The nature.com subjects stream graph displays the distribution of content across the subject areas covered by the nature.com portal.

This is an experimental interactive visualisation based on a freely available dataset from the nature.com linked data platform, which I’ve been working on in the last few months.

The main visualization provides an overview of selected content within the level 2 disciplines of the NPG Subjects Ontology. By clicking on these, it is then possible to explore more specific subdisciplines and their related articles.

For those of you who are not familiar with the Subjects Ontology: this is a categorization of scholarly subject areas which are used for the indexing of content on nature.com. It includes subject terms of varying levels of specificity such as Biological sciences (top level), Cancer (level 2), or B-2 cells (level 7). In total there are more than 2500 subject terms, organized into a polyhierarchical tree.

Starting in 2010, the various journals published on nature.com have adopted the subject ontology to tag their articles (note: different journals have started doing this at different times, hence some variations in the graph starting dates).

The visualization makes use of various d3.js modules, plus some simple customizations here and there. The hardest part of the work was putting the different page components together, to the effect of a more fluent ‘narrative’ achieved by gradually zooming into the data.

The back end is a Django web application with a relational database. The original dataset is published as RDF, so in order to use the Django APIs I’ve recreated it as a relational model. That let me also add a few extra data fields containing search indexes (e.g. article counts per month), so to make the stream graph load faster.

Comments or suggestions, as always very welcome.

A sneak peek at Nature.com articles’ archive

mikele — Mon, 08 Jun 2015 21:26:58 +0000

We’re getting closer to releasing the full set of metadata covering over one million articles published by Nature Publishing Group since 1845. So here’s a sneak peek at this dataset, in the form of a simple d3.js visual summary of what soon will be available to download and reuse.

In the last months I’ve been working with my colleagues at Macmillan Science and Education on an open data portal that makes available to the public many of the taxonomies and ontologies we use internally for organising the content we publish.

This is part of our ongoing involvement with linked data and semantic technologies, aimed both at leveraging these tools to the end of transforming the publishing workflow into a more dynamic platform, and at contributing to the evolving web of open data with a rich dataset of scientific articles metadata.

The articles dataset includes metadata about all articles published by the Nature journal, of course. But not only: the Scientific American, Nature Medicine, Nature Genetics and many other titles are also part of it (note: the full list can be downloaded as raw data here).

The first diagram shows how many articles have been published each year since 1845 (the start year of Scientific American). Nature began only a few years later in 1869; the curve getting steeper in the 90s instead corresponds to the exponential increase in publications due to the progressive specialisation of scientific journals (e.g. all the nature-branded titles).

The second diagram instead shows the increase in publication volumes on an incremental scale. We’ve now reached the 1M articles and counting!

In order to create the charts I played around with a nifty example from Mike Bostock (http://bl.ocks.org/mbostock/3902569) and added a couple of extra things to it.

The full source code is on Github.

Finally, worth mentioning that this metadata had already been made available a few of years ago under the CC0 license: you can still access it here. This upcoming release though makes it available in the context of a much more precise and stable set of ontologies. Meaning that the semantics of the dataset is more clearly laid out and consistent.

So stay tuned for more! ..and if you plan/would like to reuse these datasets please do get in touch, either here of by emailing developers@nature.com.

How to visualize a big taxonomy within a single webpage?

mikele — Fri, 22 Aug 2014 21:16:01 +0000

Here’s a couple more experiments aimed at representing visually a large taxonomy.

Some time ago I looked at ways to visualise a medium-large taxonomy (3000 terms circa) using one of the many visualisation kits out there. It turned out that pretty much all of them can’t handle that many terms, but there are other strategies that do come handy for that e.g. hide/reveal terms in the taxonomy based on what level you are looking at.

Why can’t I see the whole damn thing in one single page? Because there are too many things to display – you’d think.

So, step 1.

Here’s the entire set of elements on a page (well sort of).

Can’t we do better than that, though?

At the end of the day, if you assume a (quite modest these days) resolution of 800×600 pixels, you should be able to fit more than 300 9point characters in there (assuming 9 points equal 12 pixels).

Step 2.

Here’s another way: a font-size: 7px; and IDs instead of taxon’s labels make the visualisation much more compact.

And it does fit in a single window – hurray!

One problem though. This is not very useful with all those meaningless numbers.

Step 3.

So I tried to reduce the size a bit more so to fit the entire taxon label in there.

Also, adding a bit of interactivity so to reveal the hierarchy. The simple mechanism is this: when you click on an element of the taxonomy all of its ancestors get highlighted too. Just to remember this is not a plain list of things, but a tree.

Kind of like this one :-)

Possible next steps:

a) adding arrows to make the hierarchical relationships more evident
b) some sort of summary below the subject term in focus
c) sorting the terms by hierarchy-level rather than alphabetical order (will it make the taxonomy more intelligible?)

..to be continued..

Messing around wih D3.js and hierarchical data

mikele — Fri, 21 Jun 2013 13:23:59 +0000

These days there are a lot of browser-oriented visualization toolkits, such d3.js or jit.js. They’re great and easy to use, but how much do they scale when used with medium-large or very large datasets?

The subject ontology is a quite large (~2500 entities) taxonomical classification developed at Nature Publishing Group in order to classify scientific publications. The taxonomy is publicly available on data.nature.com, and is being encoded using the SKOS RDF vocabulary.

In order to evaluate the scalability of various javascript tree visualizations I extracted a JSON version of the subject taxonomy and tried to render it on a webpage, using out-of-the-box some of the viz approaches made available; here are the results (ps: I added the option of selecting how many levels of the tree can be visualized, just to get an idea of when a viz breaks).

Some conclusions:

The subject taxonomy actually is a poly-hierarchy (=one term can have more than one parent, so really it’s more like a directed graph). None of the libraries could handle that properly, but maybe that’s not really a limitation cause they are meant to support the visualization of trees (maybe I should play around more with force-directed graphs layout and the like..)

The only viz that could handle all of the terms in the taxonomy is D3’s collapsible tree. Still, you don’t want to keep all the branches open at the same time! Click on the image to see it with your eyes.

An approach to deal with large quantities of data is obviously to show them a little bit at a time. The Bar Hierarchy seems a pretty good way to do that, it’s informative and responsive. However it’d be nice to integrate with other controls/visual cues that would tell one what level of depth they’re currently looking at, which siblings are available etc.. etc..

Partition tables also looks pretty good in providing a visual summary of the categories available; however they tend to fail quickly when there are too many nodes, and the text is often not readable at all.. in the example below I had to include only the first 3 levels of the taxonomy for it to be loaded properly!

Rotating tree. Essentially a Tree plotted on a circle, very useful to provide a graphical overview of the data but it tends to become non responsive quickly.

Hierarchical pie chart. A pie chart that allows zooming in so to reveal hierarchical relationships (often also called Zoomable Sunburst). Quite nice and responsive, also with a large amount of data. The absence of labels could be a limiting feature though; you get a nice overview of the datascape but can’t really understand the meaning of each element unless you mouse over it.

Other stuff out there that could do a better job?

Infographics Course, Week 2

mikele — Wed, 14 Nov 2012 23:16:15 +0000

Here’re the materials related to the second week of the Introduction to Infographics and Data Visualization online course. This week we talked about two topics: a) Visual Perception and Graphic Design Principles and b) Planning for Infographics and Visualizations. The exercise was focusing on an interactive visualisation available on the New York Times website.

Key Concepts from Lesson 2


A) Visual Perception and Graphic Design Principles
-----------------------------------------

- The principles of design are grounded on how perception works
- Visual perception works differently from what we think 
	- we don't have photographs in our heads!
	- the eye passes on signals to the brain, which elaborates and makes assumptions about the perception
		- eg think of the classic geometrical visual illusions 
- Visual perception is an active process. The brain is not a passive receptor of information, but it completes, organizes and creates priorities (or hierarchies) and relationships to extract meaning
- If we know that, the goal of the designer should be to arrange compositions anticipating what the user's brain will most likely do

- The key to any visual design is the presentation of a cohesive, structured, readable and understandable composition.

==> check online examples by John Grimwade

First goal when arranging elements on a page:
	- think about a composition
	- Structure / Order / Hierarchy / Harmony / Balance
	
Main principles of Graphic Design
	Unity: presentation of a composition as an integrated whole
	Variety: is the opposite of unity, but also its complement. With too much variety, a composition will look randoml with too much unity, it will look boring
	Hierarchy: the balance between unity and variety can lead to a good hierarchy
		- answers the question: where should I start reading the infographics
		
Strategies for achieving Unity, Variety and Hierarchy
	Grids, Color, and Type
	
Grids
	- they can support unity thanks to a sense of alignment
	- help keep the consistency
	- key to using grids: think of the composition as a set of rectangles
	- first step in building an infographics: divide up the space into functional rectabgles
		- headline / intro / map / chart / timeline etc...
		- tip: things that are stacked on top of each other should have the same width
		- tip: if objects are side by side, they should have they same height
		
Fonts, Colors
	- different fonts can be use to achieve variety, and support the creation of a hierarchy
	- same with colors: eg using a Copy Color (eg Black) an highlight color (eg yellow) and a series of Neutral Tones (grey shades)
	
	

B) Planning for Infographics and Visualizations
-----------------------------------------

Creating a chart consists of making a data set adopt a visual shape

==> seminal paper by William Cleveland and Robert McGill on infovis
	- scale of charts that allow more accurate/generic judgements
		eg barcharts are accurate and facilitate comparison
		eg color gradation graphs or bubbles are generic (=show big trends)
	
- correlation coefficient
	- formula that represent the correlation between two series (on a values between -1 and 1)
	- scatter plot: good at representing correlation between two variables
	- slope graph: another way to represent correlation (although it's usually employed to represent change over time)


- Types of charts
	- line charts: display variation of one or more magnitudes over a time period by means of rising and falling lines
	- Comparison Charts: visualization of amounts, each represented by a bar (or toher objects)
	- Distribution Chart: division of a whole into its components. It can be represented by a circle ('pie') or by other objects, such as a divided bar
	- Correlation chart: shows the correlation between two (or more) variables. Also called scatter plots

- Common components of a graphic
	- headline: clearly stated what the graphic is about, or makes a point
	- values
	- axes
	- sources/attribution info
	- byline: who made the graphic
	- legend
	
	
Styling a graphic with colors and fonts makes the graphic more readable (= create the hierarchy)

Things to avoid
- Dont' distort charts (eg 3d effects) especially with pie charts, as it makes it more difficult to compare areas
- Avoid vertical labels
- Avoid backgrounds that detract attention from the main graphics (eg photographs etc)
- Avoid creating overloaded compositions

The Design Process
- Learn as much as you can about the topic
- Identify goals and challenges
- Prototype and sketch
- Test and tweak
- Turn the project in

The exercise

See the following graphic, by The New York Times, an interesting project that allows you to compare the words that were used in the National conventions. Imagine that you are hired by Steve Duenes, infographics director at the Times, to make a constructive critique of that piece. What would you say about it?

Here’s my comments about the NY visualization:

– The infographics allows to compare the two different parties usage of a certain word rather intuitively, so in that sense it is functional. The visualisation based on bubbles’ areas is usually clear; in cases where the two areas are very similar you can still get the ratio right thanks to the percentage numbers displayed.

– The fact that you can move around the bubbles is eye-catching and fun to use, but it fails to provide any added value to the tool (= no extra functionality other than maybe organising the words some other way). This is detrimental to the understanding of the structure and purpose of the visualisation: the bubbles’ location seem to imply a semantic correlation of some sort, but unfortunately it doesn’t have any.

– I’d expect to be able to filter the quotes by speaker, so to compare the usage of a certain word only between two specified people (e.g. Obama and Romney). Unfortunately that is not possible. Also, it’d be nice to be able to order or re-organise the citations on a timeline, so to explore potential patterns in the increase/decrease of use of a word. All of this could be easily achievable by introducing a ‘filter by’ panel right on top of the the quotations columns. The types of filters could be others too, eg geographical ones (by state or regions), or by the importance of the ‘roles’ of the speakers (majors, governors).

– The main issue from the interactivity point of view is that when you click on a bubble (assuming a user understands that’s what he/she has to do – btw no tooltips at all!) it’s not immediately obvious that the bottom part of the screen gets updated. I’d add some mechanism, such as a partial screen refresh, or a ‘loading’ icon, that would make this process more transparent.

– There is no way to remove a bubble once you’ve added it. So if you’re trying to compose your own ‘view’ of the tool by selecting only words you are interested in, once you get something wrong you’ve stuck with it (you can only restart from scratch by reloading the page)

– The 4 static captions at the bottom (AUTO, WOMEN etc..) are ok at the beginning of the visualisation, but once you start moving things around they don’t update at all which is not really the expected behaviour.

– If the full transcripts the quotations derive from are available online, it’d be nice to be able to link directly to them e.g. by clicking on the quotations themselves. This would allow to investigate further the original context of use of a word.

– Having small photos on the side of a speaker’s name would make it easier to identify these people; also, it shouldn’t be too difficult to include links to the person’s home pages or wikipedia entries.

Infographics Course, Week 1

mikele — Tue, 06 Nov 2012 23:18:38 +0000

This is a short summary of the activities in week 1 of the Introduction to Infographics and Data Visualization massive online course offered by the Knight Center for Journalism at Texas University. I’ll be posting the course materials and exercises here on the blog, so stay tuned if you want more.

The course is hosted by Alberto Cairo, author of the book ‘The Functional Art’. It’s been only a week since we started, but I can definitely tell that the quality of both the teaching materials and overall e-learning platform are very high. So I’d highly recommend it to anyone interested in deepening his/her knowledge on such topics. It’s too late to sign up for it now, but there will be another class running in early 2013 so keep an eye on their site if you don’t want to miss the next enrollment.

Key Concepts from Lesson 1

- Infographics is a piece of functional art (different from pure art)
- the stuff in the world is shapeless and useless, it requires people to give it a form (to model it) according to some specific purpose
- our world is not about ideology anymore, it's about complexity (Matt Taibbi en Griftopia)
- the model of information designers is to model that raw material and make sense of it
- infographics is not just about summarizing, organizing data, but it's also about letting reader explore those data
- a graphic is a tool: it extends our skills and capacities
- any good infographics is functional as a hammer - the design predetermines the function the tool should have.
- any good infographics is multilayered as an onion - eg as in a summary of main points, plus more in depth examinations
- any good infographics is beautiful and true as a mathematical equation

- function doesn't dictate form, but it restricts the variety of forms that are acceptable to use for each set of data

- classic distinction:
	- infographics: presents information in a way that becomes meaningful; it's an edited story based on data
	- information visualization: fine-tuned so to support exploration; doesn't tell a particular story, but it allows users to create their own story (it's unedited).
- for Cairo, the difference is very fuzzy, often the two things are mixed together

- Definition: a good infographics lets you answer questions more efficiently 

- considering infographics only as art is wrong: infographics are tools. Often graphics with no structure or no context are presented as inforaphics, but they are not a visual representaion of the data, just a simple page layout exercise with a bunch of unrelated numbers

- numbers have little meaning if they can't be compared with other data (eg summaries) or if I can't relate them to my life (eg contextual information)

- infographics should be constructed in many layers so that data could be cross-compared in many ways

- choosing the correct 'visual metaphor' is essential for an infographics

- if your goal is to let users compare numbers, it's better not to use bubble charts but bar charts! The human brain is not good at comparing sizes of bubbles. Bubbles are good for presenting overall patterns, but not good for precise comparisons.

- the 'onion' approach allows you to represent data in different ways to facilitate different kinds of tasks

- Infographics and visualization must be considered as visual tools for communication, understanding and analysis

- Charles Joseph Minard 1869: cosidered by many (eg Tufte) the best visualization ever!

The exercise

See the following graphic (socialwebinvolvement.jpg) and try to answer the questions

1) Is this infographic really “functional” in the sense of facilitating basic, predictable tasks (comparing, relating variables, etc.)?
Not really. If we think of an infographics as a ‘technology’ this is certainly a very poor one. Apart from the fact that China and USA have the biggest audiences (in absolute terms) it’s extremely difficult (if not impossible) to use the graphics as a tool in any other way. Several variables are presented, but we can’t compare, organize or correlate the data cause they’re all expressed in a way that doesn’t support those actions. Moreover the choice of using bubble charts is not approriate, as they make it harder for people to make comparisons among surface sizes.

2) If not, how could it be improved?
Similarly to what discussed in chapter two of the handouts (see the armed forces employees example), I’d do the following things:
a) eliminate the bubble charts, and replace them with bar charts
b) create bar charts for both absolute and relative values using a derived variable
c) improve the rendering of the labels, since now they’re a bit too small and hard to read. This could be done for example by using a specialized type of bar chart where the ‘bar’ is divided into 5 continuous coloured segments corresponding to the types of social web involvement.
d) keep the geographical map in order to provide some context; however, it could be shrunk, positioned at the bottom and used primarily to highlight which are the countries being examined in the experiment (eg by having their areas in a different color)

3) What kind of headlines, intro copy, and labels could it include to make it meaningful for a broad audience?
I think that the correlation between the label colors and their meaning should be made more explicit.

4) What other variables (if any) should be gathered/analyzed if we want to give an accurate portrait of Internet users across countries? Could we go beyond what is currently presented? Can we provide a better context for the data?
It’d be nice to have a sense of the total number of users per country, versus people that admittedly don’t use the internet (or social media). Also, it’s not clear whether the 32000 users interviewed have been split proportionally to the total population of the countries taken into consideration, or not.
Other variables that it’d be nice to investigate are
– mean of access of the internet: eg mobile phone, computer, tablet
– age distribution
– overall context of internet usage: eg leisure, work, education

Other Approaches

Here’re the work of other students (not me) who tried (with impressive results) to redesign the graphic above:

http://www.flickr.com/photos/89317425@N05/8150814858/
http://dl.dropbox.com/u/43885573/Draft2.jpg
http://www.flickr.com/photos/aaugur/8144159956/sizes/k/in/photostream/
http://www.flickr.com/photos/rubenvalero/8139950164/
http://n79.org/infographics/asg1/
http://public.tableausoftware.com/views/SocialWebInvolvement_1/Dashboard1?:embed=y

Navigating through the people of medieval Scotland… one step at a time

mikele — Mon, 10 Sep 2012 18:48:47 +0000

Navigating through the people of medieval Scotland… one step at a time! This is, in a nutshell, what users can do via the Dynamic Connections Cloud application, a prototype tool I’ve been working on recently, in the context of the People of Medieval Scotland project (PoMS), which was launched last week at the University of Glasgow.

Traditionally, digital humanities projects that produce historical databases tend to present their data using a classic tabular format, which is roughly the equivalent of a bibliographic record (e.g. as used in library softwares) only for historical data (e.g. so to present information about persons, documents, facts).

This approach has the advantage of offering a wealth of information within a clean and well organised interface, thus simplifying the task of finding what we are looking for during a search. However, by combining all the data in a single view, this approach also hides some of the key dimensions used by historians in order to make sense of the materials at hand. For example, such dimensions could be deriving from a higher-level analysis that focuses on spatio-historical, genealogical or socio-political patterns.

The limitations of the tabular format become even more evident when we consider that the PoMS database contains more than 80000 facts about 20000 people/institutions active in medieval Scotland. How were these people connected? Can we explore this network in a more interactive, game-like manner than the classic database-like structures? In other words, how can we help users see the ‘big picture’?

PoMS Laboratories

PoMS researchers have sifted through more than 8000 charters and have extracted a pretty amazing amount of information from them. Now that the database is online and can be searched via the usual mechanisms (keywords, facets) historians can investigate aspects of the making of Scotland in a small fraction of the time it would have taken them otherwise.
However, almost paradoxically, by making available such a large quantity of data in structured format new problems are arising too. Information overload is one of them: how can this wealth of data can be compared, correlated and organized into more meaningful units? How can we present the same data in a more piecemeal fashion, according to predefined pathways or views on the dataset that aim at making explicit some of the coherence principles of the historical discourse?

In order to investigate further these questions in the last months I developed the PoMS Labs, a section of the PoMS website that contains a number of prototypes usable to interact with PoMS data in innovative ways. In general, with these tools we aimed at addressing the needs of both non-expert users (e.g., learners) – who could simultaneously access the data and get a feeling for the meaningful relations among them – and experts alike (e.g., academic scholars) – who could be facilitated in the process of analysing data within predefined dimensions, so to highlight patterns of interest that would be otherwise hard to spot.

What follows contains more information about three of these prototype tools, which I think will give you a pretty good idea of what the concept of highlighting pathways in the data means (by clicking on launch you can try out the tools for yourself – which is probably the best way to discover what this is all about!).

Note: currently the only platforms we tested the Labs on are desktop computers running the latest versions of Mozilla Firefox, Google Chrome or Apple Safari.

1. Dynamic Connections Cloud (launch)

This experimental app lets you browse incrementally the network of relationships linking persons/institutions to other persons/institutions.
Since each of them is normally participating in more than one event (e.g., a transaction or a relationship factoid), we can attempt to reconstruct the network of interconnections by examining the appearance of individuals within the same event or situation.

The software lets you choose an individual and start building a ‘chain of connections‘ departing from him/her/it. Each name in the resulting connections-cloud is rendered using a different font and color, depending on the sex and on the number of common factoids being shared with the previously selected items.
At any time it is possible to go back to the main PoMS database pages in order to find out more about the individuals or factoids emerging from the connections-cloud exploration. Just click on the individual icons, or move the mouse over the links provided in order to discover more options.

The screenshot below illustrates the main functionalities of the software, and is based on a sample connection chain that starts from a rather unknown person (‘A. wife of Normam son of Malcolm‘) and arrives to a more famous institution (‘Arbroath Abbey‘).

Note: You can see a live version of the connection chain displayed above by following this link.

2. Relationships explorer (launch)

The individuals and institutions in the PoMS database are often interconnected by participating to the same events (e.g. transactions or relationships). In particular, the database contains detailed information circa the varying roles agents are playing within such events. Can we discover any interesting pattern by examining these roles? For example, do agents tend to appear always in the same role, of are there exceptions to this rule?

This visualization tool allows you to compare the different roles played by two agents played in the context of their common events. The software makes use of the D3 Sankey diagrams plugin, kindly made available by Mike Bostock. In general, Sankey diagrams are designed to show flows through a network (and are sometimes called flow diagrams).
In our case the network is always composed by three steps (person-role, event, person-role) and is relatively simple, so the Sankey diagram is mainly used in order to group nodes of the same type (e.g. roles) and provide an overview of relationships between persons and events (i.e. the ‘flow’).

The screenshot below illustrates the main functionalities of the software; in particular, it represents all existing relationships between Edward I, king of England (d.1307) and William Fraser, bishop of St Andrews (d.1297) (obviously, based on the information PoMS makes available).

Note: you can play with a live version of the specific visualisation displayed above by following this link.

3. Transactions and Witnesses (launch)

In PoMS witnesses are very important as they the persons who have witnessed a charter and are given in the witness list. Charters are usually describing some form of transaction, which is the most important type of event (‘factoid’) represented in the database. This interactive visualization lets you browse iteratively transactions and the witnesses associated to them.

Each graph starts from a transaction of choice (the ‘focus point’), and displays two levels of information: (1) all the witnesses of the transaction (normally persons or institutions), and (2) for each of these agents, all the other transactions they have witnessed.
The new transactions emerging from this network can be selected and brought to the center of the visualization (which is recalculated), thus facilitating a process of interactive exploration of the interconnections and commonalities among PoMS’s recorded transactions.

The visualization has been created thanks to the freely available JavaScript InfoVis Toolkit.

The screenshot below illustrates the main functionalities of the software; the graph is centered around a transaction (‘Agreement between Alwin, abbot of Holyrood, and Arnold, abbot of Kelso, over the Crag of Duddingston in Edinburgh‘) that has five witnesses in total.

Note: click here to see a live version of this graph.

Any feedback?

Then please do get in touch, either through this blog or the official PoMS contact page! This is all very much a work in progress, so we’re eager to hear from you.

Wittgenstein Tractatus and the JavaScript InfoVis Toolkit

mikele — Sun, 08 Jul 2012 20:31:18 +0000

What do the JavaScript InfoVis Toolkit and the Austrian philosopher Ludwig Wittgenstein have in common? Definitely not much, at first sight. But the moment you realise that Wittgenstein was so fascinated with logic that he wanted to organise his masterwork in the form of a tree structure, well, you may change your mind.

The javaScript InfoVis Toolkit includes a number of pretty cool libraries that work in the browser and can be customised to your own needs. Some of these visualisations are specifically designed for trees and graphs, so I always wondered how a dynamic tree-rendering of Wittgenstein’s Tractatus would look like.

The Tractatus Logico-Philosophicus (Latin for “Logical-Philosophical Treatise”) is the only book-length philosophical work published by the Austrian philosopher Ludwig Wittgenstein in his lifetime. It was an ambitious project: to identify the relationship between language and reality and to define the limits of science. It is recognized as a significant philosophical work of the twentieth century.
[…] The Tractatus employs a notoriously austere and succinct literary style. The work contains almost no arguments as such, but rather declarative statements which are meant to be self-evident. The statements are hierarchically numbered, with seven basic propositions at the primary level (numbered 1–7), with each sub-level being a comment on or elaboration of the statement at the next higher level (e.g., 1, 1.1, 1.11, 1.12).

The final result is available here (warning: it’s been tested only on Chrome and Firefox): http://hacks.michelepasin.org/witt/spacetree

Some more details

I’ve played around a little with one of the visualisation libraries the JavaScript InfoVis Toolkit makes available, the Radial Graph, to the purpose of transforming the Tractatus text into a more interactive platform. The Radial Graph is essentially a tree-rendering library built over a circular area (hence called also space-tree).

I liked the idea of making the tree-like structure of the text explorable one step at a time, within a framework that suggests a predefined order of the text-units but also allows for lateral steps or quick jumps to other sections. However I’m still trying to figure out what the advantages of looking at the text this way can be, once you go past the initial excitement of playing with it as if it was some sort of toy!

Some of the pros seem to be:

By zooming in and out of the tree, you can see immediately where one sentence is located and how it (structurally) relates to the other ones

The tree visualisation makes more transparent the importance of some sentences, and thus implicitly conveys some aspects of the argument Wittgenstein is making.

On the other hand, here are some cons:

We lose the the diachronic, linear sense of the text (assuming the Tractatus has one – which is something not all scholars would agree with)

The animations may become distracting..

I wonder how all of this could be developed further and/or transformed into a useful tool.. if you have any comment or suggestion please do get in touch !
I’m also planning to release the source code for the whole app as soon as a I clean it up a little; for the moment, here is the javascript bit that renders the graph: