database – Parerga und Paralipomena

Installing GraphDB (aka OWLIM) triplestore on mac os

mikele — Thu, 16 Oct 2014 19:05:38 +0000

GraphDB (formerly called OWLIM) is an RDF triplestore which is used – among others – by large organisations like the BBC or the British Museum. I’ve recently installed the LITE release of this graph database on my mac, so what follows is a simple write up of the steps that worked for me.

Haven’t played much with the database yet, but all in all, the installation was much simpler than expected (ps: this old recipe on google code was very helpful in steering me in the right direction with the whole Tomcat/Java setup).

1. Requirements

OSX: Mavericks 10.9.5
XCode: latest version available from Apple
HOMEBREW: ruby -e “$(curl -fsSkL raw.github.com/mxcl/homebrew/go)”
Tomcat7: brew install tomcat
JAVA: available from Apple

Finally – we obviously want to get a copy of OWLIM-Lite too: http://www.ontotext.com/owlim/downloads

2. Setting up

After you have downloaded and unpacked the archive, you must simply copy these two files:

owlim-lite/sesame_owlim/openrdf-sesame.war
owlim-lite/sesame_owlim/openrdf-workbench.war

..to the Tomcat webapps folder:

/usr/local/Cellar/tomcat/7.0.29/libexec/webapps/

Essentially that’s because OWLIM-Lite is packaged as a storage and inference layer for the Sesame RDF framework, which runs here as a component within the Tomcat server (note: there are other ways to run OWLIM, but this one seemed the quickest).

3. Starting Tomcat

First I created a symbolic link in my ~/Library folder, so to better manage new versions (as suggested here).

sudo ln -s /usr/local/Cellar/tomcat/7.0.39 ~/Library/Tomcat

Then in order to start/stop Tomcat it’s enough to use the catalina command:

[michele.pasin]@here:~/Library/Tomcat/bin>./catalina start
Using CATALINA_BASE:   /usr/local/Cellar/tomcat/7.0.39/libexec
Using CATALINA_HOME:   /usr/local/Cellar/tomcat/7.0.39/libexec
Using CATALINA_TMPDIR: /usr/local/Cellar/tomcat/7.0.39/libexec/temp
Using JRE_HOME:        /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Using CLASSPATH:       /usr/local/Cellar/tomcat/7.0.39/libexec/bin/bootstrap.jar:/usr/local/Cellar/tomcat/7.0.39/libexec/bin/tomcat-juli.jar

[michele.pasin]@here:~/Library/Tomcat/bin>./catalina stop
Using CATALINA_BASE:   /usr/local/Cellar/tomcat/7.0.39/libexec
Using CATALINA_HOME:   /usr/local/Cellar/tomcat/7.0.39/libexec
Using CATALINA_TMPDIR: /usr/local/Cellar/tomcat/7.0.39/libexec/temp
Using JRE_HOME:        /System/Library/Java/JavaVirtualMachines/1.6.0.jdk/Contents/Home
Using CLASSPATH:       /usr/local/Cellar/tomcat/7.0.39/libexec/bin/bootstrap.jar:/usr/local/Cellar/tomcat/7.0.39/libexec/bin/tomcat-juli.jar

Tip: Tomcat runs by default on port 8080. That can be changed pretty easily by modifying a parameter in server.xml in {Tomcat installation folder}/libexec/conf/ more details here.

4. Testing the Graph database

Start a browser and go to the Workbench Web application using a URL of this form: http://localhost:8080/openrdf-workbench/ (substituting localhost and the 8080 port number as appropriate). You should see something like this:

After selecting a server, click ‘New repository’.

Select ‘OWLIM-Lite’ from the drop-down and enter the repository ID and description. Then click ‘next’.

Fill in the fields as required and click ‘create’.

That’s it! A message should be displayed that gives details of the new repository and this should also appear in the repository list (click ‘repositories’ to see this).

5. Loading a big dataset

I’ve set out to load the NPG Articles dataset available at nature.com’s legacy linked data site data.nature.com.

The dataset contains around 40M triples describing (at the metadata level) all that’s been published by NPG and Scientific American from 1845 till nowadays. The file size is ~6 gigs so it’s not a huge dataset. Still, something big enough to pose a challenge to my macbook pro (8gigs RAM).

First, I increased the memory allocated to the Tomcat application to 5G. It was enough to create a setenv.sh file in the ${tomcat-folder}\bin\ folder. The file contains this line:

CATALINA_OPTS=”$CATALINA_OPTS -server -Xms5g -Xmx5g”

More details on Tomcat’s and Java memory issues are available here.

Then I used OWLIM’s web interface to create a new graph repository and upload the dataset file into it (I previously downloaded a copy of the dataset to my computer so to work with local files only).

It took around 10 minutes for the application to upload the file into the triplestore, and 2-3 minutes for OWLIM to process it. Much much faster than what I expected. Only minor issue, the lack of notifications (in the UI) of what was going on. Not a big deal in my case, but with larger dataset uploads it might be a potential downer.

Note: I used the web form to upload the dataset, but there are also ways to do that from the command line (which will probably result in even faster uploads).

6. Useful information

> Sparql endpoints

All of your repositories come also with a handy SPARQL endpoint, which is available at this url: http://localhost:8080/openrdf-sesame/repositories/test1 (just change the last bit so that it matches your repository name).

> Official documentation

https://confluence.ontotext.com/display/GraphDB6

> Ontotext’s Q&A forum

http://answers.ontotext.com

An introduction to Neo4j

mikele — Wed, 10 Apr 2013 11:04:48 +0000

Neo4j is a recent graph-database that is rapidly accumulating success stories, especially in areas such as “social applications, recommendation engines, fraud detection, resource authorization, network & data center management and much more“. Here’s an interesting introductory lecture about by Ian Robinson at JavaZone 2013.

Tip: Databasetube offers various other interesting articles about neo4j

A few notes from the presentation:

Premises: 
	- Data today is more connected than ever before
	- Complexity = f(size, semi-structure, connectedness)
	- Graphs are the best abstractions we have to model connectedness

The data model in neo4j: "property graph model"
	- nodes have properties (eg key-value pairs)
	- relationships have a direction, and can have properties too (eg weighted associations)

Neo4j server has a built in UI (web-based)

When to consider using a graph database:
	- lots of join tables [connectedness]
	- lots of sparse tables [semi-structure]

Neo4j fully supports ACID transactions
	- durable, consistent data
	- uses a try/success syntax

Performance
	- millions of 'joins' per second [connections are pre-calculated at insert time!]
	- consistent query times as dataset grows

Cypher query language
	- syntax mirrors the graphic representation of a graph 
	- one dimensional, left-to-right

For a comparison of various graph databases (including Neo4j) check out this tutorial from the ESWC’13 conference

Navigating through the people of medieval Scotland… one step at a time

mikele — Mon, 10 Sep 2012 18:48:47 +0000

Navigating through the people of medieval Scotland… one step at a time! This is, in a nutshell, what users can do via the Dynamic Connections Cloud application, a prototype tool I’ve been working on recently, in the context of the People of Medieval Scotland project (PoMS), which was launched last week at the University of Glasgow.

Traditionally, digital humanities projects that produce historical databases tend to present their data using a classic tabular format, which is roughly the equivalent of a bibliographic record (e.g. as used in library softwares) only for historical data (e.g. so to present information about persons, documents, facts).

This approach has the advantage of offering a wealth of information within a clean and well organised interface, thus simplifying the task of finding what we are looking for during a search. However, by combining all the data in a single view, this approach also hides some of the key dimensions used by historians in order to make sense of the materials at hand. For example, such dimensions could be deriving from a higher-level analysis that focuses on spatio-historical, genealogical or socio-political patterns.

The limitations of the tabular format become even more evident when we consider that the PoMS database contains more than 80000 facts about 20000 people/institutions active in medieval Scotland. How were these people connected? Can we explore this network in a more interactive, game-like manner than the classic database-like structures? In other words, how can we help users see the ‘big picture’?

PoMS Laboratories

PoMS researchers have sifted through more than 8000 charters and have extracted a pretty amazing amount of information from them. Now that the database is online and can be searched via the usual mechanisms (keywords, facets) historians can investigate aspects of the making of Scotland in a small fraction of the time it would have taken them otherwise.
However, almost paradoxically, by making available such a large quantity of data in structured format new problems are arising too. Information overload is one of them: how can this wealth of data can be compared, correlated and organized into more meaningful units? How can we present the same data in a more piecemeal fashion, according to predefined pathways or views on the dataset that aim at making explicit some of the coherence principles of the historical discourse?

In order to investigate further these questions in the last months I developed the PoMS Labs, a section of the PoMS website that contains a number of prototypes usable to interact with PoMS data in innovative ways. In general, with these tools we aimed at addressing the needs of both non-expert users (e.g., learners) – who could simultaneously access the data and get a feeling for the meaningful relations among them – and experts alike (e.g., academic scholars) – who could be facilitated in the process of analysing data within predefined dimensions, so to highlight patterns of interest that would be otherwise hard to spot.

What follows contains more information about three of these prototype tools, which I think will give you a pretty good idea of what the concept of highlighting pathways in the data means (by clicking on launch you can try out the tools for yourself – which is probably the best way to discover what this is all about!).

Note: currently the only platforms we tested the Labs on are desktop computers running the latest versions of Mozilla Firefox, Google Chrome or Apple Safari.

1. Dynamic Connections Cloud (launch)

This experimental app lets you browse incrementally the network of relationships linking persons/institutions to other persons/institutions.
Since each of them is normally participating in more than one event (e.g., a transaction or a relationship factoid), we can attempt to reconstruct the network of interconnections by examining the appearance of individuals within the same event or situation.

The software lets you choose an individual and start building a ‘chain of connections‘ departing from him/her/it. Each name in the resulting connections-cloud is rendered using a different font and color, depending on the sex and on the number of common factoids being shared with the previously selected items.
At any time it is possible to go back to the main PoMS database pages in order to find out more about the individuals or factoids emerging from the connections-cloud exploration. Just click on the individual icons, or move the mouse over the links provided in order to discover more options.

The screenshot below illustrates the main functionalities of the software, and is based on a sample connection chain that starts from a rather unknown person (‘A. wife of Normam son of Malcolm‘) and arrives to a more famous institution (‘Arbroath Abbey‘).

Note: You can see a live version of the connection chain displayed above by following this link.

2. Relationships explorer (launch)

The individuals and institutions in the PoMS database are often interconnected by participating to the same events (e.g. transactions or relationships). In particular, the database contains detailed information circa the varying roles agents are playing within such events. Can we discover any interesting pattern by examining these roles? For example, do agents tend to appear always in the same role, of are there exceptions to this rule?

This visualization tool allows you to compare the different roles played by two agents played in the context of their common events. The software makes use of the D3 Sankey diagrams plugin, kindly made available by Mike Bostock. In general, Sankey diagrams are designed to show flows through a network (and are sometimes called flow diagrams).
In our case the network is always composed by three steps (person-role, event, person-role) and is relatively simple, so the Sankey diagram is mainly used in order to group nodes of the same type (e.g. roles) and provide an overview of relationships between persons and events (i.e. the ‘flow’).

The screenshot below illustrates the main functionalities of the software; in particular, it represents all existing relationships between Edward I, king of England (d.1307) and William Fraser, bishop of St Andrews (d.1297) (obviously, based on the information PoMS makes available).

Note: you can play with a live version of the specific visualisation displayed above by following this link.

3. Transactions and Witnesses (launch)

In PoMS witnesses are very important as they the persons who have witnessed a charter and are given in the witness list. Charters are usually describing some form of transaction, which is the most important type of event (‘factoid’) represented in the database. This interactive visualization lets you browse iteratively transactions and the witnesses associated to them.

Each graph starts from a transaction of choice (the ‘focus point’), and displays two levels of information: (1) all the witnesses of the transaction (normally persons or institutions), and (2) for each of these agents, all the other transactions they have witnessed.
The new transactions emerging from this network can be selected and brought to the center of the visualization (which is recalculated), thus facilitating a process of interactive exploration of the interconnections and commonalities among PoMS’s recorded transactions.

The visualization has been created thanks to the freely available JavaScript InfoVis Toolkit.

The screenshot below illustrates the main functionalities of the software; the graph is centered around a transaction (‘Agreement between Alwin, abbot of Holyrood, and Arnold, abbot of Kelso, over the Crag of Duddingston in Edinburgh‘) that has five witnesses in total.

Note: click here to see a live version of this graph.

Any feedback?

Then please do get in touch, either through this blog or the official PoMS contact page! This is all very much a work in progress, so we’re eager to hear from you.

Towards a conceptual model for the domain of sculpture

mikele — Sat, 19 Nov 2011 14:44:09 +0000

For the next two years I’ll be collaborating with the Art of Making project. The project investigates the processes involved in the carving of stone during the Roman period, in particular it aims at analysing them using the insights and understanding Peter Rockwell (son of Norman Rockwell) developed during his lifelong experience as a sculptor. Eventually we will present these results by means of a freely accessible online digital resource that guides users through examples of stone carving. In this post I just wanted to report on the very first discussions I had with the sculpture and art scholars I’m working with, to the purpose of creating a shared model for this domain.

The project started this July, it is based at King’s College London and is funded by the Leverhulme Trust. I’m more involved with the digital aspects of the project, and as usual one of the first steps involved in the building of a digital resource (in particular, a database-backed digital resource) is the construction of a conceptual model that can represent the main types of things being dealt with.

In other words, it is fundamental to identify which are the things our database and web-application should ‘talk about’; later on, this model can be refined and extended so to become an abstract template of the data-manipulation tasks the software application must be capable of supporting (e.g. entering data into the system, searching and visualising them).

Here’s a nice example of the sculptures (a sarcophagus from Aphrodisias) that constitute our ‘source’ materials:

What are the key entities in the sculpture domain?

To this purpose, a few weeks ago we had a very productive brainstorming session aimed at fleshing out the main items of interest in the world of sculpture. This is a very first step towards the construction of a formal model for this domain; nonetheless, I think that we have already managed to pin down the key elements we’re going to be dealing with in the next two years.

Here’s a list of the main objects we identified:

– People, such as craftsman’s etc..
– Sculptures (of various kinds)
– Materials
– Tools
– Generic processes that are part of a sculpting project, such as quarrying and transport.
– More specific methods being used within a particular process, e.g. carving styles, or approaches to quarrying.
– Traditions, conceptualisations of the ‘way of doing things’ that, in turn, can inspire the way methods and processes are carried out nowadays.

We encoded the results of our discussions in a mind map for better readability, and also in order to use a technology that would make it easier to share our findings later on. I added it below.. (in case the interactive image doesn’t work, you can find it here too).

Fleshing out the model a bit more

After a few weeks of work we did a reiteration of the conceptual map above. The good news was that it soon became evident to us that we got it quite right on the first round; that is, we didn’t really feel like adding or removing anything from the map.

On the other hand, we thought we should try to add some relations (= links, arcs) among the concepts (=bubbles) previously identified, so to characterize their semantics a bit more. I had a go at adding some relations first, and here’s the result:

I should specify that I have no knowledge whatsoever of the domain of sculpture, so the stuff I added to the map came out entirely from the (little) research I’ve been doing on the subject (on and off) during the last weeks.

At the same time, also Will and Ben (the art historians I’m collaborating with) worked independently at the task of fleshing out the mind map with more relations. Needeless to say, what they came up with is way more dense and intricate than what I could have ever imagined! This is probably not surprising, as one would expect to see a significant difference between a non-expert’s representation of a subject domain and another one which is instead created by experts. Still, it was interesting to see it happening with my own eyes!
Here it is:

The next step will be trying to reduce the (natural) complexity of the portion of the world we are representing to a more manageable size… then, formalize it, and start building our database based on that.. stay tuned for more!

DB visualize : The Universal Database Tool

mikele — Tue, 10 Mar 2009 15:42:51 +0000

I’ve been searching for something similar for along time. Just hook up a database, and et voila you can see it, modify and export it with a clean and fast interface. Among the many (really a lot) features you have:

Database Browser: Tree based navigation through all database objects. Browse object details and invoke management features
Database Object Management: Visual support to create, alter and modify characteristics for database objects such as tables. Edit and compile support for procedures, functions, packages and triggers. Extensive database specific support.
Table Data Management : Support for editing table data including binary/BLOB and CLOB data types, import from file
SQL Tools: SQL editor with support for auto completion, parameterized SQLs, SQL formatter, visual query builder, explain plan, export large result sets
Database Server Management: DBA features for managing database instance, storage and security parameters in the database server
Tools : Export objects in a database/schema as DDL and table data
Comprehensive Database & OS Support: Oracle, Sybase, SQL Server, PostgreSQL, DB2, Mimer, Neoview, MySQL, Informix, JavaDB/Derby, Windows, Mac OS X, Linux/UNIX

However, to me the most important thing is that the visualization algorithm works really well! You can also choose how to layout the tables in your DB (hierarchic, organic, orthogonal, circular).
Oh yes and there’s a *free* version of it!!! (Installing DbVisualizer and running it out of the box automatically launches the DbVisualizer Free edition, not as many features as the full version, but still very useful!)