conference – Parerga und Paralipomena

Leipzig Semantics 2016 conference

mikele — Tue, 25 Oct 2016 16:01:51 +0000

A few weeks ago I attended the Semantics conference in Leipzig, so here’s a short report about the event.

SEMANTiCS 2016 (#semanticsconf) continues a long tradition of bringing together colleagues from around the world to present best practices, panels, papers and posters to discuss semantic systems in birds-of-a-feather sessions and informal settings.

What I really liked about this event is the fact that it is primarily industry-focused, meaning that most (if not all) of the talks were dealing with pragmatic aspects of real-world applications of semantic technologies. You can take a look at the online proceedings for more details, alternatively there are some nice videos and pictures pages too.

I meant to share some notes a few weeks ago already but never got round to doing it… so here are a few highlights:

Springer Nature’s Scigraph project got quite a bit of publicity as I was one of the invited keynote speakers. Overall, the feedback was extremely positive and it seems that many people are waiting to see more from us in the coming months. We also chatted to representatives from other publishers (Elsevier, Wolfer Kluwers, Oxford Uni Press) about areas where we could collaborate more e.g. constructing shared datasets (eg conference identifiers, coordinated by CrossRef the same way they do it for Funders).

Cathy Dolbear from Oxford University Press gave an interesting keynote describing the work they’ve been doing with Linked Data, mostly focusing on the Oxford Global Languages project, which links lexical information from multiple global and also digitally under-represented languages in a semantic graph. Also, she talked about creating rich schema.org snippets so to better interface with Google’s knowledge graph and thus increasing their ranking in search results. That was really good to hear as we’re investing in this area too!

David Kuilman from Elsevier talked about their approach to content management based on semantic technologies. David’s team has been focusing on tracking document production metadata mainly before publication (eg submission and production workflow metadata) which is quite interesting cause it’s the exact opposite of what we’ve been doing at Springer Nature.

Open Data Summit 2016

mikele — Fri, 21 Oct 2016 16:21:17 +0000

On November 1st we were invited to present the Scigraph project at the London ODI Summit, the annual event organized by the Open Data Institute to review and discuss the social and economic impact of open data in both the public and commercial sectors.

If data infrastructure is as important to our infrastructure as roads, then the Open Data Institute is helping to lay the concrete. Join us on 1 November to hear inspiring stories from around the world on how people are innovating with the web of data, with presentations from diverse innovators – from startups to high-profile speakers such as Sir Tim Berners-Lee (creator of the World Wide Web), Sir Nigel Shadbolt (AI expert) and Martha Lane Fox (Lastminute.com founder).

Our presentation was part of a a session titled How to design for open government and enterprise, which included two speakers from industry (me and Tharindi Hapuarachchi from Thomson Reuters Labs) and two from the public sector (Clare Moriarty from the Department for Environment, Food and Rural Affairs and Jamie Whyte from Trafford Council).

Feedback was very positive, in particular the audience seem to have liked the long standing commitment Springer Nature towards making science more open.

Other bits and pieces:

the open data awards from this year include various interesting projects and are worth taking a look at;

Tim Berners Lee hinting at the potential of recent technical advances like blockchain technology and the Solid project;

The ODINE (Open Data Incubator Europe) session was very interesting, in fact I’ve learnt that there’s a search engine for the internet of things too!

Finally, some more pictures..

Notes from the Force11 annual conference

mikele — Sat, 17 Jan 2015 18:04:41 +0000

I attended the https://www.force11.org/ conference in Oxford the last couple of days (the conference was previously called ‘Beyond the PDF’).

Force11 is a community of scholars, librarians, archivists, publishers and research funders that has arisen organically to help facilitate the change toward improved knowledge creation and sharing. Individually and collectively, we aim to bring about a change in modern scholarly communications through the effective use of information technology. [About Force 11]

Rather than the presentations, I would say that the most valuable aspect of this event are the many conversations you can have with people from different backgrounds: techies, publishers, policy makers, academics etc..

Nonetheless, here’s a (very short and biased) list of things that seemed to stand out.

A talk titled Who’s Sharing with Who? Acknowledgements-driven identification of resources by David Eichmann, University of Iowa. He is working on a (seemingly very effective) method for extracting contributors roles from scientific articles

This presentation describes my recent work in semantic analysis of the acknowledgement section of biomedical research articles, specifically the sharing of resources (instruments, reagents, model organisms, etc.) between the author articles and other non-author investigators. The resulting semantic graph complements the knowledge currently captured by research profiling systems, which primarily focus on investigators, publications and grants. My approach results in much finer-grained information, at the individual author contribution level, and the specific resources shared by external parties. The long-term goal for this work is unification with the VIVO-ISF-based CTSAsearch federated search engine, which currently contains research profiles from 60 institutions worldwide.

A talk titled Why are we so attached to attachments? Let’s ditch them and improve publishing by Kaveh Bazargan, head of River Valley Technologies. He demoed a prototype manuscript tracking system that allows editors, authors and reviewers to create new versions of the same document via an online google-doc-like system which has JATS XML in the background

I argue that it is precisely the ubiquitous use of attachments that has held up progress in publishing. We have the technology right now to allow the author to write online and have the file saved automatically as XML. All subsequent work on the “manuscript” (e.g. copy editing, QC, etc) can also be done online. At the end of the process the XML is automatically “rendered” to PDF, Epub, etc, and delivered to the end user, on demand. This system is quicker as there are no emails or attachments to hold it up, cheaper as there is no admin involved, and more accurate as there is only one definitive file (the XML) which is the “format of record”.

Rebecca Lawrence from F1000 presented and gave me a walk through of a new suite of tools they’re working on. That was quite impressing I must say, especially due to the variety of features they offer: tools to organize and store references, annotate and discuss articles and web pages, import them into word documents etc.. All packed within a nicely looking and user friendly application. This is due to go public beta some time in March, but you can try to get access to it sooner by signing up here.

The best poster award went to 101 Innovations in Scholarly Communication – the Changing Research Workflow. This is a project aiming to chart innovation in scholarly information and communication flows. Very inspiring and definitely worth a look.

Finally, I’m proud to say that the best demo award went to my own resquotes.com, a personal quotations-manager online tool which I’ve just launched a couple of weeks ago. Needless to say, it was great to get vote of confidence from this community!

If you want more, it’s worth taking a look directly at the conference agenda and in particular the demo/poster session agenda. And hopefully see you next year in Portland, Oregon :-)

ISWC14 paper: a hybrid semantic publishing architecture combining XML and RDF

mikele — Tue, 25 Nov 2014 08:55:21 +0000

I’m posting here a short summary of the paper I’ve given at the last International Semantic Web conference in Riva del Garda (ISWC14) together with my colleague Tony Hammond.

The presentation focused on an hybrid data architecture (XML for storage&querying, RDF for modeling&integration) which emerged as the most practical solution during the process of re-engineering of the publishing platform which has occurred within our company (Macmillan S&E) in the last years.

This is the abstract:

This paper presents recent work carried out at Macmillan Science and Education in evolving a traditional XML-based, document- centric enterprise publishing platform into a scalable, thing-centric and RDF-based semantic architecture. Performance and robustness guarantees required by our online products on the one hand, and the need to support legacy architectures on the other, led us to develop a hybrid infrastructure in which the data is modelled throughout in RDF but is replicated and distributed between RDF and XML data stores for efficient retrieval. A recently launched product – dynamic pages for scientific subject terms – is briefly introduced as a result of this semantic publishing architecture.

The paper is available online; slides from the presentation can be found below.

The ISWC industry track was packed with interesting papers so I think it’s worth taking a look at the online proceedings. The uptake of tech outside academia is always revealing of the many real-world difficulties involved in making something fit within pre-existing work practices and legacy technologies. This is especially true of larger companies, where investment in older technologies (and in people who know about them) can be considerable, hence upgrades are costly and need to be evaluated more carefully.

This is the sort of background that led me and my colleagues at MacMillan to opt for a hybrid solution that combines the power of an established enterprise MarkLogic installation with more cutting edge data integration approaches based on RDF.

Nature.com subject pages were one of the first products built on top of this architecture. And many more will come: we’re still heavily involved in this work though, so stay tuned for more stuff in this space.

Soon, we will also be releasing our public ontologies online and making available a new and improved version of the nature.com datasets.

ESWC 2013 – report from the conference

mikele — Wed, 05 Jun 2013 17:06:43 +0000

Last week I attended the European Semantic Web Conference (ESWC’13) in Montpellier and had a really good time meeting old friends and catching up with the latest research in this area. In this post I’ll collect a few pointers to papers and ideas that caught my attention.

For a high level summary of the talks, you can check out the pdf program, the workshops page or the tutorials page.

In particular the semantic publishing workshop SEPublica13 was very relevant for my current work, as its stated purpose is to discuss and review “accessing and reusing the computable data that the literature represents and describes” – something that all digital publishers are thinking about these days.

As for the rest of the conference, here’s a more lengthy summary of (some of) the presentations I managed to attend, organised by topic.

Keynote: less semantics and more web

The keynote from MIT’s David Karger was quite remarkable. In a talk titled “The Semantic Web for End Users” he challenged several widespread assumptions about the SW (maybe most intriguingly the ‘if it’s using RDF/OWL then it’s SW‘ principle). Karger argued for a a less AI-oriented, more user-centric and web-centric view of semantic web research, according to which one of the key opportunities for SW practitioners is to “make it easier for end users to produce, share, and consume structured data“, irrespectively of whether these are encoded in any of the RDF family of languages. Rather, SW tools should be measured in terms of how much they allow people to deal effectively with ‘applications whose schema is expected to change‘.
In general, the semantic web (like the web) should not be making ‘new things possible’ but rather ‘old things simpler’.

Semantic Science

Gon, B., Porto, F., & Moura, A. M. C.. On the semantic engineering of scientific hypotheses as linked data.

The paper addresses the engineering of hypotheses as linked data, and builds upon the Linked Science Core vocabulary by extending it in order allow the definition of scientific hypotheses as assumptions that constrain the interpretation of observed phenomena for computer simulation. A prototype application built by eliciting and linking hypotheses in a published research in Computational Hemodynamics (the study of the human cardiovascular system) is discussed to illustrate the notion of ‘conceptual traceability’ of research statements.

Gil, Y., Ratnakar, V., & Hanson, P. C. Organic Data Publishing : A Novel Approach to Scientific Data Sharing.

The paper introduces an approach called ‘organic data sharing‘ that 1) links dataset contributions directly to science questions, 2) reduces the burden of data sharing by enabling any scientist to contribute metadata, and 3) tracks and exposes credit for all contributors. An initial prototype that is built as an extension of a semantic wiki, can import Linked Data, and can publish as Linked Data any new content created by users.

Zhao, J., & Klyne, G. (2013). How Reliable is Your workflow : Monitoring Decay in Scholarly Publications.

The paper addresses the notion of workflow ‘decay’. Increasingly, scientific workflows are being treated as first-class artifacts, for exchanging and transferring actual scholarly findings, either as part of scholarly articles or as stand-alone objects. However scientific workflows are commonly subject to a decayed or reduced ability to be executed or repeated, largely due to the volatility of the external resources that are required for their executions. Based on our this hypothesis, the authors present a minimal set of information to be associated in a workflow to reduce its decay and be effectively exchanged as a reproducible research object.

Callahan, A., Cruz-toledo, J., Ansell, P., & Dumontier, M. (2013). Bio2RDF Release 2 : Improved coverage , interoperability and provenance of Life Science Linked Data.

Bio2RDF is an open-source project that provides linked data for the life sciences using Semantic Web technologies. Bio2RDF scripts (available on github) convert heterogeneously formatted data (e.g. flat-files, tab-delimited files, dataset specific formats, SQL, XML etc.) into a common format, RDF. The paper describes the new features of the latest Bio2RDF release, which provides a federated network of SPARQL endpoints over 19 datasets. Other new features include provenance information via PROV, mapping of dataset-specific vocabulary to the Semanticscience Integrated Ontology (SIO), context-sensitive SPARQL query formulation using SparQLed and a central registry of datasets in order to normalize generated IRIs.

Semantic Publishing

T. Kuhn, P. E. Barbano, M. L. Nagy, and M. Krauthammer, Broadening the Scope of Nanopublications.

Traditionally, nanopublications are described as an approach to (1) subdivide scientific results into minimal pieces, (2) to represent these results — called assertions — in an RDF-based formal notation, (3) to attach RDF-based provenance information on this “atomic” level, and (4) to treat each of these tiny entities as a separate publication. The authors of this paper challenge assumption (2) as unrealistic, essentially due to the proven difficulties in acquiring structured, logic-based assertions from people, and propose a new system (nanobrowser) that allows authors and curators to attach English sentences to nanopublications, thus allowing for informal representations of scientific claims.

Lord, P., & Marshall, L. (2013). Twenty-Five Shades of Greycite : Semantics for referencing and preservation.

The paper describes two new systems: greycite and kblog-metadata. The former, addresses the problem of bibliographic metadata, without resorting to a single central authority, extracting this metadata directly from URI end-points. The latter provides more specialised support for generating appropriate metadata within the popular wordpress blogging platform. The underlying rationale for both systems, claims the author, is that semantic metadata must be of value to all participants in the publishing process, most importantly the authors.

Mavergames, C., Oliver, S., & Becker, L. (2013). Systematic Reviews as an interface to the web of ( trial ) data : Using PICO as an ontology for knowledge synthesis in evidence-based healthcare research The Cochrane Collaboration

The paper describes a prototype application that makes use of linked data technologies to improve discovery of information stored in the Cochrane Database of Systematic Reviews, a resource in the domain of healthcare research (in particular the area of evidence-based medicine). The approach described relies on the PICO framework (Population, Intervention, Comparison, Outcome) as an ontology to aid in better discoverability, presentation, and synthesis of the knowledge available in the documents offered by the database. A prototype web application based on Drupal’s SW module is presented.

Wiljes, C., Jahn, N., Lier, F., Paul-stueve, T., Vompras, J., Pietsch, C., & Cimiano, P. (2013). Towards Linked Research Data : An Institutional Approach.

The paper describes an infrastructure used that enables researchers to manage their publications and the underlying research data in an easy and efficient way within a academic institution, Bielefeld University and the associated Center of Excellence Cognitive Interaction Technology. The platform created follows a Linked Data approach and uses Virtuoso to store data sources from inside the university and outside sources like DBpedia.

NLP, knowledge extraction

Iorio, A. Di, Nuzzolese, A. G., & Peroni, S. (2013). Towards the automatic identification of the nature of citations.

The paper presents an algorithm, called CiTalO, to infer automatically the function of citations by means of Semantic Web technologies and NLP techniques. CiTalO infers the function of citations by combining techniques of ontology learning from natural language, sentiment-analysis, word-sense disambiguation, and ontology mapping. These techniques are applied in a pipeline whose input is the textual context containing the citation and the output is a one or more properties of the CiTO ontology.

Jael, L., Castro, G., Berlanga, R., Rebholz-schuhmann, D., & Garcia, A. (2013). Connections across scientific publications based on semantic annotations.

The paper presents an experiment aimed at evaluating different concept annotation solutions on full text documents to determine to which extend relatedness can be inferred from such annotations. Eleven full-text articles from the open-access subset of PubMed Central have been extracted and annotated semantically using MeSH, UMLS, and other ontologies. The authors show that connections across articles from annotations automatically identified with entity recognition tools, e.g., Whatizit, NCBO Annotator, and CMA, are similar to those connections exhibit based on the PubMed MeSH terms, thus validating their approach.

A. Gangemi, A Comparison of Knowledge Extraction Tools for the Semantic Web.

This article reviews a number of Natural Language Processing tools (for various purposes, such as name-entity recognition or word sense disambiguation) that have been configured for Semantic Web tasks including ontology learning, linked data population, entity resolution, NL querying to linked data and others. The tools have been compared using a sample taken from an online article of The New York Times and the results are available online. The tools reviewed are: AIDA, AlchemyAPI, Apache Stanbol, DBpedia Spotlight, CiceroLite, FOX, FRED, NERD, Open Calais, PoolParty Knowledge Discoverer, ReVerb, Semiosearch Wikifier, Wikimeta, Zemanta.

E. Cabrio, S. Villata, F. Gandon, and I. S. Antipolis, A Support Framework for Argumentative Discussions Management in the Web.

The paper presents an approach based on NLP for automatically extracting argumentative relationships from highly active wiki pages. The overall purpose is to support community managers in managing the discussions and have an overall view of the ongoing deabtes so to detect the winning arguments. Argumentative discussions are formalized using an extension of the SIOC Argumentation vocabulary.

O. Medelyan, S. Manion, J. Broekstra, A. Divoli, A. Huang, and I. H. Witten, Constructing a Focused Taxonomy from a Document Collection

The paper describes a new method for constructing custom taxonomies from document collections, called F-STEP. It involves identifying relevant concepts and entities in text; linking them to knowledge sources like Wikipedia, DBpedia, Freebase, and any supplied taxonomies from related domains; disambiguating conflicting concept mappings; and selecting semantic relations that best group them hierarchically. By using this approach the authors constructed a custom taxonomy with 10,000 concepts and 12,700 relations from 2000 news articles. An evaluation with human judges has shows high rates of precision (90%) and recall (75%).

SW tech in real world systems

P. Szekely, C. A. Knoblock, F. Yang, X. Zhu, E. E. Fink, R. Allen, and G. Goodlander, Connecting the Smithsonian American Art Museum to the Linked Data Cloud.

This paper describes the process and lessons learned in publishing the data from the Smithsonian American Art Museum. The paper contains detailed descriptions of a) how relational data have been mapped to RDF (a system called Karma was used), b) how links to other linked data URIs have been created, and c) the process of curation to ensure that both the published information and its links to other sources within the LOD are accurate. The dataset uses an extended version of the Europeana Data Model, which is the metamodel used in the Europeana project to represent data from Europe’s cultural heritage institutions, plus other standards like PROV and Schema.org.

L. M. Garshol and A. Borge, Hafslund Sesam – an archive on semantics.

The paper describes an architecture based on RDF and Virtuoso, constructed to facilitate data integration and reuse within Hafslund, a Norwegian energy company. Documents are tagged with URIs from the triple store, and these URIs connect the document metadata with enterprise data extracted from backend systems. All source systems are integrated using a custom-built client-server solution based on SDShare – a specification for synchronizing RDF data using Atom feeds.

Random notes

SparQLed is an open source app that gives you an interactive SPARQL editor with context-aware recommendations (via autocompletion and other tricks). Definitely worth taking a look at.

I missed the excellent Semantic Data Management Techniques in
Graph Databases tutorial, but luckily the slides are available online. If you’re interested in graph databases, check them out, they include a detailed analysis and comparison of various graph databases including Neo4j, Hypergraph and many others.

David Karger pointed out a web app called If THis Then That. Rule-based reasoning on the web, without any fancy AI. Pretty cool!

identifiers.org/ is yet another service that aims at providing resolvable persistent URIs used to identify data for the scientific community

Open Knowledge Festival – Helsinki, 18-22 September 2012

mikele — Fri, 21 Sep 2012 09:12:40 +0000

I’m at the OKFestival this week. Tons of inspiring talks. I’m writing this post incrementally cause I’ve got lots of notes floating around.. so sorry for the mess!

I tried to build a panorama of the beautiful lecture hall (not that successfully.. but it’ll give you an idea):

Day 1, keynote by Martin Tisne

Martin Tisne is director of policy at Omidyar Network, he recently worked on the Transparency and Accountability Initiative, a collaborative of leading funders committed to strengthening democracy by empowering citizens to hold their governing institutions to account.

Day 1, keynote by Farida Vis

Farida Vis led the social media analysis on an academic team that examined 2.6 million riot tweets, analysing the role Twitter played in the 2011 UK riots, as part of The Guardian newspaper’s groundbreaking Reading the Riots project.

Day 2, keynote by Philip Thigo

Philip Thigo is part of a dynamic team at the Social Development Network (SODNET) that works on developing mobile and web-based technologies aimed at strengthening the role of citizens and civil society in the strategic use of technology, especially in developing countries. Philip is a Co-founder of INFONET, an initiative rooted in SODNET that is credited with empowering African civil society, governments and citizen’s to better engage in enforcing budget transparency, service delivery demands and election monitoring.

Day 2, keynote by Ville Peltola

Ville Peltola looks into the horizon at IBM as a Director of Innovation in the Chief Technology Officer’s team in IBM Europe. During the past few years Peltola has been focusing on smart cities and emerging civic innovation with open public data. En passant, Peltola mentioned a nice finnish experiment: Restaurant day

Day 2, open democracy session, Yannick Assogba

Yannick Assogba talked about IBM many bills, a bills visualizer. Other intersting links:
http://www.research.ibm.com/social/
http://researcher.watson.ibm.com/researcher/view_project.php?id=3419
http://historio.researchlabs.ibm.com/histories/98

Day 2, open democracy session, Miriam Reitenbach and Ivonne Jansen-Ding

Miriam Reitenbach and Ivonne Jansen-Dings from Waag Society in Amsterdam. Waag Society, institute for art, science and technology, develops creative technology for social innovation. The foundation researches, develops concepts, pilots and prototypes and acts as an intermediate between the arts, science and the media. Waag Society cooperates with cultural, public and private parties.
Keywords: “technology and citizen engagement”, ‘tapping into citizenship’

Day 2, open democracy session, Tangui Morlier

“Let’s reverse Lessig’s metaphor and pretend that Law is Code! Do we make the law more understandable if we use developper’s tools? […] Our project, realized with people from Sciences po university, aims to transform all these steps to open legislative data, in order to track the evolution of a law through a version-control system (such as Git) where each amendment will be an individual commit.”
Keywords: “lawrence lessig: code is law”, “towards a ‘gitlaw’”

Day 2, commons for Europe launch

http://commonsforeurope.net/, based on http://codeforamerica.org/
The actual site: check codeforeurope.net and the call for fellows

Day 3, plenary with James Cameron

James Cameron, very inspiring talk (video anywhere?)
“build more cooperative enterprises”
“human beings are not good at dealing with risks that come slowly from afar”
“break things up into manageable bits”
links:
http://grist.org/climate-change/on-titanic-anniversary-james-cameron-says-climate-change-is-our-menacing-iceberg/
http://thehill.com/blogs/e2-wire/e2-wire/115433-report-director-james-cameron-calls-climate-change-skeptics-swine
http://www.theccc.org.uk/news/features/1107-profile-on-james-cameron-vice-chairman-of-climate-change-capital
https://twitter.com/Jamesogradycam

Day 3, plenary with Tiago Peixoto

‘mobilisation of citizens’
‘participatory budgeting projects’
‘reduce participation costs’
– high participation costs => low participations
http://www.worldbank.org/
http://en.wikipedia.org/wiki/World_Bank
http://theconnectedrepublic.org/users/Tiago%20Peixoto
https://twitter.com/participatory

http://www.allourideas.org/

Day 3, Open democracy panel, Finnur Magnusson

Finnur Magnusson was the CTO for two large scale crowdsourcing events in Iceland as well as the Icelandic Constitution Council.
– http://www.gommit.com/
– http://chamber.com/finnur-magnusson
– http://sociable.co/social-media/how-iceland-crowdsourced-the-creation-of-its-new-constitution/

“Using Twitter (@Stjornlagarad), Facebook, Flickr, and YouTube, the group asked for opinions and suggestions about what should be included in the document and how they would like their country run. In total, over 16,000 user-submitted comments and proposals were sent to the Council through their website. ”

in iceland parliament computers are banned

– lessons learned: open participation => ++ quality in democracy
– written a new constitution in 4 months / no problems with online mobs etc..

Day 3, Open democracy panel, Tanja Aitamurto

Tanja Aitamurto
http://cci.mit.edu/ MIT Center for Collective Intelligence
Seeclickfix:
http://seeclickfix.com/
http://www.knightdigitalmediacenter.org/blogs/agahran/2012/09/seeclickfix-crowdsourced-local-problem-reporting-community-news

Day 3, the European citizens initiative

http://ec.europa.eu/citizens-initiative/public/welcome
http://www.citizens-initiative.eu/
http://en.wikipedia.org/wiki/European_Citizens’_Initiative

Carsten Berg, General Coordinator of the ECI Campaign; Democracy International
http://www.citizens-initiative.eu/?attachment_id=257

Day 4, Open Fablab session, Tomas Diez

Tomas Diez: http://fab.cba.mit.edu/classes/MIT/863.08/people/Tomas/

Barcelona: http://www.smartcitizen.me/en/
http://fablabbcn.org/

‘distribuite personal manufacturing’

http://en.wikipedia.org/wiki/The_Third_Industrial_Revolution
http://www.thethirdindustrialrevolution.com/
http://www.economist.com/node/21552901

Day 4, Open Fablab session, Peter Troxler

Peter Troxler
http://petertroxler.org/
http://opendesignnow.org/

Day 4, gran finale with Hans Gosling

Gapminder.org
Keywords:
“People have a completely wrong idea about the world”
“First we used the most stupid argument we had: rational argument”
“Companies are more serious than the public sector: companies go down”
“The old west has a to ic combination of ignorance and arrogance about the world”
“Don’t talk about what you want to do, just prototype fast”
“Climate is too serious for letting environmental activists deal with it”
“don’t do only small apps x your garden or bicycle … Its the big thing”

Rosling presented the television documentary The Joy of Stats, which was broadcast in the United Kingdom by BBC Four in December 2010.[6]
http://blogs.elpais.com/periodismo-con-futuro/2011/05/hansrosling.html

Conference: Computer Applications and Quantitative Methods in Archeology

mikele — Wed, 28 Mar 2012 13:59:02 +0000

Yesterday I went to the CAA 2012 conference in Southampton, one of the top conferences in the world in the field of computational archaeology. I couldn’t stay for longer than a day, but I’ve seen enough to say that archaeologist definitely know their way around when it comes to combining IT with their discipline.

I presented a poster about the Art of Making project (which deals with categorising and making available online a collection of images of ancient Roman sculpture). In particular I was there for the Data Modelling and Sharing session: the formal ontology we’re working on in the Art of Making (and the accompanying dataset) is likely going to become one of the first in its kind. So I was quite interested in finding out who’s doing what, when it comes to sharing data about the the ancient world.

The answer is, there are a lot of people doing very interesting things (btw please get in touch if you know of other relatable datasets). Here’re a brief report on some the papers that struck me (for the full list of the talks I would have liked to attend, check out my interactive schedule.)

A paper on the Pelagios project by Leif Isaksen. Pelagios is a consortium that brings together an impressive number of other datasets on the ancient world. I’d say each of them is worth taking a look at: Arachne; CLAROS; Fasti Online; Google Ancient Places; Nomisma; Open Context; Perseus; Pleiades; Ptolemy Machine, SPQR; Ure museum

A paper titled “When, What, Where, How and Who?” by Sarah May. She reported about a user-study aimed at understanding how archeologist search for information online, and whether an more integrated web of data would match their current information seeking behaviours.

The paper “Exploring Semantic Web-based research questions for the spatio-temporal relationships at Çatalhöyük“, by Holly Wright. She presented an archeological data-modeling scenario that calls for more powerful knowledge representation approaches to time and events. There are two broad approaches to solve this problem, she said: temporal reification (apparently this is mostly done using SWRL rules, e.g. here and here) and temporal fluents (some info here, and also in the context of the SOWL project). I don’t know much on this topic, but surely this paper got me interested in it!

A paper presenting the SAWS project, which looks at defining and linking related units of text in original manuscripts using semantic web technologies.

The Archeology Data Services, a York-based organisation that aims at ‘preserving digital data in the long term, and by promoting and disseminating a broad range of data in archaeology’. In particular one of their project, Stellar, has produced a number of software tools that facilitate the manipulation of archeological data and their transformation into rdf-compliant formats.

Finally, this is the schedule for the whole conference (notice the slick widget – it’s powered by a new service sched.org) :

Event: THATcamp Kansas and Digital Humanities Forum

mikele — Wed, 28 Sep 2011 16:56:55 +0000

The THATcamp Kansas and Digital Humanities Forum happened last week at the Institute for Digital Research in the Humanities, which is part of the University of Kansas in beautiful Lawrence. I had the opportunity to be there and give a talk about some recent stuff I’ve been working on regarding digital prosopography and computer ontologies, so in this blog post I’m summing up a bit the things that caught my attention while at the conference.

The event happened on September 22-24 and consistend of three separate things:

Bootcamp Workshops: a set of in-depth workshops on digital tools and other DH topics http://kansas2011.thatcamp.org/bootcamps/.

THATCamp: an “unconference” for technologists and humanists http://kansas2011.thatcamp.org/.

Representing Knowledge in the DH conference: a one-day program of panels and poster sessions (schedule | abstracts )

The workshop and THATcamp were both packed with interesting stuff, so I strongly suggest you take a look at the online documentation, which is very comprehensive. In what follows I’ll instead highlight some of the contributed papers which a) I liked and b) I was able to attend (needless to say, this list matches only my individual preference and interests). Hope you’ll find something of interest there too!

A (quite subjective) list of interesting papers

The Graphic Visualization of XML Documents, by David Birnbaum ( abstract ): a quite inspiring example of how to employ visualizations in order to support philological research in the humanities. Mostly focused on Russian texts and XML-oriented technologies, but its principles easily generalizable to other contexts and technologies.

Exploring Issues at the Intersection of Humanities and Computing with LADL, by Gregory Aist ( abstract ): the talk presented LADL, the Learning Activity Description Language, a fascinating software environment provides a way to “describe both the information structure and the interaction structure of an interactive experience”, to the purpose of “constructing a single interactive Web page that allows for viewing and comparing of multiple source documents together with online tools”.

Making the most of free, unrestricted texts–a first look at the promise of the Text Creation Partnership, by Rebecca Welzenbach ( abstract ): an interesting report on the pros and cons of making available a large repository of SGML/XML encoded texts from the Eighteenth Century Collections Online (ECCO) corpus.

The hermeneutics of data representation, by Michael Sperberg-McQueen ( abstract ): a speculative and challenging investigation of the assumptions at the root of any machine-readable representation of knowledge – and their cultural implications.

Breaking the Historian’s Code: Finding Patterns of Historical Representation, by Ryan Shaw ( abstract ): an investigation on the usage of natural language processing techniques to the purpose of ‘breaking down’ the ‘code’ of historical narrative. In particular, the sets of documents used are related to the civil rights movement, and the specific NLP techniques being employed are named entity recognition, event extraction, and event chain mining.

Employing Geospatial Genealogy to Reveal Residential and Kinship Patterns in a Pre-Holocaust Ukrainian Village, by Stephen Egbert.( abstract ): this paper showed how it is possible to visualize residential and kinship patterns in the mixed-ethnic settlements of pre-Holocaust Eastern Europe by using geographic information systems (GIS), and how these results can provide useful materials for humanists to base their work on.

Prosopography and Computer Ontologies: towards a formal representation of the ‘factoid’ model by means of CIDOC-CRM, by me and John Bradley ( abstract ): this is the paper I presented (shameless self plug, I know). It’s about the evolution of structured prosopography (= the ‘study of people’ in history) from a mostly single-application and database-oriented scenario towards a more interoperable and linked-data one. In particular, I talked about the recent efforts for representing the notion of ‘factoids’ (a conceptual model normally used in our prosopographies) using the ontological language provided by CIDOC-CRM (a computational ontology commonly used in the museum community).

Event: Digital Humanities conference 2011

mikele — Thu, 30 Jun 2011 17:50:01 +0000

Last week I went to Stanford for the Digital Humanities 2011 international conference. This is arguably the most important event for researchers and academics who employ digital methods to tackle questions and problems normally associated to ‘humanities’ disciplines. In this blog post I will start by summarising the things I was invited to talk about; then in the near future I’ll try to integrate this article with other reflections and pointers to interesting materials I ran into at the conference.

I had two papers, one by myself and one with Matteo Romanello, a bright PhD student at DDH whom I’m co-supervising (well sort of.. since the college hasn’t formally recognized me in that role yet).

The first paper is about DJFacet, a faceted search engine I created and already discussed elsewhere, so I won’t bore you with the technical details here. The aspect of it I discussed at the conference deals with the specific employment of DJFacet with complex humanities databases.

DJFacet lets you easily build a search interface consisting of many entry point (facets); in particular, thanks to the availability of an advanced functionality called ‘pivoting‘ it is possible to switch the main perspective of a search dynamically (that is, the main result type you’re searching for) while still keeping the search parameters previously chosen. This means for example that when searching for ‘people’ using facets such as ‘surname’ or ‘age’, you could change the result type to ‘documents’ and keep using the same ‘surname’ or ‘age’ facets. This is made possible by the implicit connection existing in the database between, for example, the objects of type ‘people’ and the objects of type ‘document’ (eg, ‘authors of’).

However, despite the fact that this approach proved to be, from the logical and computational point of view, completely feasible, it also opened up a number of research questions from the point of view of the meaning of these multifaceted searches across different results types. In other words, we realized that often the accumulation of filters ontologically distant from each other could be hardly translated by the end user into real-world questions; analogously, the opposite may happen, in so far as simple type of searches may be impeded by the highly structured architecture of a faceted browser. The paper attempted to address these problems and provide some initial solutions to them. Here’re the slides:

DH11: Browsing Highly Interconnected Humanities Databases Through Multi-Result Faceted Browsers

The second paper (headlined by Matteo) is about HuCit (available at www.purl.com/net/hucit), a formal ontology we’re developing together that is aimed at the formal representation of humanities citation structures.

The key idea here derives from the fact that while in the sciences a citation is normally represented in the form of a relation between two publications (and often that’s all is needed eg to generate all sorts of interesting citation network analysis algorithms), in the humanities (and especially in classics) citations are normally analyzed by scholars with a much higher attention to details.

For example, citations may exhibit a particular style which scholars want to study and classify (for example when faced with ancient citations) to the purpose of better understanding and contextualizing the meaning of a citation. Secondly, in classics we have interesting ‘phenomena’ like canonical citations: these are citations that do not point at any publication in particular, but at an idealized version of a classic text (eg Homer’s Iliad) which is used as a reference systems for all subsequent editions of that text. Canonical citations fundamentally act as a reference to a point in a (textual) coordinate system which is agreed upon by the scholarly community – and thus needs to be followed so to facilitate discussion in that community. So, in a nutshell, the HuCit ontology is providing the representational primitives needed to support computational reasoning about the ‘humanistic’ way of working with citations. Here’re the slides of the talk:

An Ontological View of Canonical Citations

All in all, I had a really good time at the conference. The campus and weather in Stanford are just amazing; the DH community really down to earth and approachable. I’ll try to update this post in the next weeks, with more information about people and projects that stimulated my imagination. Stay tuned!

Python links (and more) 7/2/11

mikele — Thu, 03 Feb 2011 15:23:21 +0000

This post contains just a collection of various interesting things I ran into in the last couple of weeks… they’re organized into three categories: pythonic links, events and conferences, and new online tools. Hope you’ll find something of interest!

Pythonic stuff:

Epidoc
Epydoc is a handy tool for generating API documentation for Python modules, based on their docstrings. For an example of epydoc’s output, see the API documentation for epydoc itself (html, pdf).

PyEnchant
PyEnchant is a spellchecking library for Python, based on the excellent Enchant library.

Dexml
The dexml module takes the mapping between XML tags and Python objects and lets you capture that as cleanly as possible. Loosely inspired by Django’s ORM, you write simple class definitions to define the expected structure of your XML document.

SpecGen
SpecGen v5, ontology specification generator tool. It’s written in Python using Redland RDF library and licensed under the MIT license.

PyCloud
Leverage the power of the cloud with only 3 lines of python code. Run long processes on the cloud directly from your shell!

commandlinefu.com
This is not really pythonic – but nonetheless useful to pythonists: a community-based repository of useful unix shell scripts!

Events and Conferences:

Digital Resources in the Humanities and Arts Conference 2011
University of Nottingham Ningbo, China. The DRHA 2011 conference theme this year is “Connected Communities: global or local2local?”

Narrative and Hypertext Workshop at the ACM Hypertext 2011 conference in Eindhoven.

Culture Hack Day, London, January 2011
This event aimed at bringing cultural organisations together with software developers and creative technologists to make interesting new things.

History Hack Day, London, January 2011
A bunch of hackers with a passion for history getting together and doing experimental stuff

Conference.archimuse.com
The ‘online space for cultural informatics‘: lots of useful info here, about publications, jobs, people etc.

Agora project: Scholarly Open Access Research in European Philosophy
Project looking at building an infrastructure for the semantic interlinking of European philosophy datasets

Online tools:

FactForge
A web application aiming at showcasing a ‘practical approach for reasoning with the web of linked data’.

Semantic Overflow
A clone of Stack Overflow (collaboratively edited question and answer site for programmers) for questions ‘about semantic web techniques and technologies’.

Google Refine
A tool for “working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases”.

Google Scribe
A text editor with embedded autocomplete suggestions as you type

Books Ngram Viewer
Tool that displays statistical information regarding the use of user-selected sentences in a corpus of books (e.g., “British English”, “English Fiction”, “French”) over the selected years

…