Wednesday, June 13, 2012

Visualising differences between classifications using cluster maps

As part of a project to build a tool to navigate through taxonomic names and classifications I've become interested in quick ways to compare classifications. For example, EOL has multiple classifications for the same taxon, and I'd like to quickly discover what the similarities and differences are.

One promising approach is to use "cluster maps", a technique described by Fluit et al. (see Aduna Cluster Map for an implementation):

Fluit, C., Sabou, M., & Harmelen, F. (2006). Visualizing the Semantic Web. (V. Geroimenko & C. Chen, Eds.) (pp. 45–58). Springer Science + Business Media. doi:10.1007/1-84628-290-X_3 (see also http://www.cs.vu.nl/~frankh/abstracts/VSW05.html)

Cluster map details

Cluster maps can be thought of as fancy Venn Diagrams, in that they can be used to depict the overlap between sets of objects. The diagram is a graph with two kinds of nodes. One represents categories (in the example above, file formats and search terms), the other represents sets of objects that occur in one or more categories (in the example above, these are files that match the search terms "rdf" and "aperture").

I've cobbled together a crude version of cluster maps. For a given taxon (e.g., a genus) I list all the immediate sub-taxa (e.g., species) in each classification in EOL, and then find the sets of sub-taxa that are shared across the classification sources (e.g., ITIS, NCBI, etc.) and those that are unique to one source. I then create the cluster map using Graphviz. Inspired by the hexagonal packing used by Aduna, I've done something similar to display the taxa in each set. Adding these to the output of Graphviz required a little fussing with. First I get Graphviz to output the graph in SVG, then I load the SVG into a program that locates each node in the graph and inserts SVG for the packed circles (given that SVG is XML this is fairly straightforward).

As an example, consider the genus Demansia (http://eol.org/pages/34967/overview). EOL reports four classifications for this genus. Below is a cluster map for this genus:

34967

This diagram show that, for example, the Catalogue of Life (CoL) and Reptile databases share 4 names, these databases share three other names with ITIS. All databases have names unique to themselves, one database (NCBI) is completely disconnected from the other three databases.

One important caveat here is that I'm mapping the scientific names as returned by EOL, and in many cases these contain the taxonomic authority. This is a major headache, prompting this outburst:


If we clean the names by removing the taxonomic authority the clusters overlap rather more:
Demansia
Now we see that only ITIS and the Reptile Database have unique names. This is one reason why I get stroppy when taxonomists start saying databases shouldn't have to supply cleaned "canonical" names. If the names have authorities then I have to clean them, because in many cases the authorities (while useful to know) are inconsistent across databases. For example:

  • Demansia olivacea GRAY 1842 versus Demansia olivacea (Gray, 1842)
  • Demansia torquata GÜNTHER 1862 versus Demansia torquata (Günther, 1862)

Taxonomic authorities are frequently misspelt, and people seem confused about when to use parentheses or not. Databases should spare the user some pain and provide clean names (and authority strings separately where they have them).

The visualisation is still incomplete (I need to make it interactive), but it shows promise. The names that are unique to one database are usually worth investigating. In some cases they are names other databases regard as synonyms, in other cases they represent spelling variations. The goal of this visualisation is to highlight the names that the user might want to investigate further.