Thursday, November 21, 2013

GBIF, GitHub, and taxonomy (again)

Quick notes on yet another attempt to marry the task of editing a taxonomic classification with versioning it in GitHub.

The idea of dumping the whole GBIF classification into GitHub as a series of nested folders looks untenable. So, maybe there's another way to tackle the problem.

Let's imagine that we dump, say, the GBIF classification down to family-level as a series of nested folders (i.e., we recreate the classification on disk). For each family we then create a bunch of files and store them in that folder. For example, we could have the classification in Darwin Core Archive format (basically, delimited text). Let's also create a graph that corresponds to that classification, using a format for which we have tools available for visualising and editing.

For example, I've created a Graph Modelling Language (GML) file for the Pinnotheridae here. Using software such as yEd I can load this file, display it, and edit it. For example, below is a compact tree layout of the graph:

Pinnotheridae

This image is a bitmap, if you opened the GML file in yEd it would be interactive, and you could zoom in, alter the layout, edit the graph, etc.

Looking at the graph there are a few oddities, such as "orphan" genera that lack any species, and some names that appear very similar. For example, there is an orphan genus Glassella, and a similar genus Glassellia (note the "i") with a single species Glassellia costaricana. A little digging in BioNames shows that Glassellia is a misspelling of Glassella. The original description appears in:

E Campos, M K Wicksten (1997) A New Genus For The Central American Crab Pinnixa costaricana Wicksten, 1982 (Crustacea: Brachyura: Pinnotheridae). Proceedings of the Biological Society of Washington 110(1): 69–73. http://biostor.org/reference/81137
So, we have one genus that appears twice due to a typo. Furthermore, there are nodes in the graph for the taxa Glassellia costaricana and Pinnixa costaricana, but these are the same thing (the names are synonyms, albeit Glassellia costaricana has the genus misspelt). So, we could delete Pinnixa costaricana, delete the mispelling Glassellia, fix the misspelling in Glassellia costaricana, and move it to the correctly spelt Glassella. There are other problems with this classification, but let's leave them for the moment.

Now, imagine that after editing I use the graph to regenerate the DWCA file, which now has the edited classification. I then commit the changes to GitHub, and anyone else (including GBIF) could grab the DWCA and, for example, replace their Pinnotheridae classification with the edited version.

We could also go further, and add what i think is a missing component of the GBIF classification, namely a link to the nomenclators. For example, in an ideal world we would have each name in the classification linked to a stable identifier for the name provided by a nomenclator, and that nomenclator would know, for example, that Pinnixa costaricana and Glassella costaricana were objective synonyms. If we had those links then we could automatically detect cases such as this where logically you can have either Pinnixa costaricana or Glassella costaricana in the same classification, but not both.

There are some wrinkles to figure out, for example it would be nice to compute the difference between the original and edited graphs in terms of graph operations (not simply the difference as text files) so we could do things like list nodes that have been moved or deleted. I did some work on this a while back (Page, R. D., & Valiente, G. (2005).BMC Bioinformatics, 6(1), 208. doi:10.1186/1471-2105-6-208), something like that tool might do the trick.

There is an element here of trying to coerce a problem into a form that can existing tools can solve, but in a way that's what makes it attractive. If we can use things that already exist then we can move from talking about it to actually doing it.