Wednesday, June 19, 2013

A new way to view taxonomic publications

One of goals of BioNames is to be more than simply another taxonomic database. In particular, I'm interested in the idea of having a platform for viewing taxonomic publications. One way to think about this is to consider the experience of viewing Wikipedia. For any given page in Wikipedia there will be links to other, related content in Wikipedia. Reading an article about a city, you can go and read about the country the city occurs in. Reading about a battle, you can discover more about the generals who fought it. The ability to discover all this interconnected information in one place is compelling.

I'd like something similar for taxonomy. Given that a taxonomic database is in essence a collection of taxonomic names and publications, and a taxonomic publication is in essence a collection of names and citations of taxonomic publications, why not embed the publication within the database and have the names and citations link to the corresponding entries in the database?

Based on some earlier efforts (e.g., Towards an interactive taxonomic article: displaying an article from ZooKeys) and inspired by the eLife Lens project, I've created a live demo of a way to view articles from the journal ZooKeys. Below is a screencast:

If you want to try this out, here are some live examples:

Note the pattern in the URL, just append the DOI for an article to

Everything is a bit rough, but it's working well enough for you to get the basic idea. Code is in github Essentially the viewer grabs the ZooKeys HTML, extracts the URL for the XML file, fetches that, then uses some XSLT style sheets to convert the XML into something viewable. There's a sprinkling of Javascript to call the BioNames API. Much of the code could be tweaked to accepted other NLM XML-based articles, such as content from PLoS and the BMC journals.

One direction this could go in is to make a viewer like this the default viewer in BioNames for ZooKeys articles, so that instead of being restricted to a PDF you can interactively navigate between the article and the cited literature. Indeed, the very action of locating cited references in BioNames builds citation links. We could imagine extending the approach to content that isn't in NLM XML, such as Zootaxa PDFs, or content from BHL. Eventually I'd like to have the taxonomic literature fully embedded in the database, not as PDF or image silos, but as documents linked to names and literature. The journal becomes a database.

More GBIF taxonomy fail

In browsing the GBIF classification in BioNames I keep coming across cases of wholesale duplication of taxa. I recently blogged about a single example, the White-browed Gibbon, but here's a larger example involving frogs.

Consider the frog genera Philautus and Raorchestes. The latter was described in 2010:

A ground-dwelling rhacophorid frog from the highest mountain peak of the Western Ghats of India (2010)Current Science (Bangalore) 98(8): 1119–1125.
and contains a number of species previously in Philautus. The GBIF classification for Philautus still has these species, which means that these taxa appear twice in the GBIF data portal (associated with different occurrences).

To gauge the scale of the problem I've done a crude pairwise plot of species names in the two genera. In the diagram below a dot(●) appears if the species name in the corresponding row and column is identical. The diagonal corresponds to comparisons of a species name with itself.

Note the ●'s that appear off the diagonal. These are species in Philautus and Raorchestes that have the same species name (e.g., Philautus glandulosus and Raorchestes glandulosus. The off-diagonal dots indicate taxa that are duplicated.

● ● Raorchestes anili
● ● Raorchestes annandalii
● ● Raorchestes beddomii
● ● Raorchestes bobingeri
● ● Raorchestes bombayensis
● ● Raorchestes chalazodes
● ● Raorchestes charius
● ● Raorchestes dubois
● ● Raorchestes flaviventris
● ● Raorchestes glandulosus
● ● Raorchestes graminirupes
● ● Raorchestes griet
● ● Raorchestes gryllus
● ● Raorchestes longchuanensis
● ● Raorchestes luteolus
● ● Raorchestes menglaensis
● ● Raorchestes munnarensis
● ● Raorchestes nerostagona
● ● Raorchestes ochlandrae
● ● Raorchestes parvulus
● ● Raorchestes ponmudi
● ● Raorchestes sahai
● ● Raorchestes shillongensis
● ● Raorchestes signatus
● ● Raorchestes terebrans
● ● Raorchestes tinniens
● ● Raorchestes travancoricus
● ● Raorchestes tuberohumerus
● ● Raorchestes viridis
● Philautus abditus
● Philautus abundus
● Philautus acutirostris
● Philautus acutus
● Philautus adspersus
● Philautus albopunctatus
● Philautus alto
● Philautus amboli
● Philautus amoenus
● Philautus andersoni
● ● Philautus anili
● ● Philautus annandalii
● Philautus asankai
● Philautus aurantium
● Philautus auratus
● Philautus aurifasciatus
● Philautus banaensis
● Philautus basilanensis
● ● Philautus beddomii
● ● Philautus bobingeri
● ● Philautus bombayensis
● Philautus bunitus
● Philautus caeruleus
● Philautus cardamonus
● Philautus carinensis
● Philautus cavirostris
● ● Philautus chalazodes
● ● Philautus charius
● Philautus cinerascens
● Philautus cornutus
● Philautus crnri
● Philautus cuspis
● Philautus decoris
● Philautus dimbullae
● Philautus disgregus
● Philautus dubius
● ● Philautus dubois
● Philautus duboisi
● Philautus erythrophthalmus
● Philautus eximius
● Philautus extirpo
● Philautus femoralis
● Philautus fergusonianus
● ● Philautus flaviventris
● Philautus folicola
● Philautus frankenbergi
● Philautus fulvus
● Philautus garo
● ● Philautus glandulosus
● Philautus gracilipes
● ● Philautus graminirupes
● ● Philautus griet
● ● Philautus gryllus
● Philautus gunungensis
● Philautus hainanus
● Philautus hallidayi
● Philautus halyi
● Philautus hazelae
● Philautus hoffmanni
● Philautus hoipolloi
● Philautus hosii
● Philautus hypomelas
● Philautus ingeri
● Philautus jacobsoni
● Philautus jerdonii
● Philautus jinxiuensis
● Philautus kempiae
● Philautus kempii
● Philautus kerangae
● Philautus leitensis
● Philautus leucorhinus
● Philautus limbus
● ● Philautus longchuanensis
● Philautus longicrus
● Philautus lunatus
● ● Philautus luteolus
● Philautus macropus
● Philautus maia
● Philautus malcolmsmithi
● Philautus maosonensis
● Philautus medogensis
● ● Philautus menglaensis
● Philautus microdiscus
● Philautus microtympanum
● Philautus mittermeieri
● Philautus mjobergi
● Philautus mooreorum
● ● Philautus munnarensis
● Philautus namdaphaensis
● Philautus nanus
● Philautus narainensis
● Philautus nasutus
● Philautus neelanethrus
● Philautus nemus
● ● Philautus nerostagona
● Philautus ocellatus
● ● Philautus ochlandrae
● Philautus ocularis
● Philautus odontotarsus
● Philautus oxyrhynchus
● Philautus pallidipes
● Philautus papillosus
● Philautus pardus
● Philautus parkeri
● ● Philautus parvulus
● Philautus petersi
● Philautus petilus
● Philautus pleurotaenia
● Philautus poecilius
● Philautus polillensis
● ● Philautus ponmudi
● Philautus poppiae
● Philautus popularis
● Philautus procax
● Philautus quyeti
● Philautus refugii
● Philautus regius
● Philautus reticulatus
● Philautus rhododiscus
● Philautus romeri
● Philautus rugatus
● Philautus rus
● ● Philautus sahai
● Philautus sanctipalustris
● Philautus sanctisilvaticus
● Philautus sarasinorum
● Philautus saueri
● Philautus schmackeri
● Philautus schmarda
● Philautus semiruber
● ● Philautus shillongensis
● ● Philautus signatus
● Philautus silus
● Philautus silvaticus
● Philautus simba
● Philautus similipalensis
● Philautus similis
● Philautus sordidus
● Philautus steineri
● Philautus stellatus
● Philautus stictomerus
● Philautus stuarti
● Philautus supercornutus
● Philautus surdus
● Philautus surrufus
● Philautus tectus
● Philautus temporalis
● ● Philautus terebrans
● ● Philautus tinniens
● ● Philautus travancoricus
● Philautus truongsonensis
● ● Philautus tuberohumerus
● Philautus tytthus
● Philautus umbra
● Philautus variabilis
● Philautus vermiculatus
● ● Philautus viridis
● Philautus vittiger
● Philautus williamsi
● Philautus woodi
● Philautus worcesteri
● Philautus wynaadensis
● Philautus zal
● Philautus zamboangensis
● Philautus zimmeri
● Philautus zorro

Why does GBIF have duplicate frogs? As for the gibbon example, the names come from different sources, and GBIF doesn't have access to (or doesn't use) data that tells it that the names are synonyms. In this case there is a clash between the Catalogue of Life, which doesn't recognise Raorchestes, and IUCN Red List, which does. The end result is a mess.

We clearly need better tools for catching these problems. We also need a decent database of taxonomic names and synonyms. The Catalogue of Life is, frankly, grossly inadequate in this respect, especially for vertebrate taxa. Increasingly it's becoming clear that the classification underlying the GBIF portal needs some serious work.

Thursday, June 13, 2013

BioNames - colourful phylogenies and downloadable SVG

My latest tweak to BioNames is to add colour to the phylogenies. Terminal nodes with the same name are labelled with the same background colour. For example, here is a tree for fiddler and ghost crabs:
The colours make it easier to see that this tree has a mixture of a few sequences from divergent taxa, and a lot of sequences from the same taxa.

Note that you can now also download the SVG drawing of the tree. Click on the Download button and (in at least some browsers, such as Chrome) the SVG will download. Other browsers may open the SVG in a separate window, in which case simply save it to your computer.

Wednesday, June 12, 2013

The first five minutes are free - renting articles on DeepDyve

Deepdyve 4colorIn 2011 I wrote a short post about DeepDyve, a service where you could rent access to an article. DeepDyve has launched a "5-Minute Freemium" service where you can view an article online for 5 minutes, for free. You have to log in, either with DeepDyve or using Facebook, but no actual money changes hands. If you want to read for longer, or download an article then you have to get out your credit card.

I've added support for DeepDyve to BioNames. If an article is available in DeepDyve, BioNames displays a link (see for an example). DeepDyve makes it possible to quickly check a fact (for example, the spelling of a taxonomic name). It obviously doesn't tackle bigger issues such as access to text for data mining, but if you just need to check something, or follow a lead, then it's an interesting and useful wrinkle on publishing models.

Gibbons and GBIF: good grief what a mess

52678 580 360One reason I built BioNames (and the related digital archive BioStor) was to create tools to help make sense of taxonomic names. In exploring databases such as GBIF and the NCBI taxonomy every so often you come across cases where things have gone horribly wrong, and to make sense of them you have to drill down into the taxonomic literature.

It's becoming increasingly clear to me that large parts of the GBIF classification that underpins their data portal is, well, a mess. There are duplicate taxa, homonyms, orphan genera, and so on. Now, building a global taxonomy on the scale of GBIF is a tough problem. They are merging a lot of individual classifications into an overall synthesis. That would be a challenging problem in itself, but it's compounded by inconsistent use of names for the same taxon. In other words, synonymy. This is the greatest self-inflicted wound in taxonomy, the desire to have names be meaningful in terms of relationships (i.e., species in the same genus should be related). If you require that, then the consequence is a mess (unless you have a really good taxonomic database in place to track name changes, and we don't).

As an example, consider the White-browed Gibbon (shown here in an image from EOL). In GBIF this taxon occurs in at least three different places in the GBIF classification (each name has occurrence data associated with it):

GBIF idNameSourceOccurrences
5219549Hylobates hoolock (Harlan, 1834)The Catalogue of Life, 3rd January 2011141
4267262Bunopithecus hoolock Harlan, 1834Mammal Species of the World, 3rd edition2
5786121Hoolock hoolock (Harlan, 1834)IUCN Red List of Threatened Species3

To keep things simple I've omitted the subspecies (such as Bunopithecus hoolock hoolock). Note that three key resources for names (the Catalogue of Life, Mammal Species of the World, and the IUCN) can't agree on what to call this ape. The names are also not entirely consistent. For example, as written, Bunopithecus hoolock Harlan, 1834 (from Mammal Species of the World, 3rd edition) would imply that this was the original name for this gibbon (because the authority [Harlan, 1834] is not in parentheses). This is incorrect, the original name of the White-browed Gibbon is Simia hoolock, and you can see the original description in BioStor:

Harlan R (1834) Description of a Species of Orang, from the north-eastern province of British East India, lately the kingdom of Assam. Transactions of the American Philosophical Society 4: 52–59.
Since then it has been shuffled around various genera, including a genus (Hoolock) for which it is the type species:
Mootnick A, Groves C (2005) A new generic name for the hoolock gibbon (Hylobatidae). International Journal of Primatology 26(4): 971–976. doi: 10.1007/s10764-005-5332-4.
GBIF regards all three names as being different taxa, despite all being names for the same gibbon. The practical consequence of this is that anyone seeking a comprehensive summary of what GBIF knows about the White-browed Gibbon is going to get different data depending on which name they use. In my experience this is not an uncommon occurrence (bats as another case where the GBIF classification is a terrible hodgepodge).

My goal here is not to berate GBIF, they are trying to aggregate messy, inconsistent data on a massive scale. But we need tools to flag cases like this poor gibbon, and ways to ensure that once we've found a problem it is fixed once and for all.

Saturday, June 08, 2013

BioNames phylogenies screencast

I've created a short screencast showing some of the phylogenies in BioNames. If you want to see these for yourself here are the links:

Friday, June 07, 2013

BioNames - Phylogenies? Yes, phylogenies

One of the things that didn't make last week's deadline for launching BioNames was the inclusion of phylogenies. This was disappointing as one of the reasons I built BioNames was to help span what I see as the gulf between classical biodiversity informatics and its emphasis on taxonomic names and classification, and modern phylogenetics where the tree is the primary focus, not some arbitrary way to partition it up.

So, where to get lots of phylogenies? I use the wonderful PhyLoTA database built by Mike Sanderson and colleagues:
Sanderson, M., Boss, D., Chen, D., Cranston, K., & Wehe, A. (2008). The PhyLoTA Browser: Processing GenBank for Molecular Phylogenetics Research. Systematic Biology, 57(3), 335-346. doi:10.1080/10635150802158688

I grabbed a dump of the trees, matched them to sequences in GenBank (more accurately, the European version, EMBL), did some post processing of those sequences, through them into CouchDB, built a SVG viewer, and voilà.

Here is a tree for the fig wasp family Agaonidae, showing the interactive zoomable tree viewer, and thumbnails for other trees for this taxon:


Here is a phylogeny for a genus of deep-sea mussels (Bathymodiolus), showing a map based on those sequences that are georeferenced in GenBank:


Lastly here is the page for the
bat family Vespertilionidae in the NCBI classification. Click on the "Data" tab to see this view.


There's still lots to do on this, but the key parts are in place. Personally I can happily while away the day just browsing through the trees, looking for case where taxa lack scientific names, obvious cases of synonymy (take a look at this tree for fiddler and ghost crabs, for example), and evidence that "species" have considerable internal phylogenetic structure.

Tuesday, June 04, 2013

BioNames and where taxonomy is published

I've added a simple "dashboard" to BioNames to display some basic data about what is in the database. Apart from a table of the number of bibliographic identifiers in the database (currently there are 54,422 publications with DOIs, for example), there are some graphic summaries. These are a bit slow to load as they are created on the fly.


The first summarises the relative frequency of articles from different publishers (broadly defined to include digital repositories such as DSpace and JSTOR). For most of this information I'm using data returned when I resolve a DOI at CrossRef. The data is incomplete and likely to change as I add more articles, and CouchDB finally catches up and indexes all the data.


The biggest blob is BioStor, which is my project to extract articles from BHL. Magnolia Press publish Zootaxa, then there are some well-known mainstream publishers such as Springer, Wiley, and Taylor & Francis (Informa UK). These publishers have digitised the back catalogues of a number of society journals, so their prominence here doesn't mean that they are actively publishing new taxonomic content. One use for a diagram like this is to think about what content to data mine. BioStor content is open access (via BHL) and so can be readily mined. Some articles in Zootaxa are open access and so could also be downloaded and processed. Then we have the big commercial publishers, who have a significant fraction of taxonomic content behind their paywalls. If the community was to think about mining this data, then this diagram suggests which publishers to start asking first.


The next diagram shows articles grouped by journal (using the journal's ISSN).


There circles are too small to be labelled usefully. A couple of things strike me. The first is the sheer number of journals! The taxonomic literature is widely scattered across numerous different outlets, which is part of the challenge of indexing the literature (and this diagram includes only those journals that have ISSNs, many smaller or older ones don't). There is no one journal which dominates the landscape (the largest circle on the top right is Zootaxa). But this diagram spans the complete history of taxonomic publication, so includes large journals (such as Annals and Magazine of Natural History) that no longer exist (at least in their present form). Might be useful to slice this diagram by, say, decade to get a clearer picture of patterns of publication.

As the database builds I post some more summaries at BioNames.

Monday, June 03, 2013

BioNames and altmetrics

One consequence of having a database of literature with external identifiers such as DOIs is that we can plug into a bunch of external services to get additional information about a reference. For example, altmetric can take a DOI and display some article level metrics. As an experiment I've added code for altmetric badges to the web page in BioNames that displays publications. For example, here is the ZooKeys paper "An extraordinary new family of spiders from caves in the Pacific Northwest (Araneae, Trogloraptoridae, new family)" in BioNames:


The "About" tab displays the altmetric badge with a bunch of metrics of engagement with this paper. If you click on the badge you get more details about what people have been saying about this paper.

It would be great to explore this across the complete set of taxonomic papers so that we could get a sense the degree of engagement people have with the latest taxonomic literature.