Tuesday, May 25, 2010

TreeBASE II makes me pull my hair out

I've been playing a little with TreeBASE II, and the more I do the more I want to pull my hair out.

Broken URLs
The old TreeBASE had a URL API, which databases such as NCBI made use of. For example, the NCBI page for Amphibolurus nobbi has a link to this taxon in TreeBASE. The link is http://www.treebase.org/cgi-bin/treebase.pl?TaxonID=T31183&Submit=Taxon+ID. Now, this is a fragile looking link to a Perl CGI script, and sure enough, it's broken. Click on it and you get a 404. In moving to the new TreeBASE II, all these inward links have been severed. At a stroke TreeBASE has cut itself off from an obvious source of traffic from probably the most important database in biology. Please, please, throw in some mod_rewrite and redirect these CGI calls to TreeBASE II.

New identifiers
All the TreeBASE studies and taxa have new identifiers. Why? Imagine if GenBank decided to trash all the accession numbers and start again from scratch. TreeBASE II does support "legacy" StudyIDs, so you can find a study using the old identifier (you know, the one people have cited in their papers). But there's no support for legacy TaxonIDs (such as T31183 for Amphibolurus nobbi). I have to search by taxon name. Why no support for legacy taxon IDs?

Dumb search
Which brings me to search. The search interface for taxa in TreeBASE is gloriously awful:


So, I have to tell the computer what I'm looking for. I have to tell it whether I'm looking for an identifier or doing a text search, then within those categories I need to be more specific: do I want a TreeBASE taxon ID (new ones of course, because the old ones have gone), NCBI id, or uBio? And this is just the "simple" search, because there's an option for "Advanced search" below.

Maybe it's just me, I get really annoyed when I'm asked to do something that a computer can figure out. I shouldn't have to tell a computer that I'm searching for a number or some text, nor should I tell it what that number of text means. Computers are pretty good at figuring that stuff out. I want one search box, into which I can type "Amphibolurus nobbi", or "Tx1294" or "T31183" or "206552" or "6457215" or "urn:lsid:ubio.org:namebank:6457215" (or a DOI, or a text string, or pretty much anything) and the computer does the rest. I don't ever want to see this:


Computers are dumb, but they're not so dumb that they can't figure out if something is a number or not. What I want is something close to this:


Is this really too much to ask? Can we have a search interface that figures out what the user is searching for?

Note to self: Given that TreeBASE has an API, I wonder how hard it would be to knock up a tool that took a search query, ran some regular expressions to figure out what the user might be interested in, then hit the API with that search, and returned the results?

My concern here is that TreeBASE II is important, very important. Which means it's important to make it usable, which means don't break existing URLs, don't make old identifiers disappear, and don't have a search interface that makes me want to pull my hair out.

Friday, May 21, 2010

Linked data part 2

Continuing the Friday folly theme, below is a screencast of a linked data browser that uses the same ideas as last week's screencast, but uses a custom browser I've written to display the results in a more user-friendly way.

Linking the data together from Roderic Page on Vimeo.

The demo is live, you can view it at http://iphylo.org/~rpage/browser/www/uri/http://bioguid.info/doi:10.1371/journal.pone.0001787. Under the hood the browser uses bioGUID as the primary linked data provider (although it should consume any valid linked data source, for example Dbpedia). The data is stored in a local triple store (ARC), and the web interface is created by transforming SPARQL queries into HTML using XSLT. You can add data to it by editing the URL in the browser location bar and reloading the page, or entering a URL on the page. Linked data URLs be entered next to the Browse button as is, e.g. http://dbpedia.org/resource/Euphausia, or appended to http://iphylo.org/~rpage/browser/www, e.g. http://http://iphylo.org/~rpage/browser/www/http://dbpedia.org/resource/Euphausi. Other identifiers, such as DOIs, PubMed ids, and specimens need to be resolved via bioGUID, e.g. http://iphylo.org/~rpage/browser/www/uri/http://bioguid.info/gi:86161637.

All still very crude, but I hope you get the idea.

Thursday, May 20, 2010

NCBI Taxonomy IDs and Wikipedia


I've written a note on the Wikipedia Taxobox page making the case for adding NCBI taxonomy IDs to the standard Taxobox used to summarise information about a taxon. Here is what I wrote:

Wikipedia's taxon pages have a huge web presence (see my blog post Google and Wikipedia revisited and Page, R. D. M. (2010). "Wikipedia as an encyclopaedia of life". Nature Precedings hdl:10101/npre.2010.4242.1). If a taxon is in Wikipedia it is almost always the first search result in Google. Researchers in other areas of biology are making use of a Wikipedia as a tool to annotate genes Gene Wiki and RNA families Wikipedia:WikiProject_RNA, respectively. Pages for genes, such as Cytochrome_b, have numerous external identifiers in their equivalent of the Taxobox (the Pfam_box). I think we are missing a huge opportunity by not including NCBI taxonomy ids. The advantages would be:

  • It would provide a valuable service to Wikipedia readers by enabling them to go to NCBI to discover more about a taxon

  • It would help Wikipedia contributors by providing a standardised way to refer to NCBI (and enable bots to add missing NCBI taxonomy ids). Putting them in an External links section makes it harder to be consistent (there are various ways to write a URL linking to the NCBI taxonomy)

  • It would facilitate linking from NCBI to Wikipedia. A mapping of Wikipedia pages to NCBI taxonomy ids could be added to NCBI Linkout, generating more traffic to the Wikipedia pages

  • Projects that are trying to integrate information from different sources would be able to combine information of genomics from NCBI with other information much more readily

Note that I am not arguing that Wikipedia should "follow" NCBI taxonomy, merely that where the potential to link exists, the links would create value, both within and outside the Wikipedia community.

Some discussion has ensued on the Taxobox page, all positive. I'm blogging this here to encourage anyone who as any more thoughts on the matter to contribute to the discussion.

Friday, May 14, 2010

Viewing a BioStor reference in Cooliris

cooliris.pngCooliris is a web browser plugin that can display a large number of images as a moving "infinite" wall. It's Friday, so for fun I added a media RSS feed to BioStor to make the BHL page scans available to Cooliris. The result is easier to show than describe, so take a peek at the video I made of A review of the Centrolenid frogs of Ecuador, with descriptions of new species (http://biostor.org/reference/20844):

Cooliris view of BioStor from Roderic Page on Vimeo.

Cooliris is a little flaky under Snow Leopard, but still works (the plug-in is cross platform). It is also available for the iPhone (and I'm assuming the iPad), which means you can get the experience on a mobile device.

Linking biodiversity data

Time for a Friday folly. I've made a clunky screencast showing an example of linking biodiversity data together, using bioGUID as the universal wrapper around various data sources. I started with GenBank sequence EF013683, added another, EF013555, then explored some links (specimen, publication, taxon, journal), using the OpenLink RDF Browser:

You can try the URIs I used in the linked data browser of your choice:

The demo is a bit clunky, partly because the linked data browser is generic. What we really need is a browser that is tailored to displaying the kind of data we're interested, and hides the gory details under the hood. But the goal is to show that, once everything we care about has a resolvable URI that provides data in a consistent form, and we re-use identifiers, then we can glue stuff together with relative ease. In principle we can simply crawl this web of data (you can append other DOIs, ISSNs, and Genbank accession numbers to http://bioguid.info and get RDF to your heart's content).

None of this is particularly new, we've had RDF in biodiversity informatics for at least five years, there are various linked data-style projects, such as GeoSpecies and the first iteration of bioGUID, and some people (such as Roger Hyam) have been pushing HTTP URIs + RDF for a while, but we seem remarkably unable to get traction on this. Notably, no major biodiversity provider provides RDF (by major I mean GenBank or GBIF size). We make diagrams like the one I drew for GBIF last year, we make the case that linking is a Good Thing™, and yet nothing much happens. This suggests that the idea is still not be presented in a compelling enough fashion. Certainly, clunky demos like the one above probably won't help much. Linked Data clients are generally pretty awful things to use. I think we're going to need some compelling applications that really grab people's attention.

Wednesday, May 12, 2010

Drawing a phylogeny in a web browser using the canvas element

Some serious displacement activity. I'm toying with adding phylogenies to iSpecies, probably sourced from the PhyLoTA browser. This raises the issue of how to display trees on a web page. PhyLoTA itself uses bitmap images, such as this one:
but I'd like to avoid bitmaps. I toyed with using SVG, but that has it's own series of issues (it basically has to be served as a separate file). So, I've spent a couple of hours playing with the <canvas> element. This enables some quite nice drawing to be down in a browser window, without plugins, SVG, or Flash. I wrote a quick PHP script to parse a Newick tree and draw it using <canvas>. It's really pretty simple, and the results are quite nice:
One minor gotcha is interacting with the diagram (this is one advantage of SVG). Turns out we need a hack, so I've used the trick of a blank, transparent GIF and a usemap (see Greg Houston's Canvas Pie Chart with Tooltips). The picture above is a screen shot, you can see a live example here.

Tuesday, May 11, 2010

Why we need wikis

I've just spent a frustrating few minutes trying to find a reference in BioStor. The reference in question is
Heller, Edmund 1901. Papers from the Hopkins Stanford Galapagos Expedition, 1898-1899. WIV. Reptiles. Proc. Biol. Soc. Washington 14: 39-98

and comes from the Reptile Database page for the gecko Phyllodactylus gilberti HELLER, 1903. This is primary database for reptile taxonomy, and supplies the Catalogue of Life, which repeats this reference verbatim.
Thing is, this reference doesn't exist! Page 39 of Proc. Biol. Soc. Washington volume 14 is the start of Gerrit S Miller (1901) A new dormouse from Italy. Proc Biol Soc Washington 14: 39-40.
After much fussing with trying diferent volumes and dates for Proc. Biol. Soc. Washington, I searched BHL for Phyllodactylus gilberti, and discovered that this name was published in Proceedings of the Washington Academy of Sciences:
Edmund Heller (1903) Papers from the Hopkins Stanford Galapagos Expedition, 1898-1899. XIV. Reptiles. Proceedings of the Washington Academy of Sciences 5: 39-98

(see http://biostor.org/reference/20322). Three errors (wrong journal, wrong date, minor typo in title), but enough to break the link between a name and the primary source for that name.

Anybody who demands authoritative, expert-vetted resources, and thinks the Catalgoue of Life is a shining example of this needs to think again. Our databases are riddled with errors, which are repackaged over and over again, yet these would be so easy to fix if they were opened up and made easy to edit. It's time to get serious about wikis.

Monday, May 10, 2010

Referring to a one-degree square in RDF using c-squares

I'm in the midst of rebuilding iSpecies (my mash-up of Wikipedia, NCBI, GBIF, Yahoo, and Google search results) with the aim of outputting the results in RDF. The goal is to convert iSpecies from a pretty crude "on-the-fly" mash-up to a triple store where results are cached and can be queried in interesting ways. Why? Partly because I think such a triple store is an obvious way to underpin a "biodiversity hub" of the kind envisaged by PLoS (see my earlier post).

As ever, once one embarks down the RDF route (and I've been here before), one hits all the classic stumbling blocks, such as "what URI do I use for a thing?", and "what vocabulary do I use to express relationships between things?". For example, I'd like to represent the geographic distribution of a taxon as depicted on a GBIF map. How do I describe this in a RDF document?

To make this concrete, take one of my favourite animals, the New Zealand mud crab Helice crassa. Here's the GBIF map for this taxon:

This map has the URL (I kid you not):


(or http://bit.ly/cuTFW9, if you prefer). Now, there's no way I'm using this URL! Plus, the URL identifies an image, not the distribution.

But, if we look at the map we see that it is made of 1° × 1° squares. If each of those had a URI then I could simply list those squares as the distribution of the crab. This seems straightforward as GBIF has a service that provides these squares. For example, the URL http://data.gbif.org/species/17462693 (where 17462693 corresponds to Helice crassa) returns:

167.0 -45.0 168.0 -44.0 5
174.0 -42.0 175.0 -41.0 20
174.0 -38.0 175.0 -37.0 17
174.0 -37.0 175.0 -36.0 4

These are the 1° × 1° squares for which there are records of Helice crassa. Now, what I'd like to do is have a URI for each square, and I'd like to do this without reinventing the wheel. I've come across a URI space for points of the globe (the WGS 84 Geographic Point URI Space"), but not one for polygons. Then it dawned on me that perhaps c-squares, developed by Tony Rees at the CSIRO in Australia, would do the trick1. To quote Tony:
C-squares is a system for storage, querying, display, and exchange of "spatial data" locations and extents in a simple, text-based, human- and machine- readable format. It uses numbered (coded) squares on the earth's surface measured in degrees (or fractions of degrees) of latitude and longitude as fundamental units of spatial information, which can then be quoted as single squares (similar to a "global postcode") in which one or more data points are located, or be built up into strings of codes to represent a wide variety of shapes and sizes of spatial data "footprints".

C-squares appeal partly (and this says nothing good about me) because they have a slightly Byzantine syntax. However, they are short, and quite easy to calculate. I'll let the reader find out the gory details. To give an example, my home town, Auckland, has latitude -36.84, longitude 174.74, which corresponds to the 1° × 1° c-square with the code 3317:364.

Now, all I need to do is convert c-squares into URIs. If you append the c-square to http://bioguid.info/csquare:, like this, http://bioguid.info/csquare:3317:364, you get a linked data-friendly URI for the c-square. In a web browser you get a simple web page like this:

A linked data client will get RDF, like this:

<?xml version="1.0" encoding="utf-8"?>
<dcterms:Location rdf:about="http://bioguid.info/csquare:3307:364">
<dwc:footprintWKT>POLYGON((-37 75,-37 74,-36 74,-36 75,-37 75))</dwc:footprintWKT>

Now, I can refer to each square by it's own URI. This will also enable me to query a triple store by c-square (e.g., what other taxa occur within this 1° × 1° square?).
  1. Tony Rees had emailed me about this in response to a tweet about URIs for co-ordinates, but it took me a while to realise how useful c-square notation could be.

Next steps for BioStor: citation matching

Thinking about next steps for my BioStor project, one thing I keep coming back to is the problem of how to dramatically scale up the task of finding taxonomic literature online. While I personal find it oddly therapeutic to spend a little time copying and pasting citations into BioStor's OpenURL resolver and trying to find these references in BHL, we need something a little more powerful.

One approach is to harvest as many bibliographies as possible, and extract citations. These citations can come from online bibliographies, as well as lists of literature cited extracted from published papers. By default, these would be treated as strings. If we can parse them to extract metadata (such as title, journal, author, year), that's great, but this is often unreliable. We'd then cluster strings into sets that we similar. If any one of these strings was associated with an identifier (such as a DOI), or if one of the strings in the cluster had been successfully parsed into it's component metadata so we could find it using an OpenURL resolver, then we've identified the reference the strings correspond to. Of course, we can seed the clusters with "known" citation strings. For citations for which we have DOIs/handles/PMIDs/BHL/BioStor URIs, we generate some standard citation strings and add these to the set of strings to be clustered.

We could then provide a simple tool for users to find a reference online: paste in a citation string, the tool would find the cluster of strings the user's string most closely resembles, then return the identifier (if any) for that cluster (and, of course, we could make this a web service to automate processing entire bibliographies at a time).

I've been collecting some references on citation matching (bookmarked on Connotea using the tag "matching") related to this problem. One I'd like to highlight is "Efficient clustering of high-dimensional data sets with application to reference matching" (doi:10.1145/347090.347123, PDF here). The idea is that a large set of citation strings (or, indeed, any strings) can first be quickly clustered into subsets ("canopies"), within which we search more thoroughly:
When I get the chance I need to explore some clustering methods in more detail. One that appeals is the MCL algorithm, which I came across a while ago by reading PG Tips: developments at Postgenomic (where it is used to cluster blog posts about the same article). Much to do...

Thursday, May 06, 2010

Linnaeus meets the Internet: PLoS + Botany = #fail

C2914D0E-13E9-4CA6-BE0A-7A8645BC6A72.jpgTo much fanfare (e.g., Nature News, "Linnaeus meets the Internet" doi:10.1038/news.2010.221), on May 5th PLoS ONE published Sandy Knapp's "Four New Vining Species of Solanum (Dulcamaroid Clade) from Montane Habitats in Tropical America" doi:10.1371/journal.pone.0010502. To quote the Nature News piece:
The paper represents the culmination of a campaign to institute the electronic publication of scientific names, a case Knapp and others have made in journals including Nature[doi:10.1038/446261a]. Allowing electronic publication should make accessing information easier for scientists worldwide — especially those in developing countries who may not have access to fully stocked libraries. This, in turn, will aid conservation efforts, Knapp says.

Given the profile of this paper, "...the first time new plant names have been published in a purely electronic journal and still complied with ICBN rules", you'd think the participants would ensure the electronic aspects of the publication worked. Sadly, this is not the case.

The four names in question have apparently been deposited in IPNI with the following LSID's:

  • Solanum aspersum: urn:lsid:ipni.org:names:77103633-1

  • Solanum luculentum: urn:lsid:ipni.org:names:77103634-1

  • Solanum sanchez-vegae: urn:lsid:ipni.org:names:77103635-1

  • Solanum sousae: urn:lsid:ipni.org:names:77103636-1

Today is May 6th. None of these names are returned by a search of IPNI, for example http://www.ipni.org/ipni/simplePlantNameSearch.do?find_wholeName= returns this:


Resolving the LSID returns this:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
<tn:TaxonName rdf:about="urn:lsid:ipni.org:names:77103633-1">
<tcom:versionedAs rdf:resource="urn:lsid:ipni.org:names:77103633-1:1.2"/>

Hmmm, so apparently this record has been "deleted"?

The paper also states that:
The IPNI LSIDs (Life Science Identifiers) can be resolved and the associated information viewed through any standard web browser by appending the LSID contained in this publication to the prefix http://ipni.org/.

This sentence mirrors similar ones in other PLoS ONE papers saying we can resolve ZooBank LSIDs by appending the LSID to http://zoobank.org (e.g., see doi:10.1371/journal.pone.0001787).

Thing is, URLs such as http://ipni.org/urn:lsid:ipni.org:names:77103633-1 return a 404 from Kew (any IPNI LSID I've tried does this).

Update As per Alan Paton's comment below, the http://ipni.org prefix now works.

So, to recap:

  1. The names aren't in IPNI

  2. The LSIDs state the record has been deleted

  3. The LSID's can't be resolved by the means stated in the paper

Now, I don't know what happened (perhaps IPNI wanted to hold off until the paper actually appeared before releasing the names), but the paper is out, the buzz in Nature is out, and IPNI doesn't have the resolver in place, yet alone the names.

Given the milestone this paper represents, and the fuss over the publication of the name Darwinius, you'd expect the bioinformatics side of it to be, you know, actually working. In these circumstances, how on Earth do we make the case that the LSID and name databasing side of taxonomic publication is useful?