Monday, September 18, 2017

Guest post: Our taxonomy is not your taxonomy

Bob mesibov The following is a guest post by Bob Mesibov.

Do you know the party game "Telephone", also known as "Chinese Whispers"? The first player whispers a message in the ear of the next player, who passes the message in the same way to a third player, and so on. When the last player has heard the whispered message, the starting and finishing versions of the message are spoken out loud. The two versions are rarely the same. Information is usually lost, added or modified as the message is passed from player to player, and the changes are often pretty funny.

I recently compared ca 100 000 beetle records as they appear in the Museums Victoria (NMV) database and in DarwinCore downloads from the Atlas of Living Australia (ALA) and the Global Biodiversity Information Facility (GBIF). NMV has its records aggregated by ALA, and ALA passes its records to GBIF. The "Telephone" effect in the NMV to ALA to GBIF comparison was large and not particularly funny.

Many of the data changes occur in beetle names. ALA checks the NMV-supplied names against a look-up table called the National Species List, which in this case derives from the Australian Faunal Directory (AFD). If no match is found, ALA generalises the record to the next higher supplied taxon, which it also checks against the AFD. ALA also replaces supplied names if they are synonyms of an accepted name in the AFD.

GBIF does the same in turn with the names it gets from ALA. I'm not 100% sure what GBIF uses as beetle look-up table or tables, but in many other cases their GBIF Backbone Taxonomy mirrors the Catalogue of Life.

To give you some idea of the magnitude of the changes, of ca 85000 NMV records supplied with a genus+species combination, about one in five finished up in GBIF with a different combination. The "taxonRank" changes are summarised in the overview below, and note that replacement ALA and GBIF taxon names at the same rank are often different:

Generalised

Of the species that escaped generalisation to a higher taxon, there are 42 names with genus triples: three different genus names for the same taxon in NMV, ALA and GBIF.

Just one example: a paratype of the staphylinid Schaufussia mona Wilson, 1926 is held in NMV. The record is listed under Rytus howittii (King, 1866) in the ALA Darwin Core download, because AFD lists Schaufussia mona as a junior subjective synonym of Tyrus howitti King, 1866, and Tyrus howittii in AFD is in turn listed as a synonym of Rytus howittii (King, 1866). The record appears in GBIF under Tyraphus howitti (King, 1865), with Rytus howittii (King, 1866) listed as a synonym. In AFD, Rytus howittii is in the tribe Tyrini, while Tyraphus howitti is a different species in the tribe Pselaphini.

ALA gives "typeStatus" as "paratype" for this record, but the specimen is not a paratype of Rytus howittii. In the GBIF download, the "typeStatus" field is blank for all records. I understand this may change in future. If it does, I hope the specimen doesn't become a paratype of Tyraphus howitti through copying from ALA.

There are lots of "Telephone" changes in non-taxonomic fields as well, including some geographical howlers. ALA says that a Kakadu National Park record is from Zambia and another Northern Territory record is from Mozambique, because ALA trusts the incorrect longitude provided by NMV more than it does the NMV-supplied locality text. GBIF blanks this locality text field, leaving the GBIF user with two African records for Australian specimens and no internal contradictions.

ALA trusts latitude/longitude to the extent of changing the "stateProvince" field for localities near Australian State borders, if a low-precision latitude/longitude places the occurrence a short distance away in an adjoining State.

Manglings are particularly numerous in the "recordedBy" field, where name strings are reformatted, not always successfully. Complex NMV strings suffer worst, e.g. "C Oke; Charles John Gabriel" in NMV becomes "Oke, C.|null" in ALA, and "Ms Deb Malseed - Winda-Mara Aboriginal Corporation WMAC; Ms Simone Sailor - Winda-Mara Aboriginal Corporation WMAC" is reformatted as in ALA "null|null|null|null"

Most of the "Telephone" effect in the NMV-ALA-GBIF comparison appears in the NMV-ALA stage. I contacted ALA by email and posted some of the issues on the ALA GitHub site; I haven't had a response and the issues are still open. I also contacted Tim Robertson at GBIF, who tells me that GBIF is working on the ALA-GBIF stage.

Can you get data as originally supplied by NMV to ALA, through ALA? Well, that's easy enough record-by-record on the ALA website, but not so easy (or not possible) for a multi-record download. Same with GBIF, but in this case the "original" data are the ALA versions.

Monday, August 28, 2017

Let’s rise up to unite taxonomy and technology

Holly Bik (@hollybik) has an opinion piece in PLoS Biology entitled "Let’s rise up to unite taxonomy and technology" https://doi.org/10.1371/journal.pbio.2002231 (thanks to @sjurdur for bringing this to my attention).

Journal pbio 2002231 g001

It's a passionate plea for integrating taxonomic knowledge and "omics" data. In her article Bik includes a mockup of the kind of tool she'd like to see (based in part on Phinch), and writes:

Step 2: Clicking on a specific data point (e.g., an OTU) will pull up any online information associated with that species ID or taxonomic group, such as Wikipedia entries, photos, DNA sequences, peer-reviewed articles, and geolocated species observations displayed on a map.

This sort of plea has been made any times, and reminds me very much of PLoS's own efforts when they wanted to build a "Biodiversity Hub" and biodiversity informatics basically failed them. The hub itself later closed down.. There's clearly a need for a simply way to summarise what we know about a species, but we've yet to really tackle this (on the face of it) fairly simple task.

Quickly summarising the available information about a species was the motivation behind my little tool iSpecies, which I recently reworked to use DBpedia, GBIF, CrossRef, EOL, TreeBASE and OpenTreeofLife as sources. For the nematode featured in Bik's figure (Desmoscolex) there's not a great deal of easily available information (see http://ispecies.org/?q=Desmoscolex). We can get a little more form other sources not queried by iSpecies, such as BioNames, which aggregates the primary taxonomic literature, see http://bionames.org/search/Desmoscolex.

Part of the problem is that taxonomy is fundamentally a "long tail" field, both in terms of the subject matter (a few very well know species, then millions of poorly known species) and our knowledge of those species (a large, scattered taxonomic literature, much of it not yet digitised, although progress is being made). Furthermore, the names of species (and our conception of them) can change, adding an additional challenge.

But I think we can do a lot better. Simple web-based tools like iSpecies can assemble reasonable information from multiple sources (and in multiple languages) on the fly. It would be nice to expand those sources (the more primary sources the better). The current iSpecies tool searches on species name. This works well if the sources being queried mention that name (e.g., in the title of a paper that has a DOI and is indexed by CrossRef). Given that many of the "omics" datasets Bik works with are likely to have dark taxa, what we'll also need is the ability to search, say, using NCBI taxon ids, and retrieve literature linked to sequences for those taxa

It would also be useful to package those up in a simple API that other tools could consume. For example, if I wanted to improve the utility of iSpecies, one approach would be to package up the results in a JSON object. Perhaps even use JSON-LD (with global identifiers for taxa, documents, etc.) to make it possible for consumers to easily integrate that data with their own.

Taxonomy could be on the brink of another golden age—if we play our cards right. As it is reinvented and reborn in the 21st century, taxonomy needs to retain its traditional organismal-focused approaches while simultaneously building bridges with phylogenetics, ecology, genomics, and the computational sciences.

Taxonomy is, of course, doing just this, albeit not nearly fast enough. There are some pretty serious obstacles, some of them cultural, but some of them due to the nature of the problem. Taxonomic knowledge is massively decentralised, mostly non-digital, and many of the key sources and aggregations are behind paywalls. There is also a fairly large "technical debt" to deal with. Ian Mulvany was recently interviewed by PLoS and he emphasised that because academic publishers had been online from early on they were pioneers, but at the same time this left them with a legacy of older technologies and approaches that can sometimes get in the way of new idea. I think taxonomy suffers from some of the same problems. Because taxonomy has long been involved with computers, sometime we needed up betting on the "wrong" solutions. For example, at one time XML was the new hotness, and people invested a lot of effort in developing XML schema, and then ontologies and RDF vocabularies. Meantime much of the web has moved to simple data formats such as JSON, many specialist vocabularies are gathering dust as schema.org takes off, and projects like Wikidata force us to rethink the need to topic-specific databases.

But these are technical details. For me the key point of "Let’s rise up to unite taxonomy and technology" is that it's a symptom of the continued failure of biodiversity informatics to actually address the needs of its users. People keep asking for fairly simple things, and we keep ignoring them (or explaining why it's MUCH harder than people think, which is another way of ignoring them).

Sunday, August 20, 2017

Notes on displaying big trees using Google Maps/Leaflet

Notes to self on web map-style tree viewers. The basic idea is to use Google Maps or Leaflet to display a tree. Hence we need to compute tiles. One approach is to use a database that supports spatial queries to store the x,y coordinates of the tree. When we draw a tile we compute the coordinates of that tile, based on position and zoom level, do a spatial query to extract all lines that intersect with the rectangle for that tile, and draw those.

A nice example of this is Lifemap (see also De Vienne, D. M. (2016). Lifemap: Exploring the Entire Tree of Life. PLOS Biology, 14(12), e2001624. doi:10.1371/journal.pbio.2001624).

It occurs to me that for trees that aren't too big we could do this without an external database. For example, what if we used a Javascript implementation of an R-tree, such as imbcmdth/RTree or its fork leaflet-extras/RTree. So, we could compute the coordinates of the nodes in the tree in "geographic" space, store the bounding box for each line/arc in an R-tree, then query that R-tree for lines that intersect with the bounding box of the relevant tile. We could use a clipping algorithm to only draw the bits of the lines that cross the tile itself.

Web maps, at least in my experience, make trips to a tile server to fetch a tile, we would want instead to call a routine within our web page, because all the data would be loaded into that page. So we'd need to modify the tile creating code.

The ultimate goal would be to have a single page web app that accepts a Newick-style tree and converts into a browsable, zoomable visualisation.

Tuesday, July 04, 2017

Apple's Knowledge Navigator concept video (1987): we are still a long way from this vision

I've been viewing Apple's Knowledge Navigator concept video from 1987 and it's striking how much of this we have today, and yet how far away we are from the complete vision. For some background on this promotional video see The Making of Knowledge Navigator. The computer scientist Alan Kay provided some advice to the makers (who put the video together for a presentation by then Apple CEO John Sculley). Kay is a true visionary, he's currently working on children, computers, and education, motivated by the realisation that, like the printing press before it, computing will change the way people think, and how children learn using computers could have a profound impact on our future.

The Knowledge Navigator video looked futuristic when it came out, but now we have ubiquitous touch interfaces, video chat, and can talk to computers (albeit not with the level of sophistication shown in the video). But there are a couple of things in the concept video that are in many ways even more impressive.

Early on, our professor is trying to track down a paper, and he can't quite remember the name of the author. His visual assistant (a more sophisticated version of Siri) finds it, which of itself isn't too exciting (Google supports searching for things when you don't quite know what it is you're looking for). What is more impressive is that the professor can access and play with the data in the paper, and compare the predictions made in that paper with more recent data.

This requires that we have access to the data and models from a published paper, and a way to easily add new data and redo the analyses. This is related to "reproducible science" doi:10.1038/s41562-016-0021 and the notion of "executable papers" doi:10.1016/j.procs.2011.04.074, but goes beyond that because we don't just reproduce the results in the original paper, we can add to them. And it's all seamless and effortless. Anyone who has tried to get adta from a paper and do something with it will recognise that we are a long way from this.

The second interesting example is when our professor is chatting online with a colleague about deforestation in South America, and she sends him her graphical model of the spread of the Sahara. They then view these side-by-side. Note that this is not two separate videos, the simulations merge together and their timeline syncs so that they play together simultaneously. The parameters of the simulations can also be changed on the fly.

This ability to collaborate in real time in the same space with both data and analysis is something that we don't really have, at least I'm not aware of it. Yes, we can work together on editing a Google Document, but throwing together two data sets or visualisations and have them align themselves automatically is pretty cool.

While some aspects of the Knowledge Navigator video look quaint, it's striking that the actual core of the video - a researcher redoing an analysis published by another researcher, or collaborating with a colleague with different but related data is still something we haven't been able to achieve yet (for some related work on collaboratively viewing evolutionary trees see "Interactive Tree Comparison for Co-located Collaborative Information Visualization" doi:10.1109/TVCG.2007.70568). In this respect the Knowledge Navigator is still a vision of the future.

Friday, June 30, 2017

Response to To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data

Nico Franz and Beckett W. Sterner recently published a preprint entitled "To Increase Trust, Change the Social Design Behind Aggregated Biodiversity Data" on bioRxiv http://dx.doi.org/10.1101/157214 Below is the abstract:

Growing concerns about the quality of aggregated biodiversity data are lowering trust in large-scale data networks. Aggregators frequently respond to quality concerns by recommending that biologists work with original data providers to correct errors "at the source". We show that this strategy falls systematically short of a full diagnosis of the underlying causes of distrust. In particular, trust in an aggregator is not just a feature of the data signal quality provided by the aggregator, but also a consequence of the social design of the aggregation process and the resulting power balance between data contributors and aggregators. The latter have created an accountability gap by downplaying the authorship and significance of the taxonomic hierarchies ≠ frequently called "backbones" ≠ they generate, and which are in effect novel classification theories that operate at the core of data-structuring process. The Darwin Core standard for sharing occurrence records plays an underappreciated role in maintaining the accountability gap, because this standard lacks the syntactic structure needed to preserve the taxonomic coherence of data packages submitted for aggregation, leading to inferences that no individual source would support. Since high-quality data packages can mirror competing and conflicting classifications, i.e., unsettled systematic research, this plurality must be accommodated in the design of biodiversity data integration. Looking forward, a key directive is to develop new technical pathways and social incentives for experts to contribute directly to the validation of taxonomically coherent data packages as part of a greater, trustworthy aggregation process.

Below I respond to some specific points that annoyed me about this article, at the end I try and sketch out a more constructive response. Let me stress that although I am the current Chair of the GBIF Science Committee, the views expressed here are entirely my own.

Trust and social relations

Trust is a complex and context-sensitive concept...First, trust is a dependence relation between a person or organization and another person or organization. The first agent depends on the second one to do something important for it. An individual molecular phylogeneticist, for example, may rely on GenBank (Clark et al. 2016) to maintain an up-to-date collection of DNA sequences, because developing such a resource on her own would be cost prohibitive and redundant. Second, a relation of dependence is elevated to being one of trust when the first agent cannot control or validate the second agent's actions. This might be because the first agent lacks the knowledge or skills to perform the relevant task, or because it would be too costly to check.

Trust is indeed complex. I found this part of the article to be fascinating, but incomplete. The social network GBIF operates in is much larger than simply taxonomic experts and GBIF, there are relationships with data providers, other initiatives, a broad user community, government agencies that approve it's continued funding, and so on. Some of the decisions GBIF makes need to be seen in this broader context.

For example, the article challenges GBIF for responding to errors in the data by saying that these should be "corrected at source". This a political statement, given that data providers are anxious not to ceed complete control of their data to aggregators. Hence the model that GBIF users see errors, those errors get passed back to source (the mechanisms for tis is mostly non-existent), the source fixes it, then the aggregator re-harvests. This model makes assumptions about whether sources are either willing or able to fix these errors that I think are not really true. But the point is this is less about not taking responsibility, but instead avoiding treading on toes by taking too much responsibility. Personally I think should take responsibility for fixing a lot of these errors, because it is GBIF whose reputation suffers (as demonstrated by Franz and Sterner's article).

Scalability

A third step is to refrain from defending backbones as the only pragmatic option for aggregators (Franz 2016). The default argument points to the vast scale of global aggregation while suggesting that only backbones can operate at that scale now. The argument appears valid on the surface, i.e., the scale is immense and resources are limited. Yet using scale as an obstacle it is only effective if experts were immediately (and unreasonably) demanding a fully functional, all data-encompassing alternative. If on the other hand experts are looking for token actions towards changing the social model, then an aggregator's pursuit of smaller-scale solutions is more important than succeeding with the 'moonshot'.

Scalability is everything. GBIF is heading towards a billion occurrence records and several million taxa (particularly as more and more taxa from DNA-barcoding taxa are added). I'm not saying that tractability trounces trust, but it is a major consideration. Anybody advocating a change has got to think about how these changes will work at scale.

I'm conscious that this argument could easily be used to swat away any suggestion ("nice idea, but won't scale") and hence be a reason to avoid change. I myself often wish GBIF would do things differently, and run into this problem. One way around it is to make use of the fact that GBIF has some really good APIs, so if you want GBIF to do something different you can build a proof of concept to show what could be done. If that is sufficiently compelling, then the case for trying to scale it up is going to be much easier to make.

Multiple classifications

As a social model, the notion of backbones (Bisby 2000) was misguided from the beginning. They disenfranchise systematists who are by necessity consensus-breakers, and distort the coherence of biodiversity data packages that reflect regionally endorsed taxonomic views. Henceforth, backbone-based designs should be regarded as an impediment to trustworthy aggregation, to be replaced as quickly and comprehensively as possible. We realize that just saying this will not make backbones disappear. However, accepting this conclusion counts as a step towards regaining accountability.

This strikes me as hyperbole. "They disenfranchise systematists who are by necessity consensus-breakers". Really? Having backbones in no way prevents people doing systematic research, challenging existing classifications, or developing new ones (which, if they are any good, will become the new consensus).

We suggest that aggregators must either author these classification theories in the same ways that experts author systematic monographs, or stop generating and imposing them onto incoming data sources. The former strategy is likely more viable in the short term, but the latter is the best long-term model for accrediting individual expert contributions. Instead of creating hierarchies they would rather not 'own' anyway, aggregators would merely provide services and incentives for ingesting, citing, and aligning expert-sourced taxonomies (Franz et al. 2016a).

Backbones are authored in the sense that they are the product of people and code. GBIF's is pretty transparent (code and some data on github, complete with a list of problems). Playing Devil's advocate, maybe the problem here is the notion of authorship. If you read a paper with 100's of authors, why does that give you any greater sense of accountabily? Is each author going to accept responsibility for (or being to talk cogently about) every aspect of that paper? If aggregators such as GBIF and Genbank didn't provide a single, simple way to taxonomically browse the data I'd expect it would be the first thing users would complain about. There are multiple communities GBIF must support, including users who care not at all about the details of classification and phylogeny.

Having said that, obviously these backbone classifications are often problematic and typically lag behind current phylogenetic research. And I accept that they can impose a certain view on how you can query data. GenBank for a long time did not recognise the Ecdysozoa (nematodes plus arthropods) despite the evidence for that group being almost entirely molecular. Some of my research has been inspired by the problem of customising a backbone classification to better more modern views (doi:10.1186/1471-2105-6-208).

If handling multiple classifications is an obstacle to people using or contributing data to GBIF, then that is clearly something that deserves attention. I'm a little sceptical, in that I think this is similar to the issue of being able to look at multiple versions of a document or GenBank sequence. Everyone says it's important to have, I suspect very few people ever use that functionality. But a way forward might be to construct a meaningful example (in other words an live demo, not a diagram with a few plant varieties).

Ways forward

We view this diagnosis as a call to action for both the systematics and the aggregator communities to reengage with each other. For instance, the leadership constellation and informatics research agenda of entities such as GBIF or Biodiversity Information Standards (TDWG 2017) should strongly coincide with the mission to promote early-stage systematist careers. That this is not the case now is unfortunate for aggregators, who are thereby losing credibility. It is also a failure of the systematics community to advocate effectively for its role in the biodiversity informatics domain. Shifting the power balance back to experts is therefore a shared interest.

Having vented, let me step back a little and try and extract what I think the key issue is here. Issues such as error correction, backbones, multiple classifications are important, but I guess the real issue here is the relationship between experts such as taxonomists and systematists, and large-scale aggregators (note that GBIF serves a community that is bigger than just these researchers). Franz and Sterner write:

...aggregators also systematically compromise established conventions of sharing and recognizing taxonomic work. Taxonomic experts play a critical role in licensing the formation of high-quality biodiversity data packages. Systems of accountability that undermine or downplay this role are bound to lower both expert participation and trust in the aggregation process.

I think this is perhaps the key point. Currently aggregation tends to aggregate data and not provenance. Pretty much every taxonomic name has at one point or other been published by somebody. For various reasons (including the crappy way most nomenclature databases cite the scientific literature) by the time these names are assembled into a classification by GBIF the names have virtually no connection to the primary literature, which also means that who contributed the research that led to that name being minted (and the research itself) is lost. Arguably GBIF is missing an opportunity to make taxonomic and phylogenetic research more visible and discoverable (I'd argue this is a better approach than Quixotic efforts to get all biologists to always cite the primary taxonomic literature).

Franz and Sterner's article is a well-argued and sophisticated assessment of a relationship that isn't working the way it could. But to talk in terms of "power balance" strikes me as miscasting the debate. Would it not be better to try and think about aligning goals (assuming that is possible). What do experts want to achieve? What do they need to achieve those goals? Is it things such as access to specimens, data, literature, sequences? Visibility for their research? Demonstrable impact? Credit? What are the impediments? What, if anything, can GBIF and other aggregators do to help? In what way can facilitating the work of experts help GBIF?

In my own "early-stage systematist career" I had a conversation with Mark Hafner about the Louisiana State University Museum providing tissue samples for molecular sequencing, essentially a "project in a box". Although Mark was complaining about the lack credit for this (a familiar theme) the thing which struck me was how wonderful it would be to have such a service - here's everything you need to do your work, go do some science. What if GBIF could do the same? Are you interested in this taxonomic group, well here's the complete sum of what we know so far. Specimens, literature, DNA sequences, taxonomic names, the works. Wouldn't that be useful?

Franz and Sterner call for "both the systematics and the aggregator communities to reengage with each other". I would echo this. I think that the sometimes dysfunctional relationship between experts and aggregators is partly due to the failure to build a community of researchers around GBIF and its activities. The focus of GBIF's relationship with the scientific community has been to have a committee of advisers, which is a rather traditional and limited approach ("you're a scientist, tell us what scientists want"). It might be better served if it provided a forum for researchers to interact with GBIF, data providers, and each other.

I stated this blog (iPhylo) years ago to vent my frustrations about TreeBASE. At the time I was fond of a quote from a philosopher of science that I was reading, to the effect that we only criticise those things that we care about. I take Franz and Sterner's article to indicate that they care about GBIF quite a bit ;). I'm looking forward to more critical discussion about how we can reconcile the needs of experts and aggregators as we seek to make global biodiversity data both open and useful.

Friday, June 16, 2017

GBIF Challenge 2017: Liberating species records from open data repositories for scientific discovery and reuse

Ebbe v5 300

GBIF is running its Ebbe Nielsen Challenge for the third successive year. This year the title is Liberating species records from open data repositories for scientific discovery and reuse. To quote from the Challenge background on Devpost:

This year's Challenge will seek to leverage the growth of open data policies among scientific journals and research funders, which require researchers to make the data underlying their findings publicly available. Adoption of these policies represents an important first step toward increasing openness, transparency and reproducibility across all scientific domains, including biodiversity-related research.

To abide by these requirements, researchers often deposit datasets in public open-access repositories. Potential users are then able to find and access the data through repositories as well as data aggregators like OpenAIRE and DataONE. Many of these datasets are already structured in tables that contain the basic elements of biodiversity information needed to build species occurrence records: scientific names, dates, and geographic locations, among others.

However, the practices adopted by most repositories, funders and journals do not yet encourage the use of standardized formats. This approach significantly limits the interoperability and reuse of these datasets. As a result, the wider reuse of data implied if not stated by many open data policies falls short, even in cases where open licensing designations (like those provided through Creative Commons) seem to encourage it.

In essence, the 2017 Challenge is to develop tools to discover these biodiversity-relevant datasets, and make them available to GBIF. In other words, we want tools to enable us to do this:

Goal

As an example of the impact that external data can have on GBIF, last year I wrote a blog post (The Zika virus, GBIF, and the missing mosquitoes) describing how I took published data (doi:10.1038/sdata.2015.35) from the Dryad repository and added it to GBIF. The effect was dramatic:

Before

1651430

After

1651430 updated

This is just one example. I suspect that there is a lot of biodiversity data gathering digital dust sitting in repositories that could be more widely reused if we just had the tools to discover it, and convert it into a form that GBIF can use. Prove me right, and win cash prizes! Details at https://gbif2017.devpost.com.

Wednesday, May 31, 2017

Programming with Glitch: microservices and serverless computing

LgbNpkq 400x400Yes, this post is indeed an attempt to fit as many buzzwords that I don't really understand into the title. I've been playing around with Glitch, which is a delightful project from Fog Creek (makers of Trello and co-creators of Stack Overflow).

On first glance Glitch looks weirdly retro, and it took a little while for me to get the hang of things. Bit it's fun and very powerful. Basically it's a place where you can start creating web apps in your browser, and each app is automatically hosted online. If you see an app that you like you can see the source code (just like you can see HTML using "view source" in your browser). if you want to hack on the code you can simply create a copy and it's yours to play with (this is called "remixing", like forking on GitHub). Your copy gets a cute name (possibly annoyingly cute) and away you go.

If you're a developer, then at this point you're probably wondering what is actually happening under the hood. Each Glitch app is a node.js app, which means you're programming in Javascript (you can just use HTML and client side Javascript if you want to avoid node.js). I'm very new to node.js, so Glitch has been a fun way to experiment.

There are two things which make Glitch very powerful. The first is the "remix" feature. Don't know where to start? Find an app that looks like it might do something you want to do, remix it, and hack away. The code is edited online, and the editor works very well. It also checks your code for Javascript errors as you type, which is helpful (usually).

The second great feature is that you get built in hosting for free. As soon as you remix an app you have a functioning web site. Remixing is very like forking in GitHub, and if you're running node.js on your local machine then the benefits of Glitch might not seem obvious. But hosting is often a pain, either you need to set up your own servers, or use a hosting service. Glitch takes care of this for you, so your app is instantly available for others to use.

So, what can you do with Glitch? There's some great examples on the Glitch site, but I want to show an almost trivial example. I've created an app called "enchanting-bongo" https://enchanting-bongo.glitch.me (yes, the name is a bit irritating) that does one simple thing. You give it a DOI for an article and enchanting-bongo tells you whether any of the authors of that work have an ORCID. For example, try the DOI 10.3897/zookeys.555.6173. Why did I write this? I'm interested in ways to link people to the work that they've done, especially work that ends up being aggregated in large-scale biodiversity databases like GBIF (see Possible project: #itaxonomist, combining taxonomic names, DOIs, and ORCID to measure taxonomic impact).

Screenshot 2017 05 31 17 09 32

The app does one thing. It takes the DOI and calls the ORCID API to see if anyone has claimed authorship of the paper with that DOI. You can use the app with a web browser, or you can use an HTTP client and call the API (e.g., https://enchanting-bongo.glitch.me/search?q=10.3897%2Fzookeys.555.6173).

Glitch is an example of servers computing, where you don't have to worry about physical servers or the software infrastructure that runs on them (e.g., the web server itself), you just write code. Like any buzzword, there is some pushback, see for example What Is “Serverless”? An Alternative Take, but for a fascinating essay I recommend Why the fuss about serverless?. But the notion that I can simply hack away on some code and have an instantly available web app is very attractive.

The other buzzword is "microservices". I'm forever needing to do tasks such as find a DOI for a paper, match a "microcitation" to the enclosing article, locate a specimen in GBIF based on catalogue number in a paper, parse some text into structured data, such as a reference, geographic coordinates, etc. These are tools that I need in lots of contexts, and I've written software to do this on my machine, often as part of larger projects. "Microservices" is the idea that instead of large, monolithic apps we write a series of minimal tools that typically do one thing, and do it well. We then chain the together to do various tasks. Having small tools means that we can treat each problem independently, and if the tools communicate over the web (HTTP) then it doesn't matter what programming language we use. I've started thinking more and more about adopting this model and developing a bunch of small services to perform many of the tasks I need. Hosting these services then becomes in issue, I have web servers in my office but they are a pain to maintain (my university is forever insisting that I upgrade their software), so cloud-based hosting seems the obvious way forward. Free-hosting looks ideal, so Glitch is looking very attractive.

So, I'm hoping to experiment more with this approach. One thing I might do is create a series of services very like enchanting-bongo, have a simple web interface and an API that the web interface calls. That way users can play with it in their web browser, then call the service via the API if it does something useful. As a more sophisticated example of a service, I'm working on tools to parse Wikispecies reference strings, and link specimen codes to records in GBIF.

One reason I'm enthusiastic about Glitch is that it is fun!. Some of the best shifts in technology that I've made have been because a tool made something easy and fun to do. For example, CouchDB made working with structured data fun, and that was a revelation (databases, fun, surely not). Fun is a much neglected characteristic of the tools we use.

Querying Wikidata

For my own use more than anything else I've started creating a list of Wikidata SPARQL queries here. I personally don't find Wikidata's data model particularly easy to grasp, so one way to learn is to take the example queries on the Wikidata Query site and mess about with them.

For those interested in taxonomic data Wikidata is quite rich in content. For example, you can find the author of a taxonomic names, or find taxon names an author is responsible for creating.

It is also fairly straightforward to search for content by identifier, e.g.

SELECT *
WHERE
{
  ?work wdt:P356 "10.2476/ASJAA.62.33" .
}
will find the article with the DOI 10.2476/ASJAA.62.33. One minor gotcha is that Wikidata has all DOIs in UPPERCASE, so you either need to sera for uppercase version of the DOI, or use a filter to convert the case, which is slow.

As I come across interesting or useful queries I'll add them to the list in GitHub.

Wikidata, WikiCite, and the "bibliography of life"

3hhZSGOn 400x400Last week I was at WikiCite 2017, a fascinating three day event in Vienna. Wikicite is "a proposal to build a bibliographic database in Wikidata to serve all Wikimedia projects", and is attracting increasing attention from academics, librarians, publishers, data geeks, and others. You can get a sense of the project by following @WikiCite on Twitter.

I went to the meeting in part to learn more about WikiCite, and also to spend some time hacking on Wikispecies. I'd been to only one Wiki event before (a Wiki Science Conference) so I'm still finding my way around this community. I spent the first two days listening to talks while coding away (more on this below), but on Wednesday put my own coding aside to join a bunch of people hacking the CrossRef event API in a great session led by Joe Wass. I've put some notes and code in GitHub. The event API tracks what people do with DOIs, including adding them to Wikipedia pages when citing a source in support of an assertion. A significant fraction of DOI resolutions are from Wikipedia pages, which is one reason why CrossRef was present at WikiCite.

Wikidata

In practice WikiCite's goal of building a bibliographic database to serve all Wikimedia projects means that articles, books, and other bibliographic items that are cited by Wikimedia projects will each be added to Wikidata. For example, the ZooKeys paper "Diversity of manota williston (Diptera, mycetophilidae) in ulu temburong national park, brunei" is item Q21188431 in Wikidata. Wikidata stores the key bibliographic metadata, including identifiers such as the DOI (which many at the WikiCite meeting pronounced "doy" much to my initial confusion). Screenshot 2017 05 31 12 46 43

This article was published in ZooKeys, which itself has a Wikidata item (Q219980), so in Wikidata the article is linked to the journal (i.e., "ZooKeys" isn't just a dumb string but a link to another Wikidata item). The article is also linked to two articles that it cites, and each of these is also a Wikidata item.

These citation links are one reason people are interested in WikiCite - it could be the basis of a free and open citation graph (for the benefits of such a graph see this piece by David Shotton doi:10.1038/502295a, a participant at the meeting in Vienna). Already some cool tools are being built on top of citation data in Wikidata, such as Scholia by Finn Årup Nielsen, Daniel Mietchen and Egon Willighagen. Here, for example, is my academic profile based on information in Wikidata. It's woefully incomplete, but intriguing. For a more complete example view Egon Willighagen's profile.

To some extent the utility of tools like Scholia will depend on how complete Wikidata's coverage is of the academic literature, which in turn raises the inevitable question of scope. Does Wikicite want to include just the literature cited in the various Wikimedia projects, or does it want to expand to include the total sum of academic literature?

Wikispecies, Wikidata and the bibliography of life

Wikispecies is one of the Wikimedia projects, and the only one that is topic-specific (the others are typically global in scope but have content in different languages, or host different data types such as images, scanned books, or structured data). As I've sketched out in an earlier post (Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library) I think Wikicite and Wikidata are potentially very important to projects such as BHL and the "bibliography of life". Much of our knowledge about the world's biodiversity is contained in the academic literature, and much of this is poorly known with no central database where we can find it, and much of it is still not digitised. It is tempting to think that Wikidata might be a platform around which the biodiversity community could focus its efforts on assembling a global database of biodiversity literature. Already major taxonomic journals such as ZooKeys are being fed into Wikidata, so it has a significant corpus of biodiversity literature already.

One way to grow this corpus is to focus on Wikispecies. In a post before the Wikicite meeting (Notes for WikiCite 2017: Wikispecies reference parsing) I elaborated on this idea. There are two stumbing blocks, one specific to Wikispecies, one a more general Wikidata issue.

The first issue is that Wikispecies bibliographic data is relatively unstructured, which makes converting it into structured data something of a challenge. I spent much of Wikicite hacking some code to do this on Glitch (more on Glitch later), you can see the results here: https://acoustic-bandana.glitch.me. This web site takes a Wikispecies reference and tries to convert it into CSL-JSON. Still very much a work in progress, but I've started building tools that use this web site as a service and process larger numbers of Wikispecies citations.

The second issue is how you get data into Wikidata, and this is something that's never been entirely clear to me. There are tools for adding an article using its DOI (sourcemd) but this isn't scalable, and doesn't handle the case of articles that don't have DOIs. This is still a "How do you Snapchat? You just Snapchat" moment. Wikidata desparately needs tools and a clear procedure whereby people like me with lots of bibliographic data can contribute.

Wikispecies

Another reason for my interest in Wikispecies (and other sources of bibliographic data such as the listed of cited literature being made available by CrossRef, see The Initiative for Open Citations) is that this data can be fed into BHL to locate more articles in that archive. Once these articles have been located they are stored in BioStor and BHL itself, but it makes sense to have them more accessible, and Wikidata looks to be an obvious candidate. Given that Wikispecies is essentially a crowd-source taxonomic database there is considerable overlap in content between Wikispecies and BHL. The Wikidata data model also allows for some of things that taxonomists care about, such as linking dates of publication to evidence relative to those dates (in older publications determining the publication date often requires quite extensive research).

Summary

Leaving aside the specific issues about how to get bibliographic data into Wikidata, I guess the question to ask is whether it makes sense to be developing large databases of bibliographic data without either using Wikidata as the platform to hold that data, or at least linking to Wikidata. Projects such as Gene Wiki are migrating from Wikipedia to Wikidata (see "Wikidata as a semantic framework for the Gene Wiki initiative" doi:10.1093/database/baw015), perhaps those of us interested in biodiversity literature could use projects like Gene Wiki as role models for how we could both contribute and benefit from Wikidata and Wikicite.

I've barely scratched the surface of what was discussed at Wikicite, for more details see the program. It is a very different sort of meeting in that the participants come from pretty diverse backgrounds, which helps shake up your own assumptions about what matters and how things should be done. It's also great that it's a meeting at which people write code or otherwise hack stuff together, so things actually get done. I've come away with lots to think about, and renewed enthusiasm about the role Wikimedia is playing in structuring our knowledge about the world.