Friday, September 30, 2016

Guest post: It's 2016 and your data aren't UTF-8 encoded?

Bob mesibov The following is a guest post by Bob Mesibov.

According to w3techs, seven out of every eight websites in the Alexa top 10 million are UTF-8 encoded. This is good news for us screenscrapers, because it means that when we scrape data into a UTF-8 encoded document, the chances are good that all the characters will be correctly encoded and displayed.

It's not quite good news for two reasons.

In the first place, one out of eight websites is encoded with some feeble default like ISO-8859-1, which supports even fewer characters than the closely related windows-1252. Those sites will lose some widely-used punctuation when read as UTF-8, unless the webpage has been carefully composed with the HTML equivalents of those characters. You're usually safe (but see below) with big online sources like Atlas of Living Australia (ALA), APNI, CoL, EoL, GBIF, IPNI, IRMNG, NCBI Taxonomy, The Plant List and WoRMS, because these declare a UTF-8 charset in a meta tag in webpage heads. (IPNI's home page is actually in ISO-8859-1, but its search results are served as UTF-8 encoded XML.)

But a second problem is that just because a webpage declares itself to be UTF-8, that doesn't mean every character on the page sings from the Unicode songbook. Very odd characters may have been pulled from a database and written onto the page as-is. In ALA I recently found an ancient rune — the High Octet Preset control character (HOP, hex 81):

http://biocache.ala.org.au/occurrences/6191ca90-873b-44f8-848d-befc29ad7513 http://biocache.ala.org.au/occurrences/5077df1f-b70a-465b-b22b-c8587a9fb626

HOP replaces ü on these pages and is invisible in your browser, but a screenscrape will capture the HOP and put SchHOPrhoff in your UTF-8 document.

Another example of ALA's fidelity to its sources is its coding of the degree symbol, which is a single-byte character (hex b0) in windows-1252, e.g. in Excel spreadsheets, but a two-byte character (hex c2 b0) in Unicode. In this record, for example:

http://biocache.ala.org.au/occurrences/5e3a2e05-1e80-4e1c-9394-ed6b37441b20

the lat/lon was supplied (says ALA) as 37°56'9.10"S 145° 0'43.74"E. Or was it? The lat/lon could have started out as 37°56'9.10"S 145°0'43.74"E in UTF-8. Somewhere along the line the lat/lon was converted to windows-1252 and the ° characters were generated, resulting in geospatial gibberish.

When a program fails to understand a character's encoding, it usually replaces the mystery character with a ?. A question mark is a perfectly valid character in commonly used encodings, which means the interpretation failure gets propagated through all future re-uses of the text, both on the Web and in data dumps. For example,

http://biocache.ala.org.au/occurrences/dfbbc42d-a422-47a2-9c1d-3d8e137687e4

gives N?crophores for Nécrophores. The history of that particular character failure has been lost downstream, as is the case for myriads of other question marks in online biodiversity data.

In my experience, the situation is much worse in data dumps from online sources. It's a challenge to find a dump without question marks acting as replacement characters. Many of these question marks appear in author and place names. Taxonomists with eastern European names seem to fare particularly badly, sometimes with more than one character variant appearing in the same record, as in the Australian Faunal Directory (AFD) offering of Wêgrzynowicz, W?grzynowicz and Węgrzynowicz for the same coleopterist. Question marks also frequently replace punctuation, such as n-dashes, smart quotes and apostrophes (e.g. O?Brien (CoL) and L?Échange and d?Urville (AFD)).

Character encoding issues create major headaches for data users. It would be a great service to biodiversity informatics if data managers compiled their data in UTF-8 encoding or took the time to convert to UTF-8 and fix any resulting errors before publishing to the Web or uploading to aggregators.

This may be a big ask, given that at least one data manager I've talked to had no idea how characters were encoded in the institution's database. But as ALA's Miles Nicholls wrote back in 2009, "Note that data should always be shared using UTF-8 as the character encoding". Biodiversity informatics is a global discipline and UTF-8 is the global standard for encoding.

Readers needing some background on character encoding will find this and especially this helpful, and a very useful tool to check for encoding problems in small blocks of text is here.