Ed's Blog: March 2013

Monday, 25 March 2013

John Cummings begins work as Wikimedian in Residence at Natural History Museum and Science Museum

John Cummings radio interview

Reposted from the Wikimedia UK Blog:John Cummings begins work as Wikimedian in Residence

Wikimedia UK is very happy to report that John Cummings, a long-standing and well known Wikimedian, has begun his work as Wikimedian in Residence at the Science Museum and Natural History Museum.

This is a ground-breaking partnership between two of the UK’s most prestigious cultural institutions and the charity that promotes and supports Wikipedia and Wikimedia projects in the UK. His role with the museums will last for four months.

John said: “It’s a real privilege to work with institutions with such important places in the history and public understanding of science. I hope I will be able to help the museums in their goals.”

John is the co-founder and project leader for MonmouthpediA and Gibraltarpedia, the world’s first Wikipedia town and city, and he is a Wikimedia UK accredited trainer for communities and institutions.

He is also technical lead for Leaderwiki, a collaborative education resource for emerging leaders from all over the world who want to make a positive contribution in their communities.

John will be working with myself and the rest of the Biodiversity Informatics team at the NHM, as well as other staff from the across the museum. You can see what's happening here.

Saturday, 23 March 2013

Re-inventing the wheel: do we need a common infrastructure for museum digital?

Over the last few years (since around the eBiosphere conference) I have several times put together slides detailing the 'Informatics Landscape' of biological collections (there's an example here) and the ecosystem of projects that it, in some way, supports. Over the years projects have been and gone, and the informatics community has coalesced around a number of projects and initiatives: Biodiversity Heritage Library for legacy literature, GBIF for specimen and observational records, Encyclopedia of Life as an aggregator for the public, Scratchpads a platform for virtual research and data sharing.

In a recent Guardian piece (Digital pro bono: time for cultural giants to offer their services) and an earlier blog post (Wouldn’t it be cool if … ) Oonagh Murphy suggests that big cultural institutions could give some of their time to help smaller cultural institutions with their web presence. This is an idea, and would no doubt have a positive impact on the sector as a whole, but should we be looking more towards the biodiversity informatics community? Would it not be better to spend this time developing a shared, open, infrastructure of online tools that smaller museums, and perhaps even larger ones, could use?

If this was the case then we could create an environment for shared development. The cost of developing some piece of functionality could be spread amongst the museums who need it, at a reduced cost to each, and then freely shared with the rest of the community. Other institutions might realise they can tweak it for a different purpose, or develop it further to meet their own needs. It would be possible to create a new ecosystem of collaboration.

This could potentially be a similar model to the Scratchpads - take an existing project (in that case Drupal) which deals with much of the basics - and build on top of it a more specific set of tools that are of use to the cultural community. Some of these enhancements, if they are generic enough, can be released back to the Drupal community for other people to use in their many and diverse projects.

The advantage of this model is that things only need to be done once: develop mobile support and everybody using the platform has mobile support. Individual projects (sites) can brand their content as they wish and still make use of pooled resources and development.

Tuesday, 19 March 2013

Visualising an archive: Walter Rothschild's correspondence

Rothschild with his famed zebra (Equus burchelli) carriage, which he drove to Buckingham Palace to demonstrate the tame character of Zebras to the public

As part of exploring possibilities for the Wikipedian in Residence project (more on this very soon) we were given some example data from the NHM archive catalogue relating to the correspondence of Walter Rothschild to see what potential there might be for digitisation and semantic linking of content. Having data with locality and time information means only one thing: time to dig out CartoDB and Torque!

Some background to the correspondence from the Tring Museum:

Tring Museum was a natural history museum owned by Walter, later 2^nd Baron Rothschild, and which was donated to the Natural History museum in 1936. It had been open to public since 1892. A large amount of papers, particularly letters from Walter were destroyed, so this correspondence is largely all that remains of the history of that museum.

This series is mostly letters to Walter and / or his curators Ernst Hartert and Karl Jordan, and is a fascinating collection with a wealth of information. Not only scientifically and historically
from the ornithologists and entomologists who wrote to Tring, but historically from the various institutions around the world, and also the economic history of the business of natural history, from the dealers, publishers and booksellers. There is also important social history to be studied about Tring Museum's relationship with local people and businesses, who visited, and were employed by the Museum. The largest part relates to collectors, writing from all over the world about the expeditions they’re on and the specimens they are collecting for the museum – writing sometimes from war zones, during revolutions and uprisings, and from jungles and deserts.

Daisy Cunynghame - NHM Library and Archives

Example data:

Date	Title	Description
18 Feb - 25 Jun 1903	The Acetylene Supply Co	"6 letters from The Acetylene Supply Co., 48 Cranbourne Street and 1 Bear Street, Leicester Square, London, England, United Kingdom. 3 of the letters addressed to Karl Jordan, 3 addressed to Ernst Hartert_x000D_ _x000D_ [Was previously reference number TM/1/69/1]"
1 Apr - 17 Dec 1903	André & Sleigh Limited	"2 letters to Ernst Hartert, 9 letters to Karl Jordan from André & Sleigh Limited, Photo-Engravers, Bushey, Hertfordshire, England, United Kingdom_x000D_ _x000D_ [Was previously reference number TM/1/69/5]"

The problem with this data, from a visualisation point of view, is that the addresses are not geolocated. Manually geolocating a large number of addresses would be a substantial task, perhaps best undertaken via a crowd-sourcing approach. Making a quick demonstrator to see what is potentially possible precludes the use of such an approach in this case.

Instead the online GeoCoder tool from gpsvisualizer.com was used to process the first 1,000 records of this dataset. This failed for a large number of the locations provided, but again, as this is only a demonstrator I just ignored the rows that failed.

The following map shows the results after the geocoding.

The geocoding of a few points (many of those shown as being in North America) is clearly wrong, however the vast majority have been correctly placed, as far as is possible.

Of course geolocating just gives us a way of visualising the archive in spatial dimensions, however we also have temporal data available, so this seemed like an obvious use for Torque on top of CartoDB. The video below (best viewed at 720px and fullscreen) shows both the spatial and temporal extent of communication.

Obviously to be a truly useful and accurate tool the data would need more rigorous processing, which would take considerably longer than creating this demonstration (which took less than a couple of hours). It does however show that visualisation tools can be useful in developing a deeper understanding or archive catalogue data.

On a (slightly) related note...
Daisy (who provided the summary of Tring Museum and Walter Rothschild above) has also written a piece about a namesake of mine that used to work for the museum as a collector: Item of the Month (July 2012) Edward Baker - One of Tring Museum's Daring Explorers.

Measuring the Impact of Wikipedia for organisations (Part 3)

Previous posts in this series:

As mentioned in a previous post in this series I have downloaded all of the Wikipedia pages that make a direct link to the Natural History Museum website. While this is useful in attempting to measure the impact of the NHM and Wikipedia on each other this post is a little bit more for fun at this stage (although the data was collected for an upcoming project).

An obvious thing to do with these downloaded pages is scan for them links - then build a graph of the interconnections between them. The script I set about this task is taking a while - so I decided to see what I could summarise about a topic (Wikipedia page) based on the articles that page links to. In all of these examples the numbers are the number of links from the 'subject' page to the other page.

First up is the iconic Dippy (Diplodocus):

4 | Othniel_Charles_Marsh
3 | Carnegie_Museum_of_Natural_History
3 | Sauropod
3 | Walking_with_Dinosaurs
2 | Jurassic
2 | Diplodocidae
2 | Type_species
2 | John_Bell_Hatcher
2 | William_Jacob_Holland
2 | Diplodocid
2 | Fossil

These as a set seem to be a reasonable, high-level, summary of the Diplodocus. There is a mixture of information that is technical (type species, Diplodocid), cultural (Walking with Dinosaurs) and about the discovery, description and display of the fossil (Marsh, Hatcher, etc).

Let's go for another species, the Holly Blue

3 | Lycaenidae
2 | Eurasia
2 | North_America
2 | India
2 | http://www.itis.gov/servlet/SingleRpt/SingleRpt?search_topic=TSN&search_value=188523
2 | Holly_Blue
2 | Main_Page
2 | Wikipedia:About
1 | Biological_classification
1 | Animal
1 | Arthropod

This time the information is more about the biogeography and higher taxonomy, but nevertheless can be seen as a reasonable, if subjectively limited, summary of the species.

Time for something different: first up a member of NHM staff, Chris Stringer

2 | Archaeology
2 | Biological_anthropology
2 | Social_anthropology
2 | Cultural_anthropology
2 | Feminist_anthropology
2 | Fellow_of_the_Royal_Society
2 | http://www.ahobproject.org/
2 | http://books.google.com.au/books?id=wTnWJGnBwgUC&printsec=frontcover&dq=Giacobini+Hominidae&hl=en&ei=jRvcS6rVJZLg7AO9_sC_Bg&sa=X&oi=book_result&ct=result&resnum=1&ved=0CDMQ6AEwAA#v=onepage&q&f=false
2 | http://books.google.com.au/books?id=Ke7_cl6tQ1EC&printsec=frontcover&dq=%22Chris+Stringer%22&hl=en&ei=JhDcS4WCF43u7APBsoiuBg&sa=X&oi=book_result&ct=result&resnum=5&ved=0CEUQ6AEwBA#v=onepage&q&f=false
2 | http://www.nhm.ac.uk/business-centre/publishing/det_humevol.html
2 | http://www.nhm.ac.uk/about-us/news/2008/march/stringer-wins-kistler-book-award.html

In short, a Fellow of the Royal Society who is an anthropologist and has written a number of books. In a purely professional sense: pretty much spot on.

So what does this kind of summary allow us to do? In a limited sense it allows us to make brief summaries of people, species and institutions that have a Wikipedia presence. But the real use comes when a large number of these analyses can be aggregated, queried and visualised. More of this another time, however here is a quick visualisation made from hacking the demos that come with arbor.js.

Full Screen Version

Measuring the Impact of Wikipedia for organisations (Part 3) by Edward Baker is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License.
Based on a work at http://pblog.ebaker.me.uk/2013/03/measuring-impact-of-wikipedia-for_19.html.

Sunday, 17 March 2013

Some links from Science Hackday London #shdl

A list of project pitches

EpiCollect (GitHub)
EpiCollect.net provides a web application for the generation of forms and freely hosted project websites (using Google's AppEngine) for many kinds of mobile data collection projects. Data can be collected using multiple mobile phones running either the Android Operating system or the iPhone (using the EpiCollect mobile app) and all data can be synchronised from the phones and viewed centrally (using Google Maps) via the Project website or directly on the phones.

Crowdcrafting
Online assistance in performing tasks that require human cognition, knowledge or intelligence such as image classification, transcription, geocoding and more!

Help advance research
Everything is open and freely usable
Things computers can't do

Yellowhammer Dialects (Czech Site)
What happens with birdsong during invasion of a new territory? To answer this question a citizen science project looks for volunteers to record yellowhammers in New Zealand and Great Britain to evaluate distribution of their dialects.

Konekta (GitHub)
Geolocate community services and make them available through a mobile site.

WAX Science
The WAX project’s goal is to launch an online collaborative platform with two main objectives :

To give a space to raise young people’s curiosity in the sciences area. With several participatory and fun approches, we want to support young people in letting their curiosity and natural motivations win back over. There will be contests, small expériences, vidéos, in the spirit of a science for everyone, revalued, but that points out the stereotypes. Because to fight something, one must be aware of it.
To give the possibility to the existing associations/initiatives/collectivities to get in touch with eachother and to know where to turn by drawing a map of what already exists, both in the field of popular science, but also on the theme of gender balance. By linking those initiatives on our website, we hope to raise the visibility of everyone of them and to catalyze the interactions !

Friday, 15 March 2013

Senior Developer at the Extreme Citizen Science grou

An interesting position!

Apply at https://atsv7.wcn.co.uk/search_engine/jobs.cgi?owner=5041178&ownertype=fair&jcode=1320261

Job title: Senior Developer at the Extreme Citizen Science group , - Ref:1320261

UCL Department / DivisionCivil, Environmental & Geomatic Engineering

Grade 8

Hours Full Time

Salary (inclusive of London allowance) £40,216 - £47,441 per annum

_Duties and Responsibilities_

We are looking for an experienced and talented Senior Programmer with knowledge of systems architecture and management to fill a 2-year vacancy to help our various research projects achieve the aims they set out to accomplish with bespoke and innovative technologies.

The main duties and responsibilities of the ExCiteS Senior Developer will include, but not be limited to the redevelopment of the Community Maps platform (www.communitymaps.org.uk ) using open source and current technologies, administration of IT systems and server management, and providing assistance to the group in making decisions about technologies that will be used on various projects. The appointee will also be required to manage Linux servers, and advise on and be involved in development projects that aim to include people in the scientific process from the Inuit in Canada to the Pygmies of the Congo. The job includes guidance with the development team which includes MSc and PhD students and postdoctoral fellows.

The post is available for immediate start and is for 2 years in the 1st instance

_Key Requirements_

The candidate will have extensive experience working as a developer, ideally within standards-based projects and using Open Source technologies with project management. They will have to have extensive knowledge of up-to-date, open source, spatial and non-spatially enabled technologies, such as Linux, PostgreSQL/PostGIS, and OpenLayers/Leaflet and quickly pick up and adapt to new development environments, particularly as we wish to move into further HTML5 and mobile development, and basing some of our technologies on open APIs. They ideal candidate should be able to use object-oriented methodologies and tools to analyse, design and implement software tools, as well as experience in designing and implementing API architectures to further extend the current software systems. It is imperative that they are able to communicate technically complex information in an understandable way. They will also need to have a solid foundation in structures and standards, properly utilising code management systems (such as GitHub), designing robust code in an easily extensible way, and ensuring that the viability of solutions extend far beyond the lifetime of the research projects themselves.

Further Details A job description and person specification can be accessed at http://bit.ly/YbdB7n

To apply for the vacancy please follow https://atsv7.wcn.co.uk/search_engine/jobs.cgi?owner=5041178&ownertype=fair&jcode=1320261

If you have any queries regarding the vacancy or the application process, please contact Prof. Muki Haklay, m.haklay@ucl.ac.uk , +44 (0)20 7679 2745.

We particularly welcome applications from black and minority ethnic candidates as they are under-represented within UCL at this level.

Closing Date: 14 Apr 2013

This appointment is subject to UCL Terms and Conditions of Service for Research and Support Staff.

Wednesday, 6 March 2013

Who owns biodiversity informatics? The Patents

I find it surprising how close some of these come to the core business of many biodiversity informaticians, and I suspect that there might be prior art in some cases. If you know of any I've missed put them in the comments and I'll add them.

Managing Taxonomic Information (US 7,650,327 B2)
Remsen, D.; Norton, C.
In a management of taxonomic information, a name that specifies an organism is identified. Based on the name and a database of organism names or classifications a link between pieces of biological identification information in the database, or a classification for the organism, is determined. Based on the other name or the classification, information associated with the organism is identified.

Information System for Biological and Life Sciences Research (Pending: US 2005/0038776 A1)
Cyrus, R.; Di Tommaso, M.; Kerlavage, A.R.; Lawrence, C.B.
An online life science research environment and virtual community with a focus on design and analysis of biological experiments includes a life sciences laboratory system employing at least one networked computer system that defines a virtual research environment. Users access the system through a portal associated with the networked computer system(s). The virtual research environment has a data coupling mechanism by which the user designates a set of user-specified data for bioinformatics processing. A processor(s) associated with the networked computer system(s) performs bioinformatics services upon the user-specified data. In one embodiment, the data coupling mechanism enables transfer of user-specified data to a memory space that is mediated or accessed by the processor performing the bioinformatics processing. Users may this exploit bioinformatics processing resources that are not deployed on users' local computer environments, and to store and organize information relating to life sciences research in a secure, online workspace.

Systems and Methods for Resolving Ambiguity Between Names and Entities (US 7,9225,444 B2)
Garrity, G.; Lyons, C.
The present invention provides systems and methods that utilize an information architecture for disambiguating scientific names and other classification labels and the entities to which those names are applied, as well as a means of accessing data on those entities in a networked environment using persistent, unique identifiers.

Systems and methods for automatically identifying and linking names in digital resources (Pending: US 2010/0198841 A1)
Parker, C; Lyons, C.; Roston, G.; Garrity, G.
The present invention provides systems and methods for automatically identifying name-like-strings in digital resources, matching these name-like-string against a set of names held in an expertly curated database, and for those name-like-strings found in said database, enhancing the content by associating additional matter with the name, wherein said matter includes information about the names that is held within said database and pointers to other digital resources which include the same name and it synonyms.

Saturday, 2 March 2013

Measuring the Impact of Wikipedia for organisations (Part 2)

This post continues from Measuring the Impact of Wikipedia for organisations (Part 1) which looked at a number of statistics relating to page views and links using linkypedia (well - a slightly customised version of linkypedia).

Part of my reasons for doing this might have become clear based on a subsequent post on this blog: Wikimedian in Residence at NHM.

This post uses a feature I added to linkypedia to save a copy of pages that link to the NHM website into a database. This allows for some quick queries to identify both the type of pages, and the content they contain.

13580 pages have links to the domain www.nhm.ac.uk

This includes (type of page, number of pages):

User pages 44
User talk pages 39
WikiProjects 2
WikiProjects pages 6
WikiProjects talk pages 20
Wikipedia Signpost 3
Village Pump 1
Reference Desk 9
Graphics Lab 1
Copyright Problems 3
Suspected Copyright Violtaions 2
Possibly unfree files 2
Media copyright questions 1
Articles for creation 2
Featured article candidates 4

Examples of other queries that can be run:

Biota InfoBox 12768 (can be assumed to be good indicator of pages about a taxon)
Type specimen 52
Lepidoptera 12773
Stub 12412
Lepidoptera stub 12190

This looks like the NHM has quite a sizeable Wikipedia footprint, however a huge majority of these are stub lepidoptera pages with very little content besides a link back to a project on the NHM website.

Sample stub lepidoptera page (Accessed 02 March 2013)

Considering the number of type specimens the museum holds (20,000 mosses alone) the figure of 52 is one that is definitely open to some improvement.

Measuring the Impact of Wikipedia for organisations (Part 2) by Ed Baker is licensed under a Creative Commons Attribution 3.0 Unported License.