Capturing process: In silico, in laboratorio and all the messy in-betweens – Cameron Neylon @ the Unilever Centre

I am not very good at live-blogging, but Cameron Neylon is at the Unilever Centre and giving a talk about capturing the scientific process. This is important stuff and so I shall give it a go.

He starts off by making the point that to capture the scientific process we need to capture the information about the objects we are investigating as well as the process how we get there.

Journals not enough – the journal article is static but knowledge is dynamic. Can solutions come from software development? Yes to a certain extent….

e.g. source control/versioning systems – captures snapshots of development over time, date stamping etc.
Unit testing – continuous tests as part of the science/knowledge testing
Solid-replication…distributed version control

Branching and merging: data integration. However, commits are free text..unstructured knowledge…no relationships between objects – what Cameron really wants to say is NO ONTOLOGIES, NO LINKED DATA.

Need linked data, need ontologies: towards a linked web of data.

Data is nice and well…but how about the stuff that goes on in the lab? Objects, data spread over multiple silos – recording much harder: we need to worry about the lab notebook.

“Lab notebook is pretty much an episodic journal” – which is not too dissimilar to a blog. Similarities are striking: descriptions of stuff happening, date stamping, categorisation, tagging, accessibility…and not of much interest to most people…;-). But problem with blogs is still information retrieval – same as lab notbook…

Now showing a blog of one of his students recording lab work…software built by Jeremy Frey’s group….blog IS the primary record: blog is a production system…2GB of data. At first glance lab-log similar to conventional blog: dates, tags etc….BUT fundamental difference is that data is marked up and linked to other relevant resources…now showing video demo of capturing provanance, date, linking of resources, versioning, etc: data is linked to experiment/procedure, procedure is linked to sample, sample is linked to material….etc….

Proposes that his blog system is a system for capturing both objects and processes….a web of objects…now showing a visualisation of resources in the notbook and demonstrates that the visualisation of the connectedness of the resources can indicate problems in the science or recording of science etc….and says it is only the linking/networking effect that allows you to do this. BUT…no semantics in the system yet (tags yes…no PROPER semantics).

Initial labblog used hand-coded markup: scientists needed to know how to hand code markup…and hated it…..this led to a desire for templates….templates create posts and associate controlled vocab and specify the metadata that needs to be recorded for a given procedure….in effect they are metadata frameworks….templates can be preconfigured for procedures and experiments….metadata frameworks map onto ontologies quite well….

Bio-ontologies…sometimes convolute process and object….says there is no particularly good ontology of experiments….I think the OBI and EXPO people might disagree….

So how about the future?

    • Important thing is: capture at source IN CONTEXT
      Capture as much as possible automatically. Try and take human out of the equation as much as possible.
      In the lab capture each object as it is created and capture the plan and track the execution step by step
      Data repositories as easy as Flickr – repos specific for a data type and then link artefacts together across repos..e.g. the Periodic Table of Videos on YouTube, embedding of chemical structures into pages from ChemSpider
      More natural interfaces to interact with these records…better visualisation etc…
      Trust and Provenance and cutting through the noise: which objects/people/literature will I trust and pay attention to? Managing people and reputation of people creating the objects: SEMANTIC SOCIAL WEB (now shows FriendFeed as an example: subscription as a measure of trust in people, but people discussing objects) “Data finds the data, then people find the people”..Social network with objects at the Centre…
      Connecting with people only works if the objects are OPEN
      Connected research changes the playing field – again resources are key
      OUCH controversy: communicate first, standardize second….but at least he ackowledges that it will be messy….
  • UPDATE: Cameron’s slides of the talk are here:

    Reblog this post [with Zemanta]

    Polymer Markup Language Paper

    Now i started this blog with the intention of writing about polymers, informatics etc.. Somewhere along the way, some advocacy, some ranting and a general critique of the scholarly publication process also crept in and, of course, there were long breaks. However, we have recently published polymer markup language, which has been in the making for a while and I am pleased to announce the paper, published in the Journal of Chemical Information and Modeling:

    Chemical Markup, XML and the World-Wide Web. 8. Polymer Markup Language

    Nico Adams, Jerry Winter, Peter Murray-Rust and Henry S. Rzepa

    Polymers are among the most important classes of materials but are only inadequately supported by modern informatics. The paper discusses the reasons why polymer informatics is considerably more challenging than small molecule informatics and develops a vision for the computer-aided design of polymers, based on modern semantic web technologies. The paper then discusses the development of Polymer Markup Language (PML). PML is an extensible language, designed to support the (structural) representation of polymers and polymer-related information. PML closely interoperates with Chemical Markup Language (CML) and overcomes a number of the previously identified challenges.

    Many thanks are due to everybody who worked on this and everybody in the Unilever Centre who was available for discussions, comments and critique.

    The paper can be found here

    Reblog this post [with Zemanta]

    Polymer Informatics and The Semantic Web – The Solution, Part 1: Adding Structure: Chemical Markup Language

    In my last post concerning our work on polumer informatics, I started to discuss how one can add structure to documents in the form of metadata, in order to help correct information retrieval. In particular, I introduced the notion of markup languages to structure information and used an example of a bread recipe, to discuss some general features of XML. So having been through all of that, how can we hold chemical information in a marked-up way.

    Being chemists, one of the assumptions that is fundamentally engrained into all of our thinking, is that the structure of a molecule is related to the physical properties of that molecule. Therefore, the most important information a chemist might wish to hold in a marked-up way is probably structural information about a molecule. Well, fortunately, over the past decade or so, Peter Murray-Rust, Henry Rzepa and others have worked on an XML dialect called CML – Chemical Markup Language. Let’s have a look at a small molecule, styrene in our case, and see what some basic CML looks like.

    Here’s a representation of styrene (InChI=1/C8H8/c1/h2-7H,1H2) that every chemist will be familiar with:

    Styrene

    and here’s how the same molecule would be represented in CML:

    StyreneInCML

    As was the case for our bread recipe, you can see that we have three containers here, namely “atomArray” and “bondArray” enclosed by the container “molecule”. Both arrays are essentially lists of atoms (with attributes specifying which element we are talking about, what id that particular atom has and what it’s 2D coordinates are) and bonds (with attributes telling us between which atom IDs the bond was formed, what the ID of the bond is and also what the bond order is). All of this taken together is what computational chemists call a “connection table”.

    Neither hard, nor scary, is it? And the simplest way of holding chemical information in a semantically rich format. In future posts I will delve somewhat deeper into the bowels of CML and show you what else it is capable of.

    Polymer Informatics and The Semantic Web – The Solution, Part 1: Adding Structure

    In one of my last posts I mentioned that one of the problems we encounter in current knowledge bases is the fact that polymer information is quite often present in free text. It is therefore very hard to extract information from these sources (although it can be done, see Peter Corbett’s OSCAR system) and even when it is accomplished, one is quite often faced with the problem of what the extracted information means. Take your favourite search engine and look for the term “cook” for example. The search engine will most likely retrieve information about people called “Cook”, about “cook” the profession, the Cook Islands or Cook County, Illinois.

    One way around this, is too add more descriptive data to data contained in web pages and other documents, or, in other words, data about data. If we could mark up the term “cook” as a person, or a profession of a place name according to the context in which we use it, a machine would have a much better time of finding the bits of information we were really interested in. Now, data about data is also called “metadata” and one way of adding metadata to documents is through the use of markup languages and, in our case, through the use of Extensible Markup Language (XML) and its dialects.

    Now the concept of a markup language should not be unfamiliar. Every internet user should has heard of HTML, Hypertext Markup Language, which can be used to structure text into headings, tables, paragraphs etc. XML, just like HTML belongs to the class of descriptive markup languages.

    markup-languages.gif

    If you use Wikis at all, then you will have come across and used another type of markup, which is used for purely presentational purposes. And maybe you write your papers, in LaTeX and deal with postscript files a lot, in which case you will have had exposure to procedural markup languages too.

    Now according to the Wikipedia entry on XML, the latter “provides a text-based means to describe and apply a tree-based structure to information. At its base level, all information manifests as text, interspersed with markup that indicates the information’s separation into a hierarchy of character data, container-like elements, and attributes of those elements.” In an XML document, metadata is enclosed in angle brackets (“”), which, in turn enclose the data to be described. This is what is meant by a container. Let’s look at a simple XML document, it’s a receipe for baking bread (also taken from the Wikipedia article):

    bread.gif

    We see that there are a number of containers with labels (known as “elements” such as “recipe”, “title”, “ingredient”, “instructions” and “step”. Some of these carry a number of attributes, such as “name”, “prep_time”, “unit” and “state”, which specify further information concerning that element.

    When looking at this example , you will have hopefully realized, that XML is eminantly human readable and that you don’t have to be computer genius to figure out what is going on in the document. And you will hopefully also realize, that this markup should now make it easy for a computer to, for example, extract all the ingredients from the text, as they are now explicitly labelled as such.

    In my next post, I’ll discuss how to mark up chemistry and molecules….but maybe you can beginn to see now, how this structuring of information could be useful for polymers already.