Capturing process: In silico, in laboratorio and all the messy in-betweens – Cameron Neylon @ the Unilever Centre

I am not very good at live-blogging, but Cameron Neylon is at the Unilever Centre and giving a talk about capturing the scientific process. This is important stuff and so I shall give it a go.

He starts off by making the point that to capture the scientific process we need to capture the information about the objects we are investigating as well as the process how we get there.

Journals not enough – the journal article is static but knowledge is dynamic. Can solutions come from software development? Yes to a certain extent….

e.g. source control/versioning systems – captures snapshots of development over time, date stamping etc.
Unit testing – continuous tests as part of the science/knowledge testing
Solid-replication…distributed version control

Branching and merging: data integration. However, commits are free text..unstructured knowledge…no relationships between objects – what Cameron really wants to say is NO ONTOLOGIES, NO LINKED DATA.

Need linked data, need ontologies: towards a linked web of data.

Data is nice and well…but how about the stuff that goes on in the lab? Objects, data spread over multiple silos – recording much harder: we need to worry about the lab notebook.

“Lab notebook is pretty much an episodic journal” – which is not too dissimilar to a blog. Similarities are striking: descriptions of stuff happening, date stamping, categorisation, tagging, accessibility…and not of much interest to most people…;-). But problem with blogs is still information retrieval – same as lab notbook…

Now showing a blog of one of his students recording lab work…software built by Jeremy Frey’s group….blog IS the primary record: blog is a production system…2GB of data. At first glance lab-log similar to conventional blog: dates, tags etc….BUT fundamental difference is that data is marked up and linked to other relevant resources…now showing video demo of capturing provanance, date, linking of resources, versioning, etc: data is linked to experiment/procedure, procedure is linked to sample, sample is linked to material….etc….

Proposes that his blog system is a system for capturing both objects and processes….a web of objects…now showing a visualisation of resources in the notbook and demonstrates that the visualisation of the connectedness of the resources can indicate problems in the science or recording of science etc….and says it is only the linking/networking effect that allows you to do this. BUT…no semantics in the system yet (tags yes…no PROPER semantics).

Initial labblog used hand-coded markup: scientists needed to know how to hand code markup…and hated it…..this led to a desire for templates….templates create posts and associate controlled vocab and specify the metadata that needs to be recorded for a given procedure….in effect they are metadata frameworks….templates can be preconfigured for procedures and experiments….metadata frameworks map onto ontologies quite well….

Bio-ontologies…sometimes convolute process and object….says there is no particularly good ontology of experiments….I think the OBI and EXPO people might disagree….

So how about the future?

    • Important thing is: capture at source IN CONTEXT
      Capture as much as possible automatically. Try and take human out of the equation as much as possible.
      In the lab capture each object as it is created and capture the plan and track the execution step by step
      Data repositories as easy as Flickr – repos specific for a data type and then link artefacts together across repos..e.g. the Periodic Table of Videos on YouTube, embedding of chemical structures into pages from ChemSpider
      More natural interfaces to interact with these records…better visualisation etc…
      Trust and Provenance and cutting through the noise: which objects/people/literature will I trust and pay attention to? Managing people and reputation of people creating the objects: SEMANTIC SOCIAL WEB (now shows FriendFeed as an example: subscription as a measure of trust in people, but people discussing objects) “Data finds the data, then people find the people”..Social network with objects at the Centre…
      Connecting with people only works if the objects are OPEN
      Connected research changes the playing field – again resources are key
      OUCH controversy: communicate first, standardize second….but at least he ackowledges that it will be messy….
  • UPDATE: Cameron’s slides of the talk are here:

    Reblog this post [with Zemanta]

    ChemAxiom: An Ontology for Chemistry – 1. The Motivation

    I have already announced the fact that we are working on ontologies in the polymer domain some time ago, though I realise that so far, I have yet to produce the proof of that: the actual ontology/ontologies.

    So today I am happy to announce that the time of vapourware is over and that we have released ChemAxiom – a modular set of ontologies, which form the first ontological framework for chemistry (or at least so we believe). The development of these ontologies has taken us a while: I started this on a hunch and as a nice intellectual exercise, not entirely sure where to go with them and what to use them and therefore not working on them full time. As the work progressed, however, we understood just how inordinately useful they would be for doing what we are trying to accomplish in both polymer informatics and chemical informatics at large. I will introduce and discuss the ontologies in a succession of blogposts, of which this is the first one

    So what, though maybe somwhat retrospectively, was the motivation for the preparation of the ontologies? In short – the breakdown of many common chemistry information systems when confronted with real chemical phenomena rather than small subsections of idealised abstractions. Let me explain.

    Chemistry and chemical information systems positively thrive on the use of a connection table as a chemical identifier and determinant of uniqueness. The reasons for this are fairly clear: chemistry, for the past 100 years or so, has elevated the (potential) correlation between the chemical structure of a molecule and its physicochemical and biological properties to be its “central dogma.” The application of this dogma has served subsections of the community – notably organic/medicinal/biological chemists incredibly well, while causing major headaches for other parts of the chemistry community and given an outright migraine to information scientists and researchers. There are several reasons for the pain:

    The use of a connection table as an identifier for chemical objects leads to significant ontological confusion. Often, chemists and their information systems do not realise that there is a fundamental distinction between (a) the platonic idea of a molecule, (b) the idea of a bulk substance and (c) an instance of (“the real bulk substance”) in a flask or bottle on the researcher’s lab bench. An example of this is the association of a physicochemical property of a chemical entity with a structure representation of a molecule: while it would, for example, make sense to do this for a HOMO energy, it does NOT make sense to speak of a melting point or a boiling point in terms of a a molecule. The point here simply is that many physicochemical properties are the mereological sums of the properties of many molecules in an ensemble. If this is true for simple properties of pure small molecules, it is even more true for properties of complex systems such as polymers, which are ensembles of many different molecules of many different architectures. A similar argument can also be made for identifiers: in most chemical information systems, it is often not clear whether the identifier (such as a CAS number etc.) refers to a molecule or a substance composed of these molecules.

    Many chemical objects have temporal characteristics. Often, chemical objects have temporal characteristics, which influence and determine their connection table. A typical example for this are rapidly interconverting isomers: glucose, when dissolved in water, for example, can be described by several rapidly interconverting structures – a single connection table is not enough to describe the concept “glucose in water” and there exists a parthood relationship between the concept and several possible connection tables. Ontologies can help with specifying and defining these parthood relationships.

    There is another aspect to time dependence we also need to consider. For many materials, their existence in time, or, put in another way, their history, often holds more meaningful information about an observed physical property of that substance than the chemical structure of one of the components of the mixture. For an observable property of a polymer, such as the glass transition temperature, for example, it matters a great deal whether the polymer was synthesized in on the solid phase in a pressure autoclave or in solution at ambient pressure. Furthermore, it matters, whether and how a polymer was processed – how was it extruded, grafted etc. All of these processes have a significant amount of influence on the observable physical properties of a bulk sample of this polymer, while leaving the chemical decription of the material, essentially unchanged (in current practice, polyethylene is often represented either by using the structure of the corresponding repeat unit (ethene, for example) or the structure of a repeat unit fragment (-CH2-CH2-). Ontologies will help us to describe and define these histories. Ultimately, we envisage that this will result in a “semantic fingerprint” of a material, which – one might speculate – will be much more appropriate for the development of design rules for materials than the dumb structure representations in use today.

    Many chemical objects are mixtures….and mixtures simply do not lend themselves to being described using the connection table of a single constituent entity of that mixture. If this is true for glucose in water, it is even truer for things such as polymers: polymers are mixtures of many different macromolecules, all of which have slightly different architectures etc. An observed physical property, and therefore a data object, is the mereological sum of the contributions made by all the constituent macromolecules and therefore, such a data object cannot simply be associated with a single connection table.

    This, in my view, is a short summary of the case for ontology in chemistry. Please feel free to violently (dis-)agree and if you want to do so, I am looking forward to a discussion in the comments section.

    There’s one more thing:

    AN INVITATION

    The ChemAxiom ontologies are far from perfect and far from finished. We hope, that they show the way how an ontological framework for chemistry could look like. In developing these ontologies, we can contribute our particular point of view, but we would like to hear yours. Even more, we would like to invite the community to get involved in the development of these ontologies in order to make them a general and valuable resource. If you would like to  become involved, then please send an email to chemaxiom at googlemail dot com or leave comments/questions etc, in the ChemAxiom Google Group.

    In the next several blog posts, I will dive into some of the technical details of the ontologies.

    (Automatic Links etc., as always, by Zemanta)

    Reblog this post [with Zemanta]

    Semantic Universe Website with Chemical/Polymer Informatics Contributions now Live

    Over on Twitter, Semantic Universe has just announced the relaunch of their website. The purpose of the site is “to educate the world about semantic technologies and applications.”

    To quote from the website:

    “Semantic Universe and Cerebra today announced the launch of the “Semantic Universe Network”, a vibrant educational and networking hub for the global semantic technology marketplace. Semantic Universe Network will be the educational and information resource for the people and companies within the high-growth semantics sector, covering the latest news, opinions, events, announcements, products, solutions, promotions and research in the industry.”

    As part of the re-launch, both Lezan Hawizy and I have written two short contributions reviewing the state of Semantic Chemistry and showcasing our work on how semantification of chemistry can happen. The contributions were intended to be short “how to..s” and as such are written in a somewhat chatty style. Here are the links:

    Semantic Chemistry

    The Semantification of Chemistry

    Feedback is welcome.

    Reblog this post [with Zemanta]

    (More) Triples for the World

    I have taken a long hiatus from blogging for a number of reasons and still don’t have time to blog much, but something has just happened that has really excited me.
    During this year’s International Semantic Web Conference in Karlsruhe (which I am still angry about not being able to attend due to time constraints), it was announced, that Freebase now produces RDF!

    Now just in case you are wondering what Freebase is, here’s a description from their website:

    Freebase, created by Metaweb Technologies, is an open database of the world’s information. It’s built by the community and for the community – free for anyone to query, contribute to, build applications on top of, or integrate into their websites.

    Already, Freebase covers millions of topics in hundreds of categories. Drawing from large open data sets like Wikipedia, MusicBrainz, and the SEC archives, it contains structured information on many popular topics, including movies, music, people and locations – all reconciled and freely available via an open API. This information is supplemented by the efforts of a passionate global community of users who are working together to add structured information on everything from philosophy to European railway stations to the chemical properties of common food ingredients.

    By structuring the world’s data in this manner, the Freebase community is creating a global resource that will one day allow people and machines everywhere to access information far more easily and quickly than they can today.

    And all of this data, they are making available as RDF triples, which you can get via a simple service:

    Welcome to the Freebase RDF service.

    This service generates views of Freebase Topics following the principles of Linked Data. You can obtain an RDF representation of a Topic by sending a simple GET request to http://rdf.freebase.com/ns/thetopicid, where the “thetopicid” is a Freebase identifier with the slashes replaced by dots. For instance to see “/en/blade_runner” represented in RDF request http://rdf.freebase.com/ns/en.blade_runner

    The /ns end-point will perform content negotiation, redirecting your client to the HTML view of the Topic if HTML is prefered (as it is in standard browsers) or redirecting you to http://rdf.freebase.com/rdf to obtain an RDF representation in N3, RDF/XML or Turtle depending on the preferences expressed in your clients HTTP Accept header.

    This service will display content in Firefox if you use the Tabulator extension.

    If you have questions of comments about the service please join the Freebase developer mailing list.

    So now there’s DBPedia and Freebase. More triples for the world, more data, more opportunity to move ahead. In chemistry, it’s sometimes so difficult to convince people of the value of open and linked data. This sort of stuff makes me feel that we are making progress. Slowly, but inexorably. And that is exciting.

    Twine as a model for repository functionality?

    Now although I have not blogged anything for a long time again, I did not mean to write this blog-post as I have some more pressing academic concerns to deal with at the moment. However, given that the discussion as to what a repository should be or should do for its users is flaring up again here, http://blog.openwetware.org/scienceintheopen/2008/06/10/the-trouble-with-institutional-repositories/ and here. In particular, Chris Rusbridge asked for ideas about repository functionality and so I thought I should chime in.

    When reading through all of the posts referenced above, the theme of automated metadata discovery is high up on everybody’s agenda and for good reason: while I DO use our DSpace implementation here in Cambridge and try and submit posters and manuscripts, I do feel considerable frustration everytime I do so. Having to enter the metadata first (and huge amounts of it) costs me anything from 5 to 10 min a pop. Now (ex-)DSpacers tell me that the interface and funcationality that make me do this is a consequence of user interaction studies. If that is true, then the mind boggles….but anyway, back to the point.

    I have been wondering for a while now, whether the darling of the semantic web community, namely Radar Network’s Twine, could not be a good model for at least some of the functionality that an institutional repo should/could have. It takes a little effort to explain what Twine is, but if you were to press me for the elevator pitch, then it would probably be fair to say that Twine is to interests and content, what Facebook is to social relationships and LinkedIn to professional relationships.

    In short, when logging into Twine, I am provided with a sort of workspace, which allows me to reposit all sorts of stuff: text documents, pdf documents, bookmarks, videos etc. The Twine Workspace:

    Image

    I can furthermore organize this content into collections (“Twines”), which can be either public or private:

    Image

    Once uploaded, all resources get pushed through a natural language processing workflow, which aims to extract metadata from these and subsequently marks the metadata up in a semantically rich form (RDF) using Twine’s own ontologies. Here, for example, is a bookmark for a book on Amazon’s site:

    Image

    The extracted metadata is shown in a user friendly way on the right. And here is the RDF that Twine produces as a concequence of metadata extraction from the Amazon page:

    Image

    So far, the NLP functionality extracts people, places, organisations, events etc. However, Radar Networks have announced that users will be allowed to use their own ontologies come the end of the year. Now I have no idea how this will work technically, but assuming that they can come up with a reasonable implementation of this, things get exciting as it is then up to the user to “customize” his workspace around his interests etc. and to decide on the information they want to see.

    On the basis of the extracted metadata, the system will suggest other documents in my own collection or in other public Twines, which might be of interest to me, and I, for one, have already been alerted to a number of interesting documents this way. Again, if Radar’s plans go well, Twine will offer document similarity analyses on the basis of clustering around autumn time.

    It doesn’t end here: there is also a social component to the system. On the basis of the metadata extracted from my documents, other users with a similar metadata profile and therefore presumed similar interests will be recommended to me and I have to opportunity to link up with them.

    As I said above, at the moment, Twine is in private beta and so the stuff is hidden by behind a password for now. However, if everything goes to plan, Radar plans to take the passwords off the public Twines so that the stuff will be exposed on the web, indexed by Google etc. And once that happens, of course, there are more triples for the world too…..which can only be a good thing.

    Personally, I am excited about all of this, simply because the potential is huge. Some of my colleagues are less enthusiastic – for all sorts of reasons. For one, the user interface is far from intuitive at the moment and it actually takes a little while to “get” Twine. But once you do, it is very exciting….and I think that a great deal of this functionality could be/should be implemented by institutional repos as well. Oh and what would it mean for data portability/data integration etc. if institutional repos started to expose RDF to the world….?

    By the way, I have quite a few Twine invites left – so should anybody want to have a look and play with the system, leave a comment on the blog and I’ll send you an invite!

    Yahoo! has! announced! support! for! semantic! web! standards!

    Well, this blog has remained dormant for far too long as I got distracted by the “real world” (i.e. papers, presentations and grant proposals – not the real real world) after christmas.

    But I can’t think of a better way to start blogging again than to report that Yahoo! has just announced their support of semantic web technologies. To quote from their search blog:

    The Data Web in Action
    While there has been remarkable progress made toward understanding the semantics of web content, the benefits of a data web have not reached the mainstream consumer. Without a killer semantic web app for consumers, site owners have been reluctant to support standards like RDF, or even microformats. We believe that app can be web search.

    By supporting semantic web standards, Yahoo! Search and site owners can bring a far richer and more useful search experience to consumers. For example, by marking up its profile pages with microformats, LinkedIn can allow Yahoo! Search and others to understand the semantic content and the relationships of the many components of its site. With a richer understanding of LinkedIn’s structured data included in our index, we will be able to present users with more compelling and useful search results for their site. The benefit to LinkedIn is, of course, increased traffic quality and quantity from sites like Yahoo! Search that utilize its structured data.

    In the coming weeks, we’ll be releasing more detailed specifications that will describe our support of semantic web standards. Initially, we plan to support a number of microformats, including hCard, hCalendar, hReview, hAtom, and XFN. Yahoo! Search will work with the web community to evolve the vocabulary framework for embedding structured data. For starters, we plan to support vocabulary components from Dublin Core, Creative Commons, FOAF, GeoRSS, MediaRSS, and others based on feedback. And, we will support RDFa and eRDF markup to embed these into existing HTML pages. Finally, we are announcing support for the OpenSearch specification, with extensions for structured queries to deep web data sources.

    We believe that our open approach will let each of these formats evolve within their own passionate communities, while providing the necessary incentive to site owners (increased traffic from search) for more widespread adoption. Site owners interested in learning more about the open search platform can sign up here.

    I have had many discussions with people over the past year or so concerning the value of the semantic web approach and some of the people I talked to have been very vocal sceptics. However, the results of some of our work in Cambridge, together with the fact that no matter where I look, the semantic web is becoming more prominent, have convinced me that we are doing the right thing. It was pleasing to see that the Semantikers were out in force recently even at events such as CeBit which has just ended. Twine seems to be inching towards a public beta now and the first reviews of the closed beta are being written even if they are mixed at the moment. Reuters recently released Calais as a web service, which uses natural language processing to identify people, places, organizations and facts and makes the metadata available as RDF constructs. So despite all the skepticism, semantic web products are starting to be shipped and even the mainstream media are picking up the idea of the semantic web with the usual insouciance, but nevertheless, it seems to be judged newsworthy. Academic efforts are coming to fruition too.

    Maybe I am gushing a little now. But it seems to me, that Yahoo now lending its support could be a significant step forward for all of us. And sometimes it is just nice to know that one is doing the right thing.

    Running Joseki

    In previous posts I have already alluded to the fact that we are interested in the use of OWL ontologies and RDF for our polymer informatics project. As part of this, we are investigating Joseki, which is a RDF publishing server, which provides a web interface for running SPARQL queries over RDF graphs.

    The Joseki documentation is sparse and I always like a take-you-by-the-hand-and-walk-you-through-in-babysteps type of tutorial to getting things up and running. So I have attempted to supply one and the link is here. Seems to work on my fruit machine (MacBook Pro), so something must have gone right…….

    Polymer Theses, Polymer Data and a Common Language.

    I am currently at the European Science Foundation’s first summer school on Nanomedicine in Cardiff, where I was invited to present some of the work in polymer informatics which we are doing in Cambridge. The summer school is a wonderful event, with approximately 180 attendees, the majority of which are PhD students and even a few undergraduates as well as a significant number of tenured faculty. The attendees came from a number of scientific disciplines, such as chemistry, biology, physics, medicine and ethics. And bringing people together in this way to talk about a field of research which is completely interfacial is the only sustainable way forward.
    An awful lot of people were very impressed by the work we do and our approach to data and knowledge management and many of the PhD students I spoke to were enthused by the potential power that informatics can bring to their research. They also appreciated the need to have well-curated data that is freely available and not copyrighted by publishers etc. With so many PhD students here talking to each other freely about their research, getting to know each other and appreciating each other’s science, it seemed to me, that there is a real chance to build a community, that exchanges data and information in order to communally advance a field of research.

    While the summer school was very multidisciplinary, there was a predominance of people interested in the use of polymers for all sorts of different applications – not least for applications in drug and gene delivery.
    People working in polymer therapeutics are quite often “jacks of all trades;” not only are they chemists who know how to synthesize and purify polymers, but, to a certain extent at least, they also have to be physical chemists, biologists, formulators etc. So the polymer pharmaceuticals community produces very rich and diverse datasets. The data they create is usually of general importance:
    An important property of polymers in medical applications, for example, is solubility. So quite often, people working in polymer pharmaceuticals will engage in the determination of phase diagramms for polymers. And as there is a lot of interest in stimulus responsive polymers, these diagramms are not just measured in pure water, but also in the presence of different ions and pH values. Researchers might also be interested in the dimensions of the polymer chain under all of those conditions, so light or x-ray scattering studies are carried out. And that is just on the pure polymer! Conjugation of a drug or gene to th pure material changes the game completely and so all of these measurements potentially get carried out again.

    Once we are done with the physicochemical characterisation, we then go on to try and characterize the polymers we have synthesized w.r.t. their biological properties: we are interested in their toxicology, their biodistribution, their specificity etc. That, too, generates an awful lot of data which is potentially related to the structure of the polymers we are dealing with.

    And as I said before, it is not only other pharmaceutical people that are interested in this sort of data. A lot of polymer chemists in general as well as companies should in principle be very interested in thi type of data: polymers are present in most modern household and cleaning products (check the labels of your shampoo and washing powder bottles).

    Therefore it seems to me, that we have a rich source of polymer-related data here, that we should attempt to harvest. Judging from the initial enthusiasm that I have encountered at the summer school leads me to think that maybe we have an opportunity to work with the polymer pharmaceutics/nanomedicine research community to build up, at least in the long term, a valuable polymer knowledge base. Now, I am aware of the fact that this community in particular is very conscious of patents and intellectual property and we have mechanisms to ensure that these considerations can be taken into account and accommodated. How could we get hold of this data?
    Over on his blog, Peter has pointed out that a viable way would be to capture digital theses in repositories, which, would not only allow the thesis to be preserved, but will undoubtedly also help with dissemination and intelligent data mining. Furthermore, it would be a way to prevent publishers from copyrighting scientific data.

    All of this said, the potentialities go much further than this. I have already mentioned the strongly interdisciplinary nature of the summer school. Now, in our work here in Cambridge, we use semantic web technologies to hold information about polymers….we have developed an XML-based polymer markup language and are working on ontologies, which codify polymer knowledge. One of the conclusions of my talk was, that biologists and medics use exactly the same technologies to communicate their data and knowledge and so here for the first time, we have an opportunity to bring knowledge from disparate disciplines together and map it onto each other. In that way, we should be able to develop a joint language which we and our information systems can understand each other and that should allow us to ask new questions – Peter has already demonstrated what is possible when a thesis can be turned into RDF.
    And theses originating in a strongly interdisciplinary field of research could be a wonderful starting point.

    So, dear polymer science/polymer pharmaceuticals community, how about it? If you are interested not only in preserving and disseminating your data (after patenting etc.), but also in being able to ask new questions of it and in bringing multiple disciplines together, then give us your theses and let us work with you to show you how all this can be achieved. Here’s an offer – please take us up on it.