Post-doc Position for the CheTA project at the National Centre for Text Mining Available

I am re-posting this from Jim Downing’s blog:

CheTA PostDoc at NaCTeM

There’s still a little time to apply for one of the two Postdoc positions on the CheTA project, our JISC-funded collaboration with NaCTeM at the University of Manchester, the Royal Society of Chemistry and Thomson Reuters. The first position is at NaCTeM. An advertisement for the second position, here in the Unilever Centre for Molecular Science Informatics, will be out shortly.

Reblog this post [with Zemanta]

Data-Rich Publishing

I have been insanely busy recently with trips and papers and corrections and…etc…and only now have a bit of time to catch up with some of my feeds and people’s blog posts. One post which caught my eye was Egon’s recent blog post about data-rich or data-centric publishing, in which he argues strongly for a new kind of publishing: a publishing in which data is treated as a first class citizen and which allows/requires an author to not just publish the words of a paper, but his research data too and to publish it in such a way that the barrier to access by machines is low.

This reminded me of what I thought was a particularly tragic case, which I blogged about a while ago here. In this particular case, industrious researchers had synthesized an incredible 630 polystyrene copolymers and recorded their RAMAN spectra. Now this is more than a crying shame: a lot of work has gone into producing the polymers and recording the data. And I ask you (provided you are a materials scientist and have an interest in such things), when was the last time that YOU came across such a large and rich library of polymers together with their spectral data? And through no fault of their own, the only way these authors saw to publish their data was in the form of a pdf archive in the supplemental information.

Now Egon’s point was that newly formed journals – and in particular newly formed Journals of Chemoinformatics – have the opportunity to do something fundamentally good and wholesome: namely to change the way in which data publication is being accomplished and to give scientists BETTER tools to deal with and disseminate their data. This long and rambly blogpost is my way of violently agreeing with Egon: I believe that THIS is where an awful lot of the added value of the journal of the future will lie. This will be even more true, as successive generations of scientists will start to become more data savvy: last week I talked to a collaborator of ours who had just put in for some funding to train chemistry students in both chemistry and informatics: a whole dedicated course. Now once these students start their own scientific careers, they will both care and know about science and scientific data. And if I were a publisher, I would want to have something to offer them….

Reblog this post [with Zemanta]

ChemAxiom: An Ontology for Chemistry 2. The Set-Up

Now that I have introduced at least some of the motivation behind ChemAxiom, let me outline some of the mechanics.

ChemAxiom is a collective term for a set of ontologies, all of which make a start at describing subdomains within chemistry. The ontology modules are independent and self-contained and can (largely) be developed seperately and concurrently. Although they are independent, they are interoperable and integrated via a common upper ontology – in the case of ChemAxiom, we have chosen the Basic Formal Ontology (BFO). I will blog the reasons for this choice in the next post.

clip_image002[11]

The ontologies are currently in various stages of axiomatisation depending on how long we have been working on them and how much we have had a chance to play – so therefore, if there are axioms there that are not and you think there should be, or if you agree/disagree with some of our design decisions, please let us know. In any case, the discussion has already started with some helpful comments over on the Google Group. Let me describe the various modules in greater detail:

The Reasons for Modularity: When developing ontologies, it is always tempting to develop the ueber-McDaddy-ontology-of-everything, because, of course, ontology development is, by definition, never done: we alsways need more than we have  – more terms, more axioms etc.. Very quickly, this can result in monstrously large and virtually unmaintainable constructs. Modularisation has, from out perspective, the advantage of (a) smaller and more handlable ontologies, (b) ontologies which are easier to maintain, (c) ontologies which can be developed in parallel or orthogonally and subsequently integrated using either a common upper ontology or mapping/rules etc…..Furthermore, if refactoring of ontologies is necessary during the development process, this is also facilitated by modularity: changes in one module have less chance of affecting changes in another module.

The General Use Case: One of the things we are particularly interested in here in Cambridge, is the extraction of chemical entities and data from text and Peter Corbett’s OSCAR is now fairly well established within the chemical informatics community. Our text sources vary widely, and can range from standard chemical papers to theses, blogs and Wikipedia pages. To give you an impression of the types of data we are talking about, there’s an example Wikipedia’s infobox for benzene (somewhat truncated):

 

benzene infobox for blog 

So we have to deal with names, identifiers of various type, physico-chemical property data as well as the corresponding metadata (e.g. measurement pressures, measurement temperatures etc.), and chemical structure (InChI, SMILES). Our ontologies should enable us the generate RDF that allow us to hold this data – the ontology here serves as a schema. While we are interested in reasoning/using reasoners for the purposes of (retrospective) typing (again, I will explain what I mean by that in subsequent blog posts) applying ontologies to the description of chemical data is our first use-case.

With all of that said, let me provide a quick summary of the modules:

Chemistry Domain Ontology – ChemAxiomDomain ChemAxiomDomain is the first module in the set. It is currently a small ontology, which clarifies some fundamental relationships in the chemistry domain. Key concepts in this ontology are “ChemicalElement”, “ChemicalSpecies” and “MolecularEntity” as well as “Role”. ChemAxiomDomain clarifies the relationships between these terms (see my previous blog post) and also deals with identifiers etc. Chemical roles too are important: while chemical entities, may be or act as nucleophiles, acids, solvents etc.. some of the time, they do not have these roles all of the time – roles are realisable entities and and ChemAxiomDomain provides a mechanism for dealing with that. There are few other high-level domain concepts in there at the moment, though obviously we are looking to expand as and when the need arises and use-cases are provided.I will blog some details in a subsequent blog post.

Properties Ontology – ChemAxiomProp. ChemAxiomProp is an ontology of over 150 chemical and materials properties, together with a first set of definitions and symbols (where available and appropriate) and some axioms for typing of properties. Again, details will follow in a subsequent blog post.

Measurement Techniques – ChemAxiomMetrology. This is an ontology of over 200 measurement techniques and also contains a list of instrument parts and axioms for typing of measurement techniques. It does not currently include information about minimum information requirements for measurement techniques (e.g. the measurement of a boiling point also requires a measurement of pressure) and other metadata, but this will be added at a later stage. Again, a detailed blog-post will follow.

ChemAxiomPoly and ChemAxiomPolyClass – These two ontologies contain terms which are in common use across polymer science as well as a taxonomy of polymers based on the composition of their backbone (though the latter is not axiomatised yet). Details will follow in a further blog post.

ChemAxiomMeta – ChemAxiomMeta is a developing ontology, that will allow the specification of provenance of data (e.g. data derived from wiki pages etc.) and will also define what a journal, journal article, thesis, thesis chapter etc is and what the relationships between these entities are. We have not currently released this yet. Details will follow in a further blog post.

ChemAxiomComtinuants – ChemAxionContinuants represents an integration of all the above sub-ontologies into an ontological framework for chemical continuants (with some occurrents mixed in when we need to talk about measurement techniques). Details will follow in a further blog post.

We have also started to work on ontologies of chemical reactions, actions and, as mentioned above, minimum information requirements – however, these are at a relatively early stage of development and hence not released yet.

So much for a short overview over the mechanics of the ontologies. I am sure there are a thousand other things I should have said, but that will have to
do for now. Comments and suggestions via the usual channels. Automatic links and tags, as always, by Zemanta.

Reblog this post [with Zemanta]

The Unilever Centre @ Semantic Technology 2009

In a previous blogpost, I had already announced, that both Jim and I had been accepted to speak at Semantic Technology 2009 in San Jose.

Well, the programme for the conference is out now and looks even more mind-blowing (in a very good way) than last year. Jim and I will be speaking on Tuesday, 16th June at 14:00. Here’s our talk abstracts:

PART I | Lensfield – The Working Scientist’s Linked Data Space Elevator (Jim Downing)

The vision of Open Linked Data in long-tail science (as opposed to Big Science, high energy physics, genomics etc) is an attractive one, with the possibility of delivering abundant data without the need for massive centralization. In achieving that vision we face a number of practical challenges. The principal challenge is the steep learning curve that scientists face in dealing with URIs, web deployment, RDF, SPARQL etc. Additionally most software that could generated Linked Data runs off-web, on workstations and internal systems. The result of this is that the desktop filesystem is likely remain the arena for the production of data in the near to medium term. Lensfield is a data repository system that works with the filesystem model and abstracts semantic web complexities away from scientists who are unable to deal with them. Lensfield makes it easy for researchers to publish linked data without leaving their familiar working environment. The presentation of this system will include a demonstration of how we have extended Lensfield to produce a Linked Data publication system for small molecule data.

PART II | The Semantic Chemical World Wide Web (Nico Adams)

The development of modern new drugs, new materials and new personal care products requires the confluence of data and ideas from many different scientific disciplines and enabling scientists to ask questions of heterogeneous data sources is crucial for future innovation and progress. The central science in much of this is chemistry and therefore the development of a “semantic infrastructure” for this very important vertical is essential and of direct relevance to large industries such as the pharmaceuticals and life sciences, home and personal care and, of course, the classical chemical industry. Such an infrastructure shouls include a range of technological capabilities, from the representation of molecules and data in semantically rich form to the availability of chemistry domain ontologies and the ability to extract data from unstructured sources.

The talk will discuss the development of markup languages and ontologies for chemicals and materials (data). It will illustrate how ontologies can be used for indexing, faceted search and retrieval of chemical information and for the “axiomatisation” of chemical entities and materials beyond simple notions of chemical structure. The talk will discuss the use of linked data to generate new chemical insight and will provide a brief discussion of the use of entity extraction and natural language processing for the “semantification” of chemical information.

But that’s not all. Lezan has been accepted to present a poster and so she will be there too,, showing off her great work on the extraction and semantification of chemical reaction data from the literature. Here is her abstract:

The domain of chemistry is central to a large number of significant industries such as the pharmaceuticals and life sciences industry, the home and personal care industry as well as the “classical” chemical industry. All of these are research-intensive and any innovation is crucially dependent on the ability to connect data from heterogeneous sources: in the pharmaceutical industry, for example, the ability to link data about chemical compounds, with toxicology data, genomic and proteomic data, pathway data etc. is crucial. The availability of a semantic infrastructure for chemistry will be a significant factor for the future success of this industry. Unfortunately, virtually all current chemical knowledge and data is generated in non-semantic form and in many silos, which makes such data integration immensely difficult.

In order to address these issues, the talk will discuss several distinct, but related areas, namely chemical information extraction, information/data integration, ontology-aided information retrieval and information visualization. In particular, we demonstrate how chemical data can be retrieved from a range of unstructured sources such as reports, scientific theses and papers or patents. We will discuss how these sources can be processed using ontologies, natural language processing techniques and named-entity recognisers to produce chemical data and knowledge expressed in RDF. We will furthermore show, how this information can be searched and indexed. Particular attention will also be paid to data representation and visualisation using topic/topology maps and information lenses. At the end of the talk, attendees should have a detailed awareness of how chemical entities and data can be extracted from unstructured sources and visualised for rapid information discovery and knowledge generation.

It promises to be a great conference and I am sure our minds will go into overdrive when there….can’t wait to go! See you there!?

Reblog this post [with Zemanta]

ChemAxiom: An Ontology for Chemistry 3 – Choosing an Upper Ontology

I have to start this blogpost with a big mea culpa. In the mists of time, I proclaimed loudly and visibly on this blog, that I thought that Upper Ontologies were, well, a bit……ya’ know…..

I have realised, that this view is wrong and entirely misguided and that Upper Ontologies are needed for the construction of modular and integratable ontological systems. But before delving into this discussion, what are Upper Ontologies?

Upper Ontologies are concerned with describing modes of existence and being and within computer science they are used for preparing computable representations for the modes of existence of things, Wikipedia actually does a fairly good job of explaining further:

The aim is very broad semantic interoperability between a large number of ontologies accessible “under” this upper ontology. As the metaphor suggests, it is usually a hierarchy of entities and associated rules (both theorems and regulations) that attempts to describe those general entities that do not belong to a specific problem domain.

The seemingly conflicting use of metaphors implying a solid rigorous bottom-up “foundation” or a top-down imposition of somewhat arbitrary and possibly political decisions is no accident – the field is characterized by controversy, politics, competing approaches and academic rivalry.[citation needed]

Debates notwithstanding, it can be said that a very important branch of an upper ontology can be considered (as continuation and development of natural philosophy) to be the physical ontology.

Now the Wikipedia article goes on to point out several upper ontologies in common use in information science today. The most prominent are (a) the General Formal Ontology (GFO), (b) the Basic Formal Ontology (BFO), (c) the Suggested Upper Merged Ontology (SUMO) and (d) the DOLCE ontology.Again, the Wikipedia article on upper ontology does a fairly good job at summarizing the differences between and the peculiarities of these various upper ontologies. Now with so much on offer, the question remains, what the criteria are for choosing an upper ontology that would be appropriate for the tasks, which ChemAxiom is trying to accomplish.

First and foremost, any upper ontology used in the ChemAxiom project must have the sufficient scope to describe the phenomena associated with chemical objects, which I have laid out in a previous blog post: chemical objects have both synchronic and diachronic qualities, which, in turn, have very different associated ontologies. Secondly, any upper ontology which we use should facilitate the broadest possible uptake of ChemAxiom by the community, either because (a) related ontologies in other domains use the same upper ontology and the upper ontology has a good acceptance or (b) because relatively straightforward mapping between ontologies is possible (for example, the mapping between many BFO and SUMO concepts is relatively straightforward).

Taking the above criteria into account, we have chosen to use the Basic Formal Ontology for the purposes of developing ChemAxiom. The BFO has many things going for it. First of all, it has a notion of both syn- and diachronicity, each of which is codified up in a “sub-ontology” but integrated via the notion of an Entity:

bfo:Entity
      a       owl:Class ;
      rdfs:label "entity"^^xsd:string ;
      rdfs:subClassOf owl:Thing ;
      owl:unionOf (snap:Continuant span:Occurrent) .

Typical examples of continuants are molecules, chemical identifiers, substances etc, whereas occurents can be chemical reactions, measurements, interconversion phenomena etc… The BFO is therefore nicely set up for talking about these various qualities (in the non-BFO sense) of chemical objects and thus fulfills the requirement of sufficient scope outlined above. This is even further re-enforced by the BFO’s notion of “Role”. In the words of the BFO, a role is a

“realizable entity [snap:RealizableEntity] the manifestation of which brings about some result or end that is not essential to a continuant [snap:Continuant] in virtue of the kind of thing that it is but that can be served or participated in by that kind of continuant [snap:Continuant] in some kinds of natural, social or institutional contexts. […] Examples: the role of a person as a surgeon, the role of a chemical compound in an experiment, the role of a patient relative as defined by a hospital administrative form, the role of a woman as a legal mother in the context of system of laws, the role of a biological grandfather as legal guardian in the context of a system of laws, the role of ingested matter in digestion, the role of a student in a university.”

The tool of a “Role” is incredibly useful when trying to describe certain generic types of molecules or substances, such as acids, catalysts, nucleophiles, solvents, etc….A solvent, for example, can be modelled as a chemical substance which has a role “SolventRole” (which, in turn, is a subclass of role) The BFO is provides useful mechanisms for these things and biological roles of chemical compounds can potentially be dealt with in the same way.

Measurements, by comparison, are not continuants, but processes – occurrents. The same is true for interconverting isomers and other fluctional processes associated with molecules and substances, as well as phase changes, transitions etc…and the BFO provides a mechanism to deal with those phenomena.

Finally, the Basic Formal Ontology is the ontology adopted by the OBO family of ontologies and thus fulfills the criterion of wide acceptance within the community of practice or a related community – in this case bioscience and medicine. This should help to facilitate the integration of ChemAxiom with those ontologies, should this be a desireable thing for the community. I think, the blogposts until now set the scene nicely for some more detailed discussions of the individual ontologies, which will follow in subsequent posts either during or after the holidays. Until then, Happy Easter everyone. Comments and suggestions via the usual channels please…autogenerated links and tags – as always – by Zemanta.

Reblog this post [with Zemanta]