2011 – The International Year of Chemistry

Appearance of real linear polymer chains as re...

Image via Wikipedia

In their editorial for the January Issue (you will need a Nature subscription to access this, altrenatively see the Sceptical Chymyst post here), the good folks at Nature Chemistry have reminded us that 2011 is the International Year of Chemistry:

“The United Nations has proclaimed 2011 to be the International Year of Chemistry. Under this banner, chemists should seize the opportunity to highlight the rich history and successes of our subject to a much broader audience — and explain how it can help to solve the global challenges we face today and in the future.”

The year even has a website. The UN also singles out two important areas of chemistry – neither of which have chemistry in the name – on the frontpage of the site: namely the development of advanced materials and molecular medicine. I am extremely happy to see this – materials and in particular polymers have been a long-standing interest of mine and some of the immunology work I am currently doing has implications for molecular medicine too.

There are several ways to participate in the Year of Chemistry – one of them is through an essay and video competition: “A World Without Polymers”. Students are asked to make short videos or write essays, trying to imagine what the world would be like without polymers. Furthermore there are networking events, conferences and more all across the world. So go and check out the UN’s site, participate and contribute!

Enhanced by Zemanta

Exploring Chemical Space with GDB – Jean Louis Raymond (University of Bern)

Three molecules. This image was originally upl...
Image via Wikipedia

(These are live notes from a talk Prof Reymond gave at EBI today)

The GDB Database

GDB = Generated Database (of Molecules)

The Chemical Universe Project – how many small molecules are possible?

GDB was put together by starting from graphs –  in this case the graphs were hydrocarbons and used GENG software to elaborate all possible graphs (after predefining which graphs are chemically reasonable and incorporating bonding informatation etc.) Then place atoms, enumerate, get combinatorial explosion of compounds and apply filters to remove chemical immpossibility: result couple of billion compounds.

 

Some choices restricting diversity: no allenes, no DB at bridgeheads etc, problematic heteroatom constellations (did not consider peroxides), hydrolytically labile functional groups.

In general – number of possible molecules increases exponentially with increasing number of nodes.

Showing that the molecular diversity increases with linear open carbon skeletons – cyclic graphs have fewer substitution possibilities. Chiral compounds offer more diversity than non-chiral ones.

 

GDB Website

 

Now talking about GDB13:

removed fluorine, introduced sulphur, filtered for molecules with “too many” heteroatoms – due to synthetic difficulties and the fact they may be of lesser interest to medchem.

Now showing statistical analysis of molecular types in GDB. 95% of all marketed drugs violate at least two Lipinski Rules. All molecules in the GDB13 are Lipinski conformant.

Use case: take known drug and find isomers. Aspirin has approx 180 compounds similar to Aspirin by Tanimoto score > 0.7 similarity. Points out that any of these molecules may not have been imagined by chemists.

 

GDB15 is just out – corrected some bugs, eliminated enol ethers (due to quick hydrolysis), optimized CPU usage…approx 26 billion molecules, 1.4 Tb – counting them takes a day)

 

Applications of the Database – mainly GDB 11

Use case: Glutamatergic Synapse Binding

used Bayesian classifier trained with known actives and then used that to retrieve about 11000 molecules from GDB11. This was followed by high throughput docking – selected 22 compounds for lab testing. Enrichment of glycine-containing compounds. Now showing some activity data for selected compounds.

Use case: Glutamate Transporter: applied certain structural selection criteria to database molecules to obtain a subset of approx 250 k compounds. Again followed by HT docking. Now showing syntheses of some selected candidate structures together with screening data.

 

“Molecular Quantum Numbers”

Classification system for large compound databases. Draws analogy to periodic table: classification system for elements. We do not have something like this for molecules. Define features for molecules: atom types, bond types, polarity, topology……42 categories in total. Now examines ZINC database against these features: can show that there are common features for molecules occupying similar categories.PCA analysis: first 2 PCs cover 70% of diversity space: first PC includes molecular weight…2D representations considered to be acceptable. PCA also shows nice grouping of molecules by number of cycles

Same analysis for GDB 11: first PCs now mainly account for molecular flexibility, polarity (doesn’t contain many rings due to atom limitation).

Analysis for PubChem – difficult to discover information at the moment.

Was on the cover of ChemMedChem this November.

Shows examples of fishing our structural motive analogies for given molecular motives.

Reblog this post [with Zemanta]

ChemAxiom: An Ontology for Chemistry 4. ChemAxiomChemDomain

Obligations to our funders and some publishers have delayed me in continuing this series of blog post and participation in the discussion on the Google Group for a few days, but I hope I can catch up on either now. In my previous blogpost, I have summarised all of the ChemAxiom modules briefly: now is the time to delve into some more detail. First up then: ChemAxiomChemDomain.

ChemAxiomChemDomain is, at the moment, a rather small, but nevertheless important ontology, which clarifies some fundamental domain concepts in chemistry, namely the relationship between platonic molecules, platonic bulk substances, instances of either and roles.  

First oof all, let’s turn to some fundamental concepts. The classes “ChemicalElement”, “MolecularEntity”, and “ChemicalSpecies”are all subclasses of “snap:Object”. The class “Object” in the BFO is defined as a “material entity [snap:MaterialEntity] that is spatially extended, maximally self-connected and self-contained (the parts of a substance are not separated from each other by spatial gaps) and possesses an internal unity. The identity of substantial object [snap:Object] entities is independent of that of other entities and can be maintained through time.” Various disjoint axioms specify the fact that “MolecularEntities” are not the same as “ChemicalSpecies”, thus addessing some of fundamental issues about the relationship between molecules and substances etc.

Further axioms on these classes specify other necessary parthood relationships: “ChemicalSpecies” are composed of molecules or other ChemicalSpecies (thus giving recursion and allowing the modeling of formulations) or BulkChemicalElements.:

ChemistryOntology:ChemicalSpecies
      a       owl:Class ;
      rdfs:comment “An ensemble of chemically identical molecular entities that can explore the same set of molecular energy levels on the time scale of the experiment.”@en ;
      rdfs:subClassOf snap:Object ;
      rdfs:subClassOf
              [ a       owl:Class ;
                owl:unionOf ([ a       owl:Restriction ;
                            owl:onProperty ChemistryOntology:hasPart ;
                            owl:someValuesFrom ChemistryOntology:MolecularEntity
                          ] [ a       owl:Restriction ;
                            owl:onProperty ChemistryOntology:hasPart ;
                            owl:someValuesFrom ChemistryOntology:ChemicalSpecies
                          ] [ a       owl:Restriction ;
                            owl:hasValue ChemistryOntology:BulkChemicalElement ;
                            owl:onProperty ChemistryOntology:hasPart
                          ])
              ] ;
      rdfs:subClassOf
              [ a       owl:Restriction ;
                owl:onProperty ChemistryOntology:preseentInAmount ;
                owl:someValuesFrom xsd:string
              ] ;
      rdfs:subClassOf
              [ a       owl:Restriction ;
                owl:onProperty ChemAxiomProp:hasProperty ;
                owl:someValuesFrom ChemAxiomProp:Property
              ] ;
      owl:disjointWith ChemistryOntology:ChemicalElement , ChemistryOntology:MolecularEntity

When intengrated with ChemAxiomProp (as has been done in ChemAxiomComtinuants), ChemicalSpecies can be connected up to their properties and other statements which one might wish to make about chemical species.

Another part of ChemAxiomChemDomain is the definition of roles: generic types of ChemicalSpecies, such as solvents, acids, catalysts, can be defined in terms of roles: no molecule is ever only just a solvent or an acid or a catalyst. Rather, these categories are realisable entities; a molecular species or a chemical entity behaves as a catalyst, nucleophile or a solvent under certain circumstances

ChemistryOntology:NucleophileMolecule
      a       owl:Class ;
      rdfs:subClassOf ChemistryOntology:MolecularEntity ;
      owl:disjointWith ChemistryOntology:ElectrophileMolecule ;
      owl:equivalentClass
              [ a       owl:Class ;
                owl:intersectionOf (ChemistryOntology:MolecularEntity [ a       owl:Restriction ;
                            owl:onProperty ChemistryOntology:hasRole ;
                            owl:someValuesFrom ChemistryOntology:NucleophileRole
                          ])
              ] .

Furthemore, roles in combination with MolecularEntity or ChemicalSpecies allow the definition of generic molecules or substances, such as acids (hydrochloric acid) and acids (proton donor), catalysts, solvents etc. At the moment, the number of axio
ms is small, however, as the body of axioms grows in the future, it can be expected, that  ChemAxiom will become more and more useful for the disambiguation of concepts: while it would make sense for a chemical species, which is an acid, to talk about a pH-Value, it would not make sense to speak of “molecular acids” in the same terms.

Finally, OWL’s model of classes as collections of instances models the things we need to model really well: the class “ChemicalSpecies” and “MolecularEntitiy” and thweir respective subclasses can be thought of as rpreesentinmg the platonic ideals of molecules or substances, whereas instances of these classes can be thought of as representing “real” samples of both molecules (e.g. a single molecule, in for example, matrix isolation) and substances (100 ml of HCl in a flask).

So much for ChamAxiomChemDomain fo rnow. It is the beginning of a domain model and very much driven by the use-case I ourtlined in a prewvious blog post. Obviously, we would like to expand the scope of this particular ontology to be morwe universally useful in the future., However, I believe that rather to do this via random ontological engineering, this should be driven by use-cases. So therefore, if you have use-cases in mind, please be in touch and let’s discuss how we can collaborate.

Tags and automatic links, as always, by Zemanta.

Reblog this post [with Zemanta]

Data-Rich Publishing

I have been insanely busy recently with trips and papers and corrections and…etc…and only now have a bit of time to catch up with some of my feeds and people’s blog posts. One post which caught my eye was Egon’s recent blog post about data-rich or data-centric publishing, in which he argues strongly for a new kind of publishing: a publishing in which data is treated as a first class citizen and which allows/requires an author to not just publish the words of a paper, but his research data too and to publish it in such a way that the barrier to access by machines is low.

This reminded me of what I thought was a particularly tragic case, which I blogged about a while ago here. In this particular case, industrious researchers had synthesized an incredible 630 polystyrene copolymers and recorded their RAMAN spectra. Now this is more than a crying shame: a lot of work has gone into producing the polymers and recording the data. And I ask you (provided you are a materials scientist and have an interest in such things), when was the last time that YOU came across such a large and rich library of polymers together with their spectral data? And through no fault of their own, the only way these authors saw to publish their data was in the form of a pdf archive in the supplemental information.

Now Egon’s point was that newly formed journals – and in particular newly formed Journals of Chemoinformatics – have the opportunity to do something fundamentally good and wholesome: namely to change the way in which data publication is being accomplished and to give scientists BETTER tools to deal with and disseminate their data. This long and rambly blogpost is my way of violently agreeing with Egon: I believe that THIS is where an awful lot of the added value of the journal of the future will lie. This will be even more true, as successive generations of scientists will start to become more data savvy: last week I talked to a collaborator of ours who had just put in for some funding to train chemistry students in both chemistry and informatics: a whole dedicated course. Now once these students start their own scientific careers, they will both care and know about science and scientific data. And if I were a publisher, I would want to have something to offer them….

Reblog this post [with Zemanta]

ChemAxiom: An Ontology for Chemistry 2. The Set-Up

Now that I have introduced at least some of the motivation behind ChemAxiom, let me outline some of the mechanics.

ChemAxiom is a collective term for a set of ontologies, all of which make a start at describing subdomains within chemistry. The ontology modules are independent and self-contained and can (largely) be developed seperately and concurrently. Although they are independent, they are interoperable and integrated via a common upper ontology – in the case of ChemAxiom, we have chosen the Basic Formal Ontology (BFO). I will blog the reasons for this choice in the next post.

clip_image002[11]

The ontologies are currently in various stages of axiomatisation depending on how long we have been working on them and how much we have had a chance to play – so therefore, if there are axioms there that are not and you think there should be, or if you agree/disagree with some of our design decisions, please let us know. In any case, the discussion has already started with some helpful comments over on the Google Group. Let me describe the various modules in greater detail:

The Reasons for Modularity: When developing ontologies, it is always tempting to develop the ueber-McDaddy-ontology-of-everything, because, of course, ontology development is, by definition, never done: we alsways need more than we have  – more terms, more axioms etc.. Very quickly, this can result in monstrously large and virtually unmaintainable constructs. Modularisation has, from out perspective, the advantage of (a) smaller and more handlable ontologies, (b) ontologies which are easier to maintain, (c) ontologies which can be developed in parallel or orthogonally and subsequently integrated using either a common upper ontology or mapping/rules etc…..Furthermore, if refactoring of ontologies is necessary during the development process, this is also facilitated by modularity: changes in one module have less chance of affecting changes in another module.

The General Use Case: One of the things we are particularly interested in here in Cambridge, is the extraction of chemical entities and data from text and Peter Corbett’s OSCAR is now fairly well established within the chemical informatics community. Our text sources vary widely, and can range from standard chemical papers to theses, blogs and Wikipedia pages. To give you an impression of the types of data we are talking about, there’s an example Wikipedia’s infobox for benzene (somewhat truncated):

 

benzene infobox for blog 

So we have to deal with names, identifiers of various type, physico-chemical property data as well as the corresponding metadata (e.g. measurement pressures, measurement temperatures etc.), and chemical structure (InChI, SMILES). Our ontologies should enable us the generate RDF that allow us to hold this data – the ontology here serves as a schema. While we are interested in reasoning/using reasoners for the purposes of (retrospective) typing (again, I will explain what I mean by that in subsequent blog posts) applying ontologies to the description of chemical data is our first use-case.

With all of that said, let me provide a quick summary of the modules:

Chemistry Domain Ontology – ChemAxiomDomain ChemAxiomDomain is the first module in the set. It is currently a small ontology, which clarifies some fundamental relationships in the chemistry domain. Key concepts in this ontology are “ChemicalElement”, “ChemicalSpecies” and “MolecularEntity” as well as “Role”. ChemAxiomDomain clarifies the relationships between these terms (see my previous blog post) and also deals with identifiers etc. Chemical roles too are important: while chemical entities, may be or act as nucleophiles, acids, solvents etc.. some of the time, they do not have these roles all of the time – roles are realisable entities and and ChemAxiomDomain provides a mechanism for dealing with that. There are few other high-level domain concepts in there at the moment, though obviously we are looking to expand as and when the need arises and use-cases are provided.I will blog some details in a subsequent blog post.

Properties Ontology – ChemAxiomProp. ChemAxiomProp is an ontology of over 150 chemical and materials properties, together with a first set of definitions and symbols (where available and appropriate) and some axioms for typing of properties. Again, details will follow in a subsequent blog post.

Measurement Techniques – ChemAxiomMetrology. This is an ontology of over 200 measurement techniques and also contains a list of instrument parts and axioms for typing of measurement techniques. It does not currently include information about minimum information requirements for measurement techniques (e.g. the measurement of a boiling point also requires a measurement of pressure) and other metadata, but this will be added at a later stage. Again, a detailed blog-post will follow.

ChemAxiomPoly and ChemAxiomPolyClass – These two ontologies contain terms which are in common use across polymer science as well as a taxonomy of polymers based on the composition of their backbone (though the latter is not axiomatised yet). Details will follow in a further blog post.

ChemAxiomMeta – ChemAxiomMeta is a developing ontology, that will allow the specification of provenance of data (e.g. data derived from wiki pages etc.) and will also define what a journal, journal article, thesis, thesis chapter etc is and what the relationships between these entities are. We have not currently released this yet. Details will follow in a further blog post.

ChemAxiomComtinuants – ChemAxionContinuants represents an integration of all the above sub-ontologies into an ontological framework for chemical continuants (with some occurrents mixed in when we need to talk about measurement techniques). Details will follow in a further blog post.

We have also started to work on ontologies of chemical reactions, actions and, as mentioned above, minimum information requirements – however, these are at a relatively early stage of development and hence not released yet.

So much for a short overview over the mechanics of the ontologies. I am sure there are a thousand other things I should have said, but that will have to
do for now. Comments and suggestions via the usual channels. Automatic links and tags, as always, by Zemanta.

Reblog this post [with Zemanta]

ChemAxiom: An Ontology for Chemistry – 1. The Motivation

I have already announced the fact that we are working on ontologies in the polymer domain some time ago, though I realise that so far, I have yet to produce the proof of that: the actual ontology/ontologies.

So today I am happy to announce that the time of vapourware is over and that we have released ChemAxiom – a modular set of ontologies, which form the first ontological framework for chemistry (or at least so we believe). The development of these ontologies has taken us a while: I started this on a hunch and as a nice intellectual exercise, not entirely sure where to go with them and what to use them and therefore not working on them full time. As the work progressed, however, we understood just how inordinately useful they would be for doing what we are trying to accomplish in both polymer informatics and chemical informatics at large. I will introduce and discuss the ontologies in a succession of blogposts, of which this is the first one

So what, though maybe somwhat retrospectively, was the motivation for the preparation of the ontologies? In short – the breakdown of many common chemistry information systems when confronted with real chemical phenomena rather than small subsections of idealised abstractions. Let me explain.

Chemistry and chemical information systems positively thrive on the use of a connection table as a chemical identifier and determinant of uniqueness. The reasons for this are fairly clear: chemistry, for the past 100 years or so, has elevated the (potential) correlation between the chemical structure of a molecule and its physicochemical and biological properties to be its “central dogma.” The application of this dogma has served subsections of the community – notably organic/medicinal/biological chemists incredibly well, while causing major headaches for other parts of the chemistry community and given an outright migraine to information scientists and researchers. There are several reasons for the pain:

The use of a connection table as an identifier for chemical objects leads to significant ontological confusion. Often, chemists and their information systems do not realise that there is a fundamental distinction between (a) the platonic idea of a molecule, (b) the idea of a bulk substance and (c) an instance of (“the real bulk substance”) in a flask or bottle on the researcher’s lab bench. An example of this is the association of a physicochemical property of a chemical entity with a structure representation of a molecule: while it would, for example, make sense to do this for a HOMO energy, it does NOT make sense to speak of a melting point or a boiling point in terms of a a molecule. The point here simply is that many physicochemical properties are the mereological sums of the properties of many molecules in an ensemble. If this is true for simple properties of pure small molecules, it is even more true for properties of complex systems such as polymers, which are ensembles of many different molecules of many different architectures. A similar argument can also be made for identifiers: in most chemical information systems, it is often not clear whether the identifier (such as a CAS number etc.) refers to a molecule or a substance composed of these molecules.

Many chemical objects have temporal characteristics. Often, chemical objects have temporal characteristics, which influence and determine their connection table. A typical example for this are rapidly interconverting isomers: glucose, when dissolved in water, for example, can be described by several rapidly interconverting structures – a single connection table is not enough to describe the concept “glucose in water” and there exists a parthood relationship between the concept and several possible connection tables. Ontologies can help with specifying and defining these parthood relationships.

There is another aspect to time dependence we also need to consider. For many materials, their existence in time, or, put in another way, their history, often holds more meaningful information about an observed physical property of that substance than the chemical structure of one of the components of the mixture. For an observable property of a polymer, such as the glass transition temperature, for example, it matters a great deal whether the polymer was synthesized in on the solid phase in a pressure autoclave or in solution at ambient pressure. Furthermore, it matters, whether and how a polymer was processed – how was it extruded, grafted etc. All of these processes have a significant amount of influence on the observable physical properties of a bulk sample of this polymer, while leaving the chemical decription of the material, essentially unchanged (in current practice, polyethylene is often represented either by using the structure of the corresponding repeat unit (ethene, for example) or the structure of a repeat unit fragment (-CH2-CH2-). Ontologies will help us to describe and define these histories. Ultimately, we envisage that this will result in a “semantic fingerprint” of a material, which – one might speculate – will be much more appropriate for the development of design rules for materials than the dumb structure representations in use today.

Many chemical objects are mixtures….and mixtures simply do not lend themselves to being described using the connection table of a single constituent entity of that mixture. If this is true for glucose in water, it is even truer for things such as polymers: polymers are mixtures of many different macromolecules, all of which have slightly different architectures etc. An observed physical property, and therefore a data object, is the mereological sum of the contributions made by all the constituent macromolecules and therefore, such a data object cannot simply be associated with a single connection table.

This, in my view, is a short summary of the case for ontology in chemistry. Please feel free to violently (dis-)agree and if you want to do so, I am looking forward to a discussion in the comments section.

There’s one more thing:

AN INVITATION

The ChemAxiom ontologies are far from perfect and far from finished. We hope, that they show the way how an ontological framework for chemistry could look like. In developing these ontologies, we can contribute our particular point of view, but we would like to hear yours. Even more, we would like to invite the community to get involved in the development of these ontologies in order to make them a general and valuable resource. If you would like to  become involved, then please send an email to chemaxiom at googlemail dot com or leave comments/questions etc, in the ChemAxiom Google Group.

In the next several blog posts, I will dive into some of the technical details of the ontologies.

(Automatic Links etc., as always, by Zemanta)

Reblog this post [with Zemanta]

Semantic Universe Website with Chemical/Polymer Informatics Contributions now Live

Over on Twitter, Semantic Universe has just announced the relaunch of their website. The purpose of the site is “to educate the world about semantic technologies and applications.”

To quote from the website:

“Semantic Universe and Cerebra today announced the launch of the “Semantic Universe Network”, a vibrant educational and networking hub for the global semantic technology marketplace. Semantic Universe Network will be the educational and information resource for the people and companies within the high-growth semantics sector, covering the latest news, opinions, events, announcements, products, solutions, promotions and research in the industry.”

As part of the re-launch, both Lezan Hawizy and I have written two short contributions reviewing the state of Semantic Chemistry and showcasing our work on how semantification of chemistry can happen. The contributions were intended to be short “how to..s” and as such are written in a somewhat chatty style. Here are the links:

Semantic Chemistry

The Semantification of Chemistry

Feedback is welcome.

Reblog this post [with Zemanta]