data | Scimantica - Semantic Science

Reading the Tea Leaves of 2011 – Data and Technology Predictions for the Year Ahead

January 4, 2011 1 Comment

The beginning of a new year usually affords the opportunity to join in the predication game and to think about which topics will not only be on our radar screens on the next year, but may dominate it. I couldn’t help myself but to attempt to do the same in my particular line of work – if for no other reason, than to see how wrong I was when I will look at this again at the beginning of 2012. Here are what I think will be at least some of the big technology and data topics in 2011:

1. Big, big, big Data
2010 has been an extraordinary year when it comes to data availability. Traditional big data producers such as biology continue to generate vast amounts of sequencing and other data. Government data is pouring in from countries all over the world, be it here in the United Kingdom, in the United States and efforts to liberate and obtain government data are also starting in other countries. The Linked Open Data Cloud is growing steadily:

Linked Open Data October 2007 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Linked Open Data September 2010 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

the current linked data cloud has about 20 billion triples in it. Britain now has, thanks to the Open Knowledge Foundation, an open bibliography. The Guardian’s Datastore is a wonderful example of a commercial company making data available. The New York Times is making an annotated corpus available. Twitter and other user-generated content also provide significant data firehoses from which one can drink and build interesting mashups and applications, such as Ben Marsh’s UK Snow Map. So that are just some examples of big data and there are several issues associated with it, that will occupy us in 2011.

2. Curation and Scalability
A lot of this big data we are talking about is “real-world” and messy. There is no nice underlying ontological model (the stuff that I am so fond of) and by necessity it is exceptionally noisy. Extracting a signal out of clean data is hard enough, but getting one out of messy data requires a great deal of effort and an even greater deal of care. And therefore the development of curation tools and methodologies will continue to be high up on the agenda of the data scientist. The development of both automated and social curation tools will be high up on the agenda. And yes, I do believe that this effort is going to become a lot more social – there are signs of this starting to happen everywhere.
However, we are now generating so much data, that the sheer amount is starting to outstrip our ability to compute it – and therefore scalability will become an issue. The fact that service providers such as Amazon are offering Cluster GPU Instances as part of the EC2 offering is highly significant in this respect. MapReduce technologies seem to be extremely popular in “Web 2.0” companies and the Hadoop ecosystem is growing extremely fast – and the ability to “make Hadoop your bitch” as an acquaintance of mine recently put it, seems to be an in-demand skill at the moment and I think for the forseeable future. And – needless to say – successful automated curation of big data,, too, requires scalable computing.

3. Discovery
Having a lot of datasets available to play with is wonderful, but what if nobody knows they are there. Even in science, it is still much much harder to discover datasets than ought to be the case. And even once you have found what you may have been looking for, it is hard to decide whether that really was what you were looking for – describing metadata is often extremely poor or not available. There is currently little collaboration between information and data providers. Data marketplaces such as Infochimps, Factual, Public Datasets on Amazon AWS or the Talis Connected Commons (to name but a few) are springing up, but there is a lot of work to do still. And is it just me or is science – the very people whose primary product is data and knowledge – is lagging far behind in developing these market places. Maybe they will develop as part of a change in the scholarly pulication landscape (journals such as Open Research Computation have a chance of leading the way here), but it is too early to tell. The increasing availablity of data will push this topic further onto the agenda in 2011.

4. An Impassioned Plea for Small Data
One thing, that will unfortunately not be on the agenda much is small data. Of course it won’t matter to you when you do stuff either at web scale or if you are someone working in Genomics. However, looking at my past existence as a laboratory-based chemist in an academic lab, a significant amount of valuable data is being produced by the lone research student who is the only one working on his project or by a small research group in a much larger department. Although there is a trend to large-scale projects in academia and away from individual small grants, small-scale data production on small scale research projects is still the reality in a significant number laboratories the world over. And the only time, this data will get published, is as a mangled PDF document in some journal supplementary – and as such is dead. And sometimes it is perfectly good data, which never gets published at all: in my previous woworkplace we found that our in-house crystallographer was sitting on several thousand structures, which were perfectly good and publishable, but had, for various reasons, never been published. And usually it is data that has been produced at great cost to both the funder as well as the student. Now small data like this is not sexy per se. But if you manage to collect lots of small data from lots of small laboratories, it becomes big data. So my plea would simply be not to forget small data, to build systems, which collect, curate and publish it and make it available to the world. It’ll be harder to convince both funders and institutions and often researchers to engage with it. But please let’s not forget it – it’s valuable.

Enough soothsaying for one blog post. But let’s get the discussion going – what are your data and technology predictions for 2011?

Filed under data, semantic web Tagged with Curation, Linked Data, MapReduce, Open Knowledge foundation, Scalability, Web 2.0

Visualisation of Ontologies and Large Scale Graphs

February 17, 2010 7 Comments

: Image via Wikipedia

For a whole number of reasons, I am currently looking into the visualisation of large-scale graphs and ontologies and to that end, I have made some notes concerning tools and concepts which might be useful for others. Here they are:

Visualisation by Node-Link and Tree

jOWL: jQuery Plugin for the navigation and visualisation of OWL ontologies and RDFS documents. Visualisations mainly as trees, navigation bars.

OntoViz: Plugin into Protege…at the moment supports Protege 3.4 and doesn’t seem to work with Protege 4.

IsaViz: Much the same as OntoViz really. Last stable version 2004 and does not seem to see active development.

NeOn Toolkit: The Neon toolkit also has some visualisation capability, but not independent of the editor. Under active development with a growing user base.

OntoTrack: OntoTrack is a graphical OWL editor and as such has visualisation capabilities. Meager though and it does not seem to be supported or developed anymore either…the current version seems about 5 years old.

Cone Trees: Cone trees are three-dimensional extensions of 2D tree structures and have been designed to allow for a greater amount odf information to be visualised and navigated. Not found any software for download at the moment, but the idea is so interesting that we should bear it in mind. Examples are here, here and the key reference is Robertson, George G. and Mackinlay, Jock D. and Card, Stuart K., Cone Trees: animated 3D visualizations of hierarchical information, CHI ’91: Proceedings of the SIGCHI conference on Human factors in computing systems, 1991, ISBN = 0-89791-383-3, pp.189-194. (DOI here)

PhyloWidget: PhyloWidget is software for the visualisation of phylogenetic trees, but should be repurposable for ontology trees. Javascript – so appropriate for websites. Student project as part of the Phyloinformatics Summer of Code 2007.

The JavaScript Information Visualization Toolkit: Extremely pretty JS toolkit for the visualisation of graphs etc…..Dynamic and interactive visualisations too…just pretty. Have spent some time hacking with it and I am becoming a fan.

Welkin: Standalone application for the visualisation of RDF graphs. Allows dynamic filtering, colour coding of resources etc…

Three-Dimensional Visualisation

Ontosphere3D: Visualisation of ontologies on 3D spheres. Does not seem to be supported anymore and requires Java 3D, which is just a bad nightmare in itself.

Cone Trees (see above) with their extension of Disc Trees (for an example of disc trees, see here

3D Hyperbolic Tree as exemplified by the Walrus software. Originally developed for website visualisation, results in stunnign images. Not under active development anymore, but source code available for download.

Cytoscape: The 1000 pound gorilla in the room of large-scale graph visualization. There are several plugins available for interaction with the Gene Ontology, such as BiNGO and ClueGO. Both tools consider the ontologies as annotation rather than a knowledgebase of its own and can be used for the identification of GO terms, which are overrepresented in a cluster/network. In terms of visualisation of ontologies themselves, there is there is the RDFScape plugin, which can visualize ontologies.

Zoomable Visualisations

Jamabalaya – Protege Plugin, but can also run as a browser applet. Uses Shrimp to visualise class hierarchies in ontologies and arrows between boxes to represent relationships.

CropCircles (link is to the paper describing it): CropCircles have been implemented in the SWOOP ontology editor which is not under active development anymore, but where the source code is available.

Information Landscapes – again, no software, just papers.

Filed under data, informatics, ontology Tagged with Add new tag, Gene Ontology, Information graphics, Knowledge Management, Knowledge Representation, Ontologies, ontology, Resource Description Framework, Visualization

Exploring Chemical Space with GDB – Jean Louis Raymond (University of Bern)

December 4, 2009 2 Comments

: Image via Wikipedia

(These are live notes from a talk Prof Reymond gave at EBI today)

The GDB Database

GDB = Generated Database (of Molecules)

The Chemical Universe Project – how many small molecules are possible?

GDB was put together by starting from graphs – in this case the graphs were hydrocarbons and used GENG software to elaborate all possible graphs (after predefining which graphs are chemically reasonable and incorporating bonding informatation etc.) Then place atoms, enumerate, get combinatorial explosion of compounds and apply filters to remove chemical immpossibility: result couple of billion compounds.

Some choices restricting diversity: no allenes, no DB at bridgeheads etc, problematic heteroatom constellations (did not consider peroxides), hydrolytically labile functional groups.

In general – number of possible molecules increases exponentially with increasing number of nodes.

Showing that the molecular diversity increases with linear open carbon skeletons – cyclic graphs have fewer substitution possibilities. Chiral compounds offer more diversity than non-chiral ones.

GDB Website

Now talking about GDB13:

removed fluorine, introduced sulphur, filtered for molecules with “too many” heteroatoms – due to synthetic difficulties and the fact they may be of lesser interest to medchem.

Now showing statistical analysis of molecular types in GDB. 95% of all marketed drugs violate at least two Lipinski Rules. All molecules in the GDB13 are Lipinski conformant.

Use case: take known drug and find isomers. Aspirin has approx 180 compounds similar to Aspirin by Tanimoto score > 0.7 similarity. Points out that any of these molecules may not have been imagined by chemists.

GDB15 is just out – corrected some bugs, eliminated enol ethers (due to quick hydrolysis), optimized CPU usage…approx 26 billion molecules, 1.4 Tb – counting them takes a day)

Applications of the Database – mainly GDB 11

Use case: Glutamatergic Synapse Binding

used Bayesian classifier trained with known actives and then used that to retrieve about 11000 molecules from GDB11. This was followed by high throughput docking – selected 22 compounds for lab testing. Enrichment of glycine-containing compounds. Now showing some activity data for selected compounds.

Use case: Glutamate Transporter: applied certain structural selection criteria to database molecules to obtain a subset of approx 250 k compounds. Again followed by HT docking. Now showing syntheses of some selected candidate structures together with screening data.

“Molecular Quantum Numbers”

Classification system for large compound databases. Draws analogy to periodic table: classification system for elements. We do not have something like this for molecules. Define features for molecules: atom types, bond types, polarity, topology……42 categories in total. Now examines ZINC database against these features: can show that there are common features for molecules occupying similar categories.PCA analysis: first 2 PCs cover 70% of diversity space: first PC includes molecular weight…2D representations considered to be acceptable. PCA also shows nice grouping of molecules by number of cycles

Same analysis for GDB 11: first PCs now mainly account for molecular flexibility, polarity (doesn’t contain many rings due to atom limitation).

Analysis for PubChem – difficult to discover information at the moment.

Was on the cover of ChemMedChem this November.

Shows examples of fishing our structural motive analogies for given molecular motives.

Filed under chemistry, data, informatics Tagged with chemistry, Molecule, PubChem, Statistics

SWAT4LS2009 – Linking Open Drug Data to Cheminformatics abd Proteochemometrics

November 20, 2009 1 Comment

: Image via Wikipedia

(Notes frm the presentation as it happens)

Knowledge is not uni or bivariate, but we think of it as such: this leads to information loss.

Naming things: showing example of a trivial name, an IUPAC systematic name and an InChI and points out that these have different information content.

Points out scaling problem: drug discovery is multivariate and happens in a space of approx 10¹⁶ molecules (all molecules that are feasible and thought to be drug-like). Information loss occurs as you traverse this space backwards and forwards.

Now talks about molecular information in RDF: http://rdf.openmolecules.net for the provision of derefernceable URIs for molecules….and plugging the Chemistry Development Kit (CDK) as a means for cnverting between multiple representations of a molecule. Now moves on to Bioclipse as an integrating tool that allows chemical data transformations and the tracking of vwhy these transformations occur (version-controllable scripts to drive Bioclipse).

RDF extension to bioclipse: local RDF storage, read/write RDF, run SPARQL queries and extract RDF from XHTTML/RDFa.

Now shows an example of the expression of the CDK data model using ontologies but no details. Brief mention of his recent descriptor ontology.

Filed under chemistry, data, informatics, ontology, RDF, semantic web Tagged with Chemistry Development Kit, data, Metadata, Resource Description Framework

SWAT4LS2009 – Keynote Alan Ruttenberg: Semantic Web Technology to Support Studying the Relation of HLA Structure Variation to Disease

November 20, 2009 2 Comments

(These are live-blogging notes from Alan’s keynote…so don’t expect any coherent text….use them as bullt points to follow the gist of the argument.)

The Science Commons:

a project of the Creative Commons
6 people
CC specializes CC to science
information discovery and re-use
establish legal clarity around data sharing and encourage automated attribution and provenance

Semantic Web for Biologist because it maximizes value o scientific work by removing repeat experimentation.

ImmPort Semantic Integration Feasibility Project

Immport is an immunology database and analysis portal
Goals:metaanalysis
Question: how can ontology help data integration for data from many sources

Using semantics to help integrate sequence features of HLA with disorders
Challenges:

Curation of sequence features
Linking to disorders
Associating allele sequences with peptide structures with nomenclature with secondary structure with human phenotype etc etc etc…

Talks about elements of representation

pdb structures translated into ontology-bases respresentations
canonical MHC molecule instances constructed from IMGT
relate each residue in pdb to the canonical residue if exists
use existing ontologies
contact points between peptide and other chains computed using JMOL following IMGT. Represented as relation between residue instances.
Structural features have fiat parts

Connecting Allele Names to Disease Names

use papers as join factors: papers mention both disease and allele – noisy
use regex and rewrites applied to titles and abstracts to fish out links between diseases and alleles

Correspondence of molecules with allele structures is difficult.

use blast to fiind closest allele match between pdb and allele sequence
every pdb and allele residue has URI
relate matching molecules
relate each allele residue to the canonical allele
annotate various residoes with various coordinate systems

This creates massive map that can be navigated and queried. Example queries:

What autoimmune diseases can de indexed against a given allele?
What are the variant residues at a position?
Classification of amino acids
Show alleles perturned at contacts of 1AGB

Summary of Progress to Date:
Elements of Approach in Place: Structure, Variation, transfer of annotation via alignment, information extraction from literature etc…

Nuts and Bolts:

Primary source
Local copy of souce
Scripts transforms to RDF
Exports RDF Bundles
Get selected RDF Bundles and load into triple store

Parsers generate in memory structures (python, java)
Template files are instructions to fomat these into owl
Modeling is iteratively refined by editiing templates
RDF loaded into Neurocommons, some amount of reasoning

RDFHerd package management for data

neurocommons.org/bundles

Can we reduce the burden of data integration?

Too many people are doing data integration – wasting effort
Use web as platform
Too many ontologies…here’s the social pressure again

Challenges

have lawyers bless every bit of data integration
reasoning over triple stores
SPARQL over HTTP
Understand and exploit ontology and reasoning
Grow a software ecosystem like Firefox

Filed under data, informatics, ontology, OWL, semantic web, Uncategorized Tagged with Data integration, Knowledge Management, Knowledge Representation, Ontologies, ontology, Resource Description Framework, semantic web, SPARQL

Tomorrow’s Giants 2 – Dataset Comparison, Data Sharing and Future Literatures

November 17, 2009 Leave a comment

Following my first post from last week, here are more questions that the Royal Society wanted us Cambridge researchers to discuss during the peparatory Tomorrow’s Giant’s Meeting in Cambridge.

How can – and is it appropriate to – facilitate inter-laboratory dataset comparison?
Great that the question was asked. And the answer is yes of course it is. Not only is it appropriate, it is the vey essence of scientific endeavour. What else could be called science? That said, the fact that the question even had to be asked and that the answer is not self evident is disappointing. What has science/have scientists lost by way of attitude/ethics etc. that makes us even ask that question? Yes admittedly, there may be commercial reasons as to why this sort of comparison is not desirable. One of the participants in the session was at great pains to point out that there is often commercial interest tied up to data which prevents sharing and re-use and that is a fair point. However, over the past couple of years I have sat through far too many presentations where the presenter got up and talked about the development of a proprietary model/machine learning tool using a proprietary dataset and proprietary software. Now that is NOT science – at best it is a piece of local engineering which solves a particular problem for the presenter, but it does not advance human knowledge at all. I,, as a fellow scientist, could not pick up any aspect of this work and build upon it as it is all proprietary. Local engineering at best.

Does the type of data have an impact on the ways it can be shared?
Flippantly speaking: “you betcha”. Again, great that the question was even asked. And the answer is multifaceted because the question can be read in a number of different ways. It could be read as “does the provenance of the data and context in which it was generated have an impact on the ways in which it can be shared?” The question can also be read as “Does the (technical) format the data is in have an impact on the way in which it can be shared? The answer in both cases is yes. Let’s tackle these two in turn. One of the participants of the workshop worked at the faculty of education and her primary research data consisted of a large collection of interviews she had conducted with children over the course of her work. She believes that this data is valuable to other researchers in her field and would dearly love to share – but finds herself in a mire of legal and ethical concerns with respect to, for example, the children’s privacy that effectively prevent her from data sharing. So yes, the context in which data is produced and the type of data that is generated can be an obstacle to sharing. If “type of data” is understood to mean “format” then the answer is also yes. A number of my colleagues have pointed out (see here, for example) the data loss that occurs when documents containing scientific data are converted from the format in which they were produced to pdf (examples are here, here and here). The production of data in vernacular or lossy dataformats obviously also have an impact on data sharing – particularly when the sharing and exchange format is lossy.
However, the fact that the question had to be asked at all and that it went straight over the heads of most scientists who were at the meeting and who do not work in the data business, is intensely disappointing. Laboratory researchers have no appreciation of what they are doing when they convert their Word documents to pdf. Data science and informatics are not part of the standard curriculum in the education of scientists – something that desperately needs to change if data loss due to ignorance in data handling is to be avoided in the future.

Future literatures in the wider sense i.e. not just how findings are published in journals, but how can interim findings be shared and accessed?
That is a great question and one, as it turns out, that many of the people present in the meeting had pondered themselves in one form or another already. Scientists should not only be assessed on the basis of the journal articles they write, but, for example, also on the (raw) data they publish. However, science has, so far, not only not evolved a technical soloution to the data publication problem (of course, there isn’t just one solution – there are many depending on the type of data as well as the specific subject/sub-subject/sub-sub-subject that is producing the data etc.) Interim findings are part of this and systems like Nature Preceedings could point the way (although even Nature Preceedings does not allow us to deal with data). Obviously, one has to be careful that these do not just become dumping grounds for lower quality science. Once we have evolved technical solutions for publishing data, the next step will be to develop an ecosystem of metrics. And those metrics should only extend to things like data quality, trust and data provenance. Data “usefulness” – e.g. things like citation indices etc for data should, I think, not be part of the mix: it is impossible to predict what data will be useful when and under which circumstances (and incidentally it is the same for papers). In that sense, data usefulness can be as flighty as fashion and should not be a criterion.

There were a few more questions – and I will blog about these in a future post.

Filed under data Tagged with data, Knowledge, Publishing, Research, Royal Society

Tomorrow’s Giants 1 – Big Data

November 5, 2009 1 Comment

I recently spent an afternoon at a meeting entitled “Tomorrow’s Giants”, which was jointly organized by the Royal Society and Nature and took place here in Cambridge. The meeting was in preparation for a larger meeting, also entitled “Tomorrow’s Giants” which is to be held on the 1st July 2010 as part of the Royal Society’s 350th anniversary celebrations. The purpose of the larger event will be to bring together scientists and politicians in an effort to gather scientist’s visions for the next 5 decades and to ask questions such as

What will be required to enable academic achievement in the future?
What are the main goals and challenges facing science in the future?

In discussing this, funding considerations were to be left to one side. This is interesting, considering that the current fashion and move towards larger and larger platform grants has profound implications for some of the questions the Royal Society and nature wanted to debate.

As part of the preparatory Cambridge meeting, the Royal Society and Nature had singled out four questions they whished us to debate:

“Database Management”
“Science Organisation”
“Metrics”
“Career Security and Support”

For historical and other reasons, readers of this blog will not be surprised to know that my personal interests are centered on scientific data and I shall therefore spend a few blogposts on the question of scientific data, that we were asked to debate. In this context, “Database Management” was a very unfortunate name for a vastly important topic which had all to do how science handles its data in the future. The questions that were asked were: (a) Managing big data – what is the right infrastructure for data sharing, (b) is “big data more of a concern for some disciplines rather than others (e.g. biologists), (c) how can – and is it appropriate to – facilitate inter-laboratory dataset comparison (d) does the type of data have an impact on the ways it can be shared? (d) future literatures in the wider sense i.e. not just how findings are published in journals, but how can interim findings be shared and accessed? (e) what about the tension between transparency and data protection (f) implications for the growing use of web2.0 as a resource for sharing research findings and (g) how well organised is the current use of web 2.0 and how does this impact accessibility?

These were all wonderful questions which must be asked in order to “future-proof” science and to which we were expected to provide answers in 20 min (!). While I was and am glad that we were to debate these issues, the devil is – as always – in the detail and the undifferentiated nature of asking made might heart sink again.

In this post, I would like to address the first two questions:

Managing big data – what is the right infrastructure for sharing
The Good: What is exciting here is the recognition by the RS that data needs infrastructure. And that infrastructure is both technical as well as sociocultural problem. Some components of that infrastructure (and by far not all) that are direly needed are

Data Repositories (departmental, university level, subject-specific and transinstitutional
Open, non-propriatary and standards-based markup (exchange formats)
Computable Metadata (e.g. ontologies which can be used to give data COMPUTABLE meaning
University librarians who think that preservation of the data generated by one’s own instritution falls WITHIN the remit of the library
Scholarly Societies who remember that they were founded in response to a scaling problem – namely the increasing availability of scientific data and the need to distribute it – and who start taking this reason for their existence seriously again rather than trying to lock up data in inaccessible and copyrighted/DRM’ed/pdf’ed publications
Academics who belive that data science should be a compulsory part of every undergraduate’s course
Funding agencies who mandate open access publishing and data sharing as a condition of the award of a grant
The availability and use of appropriate data licences, such as Creative Commons licences or Open Knowledge Foundation Licences

etc etc…..I am sure there are many more things that I should mention here and that I have forgotten. Come to think it: funding bodies and universities – don’t forget about or squeeze out the infrastructure guys. Don’t say to the infrastructure guys that the development of /institutional repositories/markup languages/models/eScience tools is not science but it engineering and has no place in a research university that “does science”. Do you detect bitterness? Yes you do – some of my colleagues – even those that call themselves “chemoinformaticians” tell me just this on a regular basis. Only thing is – without the infrastructure guys and the engineers that develop all of this stuff and develop it in a scientific manner using scientific methods, NO science will get done because there will be no infrastructure to support it. And which buttons will you push then to calculate your transition states, dock your molecules etc.? Yes – data needs infrastructure…now universities, senior academics and funding bodies….put your money and your recongnition where your mouth is.
The Bad:The focus of the question on BIG data perturbs me immensely. Because BIG data is, well, BIG data, one of the first things that people who produce/manage/exchange BIG data have to do – almost by the very nature of the thing – is to worry about infrastructure for BIG data. And while we may not have all the technical answers just yet (e.g. it is sad in a way that the fastest bandwidth we have for shuffling really BIG data, such as produced by astronomers around the world, for example, is to load it onto hard disks and to load these onto trucks and to send the trucks on their way) people who deal in BIG data are very aware that it needs infrastructure and hardly need convincing. It is not BIG data that is the problem. What is the problem, is data that is produced in the “bog-standard” long-tail research group of between 3 and 20 people. It is these guys, who usually DO NOT (unless they happen to be blessed and are biologists) have the infrastructure to make data available in such a way that it can be stored exchanged and re-used. It is the biology/chemistry/physics…PhD student that has slaved for three years to assemble data and keeps it an Excel spreadsheet that we need to worry about – how do we make it possible for him to publish his data and make it reusable? How about the departmental crystallographer who sits on thousands of publication-quality but unpublished crystal structures just because the compound never quite made it into a paper. We need to develop mechanisms and infrastucture for the small “long-tail” laboratory scientists…the big data guys have this figured out anyway.

Is Big Data more of a concern for some disciplines rather than others (e.g. biologists)?
The GoodYes of course it is. High throughput screening/ gene sequencing/radioastrononmy produce huge amount of data. Yes it is a concern for them – but they are thinking about it already.
The Bad Big data again. See above – it is not about Big data…let’s talk about the synthetic organic chemistr and the data associated with the 20 compounds he makes over 3 years too, please.

I’ll continue to address some of the other data related questions in other blog posts.

Filed under data Tagged with Creative Commons, Engineering, Open Access, Publications, Royal Society, Web 2.0

Capturing process: In silico, in laboratorio and all the messy in-betweens – Cameron Neylon @ the Unilever Centre

May 12, 2009 1 Comment

I am not very good at live-blogging, but Cameron Neylon is at the Unilever Centre and giving a talk about capturing the scientific process. This is important stuff and so I shall give it a go.

He starts off by making the point that to capture the scientific process we need to capture the information about the objects we are investigating as well as the process how we get there.

Journals not enough – the journal article is static but knowledge is dynamic. Can solutions come from software development? Yes to a certain extent….

e.g. source control/versioning systems – captures snapshots of development over time, date stamping etc.
Unit testing – continuous tests as part of the science/knowledge testing
Solid-replication…distributed version control

Branching and merging: data integration. However, commits are free text..unstructured knowledge…no relationships between objects – what Cameron really wants to say is NO ONTOLOGIES, NO LINKED DATA.

Need linked data, need ontologies: towards a linked web of data.

Data is nice and well…but how about the stuff that goes on in the lab? Objects, data spread over multiple silos – recording much harder: we need to worry about the lab notebook.

“Lab notebook is pretty much an episodic journal” – which is not too dissimilar to a blog. Similarities are striking: descriptions of stuff happening, date stamping, categorisation, tagging, accessibility…and not of much interest to most people…;-). But problem with blogs is still information retrieval – same as lab notbook…

Now showing a blog of one of his students recording lab work…software built by Jeremy Frey’s group….blog IS the primary record: blog is a production system…2GB of data. At first glance lab-log similar to conventional blog: dates, tags etc….BUT fundamental difference is that data is marked up and linked to other relevant resources…now showing video demo of capturing provanance, date, linking of resources, versioning, etc: data is linked to experiment/procedure, procedure is linked to sample, sample is linked to material….etc….

Proposes that his blog system is a system for capturing both objects and processes….a web of objects…now showing a visualisation of resources in the notbook and demonstrates that the visualisation of the connectedness of the resources can indicate problems in the science or recording of science etc….and says it is only the linking/networking effect that allows you to do this. BUT…no semantics in the system yet (tags yes…no PROPER semantics).

Initial labblog used hand-coded markup: scientists needed to know how to hand code markup…and hated it…..this led to a desire for templates….templates create posts and associate controlled vocab and specify the metadata that needs to be recorded for a given procedure….in effect they are metadata frameworks….templates can be preconfigured for procedures and experiments….metadata frameworks map onto ontologies quite well….

Bio-ontologies…sometimes convolute process and object….says there is no particularly good ontology of experiments….I think the OBI and EXPO people might disagree….

So how about the future?

Important thing is: capture at source IN CONTEXT

Capture as much as possible automatically. Try and take human out of the equation as much as possible.

In the lab capture each object as it is created and capture the plan and track the execution step by step

ChemSpider

More natural interfaces to interact with these records…better visualisation etc…

FriendFeed

Connecting with people only works if the objects are OPEN

Connected research changes the playing field – again resources are key

OUCH controversy: communicate first, standardize second….but at least he ackowledges that it will be messy….

UPDATE: Cameron’s slides of the talk are here:

Capturing Process

View more presentations from Cameron Neylon.

Filed under blogging, data, informatics, ontology, RDF, semantic web, XML Tagged with Blog, Cameron Neylon, ChemSpider, Flickr, FriendFeed, Knowledge Management, YouTube

Data-Rich Publishing

April 10, 2009 2 Comments

I have been insanely busy recently with trips and papers and corrections and…etc…and only now have a bit of time to catch up with some of my feeds and people’s blog posts. One post which caught my eye was Egon’s recent blog post about data-rich or data-centric publishing, in which he argues strongly for a new kind of publishing: a publishing in which data is treated as a first class citizen and which allows/requires an author to not just publish the words of a paper, but his research data too and to publish it in such a way that the barrier to access by machines is low.

This reminded me of what I thought was a particularly tragic case, which I blogged about a while ago here. In this particular case, industrious researchers had synthesized an incredible 630 polystyrene copolymers and recorded their RAMAN spectra. Now this is more than a crying shame: a lot of work has gone into producing the polymers and recording the data. And I ask you (provided you are a materials scientist and have an interest in such things), when was the last time that YOU came across such a large and rich library of polymers together with their spectral data? And through no fault of their own, the only way these authors saw to publish their data was in the form of a pdf archive in the supplemental information.

Now Egon’s point was that newly formed journals – and in particular newly formed Journals of Chemoinformatics – have the opportunity to do something fundamentally good and wholesome: namely to change the way in which data publication is being accomplished and to give scientists BETTER tools to deal with and disseminate their data. This long and rambly blogpost is my way of violently agreeing with Egon: I believe that THIS is where an awful lot of the added value of the journal of the future will lie. This will be even more true, as successive generations of scientists will start to become more data savvy: last week I talked to a collaborator of ours who had just put in for some funding to train chemistry students in both chemistry and informatics: a whole dedicated course. Now once these students start their own scientific careers, they will both care and know about science and scientific data. And if I were a publisher, I would want to have something to offer them….

Filed under data, informatics, paper Tagged with chemistry, informatics, Materials science, open data, polymer, Publishing

ChemAxiom: An Ontology for Chemistry 2. The Set-Up

April 8, 2009 4 Comments

Now that I have introduced at least some of the motivation behind ChemAxiom, let me outline some of the mechanics.

ChemAxiom is a collective term for a set of ontologies, all of which make a start at describing subdomains within chemistry. The ontology modules are independent and self-contained and can (largely) be developed seperately and concurrently. Although they are independent, they are interoperable and integrated via a common upper ontology – in the case of ChemAxiom, we have chosen the Basic Formal Ontology (BFO). I will blog the reasons for this choice in the next post.

The ontologies are currently in various stages of axiomatisation depending on how long we have been working on them and how much we have had a chance to play – so therefore, if there are axioms there that are not and you think there should be, or if you agree/disagree with some of our design decisions, please let us know. In any case, the discussion has already started with some helpful comments over on the Google Group. Let me describe the various modules in greater detail:

The Reasons for Modularity: When developing ontologies, it is always tempting to develop the ueber-McDaddy-ontology-of-everything, because, of course, ontology development is, by definition, never done: we alsways need more than we have – more terms, more axioms etc.. Very quickly, this can result in monstrously large and virtually unmaintainable constructs. Modularisation has, from out perspective, the advantage of (a) smaller and more handlable ontologies, (b) ontologies which are easier to maintain, (c) ontologies which can be developed in parallel or orthogonally and subsequently integrated using either a common upper ontology or mapping/rules etc…..Furthermore, if refactoring of ontologies is necessary during the development process, this is also facilitated by modularity: changes in one module have less chance of affecting changes in another module.

The General Use Case: One of the things we are particularly interested in here in Cambridge, is the extraction of chemical entities and data from text and Peter Corbett’s OSCAR is now fairly well established within the chemical informatics community. Our text sources vary widely, and can range from standard chemical papers to theses, blogs and Wikipedia pages. To give you an impression of the types of data we are talking about, there’s an example Wikipedia’s infobox for benzene (somewhat truncated):

So we have to deal with names, identifiers of various type, physico-chemical property data as well as the corresponding metadata (e.g. measurement pressures, measurement temperatures etc.), and chemical structure (InChI, SMILES). Our ontologies should enable us the generate RDF that allow us to hold this data – the ontology here serves as a schema. While we are interested in reasoning/using reasoners for the purposes of (retrospective) typing (again, I will explain what I mean by that in subsequent blog posts) applying ontologies to the description of chemical data is our first use-case.

With all of that said, let me provide a quick summary of the modules:

Chemistry Domain Ontology – ChemAxiomDomain ChemAxiomDomain is the first module in the set. It is currently a small ontology, which clarifies some fundamental relationships in the chemistry domain. Key concepts in this ontology are “ChemicalElement”, “ChemicalSpecies” and “MolecularEntity” as well as “Role”. ChemAxiomDomain clarifies the relationships between these terms (see my previous blog post) and also deals with identifiers etc. Chemical roles too are important: while chemical entities, may be or act as nucleophiles, acids, solvents etc.. some of the time, they do not have these roles all of the time – roles are realisable entities and and ChemAxiomDomain provides a mechanism for dealing with that. There are few other high-level domain concepts in there at the moment, though obviously we are looking to expand as and when the need arises and use-cases are provided.I will blog some details in a subsequent blog post.

Properties Ontology – ChemAxiomProp. ChemAxiomProp is an ontology of over 150 chemical and materials properties, together with a first set of definitions and symbols (where available and appropriate) and some axioms for typing of properties. Again, details will follow in a subsequent blog post.

Measurement Techniques – ChemAxiomMetrology. This is an ontology of over 200 measurement techniques and also contains a list of instrument parts and axioms for typing of measurement techniques. It does not currently include information about minimum information requirements for measurement techniques (e.g. the measurement of a boiling point also requires a measurement of pressure) and other metadata, but this will be added at a later stage. Again, a detailed blog-post will follow.

ChemAxiomPoly and ChemAxiomPolyClass – These two ontologies contain terms which are in common use across polymer science as well as a taxonomy of polymers based on the composition of their backbone (though the latter is not axiomatised yet). Details will follow in a further blog post.

ChemAxiomMeta – ChemAxiomMeta is a developing ontology, that will allow the specification of provenance of data (e.g. data derived from wiki pages etc.) and will also define what a journal, journal article, thesis, thesis chapter etc is and what the relationships between these entities are. We have not currently released this yet. Details will follow in a further blog post.

ChemAxiomComtinuants – ChemAxionContinuants represents an integration of all the above sub-ontologies into an ontological framework for chemical continuants (with some occurrents mixed in when we need to talk about measurement techniques). Details will follow in a further blog post.

We have also started to work on ontologies of chemical reactions, actions and, as mentioned above, minimum information requirements – however, these are at a relatively early stage of development and hence not released yet.

So much for a short overview over the mechanics of the ontologies. I am sure there are a thousand other things I should have said, but that will have to
do for now. Comments and suggestions via the usual channels. Automatic links and tags, as always, by Zemanta.

Filed under chemistry, data, informatics, ontology Tagged with chemistry, International Chemical Identifier, Knowledge Management, Knowledge Representation, ontology

← Older posts

Scimantica – Semantic Science