Semantic Web Tools and Applications for Life Sciences 2009 – A Personal Summary

A bicyclist in Amsterdam, the Netherlands.
Image via Wikipedia

So another SWAT4LS is behind us, this time wonderfully organised by Andrea Splendiani, Scott Marshall, Albert Burger, Adrian Paschke and Paolo Romano.

I have been back home in Cambridge for a couple of days now and have been asking myself whether there was an overall conclusion from the day – some overarching bottom line that one could take away and against which one could measure the talks at SWAT4LS2010 to see whether there has been progress or not. The programme consisted of a great mixture of both longer keynotes, papers, “highlight posters” and highlight demonstations illustrating a wide range of activities at the semantic web technology – computer science and biomedical research.

Topics at the workshop covered diverse areas such as the analysis of the relationship between  HLA structure variation and disease, applications for maintaining patient records in clinical information systems, patient classification on the basis of semantic image annotations to the use of semantics in chemo- and proteoinformatics and the prediction of drug-target interactions on the basis of sophisticated text mining as well as games such as Onto-Frogger (though I must confess that I somehow missed the point of what that was all about).

So what were the take-home messages of the day? Here are a few points that stood out to me:

  • During his keynote, Alan Ruttenberg coined the dictum of “far too many smart people doing data integration”, which was subsequently taken up by a lot of the other speakers – an indication that most people seemed to agree with the notion that we still spend far too much time dealing with the “mechanics” of data – mashing it up and integrating it, rather than analysing and interpreting it.
  • During last year;s conference, it already became evident that a lot of scientific data is now coming online in a semantic form. The data avalanche has certainly continued and the feeling of an increased amount of data availability, at least in the biosciences, has intensified. While chemistry has been lagging behind, data is becoming available here too. On the one hand, there are Egon’s sterling efforts with openmolecules.net and the data solubility project, on the other, there are big commercial entities like the RSC and ChemSpider. During the meeting, Barend Mons also announced that he had struck an agreement with the RSC/ChemSpider to integrate the content of ChemSpider into his Concept Wiki system. I will reserve judgement as to the usefulness and openness of this until it is further along. In any case, data is trickling out – even in chemistry.
  • Another thing that stood out to me – and I could be quite wrong in this interpretation, given that this was very much a research conference – was the fact that there were many proof-of-principle applications and demonstrators on show, but very few production systems, that made use of semantic technologies at scale. A notable exception to this was the GoPubMed (and related) system demonstrated by Michael Schroeder, who showed how sophisticated text mining can be used not only to find links between seemingly unrelated concepts in the literature, but can also assist in ontology creation and the prediction of drug-target interactions.

Overall, many good ideas, but, as seems to be the case with all of the semantic web, no killer application as to yet – and at every semweb conference I go to we seem to be scrabbling around for one of those. I wonder if there will be one and what it will be.

Thanks to everybody for a good day. It was nice to see some old friends again and make some new ones. Duncan Hull has also written up some notes on the day – so go and read his perspective. I, for one, am looking forward to SWAT4LS2010.

Reblog this post [with Zemanta]

SWAT4LS2009 – Keynote Alan Ruttenberg: Semantic Web Technology to Support Studying the Relation of HLA Structure Variation to Disease

(These are live-blogging notes from Alan’s keynote…so don’t expect any coherent text….use them as bullt points to follow the gist of the argument.)

The Science Commons:

  • a project of the Creative Commons
  • 6 people
  • CC specializes CC to science
  • information discovery and re-use
  • establish legal clarity around data sharing and encourage automated attribution and provenance

Semantic Web for Biologist because it maximizes value o scientific work by removing repeat experimentation.

ImmPort Semantic Integration Feasibility Project

  • Immport is an immunology database and analysis portal
  • Goals:metaanalysis
  • Question: how can ontology help data integration for data from many sources

Using semantics to help integrate sequence features of HLA with disorders
Challenges:

  • Curation of sequence features
  • Linking to disorders
  • Associating allele sequences with peptide structures with nomenclature with secondary structure with human phenotype etc etc etc…

Talks about elements of representation

  • pdb structures translated into ontology-bases respresentations
  • canonical MHC molecule instances constructed from IMGT
  • relate each residue in pdb to the canonical residue if exists
  • use existing ontologies
  • contact points between peptide and other chains computed using JMOL following IMGT. Represented as relation between residue instances.
  • Structural features have fiat parts

Connecting Allele Names to Disease Names

  • use papers as join factors: papers mention both disease and allele – noisy
  • use regex and rewrites applied to titles and abstracts to fish out links between diseases and alleles

Correspondence of molecules with allele structures is difficult.

  • use blast to fiind closest allele match between pdb and allele sequence
  • every pdb and allele residue has URI
  • relate matching molecules
  • relate each allele residue to the canonical allele
  • annotate various residoes with various coordinate systems

This creates massive map that can be navigated and queried. Example queries:

  • What autoimmune diseases can de indexed against a given allele?
  • What are the variant residues at a position?
  • Classification of amino acids
  • Show alleles perturned at contacts of 1AGB

Summary of Progress to Date:
Elements of Approach in Place: Structure, Variation, transfer of annotation via alignment, information extraction from literature etc…

Nuts and Bolts:

  • Primary source
  • Local copy of souce
  • Scripts transforms to RDF
  • Exports RDF Bundles
  • Get selected RDF Bundles and load into triple store
  • Parsers generate in memory structures (python, java)
  • Template files are instructions to fomat these into owl
  • Modeling is iteratively refined by editiing templates
  • RDF loaded into Neurocommons, some amount of reasoning

RDFHerd package management for data

neurocommons.org/bundles

Can we reduce the burden of data integration?

  • Too many people are doing data integration – wasting effort
  • Use web as platform
  • Too many ontologies…here’s the social pressure again

Challenges

  • have lawyers bless every bit of data integration
  • reasoning over triple stores
  • SPARQL over HTTP
  • Understand and exploit ontology and reasoning
  • Grow a software ecosystem like Firefox
Reblog this post [with Zemanta]

Hello from Hinxton

So in my last post I pretty much said good-bye to the Unilever Centre and the people there and now it is time for a hello – a hello to a new job. I have recently joined the Department of Genetics and the group of Prof Ashburner as a Research Associate. While I am formally employed by the university, I will, however, spend most of my time at the European Bioinformatics Institute in the group of Christoph Steinbeck.

My remit here will be to continue to develop chemical ontology and in particular to help, together with my colleagues and the ChEBI user community, to put the ChEBI ontology onto a “formal” footing and to align it with the upper ontology used by the OBO Foundry ontologies. I will blog more about this as the story develops – however, for now, I am very excited about this new opportunity. I have a great set of new colleagues (Duncan Hull has also just joined the ChEBI team and has blogged about it) both in the ChEBI group as well as in the wider EBI community and there is a community of people here that believe in the value of this type of work. So I am very much looking forward to helping create some exciting ontology and resources of value to the chemical and biological community.

As I was walking across the Genome campus this morning, I couldn’t help but to be struck by its beauty – here are some pictures I shot with my mobile phone:

Hinxton High Street

Hinxton High Street - On the way to the Genome Campus


Genome Campus - By Hinxton Hall

Genome Campus - By Hinxton Hall

Reblog this post [with Zemanta]

The Unilever Centre @ Semantic Technology 2009

In a previous blogpost, I had already announced, that both Jim and I had been accepted to speak at Semantic Technology 2009 in San Jose.

Well, the programme for the conference is out now and looks even more mind-blowing (in a very good way) than last year. Jim and I will be speaking on Tuesday, 16th June at 14:00. Here’s our talk abstracts:

PART I | Lensfield – The Working Scientist’s Linked Data Space Elevator (Jim Downing)

The vision of Open Linked Data in long-tail science (as opposed to Big Science, high energy physics, genomics etc) is an attractive one, with the possibility of delivering abundant data without the need for massive centralization. In achieving that vision we face a number of practical challenges. The principal challenge is the steep learning curve that scientists face in dealing with URIs, web deployment, RDF, SPARQL etc. Additionally most software that could generated Linked Data runs off-web, on workstations and internal systems. The result of this is that the desktop filesystem is likely remain the arena for the production of data in the near to medium term. Lensfield is a data repository system that works with the filesystem model and abstracts semantic web complexities away from scientists who are unable to deal with them. Lensfield makes it easy for researchers to publish linked data without leaving their familiar working environment. The presentation of this system will include a demonstration of how we have extended Lensfield to produce a Linked Data publication system for small molecule data.

PART II | The Semantic Chemical World Wide Web (Nico Adams)

The development of modern new drugs, new materials and new personal care products requires the confluence of data and ideas from many different scientific disciplines and enabling scientists to ask questions of heterogeneous data sources is crucial for future innovation and progress. The central science in much of this is chemistry and therefore the development of a “semantic infrastructure” for this very important vertical is essential and of direct relevance to large industries such as the pharmaceuticals and life sciences, home and personal care and, of course, the classical chemical industry. Such an infrastructure shouls include a range of technological capabilities, from the representation of molecules and data in semantically rich form to the availability of chemistry domain ontologies and the ability to extract data from unstructured sources.

The talk will discuss the development of markup languages and ontologies for chemicals and materials (data). It will illustrate how ontologies can be used for indexing, faceted search and retrieval of chemical information and for the “axiomatisation” of chemical entities and materials beyond simple notions of chemical structure. The talk will discuss the use of linked data to generate new chemical insight and will provide a brief discussion of the use of entity extraction and natural language processing for the “semantification” of chemical information.

But that’s not all. Lezan has been accepted to present a poster and so she will be there too,, showing off her great work on the extraction and semantification of chemical reaction data from the literature. Here is her abstract:

The domain of chemistry is central to a large number of significant industries such as the pharmaceuticals and life sciences industry, the home and personal care industry as well as the “classical” chemical industry. All of these are research-intensive and any innovation is crucially dependent on the ability to connect data from heterogeneous sources: in the pharmaceutical industry, for example, the ability to link data about chemical compounds, with toxicology data, genomic and proteomic data, pathway data etc. is crucial. The availability of a semantic infrastructure for chemistry will be a significant factor for the future success of this industry. Unfortunately, virtually all current chemical knowledge and data is generated in non-semantic form and in many silos, which makes such data integration immensely difficult.

In order to address these issues, the talk will discuss several distinct, but related areas, namely chemical information extraction, information/data integration, ontology-aided information retrieval and information visualization. In particular, we demonstrate how chemical data can be retrieved from a range of unstructured sources such as reports, scientific theses and papers or patents. We will discuss how these sources can be processed using ontologies, natural language processing techniques and named-entity recognisers to produce chemical data and knowledge expressed in RDF. We will furthermore show, how this information can be searched and indexed. Particular attention will also be paid to data representation and visualisation using topic/topology maps and information lenses. At the end of the talk, attendees should have a detailed awareness of how chemical entities and data can be extracted from unstructured sources and visualised for rapid information discovery and knowledge generation.

It promises to be a great conference and I am sure our minds will go into overdrive when there….can’t wait to go! See you there!?

Reblog this post [with Zemanta]

Semantic Universe Website with Chemical/Polymer Informatics Contributions now Live

Over on Twitter, Semantic Universe has just announced the relaunch of their website. The purpose of the site is “to educate the world about semantic technologies and applications.”

To quote from the website:

“Semantic Universe and Cerebra today announced the launch of the “Semantic Universe Network”, a vibrant educational and networking hub for the global semantic technology marketplace. Semantic Universe Network will be the educational and information resource for the people and companies within the high-growth semantics sector, covering the latest news, opinions, events, announcements, products, solutions, promotions and research in the industry.”

As part of the re-launch, both Lezan Hawizy and I have written two short contributions reviewing the state of Semantic Chemistry and showcasing our work on how semantification of chemistry can happen. The contributions were intended to be short “how to..s” and as such are written in a somewhat chatty style. Here are the links:

Semantic Chemistry

The Semantification of Chemistry

Feedback is welcome.

Reblog this post [with Zemanta]

Semantic Web Applications and Tools for Life Sciences – Afternoon Session

Tutorial: The W3C Interest Group on Semantic Web Technologies for Health Care and Life Sciences (M.S. Marshall)

“Scientists should be able to work in terms of commonly used concepts

The scientist should be able to ork in terms of personal concepts and hypotheses (not forced to map concepts to the terms that have been chosed for him)

Otherwise general overview over what the interest groups does and how it works….link to the webpage is here.

To participate email [email protected]

Task Forces:

  • Terminology
  • Linking Open Drug Data
  • Scientific Discourse
  • Clinical Observations Interoperability
  • BioRDF – integrated neuroscience knowledge base
  • Other Projects – clinical decision support, URI workshop

Paper in IEEE Software: SOftware design for empoweriing scientists

I stopped blogging after this mainly because my batteries were dry and there was a scarmble for the power sockets in the room I did not wishh to participate in. During the meeting some people said they were blogging this and that there was some discussion on Friendfeed….but I can’t find anything much on either. If anybody has a few links, please give me a shout and I will happily link out.

Reblog this post [with Zemanta]

Semantic Web Applications and Tools for Life Sciences – Morning Session

I am currently at a meeting in Edinburgh with the title “Semantic Web Applications and Tools for Life Sciences“. The title is programmatic and it promises to be a hugely exciting meeting. As far as I can tell, the British ontological aristocracy is here and a few more besides. The following are some notes I made during the meeting.

1. Keynote: Semantic Web Technology in Translational Cancer Research (M. Krauthammer, Yale Univ.)

How to integrate semantic web technologies with the Cancer Biomedical Informatics Grid (caBIG)?

Use case: melanoma…worked on at 5 NCI sites in US: Harvard, Penn, Yale, Anderson….can measure all kinases involved in disease pathways…use semantic technologies to share and integrate data from all sites and link to other data sources…e.g. drug screening results etc…..

MelaGrid consortium: data sharing, omics integration, workflow integration for clinical trials

Data sharing: create community wide resources – a federated repository of melanoma specimens

currently caBIG uses ISO/IEC 11179 metadata standards to register CDEs (common data element) and additional annotation via NCI thesaurus concepts: example of use: caTissue…tissue tracking software (multisite banking, form definition, temporal searches etc.)

omics integration: caBIG domain models are in essence ontologies…..translate into OWL models and integrate with other ontologies (e.g. sequence ontology etc.) to align data from various sources

using Sesame as a triple store, but have performance problems….use SPARQL as query language rather than caBIGs own query language

2. Semantic Data Integration for Francisella tubularis novicida Proteomic and Genomic Data (Nadia Anwar et al.)

Why is data integration important in biology?

datainformatics in bioinformatics is nor a solved problem…there are no technologies which satisfy all the problems biologists are likely to ask, also issues with data accesss and permissions…..yet another problem is heterogeneous nature of data: information discovery is not integrated…all technologies have strengths and weaknesses…data relates – but it doesn’t overlap

Solution: semantic data integration across omes data silos….

Case Study: Francisella tularensis (bacterium, infection through airways…infects immune system….francisella can bypass macrophages….forms phagosome, but can escape from it…bioterrorism fears…..”Hittite plague” been associated with Tularemia)

available datasources: genome data…from international database….convert to simple rdf data, kegg, ncbi, GO, Poson, transcriptomics data

used data from proteomics experiment to integrate with the constructed graphs….could show that it was easy to query the whole graph…..but issues with modeling of the data and the resulting rdf graph…so some careful data modeling is still necessary….some performancce issues with datasets cotaining many reified statements…..memory problems…

Summary: In principle it’s easy – in practice it is still hard work

Use of shared lexical resources for efficient ontological engineering (Antonio Jimeno et al.)

Motivation: Health-e-Child Project (creation of an integrated (grid-based) healthcare platform for European Paediatrics

Use Case: Juvenile Rheumathoid Arthritis Ontology construction
reuse existing ontologies – Galen, NCI but….problem with alignment becuase of missing information that could facilitate mapping, also many mapping tools based on statistics….thus trust

A common terminological resource for life sciences….generate a reference thesaurus that Galen,, NCI, JRAO thesaurus to normalise term concepts

Def Thesaurus: Collection of entity names in domain with synonyms, taxonomy of more general and specific terms (DAG)…..no axiomatisation

Problems in thesaurus construction: ambiguity (retinoblastoma – gene or disease), inappropriate term labels, maintenance: thesaurus and ontologies need to be updated simultaneously now…

KASBi: Knowledge Bases Analysis in Systems Biology ()

Problem: Combining data from different data sources – use semweb rather than standard data integration systems for integration…in particualar use reasoners….

In KASBi try and integrate reasoners/semweb with traditional database tech: use semtech to generate a “query plan” which specifies how queries need to be carried out across resources

goWeb – Semantic Search Engine for the Life Science Web (Heiko Dietze)

Typical question: “What is the diagnosis for the symptoms for multiple spinal tumors and skin tumors?”, “Which organisms is FGF8 studied in?”

goWeb combines simple key-word web searching, text mining and ontologies for question answering

Keyword search in goWeb is sent to yahoo, which returns snippets. These are subsequently pushed through NLP to extract concepts and mark them up with ontology concepts…….use ontolgies to further filter results…..

Path Explorer: Service Mining for Biological Pathways on the Web (George Zheng)

Two major biological data representation approaches: free text(discoverable but not invocable), computer models (constructed but made available in isolated environment – invocable but not discoverable)

Solution: model biological processes using web service operations (aim: to be invocable and discoverable) pathways of service oriented processes canbe discovered and invoked

SOA: service providers publish services into registry where they can be discovered by service providers

DAMN – slides are much to small…can’t see anything….”entities are service providers and service consumers”
….ook…..he’s lost me now – I can’t see anything anymore…..

Close integration of ML and NLP tools in …
Scope: Fine grained semantic annotation: eg he GenE protein inhibits……mark up GenE protein as a protein, inhibits as a negative interaction etc…..

Availability of NLP Pipeline….Alvis/A3P, GATE, UMA but domain specific NLP resources are rare

focus on target knowledge ensures learnability
rigorous manual annotation
high quality annotation and low vvlumes require proper nrmalisation of training corpora (syntactic dependencies vs shallow clues)
clarification of different annotatoon tasks and knowledge – consistency between NE ype and semantics

Fine grained annotation is feasible and necessary for high quality services: i.e. in verticals and science….

Right – time for lunch and a break. I have only captured aspects of the presentations and stuff that resonated with me at the time….so please nobody shoot me if they think I haven’t grabbed the most fundamental points….Link to the slides from the event is here

Reblog this post [with Zemanta]