Semantic Web Applications and Tools for Life Sciences – Morning Session

I am currently at a meeting in Edinburgh with the title “Semantic Web Applications and Tools for Life Sciences“. The title is programmatic and it promises to be a hugely exciting meeting. As far as I can tell, the British ontological aristocracy is here and a few more besides. The following are some notes I made during the meeting.

1. Keynote: Semantic Web Technology in Translational Cancer Research (M. Krauthammer, Yale Univ.)

How to integrate semantic web technologies with the Cancer Biomedical Informatics Grid (caBIG)?

Use case: melanoma…worked on at 5 NCI sites in US: Harvard, Penn, Yale, Anderson….can measure all kinases involved in disease pathways…use semantic technologies to share and integrate data from all sites and link to other data sources…e.g. drug screening results etc…..

MelaGrid consortium: data sharing, omics integration, workflow integration for clinical trials

Data sharing: create community wide resources – a federated repository of melanoma specimens

currently caBIG uses ISO/IEC 11179 metadata standards to register CDEs (common data element) and additional annotation via NCI thesaurus concepts: example of use: caTissue…tissue tracking software (multisite banking, form definition, temporal searches etc.)

omics integration: caBIG domain models are in essence ontologies…..translate into OWL models and integrate with other ontologies (e.g. sequence ontology etc.) to align data from various sources

using Sesame as a triple store, but have performance problems….use SPARQL as query language rather than caBIGs own query language

2. Semantic Data Integration for Francisella tubularis novicida Proteomic and Genomic Data (Nadia Anwar et al.)

Why is data integration important in biology?

datainformatics in bioinformatics is nor a solved problem…there are no technologies which satisfy all the problems biologists are likely to ask, also issues with data accesss and permissions…..yet another problem is heterogeneous nature of data: information discovery is not integrated…all technologies have strengths and weaknesses…data relates – but it doesn’t overlap

Solution: semantic data integration across omes data silos….

Case Study: Francisella tularensis (bacterium, infection through airways…infects immune system….francisella can bypass macrophages….forms phagosome, but can escape from it…bioterrorism fears…..”Hittite plague” been associated with Tularemia)

available datasources: genome data…from international database….convert to simple rdf data, kegg, ncbi, GO, Poson, transcriptomics data

used data from proteomics experiment to integrate with the constructed graphs….could show that it was easy to query the whole graph…..but issues with modeling of the data and the resulting rdf graph…so some careful data modeling is still necessary….some performancce issues with datasets cotaining many reified statements…..memory problems…

Summary: In principle it’s easy – in practice it is still hard work

Use of shared lexical resources for efficient ontological engineering (Antonio Jimeno et al.)

Motivation: Health-e-Child Project (creation of an integrated (grid-based) healthcare platform for European Paediatrics

Use Case: Juvenile Rheumathoid Arthritis Ontology construction
reuse existing ontologies – Galen, NCI but….problem with alignment becuase of missing information that could facilitate mapping, also many mapping tools based on statistics….thus trust

A common terminological resource for life sciences….generate a reference thesaurus that Galen,, NCI, JRAO thesaurus to normalise term concepts

Def Thesaurus: Collection of entity names in domain with synonyms, taxonomy of more general and specific terms (DAG)…..no axiomatisation

Problems in thesaurus construction: ambiguity (retinoblastoma – gene or disease), inappropriate term labels, maintenance: thesaurus and ontologies need to be updated simultaneously now…

KASBi: Knowledge Bases Analysis in Systems Biology ()

Problem: Combining data from different data sources – use semweb rather than standard data integration systems for integration…in particualar use reasoners….

In KASBi try and integrate reasoners/semweb with traditional database tech: use semtech to generate a “query plan” which specifies how queries need to be carried out across resources

goWeb – Semantic Search Engine for the Life Science Web (Heiko Dietze)

Typical question: “What is the diagnosis for the symptoms for multiple spinal tumors and skin tumors?”, “Which organisms is FGF8 studied in?”

goWeb combines simple key-word web searching, text mining and ontologies for question answering

Keyword search in goWeb is sent to yahoo, which returns snippets. These are subsequently pushed through NLP to extract concepts and mark them up with ontology concepts…….use ontolgies to further filter results…..

Path Explorer: Service Mining for Biological Pathways on the Web (George Zheng)

Two major biological data representation approaches: free text(discoverable but not invocable), computer models (constructed but made available in isolated environment – invocable but not discoverable)

Solution: model biological processes using web service operations (aim: to be invocable and discoverable) pathways of service oriented processes canbe discovered and invoked

SOA: service providers publish services into registry where they can be discovered by service providers

DAMN – slides are much to small…can’t see anything….”entities are service providers and service consumers”
….ook…..he’s lost me now – I can’t see anything anymore…..

Close integration of ML and NLP tools in …
Scope: Fine grained semantic annotation: eg he GenE protein inhibits……mark up GenE protein as a protein, inhibits as a negative interaction etc…..

Availability of NLP Pipeline….Alvis/A3P, GATE, UMA but domain specific NLP resources are rare

focus on target knowledge ensures learnability
rigorous manual annotation
high quality annotation and low vvlumes require proper nrmalisation of training corpora (syntactic dependencies vs shallow clues)
clarification of different annotatoon tasks and knowledge – consistency between NE ype and semantics

Fine grained annotation is feasible and necessary for high quality services: i.e. in verticals and science….

Right – time for lunch and a break. I have only captured aspects of the presentations and stuff that resonated with me at the time….so please nobody shoot me if they think I haven’t grabbed the most fundamental points….Link to the slides from the event is here

Reblog this post [with Zemanta]

The Dutch Polymer Institute

I am currently sitting through one of those corporate presentations and so have some time to blog about the Dutch Polymer Institute (DPI) which is organising the meeting I am currently attending.

The DPI was set up by the Dutch Government in 1996 as part of the “leading technology institute” (LTI – other current institutes are Netherlands Institute for Metals Research, Telematics Institute, Wageningen Centre for Food Sciences, Dutch Separation Technology Institute, Top Institute Pharma, Wetsus and TI Green Genetics) initiative. The DPI is a public-private partnership (PPP), with funding being provided by industry (25 %), academia (25 %) and government (50 %). A 2003 OECD study suggested that the DPI was one of the purest forms of PPP and is certainly one of the few examples that I know, which work well. In a typical scenario, an industrial member joins the institute (which is a virtual institute – it has no laboratories or facilities of its own) by purchasing a share (“a ticket”) in the institute, which currently is worth approximately roughly 50000 Euros per annum with a minimum commitment of four years. Academia contributes the same amount of money (in practice through in-kind contributions) and the Dutch government doubles this sum. An investment of Eur 200 000 by industry thus generates about Eur 800 000 in research funding (the DPI has minimal overheads due to the virtual nature of the institute). This is a daring scheme in many ways: the institute is international and while the largest beneficiary from this funding model is still Dutch research (and in particular the Eindhoven University of Technology (TU/e)), international research is funded too and we in Cambridge certainly benefit from this as well. It is a great credit to the Netherlands that this is possible and I wonder how many other European governments would willing to set up such a scheme.

DPI’s main mission is to catalyse the process of developing fundamental research further and to bring it up to the end of the pre-competitive phase, so that it can subsequently be taken up by industry and developed into commercial products. The DPI does this by financing academic research and staff suggested and requested by the academic partners in the programme and approved and decided over by the industrial members. Any intellectual property generated as part of the projects will be transferred to the industrial members if there is a request to do so (part of industry’s ROI) and if not, will be disseminated in the normal way (i.e. through publications). It furthermore provides a platform for networking and recruiting.

The institute has a lot ot be proud of: DPI funded research produces between 150 and 200 papers per year and a good number of patents, a significant number of which have been transferred to industry. Fur full details check the annual report (2007) here.

A meeting like the one today shows the vibrance of its community and I can only hope, that it will continue to prosper in the future.

Reblog this post [with Zemanta]

An FAQ for Open Access

In a blogpost yesterday, I asked for an FAQ for Open Access for mere mortals. Well, it turns out that Peter Suber has already provided one, which I had previously overlooked. Peter pointed me to it in the comments section of the post. As this is important and tremendously valuable, I thought I should make it more explicit here:

An Open Access Overview

Autogenerated links by Zemanta.

Reblog this post [with Zemanta]

An appetite for open data…

…is what I have encountered here at Antwerp already. I am currently at the annual meeting of the Dutch Polymer Institute, with which I have been associated in various forms over the best part of five years now. We are the guests of Borealis here in Antwerp and as such, it promises to be an interesting meeting. The morning will be taken up with “Golden Thesis Awards”. The DPI evaluates all PhD thesis it funds by scinetific merit and the best PhD students in a year will be given an award. This is followed by an excursion to Borealis and in the afternoon, there will be thematic sessions: “Polymers and Water” and “Polymers and Time”. The former is self explanatory and the latter concerns mainly molecular simulations of polymers at short and long time scales. This is followed by poster sessions and a Borealis hosted dinner in the evening. Tomorrow then we will have several further talks on bio-based polymers, sustainability and solar cells and in the evening a brain-storm sesssion: “What could polymers mean for the bottom of the pyramid?” I like DPI meetings – they are extremely young…most of the participants are PhDs and Post-Docs and always brimming with energy.

In that spirit, I arrived at my hotel last night and sat down for dinner. It didn’t take long before I was surrounded by old and some new acquaintances and we spent the time catching up and discussing what we have been doing. And inevitably the conversaton turned to polymer informatics and open data. There were many questions: “Will extraction of data from a manuscript cause problems with publication later?”, “Why should I trust you and give you my manuscript or thesis to datamine?”, “How does copyright work out?” “What happens to the publishers – why should they not sell my data?” etc. However, all the minds were open. They see the argument for open data and open knowledge and they agree with it in principle, but there is great uncertainty as to the politics and technicalities associated with open data. The moral of the story is: much more talking needs to be done and much more education. Open access and open data evangelists should put together an FAQ for “mere mortals” i.e. researchers who do not think about this all the time and who should not have to think subtly about the differeneces between “gold OA”, “green OA” “libre OA” and what have you. We need to do much more talking to the science community. Let’s start now. And let’s not weaken our position by OA sophistry. I wil try and blog some more as the meeting goes on and hopefully also provide some photos.

PS: You will see some new and unusual tags at the bottom of this blog post and(UPDATE: no tags apparently) links in the text. I have installed Zemanta to try and make this blog semantically a little richer. The tags and links are autogenerated and I hope the result is worthwhile.

Reblog this post [with Zemanta]