Semantic Web Tools and Applications for Life Sciences 2009 – A Personal Summary

A bicyclist in Amsterdam, the Netherlands.
Image via Wikipedia

So another SWAT4LS is behind us, this time wonderfully organised by Andrea Splendiani, Scott Marshall, Albert Burger, Adrian Paschke and Paolo Romano.

I have been back home in Cambridge for a couple of days now and have been asking myself whether there was an overall conclusion from the day – some overarching bottom line that one could take away and against which one could measure the talks at SWAT4LS2010 to see whether there has been progress or not. The programme consisted of a great mixture of both longer keynotes, papers, “highlight posters” and highlight demonstations illustrating a wide range of activities at the semantic web technology – computer science and biomedical research.

Topics at the workshop covered diverse areas such as the analysis of the relationship between  HLA structure variation and disease, applications for maintaining patient records in clinical information systems, patient classification on the basis of semantic image annotations to the use of semantics in chemo- and proteoinformatics and the prediction of drug-target interactions on the basis of sophisticated text mining as well as games such as Onto-Frogger (though I must confess that I somehow missed the point of what that was all about).

So what were the take-home messages of the day? Here are a few points that stood out to me:

  • During his keynote, Alan Ruttenberg coined the dictum of “far too many smart people doing data integration”, which was subsequently taken up by a lot of the other speakers – an indication that most people seemed to agree with the notion that we still spend far too much time dealing with the “mechanics” of data – mashing it up and integrating it, rather than analysing and interpreting it.
  • During last year;s conference, it already became evident that a lot of scientific data is now coming online in a semantic form. The data avalanche has certainly continued and the feeling of an increased amount of data availability, at least in the biosciences, has intensified. While chemistry has been lagging behind, data is becoming available here too. On the one hand, there are Egon’s sterling efforts with openmolecules.net and the data solubility project, on the other, there are big commercial entities like the RSC and ChemSpider. During the meeting, Barend Mons also announced that he had struck an agreement with the RSC/ChemSpider to integrate the content of ChemSpider into his Concept Wiki system. I will reserve judgement as to the usefulness and openness of this until it is further along. In any case, data is trickling out – even in chemistry.
  • Another thing that stood out to me – and I could be quite wrong in this interpretation, given that this was very much a research conference – was the fact that there were many proof-of-principle applications and demonstrators on show, but very few production systems, that made use of semantic technologies at scale. A notable exception to this was the GoPubMed (and related) system demonstrated by Michael Schroeder, who showed how sophisticated text mining can be used not only to find links between seemingly unrelated concepts in the literature, but can also assist in ontology creation and the prediction of drug-target interactions.

Overall, many good ideas, but, as seems to be the case with all of the semantic web, no killer application as to yet – and at every semweb conference I go to we seem to be scrabbling around for one of those. I wonder if there will be one and what it will be.

Thanks to everybody for a good day. It was nice to see some old friends again and make some new ones. Duncan Hull has also written up some notes on the day – so go and read his perspective. I, for one, am looking forward to SWAT4LS2010.

Reblog this post [with Zemanta]

Tomorrow’s Giants 2 – Dataset Comparison, Data Sharing and Future Literatures

Following my first post from last week, here are more questions that the Royal Society wanted us Cambridge researchers to discuss during the peparatory Tomorrow’s Giant’s Meeting in Cambridge.

How can – and is it appropriate to – facilitate inter-laboratory dataset comparison?
Great that the question was asked. And the answer is yes of course it is. Not only is it appropriate, it is the vey essence of scientific endeavour. What else could be called science? That said, the fact that the question even had to be asked and that the answer is not self evident is disappointing. What has science/have scientists lost by way of attitude/ethics etc. that makes us even ask that question? Yes admittedly, there may be commercial reasons as to why this sort of comparison is not desirable. One of the participants in the session was at great pains to point out that there is often commercial interest tied up to data which prevents sharing and re-use and that is a fair point. However, over the past couple of years I have sat through far too many presentations where the presenter got up and talked about the development of a proprietary model/machine learning tool using a proprietary dataset and proprietary software. Now that is NOT science – at best it is a piece of local engineering which solves a particular problem for the presenter, but it does not advance human knowledge at all. I,, as a fellow scientist, could not pick up any aspect of this work and build upon it as it is all proprietary. Local engineering at best.

Does the type of data have an impact on the ways it can be shared?
Flippantly speaking: “you betcha”. Again, great that the question was even asked. And the answer is multifaceted because the question can be read in a number of different ways. It could be read as “does the provenance of the data and context in which it was generated have an impact on the ways in which it can be shared?” The question can also be read as “Does the (technical) format the data is in have an impact on the way in which it can be shared? The answer in both cases is yes. Let’s tackle these two in turn. One of the participants of the workshop worked at the faculty of education and her primary research data consisted of a large collection of interviews she had conducted with children over the course of her work. She believes that this data is valuable to other researchers in her field and would dearly love to share – but finds herself in a mire of legal and ethical concerns with respect to, for example, the children’s privacy that effectively prevent her from data sharing. So yes, the context in which data is produced and the type of data that is generated can be an obstacle to sharing. If “type of data” is understood to mean “format” then the answer is also yes. A number of my colleagues have pointed out (see here, for example) the data loss that occurs when documents containing scientific data are converted from the format in which they were produced to pdf (examples are here, here and here). The production of data in vernacular or lossy dataformats obviously also have an impact on data sharing – particularly when the sharing and exchange format is lossy.
However, the fact that the question had to be asked at all and that it went straight over the heads of most scientists who were at the meeting and who do not work in the data business, is intensely disappointing. Laboratory researchers have no appreciation of what they are doing when they convert their Word documents to pdf. Data science and informatics are not part of the standard curriculum in the education of scientists – something that desperately needs to change if data loss due to ignorance in data handling is to be avoided in the future.

Future literatures in the wider sense i.e. not just how findings are published in journals, but how can interim findings be shared and accessed?
That is a great question and one, as it turns out, that many of the people present in the meeting had pondered themselves in one form or another already. Scientists should not only be assessed on the basis of the journal articles they write, but, for example, also on the (raw) data they publish. However, science has, so far, not only not evolved a technical soloution to the data publication problem (of course, there isn’t just one solution – there are many depending on the type of data as well as the specific subject/sub-subject/sub-sub-subject that is producing the data etc.) Interim findings are part of this and systems like Nature Preceedings could point the way (although even Nature Preceedings does not allow us to deal with data). Obviously, one has to be careful that these do not just become dumping grounds for lower quality science. Once we have evolved technical solutions for publishing data, the next step will be to develop an ecosystem of metrics. And those metrics should only extend to things like data quality, trust and data provenance. Data “usefulness” – e.g. things like citation indices etc for data should, I think, not be part of the mix: it is impossible to predict what data will be useful when and under which circumstances (and incidentally it is the same for papers). In that sense, data usefulness can be as flighty as fashion and should not be a criterion.

There were a few more questions – and I will blog about these in a future post.

Reblog this post [with Zemanta]