Project Management by Committee

I am just catching up on Seth Godin’s blog – as usual his posts are short and poignant. This one struck a particular chord:

“Hi, we’re here to take your project to places you didn’t imagine.

With us on board, your project will now take three times as long.

It will cost five times as much.

And we will compromise the art and the vision out of it, we will make it reasonable and safe and boring.”

Great work is never reasonable, safe or boring. Thanks anyway.

Support the Long-Term Future of KEGG

The KEGG database is an invaluable resource for biologists, bioinformaticians, clinical researchers, chemists etc. in general and has also been invaluable in some of my personal activities. KEGG is developed in the laboratory of Minoru Kanehisa who is now coming up towards his mandatory retirement. And he is looking to put KEGG on a sustainable footing and to give it a viable business model for the future. The following is a complete reproduction (though no explicit licence for reuse is provided I claim fair use) of a recent post on the KEGG website:

Plea to Support KEGG

Since 1995 the KEGG database has been developed in my laboratories (Kanehisa Laboratories) at Kyoto University and the University of Tokyo thanks to funding from the Japanese Ministry of Education and its agencies. Contrary to popular perception, KEGG has never been a public database, as there has never been an official long-term commitment from any government agency. Although I have managed over the years to obtain multiple and overlapping short-term research grants to support KEGG, this has become more difficult now that I am reaching the mandatory retirement age. Foreseeing this eventuality, together with my colleagues, I started a non-profit organization, NPO Bioinformatics Japan, as a vehicle to raise funds for the service that we have been delivering.

For the last ten years our major source of funding has come from the Institute for Bioinformatics Research and Development (BIRD) of the Japan Science and Technology Agency (JST). As of April 1, 2011 BIRD has been converted to the National Bioscience Database Center (NBDC) in JST. The newly established NBDC focuses on the integration of various databases, and does not support the development of individual databases as BIRD did. The good news is that I was awarded a three-year grant from NBDC for integration of KEGG MEDICUS with disease and drug information used in practice and in society. However, the bad news is that this grant is not sufficient to continue to hire my talented crew of KEGG curators and software developers.

KEGG is now one of the most widely used biological databases in the world as indicated by the web access statistics (150 to 200 thousand unique visitors per month) and the number of KEGG paper citations (one thousand per year). I intend to ensure that KEGG remains a freely available web resource. However, this will be possible only with your support. First, I would like to ask all of you who have benefited from KEGG to write, email, tweet, and blog about your support for KEGG. I hope, in the long run, your voices will increase our chances of getting more stable funding. Second, we will continue to ask commercial organizations to obtain a license to use KEGG from Pathway Solutions Inc. I am very grateful to all the companies who have so far supported KEGG by obtaining license agreements. This licensing revenue is fully reinvested to further the development of KEGG. Unfortunately though, this is still insufficient to maintain the high-quality service that we strive to accomplish. Consequently, I would like to introduce the following mechanism.

Starting on July 1, 2011 the KEGG FTP site for academic users will be transferred from GenomeNet at Kyoto University to NPO Bioinformatics Japan, and it will be available only to paid subscribers. The publicly funded portion, the medicusdirectory, will continue to be freely accessible at GenomeNet. The KEGG FTP site for commercial customers managed by Pathway Solutions will remain unchanged. The new FTP site is available for free trial until the end of June.

Please register to learn more about the KEGG FTP subscription.

Thank you!

Minoru Kanehisa

2011 – The International Year of Chemistry

Appearance of real linear polymer chains as re...

Image via Wikipedia

In their editorial for the January Issue (you will need a Nature subscription to access this, altrenatively see the Sceptical Chymyst post here), the good folks at Nature Chemistry have reminded us that 2011 is the International Year of Chemistry:

“The United Nations has proclaimed 2011 to be the International Year of Chemistry. Under this banner, chemists should seize the opportunity to highlight the rich history and successes of our subject to a much broader audience — and explain how it can help to solve the global challenges we face today and in the future.”

The year even has a website. The UN also singles out two important areas of chemistry – neither of which have chemistry in the name – on the frontpage of the site: namely the development of advanced materials and molecular medicine. I am extremely happy to see this – materials and in particular polymers have been a long-standing interest of mine and some of the immunology work I am currently doing has implications for molecular medicine too.

There are several ways to participate in the Year of Chemistry – one of them is through an essay and video competition: “A World Without Polymers”. Students are asked to make short videos or write essays, trying to imagine what the world would be like without polymers. Furthermore there are networking events, conferences and more all across the world. So go and check out the UN’s site, participate and contribute!

Enhanced by Zemanta

Reading the Tea Leaves of 2011 – Data and Technology Predictions for the Year Ahead

The beginning of a new year usually affords the opportunity to join in the predication game and to think about which topics will not only be on our radar screens on the next year, but may dominate it. I couldn’t help myself but to attempt to do the same in my particular line of work – if for no other reason, than to see how wrong I was when I will look at this again at the beginning of 2012. Here are what I think will be at least some of the big technology and data topics in 2011:

1. Big, big, big Data
2010 has been an extraordinary year when it comes to data availability. Traditional big data producers such as biology continue to generate vast amounts of sequencing and other data. Government data is pouring in from countries all over the world, be it here in the United Kingdom, in the United States and efforts to liberate and obtain government data are also starting in other countries. The Linked Open Data Cloud is growing steadily:

Image

Linked Open Data October 2007 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

Image

Linked Open Data September 2010 - Linking Open Data cloud diagram, by Richard Cyganiak and Anja Jentzsch. http://lod-cloud.net/

the current linked data cloud has about 20 billion triples in it. Britain now has, thanks to the Open Knowledge Foundation, an open bibliography. The Guardian’s Datastore is a wonderful example of a commercial company making data available. The New York Times is making an annotated corpus available. Twitter and other user-generated content also provide significant data firehoses from which one can drink and build interesting mashups and applications, such as Ben Marsh’s UK Snow Map. So that are just some examples of big data and there are several issues associated with it, that will occupy us in 2011.

2. Curation and Scalability
A lot of this big data we are talking about is “real-world” and messy. There is no nice underlying ontological model (the stuff that I am so fond of) and by necessity it is exceptionally noisy. Extracting a signal out of clean data is hard enough, but getting one out of messy data requires a great deal of effort and an even greater deal of care. And therefore the development of curation tools and methodologies will continue to be high up on the agenda of the data scientist. The development of both automated and social curation tools will be high up on the agenda. And yes, I do believe that this effort is going to become a lot more social – there are signs of this starting to happen everywhere.
However, we are now generating so much data, that the sheer amount is starting to outstrip our ability to compute it – and therefore scalability will become an issue. The fact that service providers such as Amazon are offering Cluster GPU Instances as part of the EC2 offering is highly significant in this respect. MapReduce technologies seem to be extremely popular in “Web 2.0” companies and the Hadoop ecosystem is growing extremely fast – and the ability to “make Hadoop your bitch” as an acquaintance of mine recently put it, seems to be an in-demand skill at the moment and I think for the forseeable future. And – needless to say – successful automated curation of big data,, too, requires scalable computing.

3. Discovery
Having a lot of datasets available to play with is wonderful, but what if nobody knows they are there. Even in science, it is still much much harder to discover datasets than ought to be the case. And even once you have found what you may have been looking for, it is hard to decide whether that really was what you were looking for – describing metadata is often extremely poor or not available. There is currently little collaboration between information and data providers. Data marketplaces such as Infochimps, Factual, Public Datasets on Amazon AWS or the Talis Connected Commons (to name but a few) are springing up, but there is a lot of work to do still. And is it just me or is science – the very people whose primary product is data and knowledge – is lagging far behind in developing these market places. Maybe they will develop as part of a change in the scholarly pulication landscape (journals such as Open Research Computation have a chance of leading the way here), but it is too early to tell. The increasing availablity of data will push this topic further onto the agenda in 2011.

4. An Impassioned Plea for Small Data
One thing, that will unfortunately not be on the agenda much is small data. Of course it won’t matter to you when you do stuff either at web scale or if you are someone working in Genomics. However, looking at my past existence as a laboratory-based chemist in an academic lab, a significant amount of valuable data is being produced by the lone research student who is the only one working on his project or by a small research group in a much larger department. Although there is a trend to large-scale projects in academia and away from individual small grants, small-scale data production on small scale research projects is still the reality in a significant number laboratories the world over. And the only time, this data will get published, is as a mangled PDF document in some journal supplementary – and as such is dead. And sometimes it is perfectly good data, which never gets published at all: in my previous woworkplace we found that our in-house crystallographer was sitting on several thousand structures, which were perfectly good and publishable, but had, for various reasons, never been published. And usually it is data that has been produced at great cost to both the funder as well as the student. Now small data like this is not sexy per se. But if you manage to collect lots of small data from lots of small laboratories, it becomes big data. So my plea would simply be not to forget small data, to build systems, which collect, curate and publish it and make it available to the world. It’ll be harder to convince both funders and institutions and often researchers to engage with it. But please let’s not forget it – it’s valuable.

Enough soothsaying for one blog post. But let’s get the discussion going – what are your data and technology predictions for 2011?

Image

The Manuscript Submission Process at Science (Magazine)

Combination of 20px and rotated version of 20p...
Image via Wikipedia

Today Peter Stern, the Senior Editor of Science Magazine was here at the Genome Campus to give a talk about the manuscript submission process at Science. No matter where you are wr.t. scientific publication and the future of scholarly communication, the talk was very engaging, throughtfully delivered and mercifully done without any audivisual aids. I have made a few notes during the talk and here they are. They are unedited “live typing” and as such not pretty – but hopefully useful when trying to understand the publication process for Science. I have (mostly) refrained from commenting on what he said – though there could be a lot that could be said about this – but maybe at a later date. For now, just the raw unadulterated notes:

Peter Stern: The Manuscript Submission Process at Science
Three points to address:

1. Presubmission Enquiries
2. Board Members
3. Review Process

Ad One
Scientists sometimes forget the bigger picture as they work and get their results. Presubmission enquiries can be useful to get some feedback and help place work in some bigger picture. Insists on confidentiality of info provided pre sub enquiries.

Ad Two
Science has about 28 individuals trying to cover all of science….all editors have science profiles…multiple ones…many have run research groups.

Ad Three
Paper is submitted…Stern makes high play of safety and confidentiality. Paper gets assigned to editor. Editors try to read in full and form an informal opinion, but admits that this is getting harder due to volume of submissions. Now talks at great length about the “wackos” (creationists, people inventing perpetuum mobile). Editors work with advisors – “board members”….again about 10-12 advisors…..but they don’t do a review but rather try to place manuscript in the bigger picture of science. They come back with a short evaluation and a confidence score. Board members are active scientists with labs…..looks for gentleman factor in board members (wants to be sure that they are fair to papers even if paper disagrees scientifically with board member).

Once feedback from board members has been received editor opens another round of discussion with fellow editors. If there is a positive decision at this stage paper will be sent for full review. Most papers fail of this stage. Also little room for discussion – decision is essentially binary.

Finding referees: authors can prepare a “negative” list and a positive list of referees. Lists are usually respected…certainly the “negative” list. Editors often scan websites of grant giving bodies…to avoid friends/collaborators refereeing each other. Recommends “Guardians of Science” – a sociological study of the peer review process. Default options of two referees, sometimes more if necessary. Default review time of two weeks: seen as the right balance between speed an ensuring that authors din’t get scooped and allowing enough time for in depth review.

When referee comments come back, there is room for negotiation depending on comments. What happens next depends on what referees ask for. If it is reasonably further work, paper could go back to authors, if too much further work si requested editor has to make a decision. “Peer review is not a democratic process.” If referee reviews are all over the shop could use an arbitrator – which could be a board member.

If positive decision is made, editor will do a “pre-edit” to make it fit Science style. If author is native English speaker, editor will focus on logical argument and flow of paper, if non-native speaker, more linguistic help is needed. After pre-edit is done, paper is returned to authors and a revised version is expected back within 4 weeks unless experimental work needs to be done which takes longer. Most of the time revised paper goes back to referees and gets green light if referee comments have been addressed.

Once accepted papers can go onto science express for rapid publication and to allow the scientists to claim precedence of publication. This is followed by harsh copy-editing. Calls orthographic mistakes an “affront to science”. Now talks about how good they are at disseminating science and making their authors famous. Here’s the gatekeeper justification again.

Enhanced by Zemanta

Visualisation of Ontologies and Large Scale Graphs

{{en|A phylogenetic tree of life, showing the ...
Image via Wikipedia

For a whole number of reasons, I am currently looking into the visualisation of large-scale graphs and ontologies and to that end, I have made some notes concerning tools and concepts which might be useful for others. Here they are:

Visualisation by Node-Link and Tree

jOWL: jQuery Plugin for the navigation and visualisation of OWL ontologies and RDFS documents. Visualisations mainly as trees, navigation bars.

OntoViz: Plugin into Protege…at the moment supports Protege 3.4 and doesn’t seem to work with Protege 4.

IsaViz: Much the same as OntoViz really. Last stable version 2004 and does not seem to see active development.

NeOn Toolkit: The Neon toolkit also has some visualisation capability, but not independent of the editor. Under active development with a growing user base.

OntoTrack: OntoTrack is a graphical OWL editor and as such has visualisation capabilities. Meager though and it does not seem to be supported or developed anymore either…the current version seems about 5 years old.

Cone Trees: Cone trees are three-dimensional extensions of 2D tree structures and have been designed to allow for a greater amount odf information to be visualised and navigated. Not found any software for download at the moment, but the idea is so interesting that we should bear it in mind. Examples are here, here and the key reference is Robertson, George G. and Mackinlay, Jock D. and Card, Stuart K., Cone Trees: animated 3D visualizations of hierarchical information, CHI ’91: Proceedings of the SIGCHI conference on Human factors in computing systems, 1991, ISBN = 0-89791-383-3, pp.189-194. (DOI here)

PhyloWidget: PhyloWidget is software for the visualisation of phylogenetic trees, but should be repurposable for ontology trees. Javascript – so appropriate for websites. Student project as part of the Phyloinformatics Summer of Code 2007.

The JavaScript Information Visualization Toolkit: Extremely pretty JS toolkit for the visualisation of graphs etc…..Dynamic and interactive visualisations too…just pretty. Have spent some time hacking with it and I am becoming a fan.

Welkin: Standalone application for the visualisation of RDF graphs. Allows dynamic filtering, colour coding of resources etc…

Three-Dimensional Visualisation

Ontosphere3D: Visualisation of ontologies on 3D spheres. Does not seem to be supported anymore and requires Java 3D, which is just a bad nightmare in itself.

Cone Trees (see above) with their extension of Disc Trees (for an example of disc trees, see here

3D Hyperbolic Tree as exemplified by the Walrus software. Originally developed for website visualisation, results in stunnign images. Not under active development anymore, but source code available for download.

Cytoscape: The 1000 pound gorilla in the room of large-scale graph visualization. There are several plugins available for interaction with the Gene Ontology, such as BiNGO and ClueGO. Both tools consider the ontologies as annotation rather than a knowledgebase of its own and can be used for the identification of GO terms, which are overrepresented in a cluster/network. In terms of visualisation of ontologies themselves, there is there is the RDFScape plugin, which can visualize ontologies.

Zoomable Visualisations

Jamabalaya – Protege Plugin, but can also run as a browser applet. Uses Shrimp to visualise class hierarchies in ontologies and arrows between boxes to represent relationships.

CropCircles (link is to the paper describing it): CropCircles have been implemented in the SWOOP ontology editor which is not under active development anymore, but where the source code is available.

Information Landscapes – again, no software, just papers.

Reblog this post [with Zemanta]

Merry Christmas Everyone

German painting, 1457
Image via Wikipedia

Another year is coming to a close and it has been nothing short of eventful. There has been the end of one direction of research, the beginning of my existence as a service provider at the EBI and several new strands of research. Not to speak of moving house and a number of other things.

I have learned a lot about people this year and sometimes more than I wanted to. In particular, I have learned that “trust” is the only way that allows anyone to manage anything – both in business and academia. Destroying trust between people or people and organisations, causes untold harm in the medium and long term, no matter how expedient it seems at the time.

However, it is christmas now and the world rests for a few days. Time to reflect on 2009 and to look forward to the new year with all its possibilities and challenges.

A very merry christmas and a happy new year to you all, thank you for reading the blog and see you in 2010!

Reblog this post [with Zemanta]

Almost Christmas….

Christmas is almost upon us and many are at home with their friends and family and looking forward to a few quiet days. Should you, however, not wish to forget about science althogether during this period, have a look at Prof Richard Wiseman’s (University of Hertfordshire) christmas science experiments:

Exploring Chemical Space with GDB – Jean Louis Raymond (University of Bern)

Three molecules. This image was originally upl...
Image via Wikipedia

(These are live notes from a talk Prof Reymond gave at EBI today)

The GDB Database

GDB = Generated Database (of Molecules)

The Chemical Universe Project – how many small molecules are possible?

GDB was put together by starting from graphs –  in this case the graphs were hydrocarbons and used GENG software to elaborate all possible graphs (after predefining which graphs are chemically reasonable and incorporating bonding informatation etc.) Then place atoms, enumerate, get combinatorial explosion of compounds and apply filters to remove chemical immpossibility: result couple of billion compounds.

 

Some choices restricting diversity: no allenes, no DB at bridgeheads etc, problematic heteroatom constellations (did not consider peroxides), hydrolytically labile functional groups.

In general – number of possible molecules increases exponentially with increasing number of nodes.

Showing that the molecular diversity increases with linear open carbon skeletons – cyclic graphs have fewer substitution possibilities. Chiral compounds offer more diversity than non-chiral ones.

 

GDB Website

 

Now talking about GDB13:

removed fluorine, introduced sulphur, filtered for molecules with “too many” heteroatoms – due to synthetic difficulties and the fact they may be of lesser interest to medchem.

Now showing statistical analysis of molecular types in GDB. 95% of all marketed drugs violate at least two Lipinski Rules. All molecules in the GDB13 are Lipinski conformant.

Use case: take known drug and find isomers. Aspirin has approx 180 compounds similar to Aspirin by Tanimoto score > 0.7 similarity. Points out that any of these molecules may not have been imagined by chemists.

 

GDB15 is just out – corrected some bugs, eliminated enol ethers (due to quick hydrolysis), optimized CPU usage…approx 26 billion molecules, 1.4 Tb – counting them takes a day)

 

Applications of the Database – mainly GDB 11

Use case: Glutamatergic Synapse Binding

used Bayesian classifier trained with known actives and then used that to retrieve about 11000 molecules from GDB11. This was followed by high throughput docking – selected 22 compounds for lab testing. Enrichment of glycine-containing compounds. Now showing some activity data for selected compounds.

Use case: Glutamate Transporter: applied certain structural selection criteria to database molecules to obtain a subset of approx 250 k compounds. Again followed by HT docking. Now showing syntheses of some selected candidate structures together with screening data.

 

“Molecular Quantum Numbers”

Classification system for large compound databases. Draws analogy to periodic table: classification system for elements. We do not have something like this for molecules. Define features for molecules: atom types, bond types, polarity, topology……42 categories in total. Now examines ZINC database against these features: can show that there are common features for molecules occupying similar categories.PCA analysis: first 2 PCs cover 70% of diversity space: first PC includes molecular weight…2D representations considered to be acceptable. PCA also shows nice grouping of molecules by number of cycles

Same analysis for GDB 11: first PCs now mainly account for molecular flexibility, polarity (doesn’t contain many rings due to atom limitation).

Analysis for PubChem – difficult to discover information at the moment.

Was on the cover of ChemMedChem this November.

Shows examples of fishing our structural motive analogies for given molecular motives.

Reblog this post [with Zemanta]

Semantic Web Tools and Applications for Life Sciences 2009 – A Personal Summary

A bicyclist in Amsterdam, the Netherlands.
Image via Wikipedia

So another SWAT4LS is behind us, this time wonderfully organised by Andrea Splendiani, Scott Marshall, Albert Burger, Adrian Paschke and Paolo Romano.

I have been back home in Cambridge for a couple of days now and have been asking myself whether there was an overall conclusion from the day – some overarching bottom line that one could take away and against which one could measure the talks at SWAT4LS2010 to see whether there has been progress or not. The programme consisted of a great mixture of both longer keynotes, papers, “highlight posters” and highlight demonstations illustrating a wide range of activities at the semantic web technology – computer science and biomedical research.

Topics at the workshop covered diverse areas such as the analysis of the relationship between  HLA structure variation and disease, applications for maintaining patient records in clinical information systems, patient classification on the basis of semantic image annotations to the use of semantics in chemo- and proteoinformatics and the prediction of drug-target interactions on the basis of sophisticated text mining as well as games such as Onto-Frogger (though I must confess that I somehow missed the point of what that was all about).

So what were the take-home messages of the day? Here are a few points that stood out to me:

  • During his keynote, Alan Ruttenberg coined the dictum of “far too many smart people doing data integration”, which was subsequently taken up by a lot of the other speakers – an indication that most people seemed to agree with the notion that we still spend far too much time dealing with the “mechanics” of data – mashing it up and integrating it, rather than analysing and interpreting it.
  • During last year;s conference, it already became evident that a lot of scientific data is now coming online in a semantic form. The data avalanche has certainly continued and the feeling of an increased amount of data availability, at least in the biosciences, has intensified. While chemistry has been lagging behind, data is becoming available here too. On the one hand, there are Egon’s sterling efforts with openmolecules.net and the data solubility project, on the other, there are big commercial entities like the RSC and ChemSpider. During the meeting, Barend Mons also announced that he had struck an agreement with the RSC/ChemSpider to integrate the content of ChemSpider into his Concept Wiki system. I will reserve judgement as to the usefulness and openness of this until it is further along. In any case, data is trickling out – even in chemistry.
  • Another thing that stood out to me – and I could be quite wrong in this interpretation, given that this was very much a research conference – was the fact that there were many proof-of-principle applications and demonstrators on show, but very few production systems, that made use of semantic technologies at scale. A notable exception to this was the GoPubMed (and related) system demonstrated by Michael Schroeder, who showed how sophisticated text mining can be used not only to find links between seemingly unrelated concepts in the literature, but can also assist in ontology creation and the prediction of drug-target interactions.

Overall, many good ideas, but, as seems to be the case with all of the semantic web, no killer application as to yet – and at every semweb conference I go to we seem to be scrabbling around for one of those. I wonder if there will be one and what it will be.

Thanks to everybody for a good day. It was nice to see some old friends again and make some new ones. Duncan Hull has also written up some notes on the day – so go and read his perspective. I, for one, am looking forward to SWAT4LS2010.

Reblog this post [with Zemanta]