What is a repository (and more importantly: what should it be)?

Every once in a while, things come together and one begins to understand stuff. I have had that experience over the last couple of days.
I spent much of last week trying to put together a white paper explaining what we to in simple terms to some managers. I have been through that exercise a couple of times now and so am quite comfortable explaining semantic technologies to chemists. So far so good. Recently, however, we have started working on molecular repositories – Peter has already mentioned this in a previous blog post. Now, of course, we do want to make use of them in polymer science and so I sat down and tried to explain what a repository is. Turns out, I got stuck.

I got stuck because I had to explain what a repository is from the point of view of functionality – when talking to non-specialist managers, it is the only way one can sell and explain these things…they do not care about the technological perspective…the stuff that’s under the hood. I found it impossible to explain what the value proposal of a repository is and how it differentiates itself in terms of its functionality from, say, the document management systems in their respective companies. I talked it over with some colleagues and they couldn’t come up with anything much either. As a matter of fact, we were even unable to come up with a definition of the word “repository” that would satisfy and help differentiate: it seemed a completely meaningless word.

Now today it struck me: a repository should be a place where information is (a) collected, preserved and disseminated (b) semantically enriched, (c) on the basis of the semantic enrichment put into a relationship with other repository content to enable knowledge discovery, collaboration and, ultimately, the creation of knowledge spaces. And to me, this is what the repository of the future should be and how it should be defined.

Let’s take a concrete example to illustrate what I mean: my institutional repository provides me with a workspace, which I can use in my scientific work everyday. In that workspace, I have my collection of literature (scientific papers, other people’s theses etc.), my scientific data (spectra, chromatogramms etc) as well as drafts of my papers that I am working on at the moment. Furthermore, I have the ability to share some of this stuff with my colleagues and also to permanently archive data and information that I don’t require for projects in the future.
Now I spend my morning working on a literature review, so I have my manuscript open and I am searching and retrieving papers from journals and information from the web. As I retrieve this information I submit it to my workspace. Before I submit, the workspace functionality of course allows me to tag the information, either using my own tags, or tags, which are actually ontology terms and which are being suggested to me as I type. Submission to the workspace triggers a whole chain of events: the paper I have just added to my workspace gets submitted to a workflow, which parses it and discovers people, places, chemical information etc….and the relevant metadata is added to my document in the form of RDF. The document thus gets semantically enriched.

Now the system has discovered, that the document I just submitted contains the terms “methacrylate” and “ATRP” and “synthesis” reasons, that the paper is probably talking about a polymer synthesis. Therefore it shows me other papers from my own workspace and the workspace of colleagues that have either shared this information or who have “befriended” me, that talk about acrylates and radical polymerisation.

In the afternoon then, I go into the lab and am finally able to make the compound that I have been working on for a while in a pure form. I write up my experimental in my lab notebook and submit the data to the workspace. Again, a background procedure detects chemical entities, actions etc. and augments my document with the relevant metadata. My workspace already contains the NMR spectrum I ran earlier in the afternoon and I can now simply cross-reference the write-up for the compound with the NMR spectrum and also annotate the spectrum right there and then in my workspace. Once I am satisfied with the assignments, I archive the compound and the data – I know that when it comes to writing this up, the system has a number of templates built in which allow me, with the click of a button, to pull this data together and autogenerate an experimental section in the style of a particular paper. Furthermore, the system alerts me, that someone in another group has tried to prepare a very similar compound a while ago…hmm…maybe I should go and talk to that person, as I did encounter some difficulties during the synthesis….Finally, as I am happy with the purity of my compound and the associated data, I allow the system to expose this data on the web after an embargo period (say until after the publication of the relevant data)

I could carry on with this scenario for quite some time: but the most salient point is: repositories must not be roach motels: they need to make data do work for the scientist. And by doing so, almost as an afterthought, they will also fulfill all their traditional roles of collecting artefacts, perserving them and disseminating them.

Now all of the technologies for this are, in principle, in place: we have DSpace, for example, for storage and dissemination, natural language processing systems such as OSCAR3 or parts of speech taggers for entity recognition, RDF to hold the data, OWL and SWRL for reasoning. And, although the example here, was chemistry specific, the same thing should be doable for any other discipline. As an outsider, it seems to me that what needs to happen now, is for these technologies to converge and integrate. Yes it needs to be done in a subject specific manner. Yes, every department should have an embedded informatician to take care of the data structures that are specific for a particular discipline. But the important thing is to just make the damn data do work!

D’oh solved…..

So the mystery didn’t last that long and Peter solved the puzzle quickly (see comments last post). Let me talk through it in my own words….what I expected and what I got and why I got it (as I say, I am a Java newbie and it took some reading around to figure it out).

The programme seems to be easy enough. First, a new set of shorts is initialised and then the programme starts to iterate from i = 0 to i = 9 (as i < 10). During the iteration i gets added to the set. So first time round, i = 0 is added. With the next instruction, then, 0-1 equals -1 and as set is empty and doesn't contain -1, nothing happens. Now the programme goes round the loop again, adding 1 and removing 1-1 = 0, which leaves 1 in the set. Third time round, 2 is added and 1 is removed, leaving 2. So we only ever leave the most recent element in the set and at the end of the loop, the set should only contain 9. As the size of the set gets printed, and there is only one element, I expected 1 to be printed.

Imagine my surprise then, when, instead of 1, I got 10. I was a bit flummoxed by this, but the answer is quite simple in the end: when elements get added to the set, they are added as type short, but when stuff gets removed, it is of type integer.

In essence, the expression i-1 produces a result that is of type int……and Java autoboxes this into an Integer object (remember that sets cannot hold primitive values but only object references?). And furthermore, Short and Integer objects, even if they do contain the same value, do not compare as equal. And so if you have a set where shorts are being added and integers removed, well then nothing happens.

And why doesn't the compiler complain when I remove an integer from a set of shorts? A look at the javadoc for the Set interface solves that one: the add() method enforces that only shorts can be added to a set of shorts. However, the remove() method allows you to remove anything from a set.

So one way of avoiding this, as Peter has pointed out, is to use int rather than short, or, I guess, you could also cast back to short in the programme.

D’oh…..

Over on his blog, Peter sometimes like to tease people with the odd puzzle or so. Over the Easter holidays, I have been playing with Java Sets a bit and promptly fell into a trap. Now as is usual with these things, particularly if you are somewhat of a Java “noob” like me, it took me a while to figure out what went wrong. And in the spirit of sharing and joint pain, I thought I should do a puzzle too and blog this one……it might keep some of you amused. Here is the little toy programme I was playing with at the time:

code.jpg

(Apologies for the code being in a pic – WordPress messes with the angle brackets) Looks easy enough, right? And I promise you it compiles and runs. Now, without running it, what do you think it prints? Now run it. What does it print? Did you expect what it printed (if so, congratulations…..:-)). If not, what could be the reason?

What is a document?

The following is a verbatim post on Richard Cyganiak’s blog recently. I normally hate re-blogging other people’s content – however, this one is so funny, it deserves further attention:

QOTD: timbl is a document

Simon Spero on the SKOS list:

The meaning of “document” in this context is extremely broad; if we follow Otlet’s definition of a document as anything which can convey information to an observer, the term would seem to cover anything which can have a subject.

By this standard, timbl is a document, but only when someone’s looking.

Ah, the Semantic Web community! Please leave your common sense at the door …



And Amen to that!

Yahoo! has! announced! support! for! semantic! web! standards!

Well, this blog has remained dormant for far too long as I got distracted by the “real world” (i.e. papers, presentations and grant proposals – not the real real world) after christmas.

But I can’t think of a better way to start blogging again than to report that Yahoo! has just announced their support of semantic web technologies. To quote from their search blog:

The Data Web in Action
While there has been remarkable progress made toward understanding the semantics of web content, the benefits of a data web have not reached the mainstream consumer. Without a killer semantic web app for consumers, site owners have been reluctant to support standards like RDF, or even microformats. We believe that app can be web search.

By supporting semantic web standards, Yahoo! Search and site owners can bring a far richer and more useful search experience to consumers. For example, by marking up its profile pages with microformats, LinkedIn can allow Yahoo! Search and others to understand the semantic content and the relationships of the many components of its site. With a richer understanding of LinkedIn’s structured data included in our index, we will be able to present users with more compelling and useful search results for their site. The benefit to LinkedIn is, of course, increased traffic quality and quantity from sites like Yahoo! Search that utilize its structured data.

In the coming weeks, we’ll be releasing more detailed specifications that will describe our support of semantic web standards. Initially, we plan to support a number of microformats, including hCard, hCalendar, hReview, hAtom, and XFN. Yahoo! Search will work with the web community to evolve the vocabulary framework for embedding structured data. For starters, we plan to support vocabulary components from Dublin Core, Creative Commons, FOAF, GeoRSS, MediaRSS, and others based on feedback. And, we will support RDFa and eRDF markup to embed these into existing HTML pages. Finally, we are announcing support for the OpenSearch specification, with extensions for structured queries to deep web data sources.

We believe that our open approach will let each of these formats evolve within their own passionate communities, while providing the necessary incentive to site owners (increased traffic from search) for more widespread adoption. Site owners interested in learning more about the open search platform can sign up here.

I have had many discussions with people over the past year or so concerning the value of the semantic web approach and some of the people I talked to have been very vocal sceptics. However, the results of some of our work in Cambridge, together with the fact that no matter where I look, the semantic web is becoming more prominent, have convinced me that we are doing the right thing. It was pleasing to see that the Semantikers were out in force recently even at events such as CeBit which has just ended. Twine seems to be inching towards a public beta now and the first reviews of the closed beta are being written even if they are mixed at the moment. Reuters recently released Calais as a web service, which uses natural language processing to identify people, places, organizations and facts and makes the metadata available as RDF constructs. So despite all the skepticism, semantic web products are starting to be shipped and even the mainstream media are picking up the idea of the semantic web with the usual insouciance, but nevertheless, it seems to be judged newsworthy. Academic efforts are coming to fruition too.

Maybe I am gushing a little now. But it seems to me, that Yahoo now lending its support could be a significant step forward for all of us. And sometimes it is just nice to know that one is doing the right thing.