Do we really need discovery metadata?
Many of the projects we are involved in and interact with are about systematising metadata for scientific and other scholarly applications. There are several sorts of MD; I include at least rights, provenance, semantics/format, and discovery. I’ll go along with the first three - I need to know what I can do with something, where it came from and how to read and interpret it. But do we really need discovery metadata?
Until recently this has been assumed as almost an axiom - if we annotate a digital object with domain-specific knowledge then it should be easier to find and there should be fewer false positives. If I need a thesis on the synthesis of terpenes then surely it should help if it is labelled “chemistry”, “synthesis” and “terpenes”. And it does.
But there are several downsides:
- It’s very difficult to agree on how to structure metadata. This is mainly because everyone has a (valid) opinion and no-one can quite agree with anyone else. So the mechanism involves either interminable committees in “smoke-filled-rooms” (except without the smoke) or self-appointed experts who make it up themselves. The first is valuable if we need precise definitions and possibly controlled vocabularies but is not normally designed for discovery. The second - as is happening all over the world leads to collisions and even conflict.
- Authors won’t comply. They either leave the metadata fields blank, or make something up to get it over with or simply abandon the operation
- It’s extremely expensive. If a domain expert is required to reposit a document it doesn’t scale.
So is it really necessary? If I have a thesis I can tell without metadata just by looking whether it’s about chemistry (whatever language it’s in), whether it has synthesis and whether it contains terpenes. And so can Google. I just type “terpene synthesis” and all the first page are about terpene synthesis.
The point is that indexing full text (or the full datument

) is normally sufficient for most of our content discovery.
Peter Corbett has implemented Lucene - a free text indexer and done some clever things with chemistry and chemical compounds. That means his engine is now geared up to discover chemistry on the Web from its content. I’ll speculate that it’s more powerful than the existing chemical metadata…
So it was great to see on
Peter Suber’s Blog (can’t stop blogging it!)
Full-text cross-archive search from OpenDOAR
I agree! Let’s simply archive our full texts and full datuments. We’ll never be able to add metadata by human means - let the machines do it. And this has enormous benefots for subjects like chemistry - Peter’s OSCAR3 can add chemical metadata automatically in great detail.
So now all we want is chemistry theses… and chemistry papers … in those repositories. You know what you have to do…
This entry was posted
on Thursday, October 26th, 2006 at 9:50 pm and is filed under chemistry, open issues.
You can follow any responses to this entry through the RSS 2.0 feed.
You can leave a response, or trackback from your own site.
Well, err, yes and no. Certainly Google is well known for ignoring keywords put in the meta tags at the top of webpages in favour of indexing on the text itself - and a good semantic markup to turn a document into a datument will give the indexing algorithm plenty to work on. In short, losing the distinction between discovery metadata and semantic data won’t hurt. However…
…natural language processing won’t get you 100%. In fact, you’ll be lucky to get near 90% under many circumstances. This is often better than nothing but if you can involve humans in a hybrid annotation then you’ll get better results, especially if one of those is the author.
Also, Google has another source of metadata too - other people’s link text. For example, consider the famous googlebomb miserable failure. Apparently “these results may seem politically slanted” - googleblog tells all.
I suspect that at least part of the appeal of discovery metadata comes from the closed-access (and ultimately dead-tree) publishing world. If you can’t let the index have the full text then you’ve got to pull pertinent things out of the text to give the index something to go on. Open access on the web doesn’t have this problem.
(1) Of course you are right, Peter. But, as you say 90% is better than 0%. And 0% is what a lot of activities currently get.
I have no idea what percentage of DOAR documents have links. If they are HTML they have a chance of being useful (cows). If they are PDF then they probably have no links (hamburgers). Maybe someone from SHERPA can tell us. If they *do* have links, then we get a real chance to build a scholarly linkbase and I am sure there is Open software which can do some exciting things with this. It’s a different problem from general Google as there are (I hope) no academic spammers of repositories!!
[...] Why is this important for me in the data/repositories world? Well, It’s becoming increasingly clear that getting search engines to harvest metadata, or getting them to crawl metadata only splash pages doesn’t work and we should be directing them straight to the full text if we want our content to be indexed effectively. Sitemaps allow us to build search optimized representations of our content (this applies double for data without a default textual representation) and point the engines straight to them. [...]