Unilever Centre for Molecular Informatics
 

Peter Corbett

Teaching computers to read chemistry papers

 

Archive for September, 2007

To annotate or not to annotate?

Friday, September 21st, 2007

Hal Daume III writes about whether manual annotation is a bad idea or not. The post is a follow-up to an earlier post, in response to an article in Computational Linguistics (well worth reading, if you can get to it).

The argument goes like this: NLP work needs examples of the sorts of outputs that it is to produce, to use as training and evaluation data. In some cases (for example machine translation), there are examples of this “in the wild” that have been created for purposes other than NLP work. In others (for example part-of-speech tagging) such resources are unlikely to be directly useful to anyone but NLP researchers and so we have to create our own rather than finding ones that have been created for other purposes. Thus we end up with a set of annotated corpora, treebanks etc..

The trouble with annotated corpora is that they are a source of error and oversimplification. Language is complex, and annotation is hard. If you give naive annotators minimal guidelines to work with (”go and mark up all of the genes in these abstracts”), you get a lot of variation in what people
mark up. You can reduce this by creating extensive guidelines and doing repeated inter-annotator agreement studies until your metrics are good, but that only proves that your guidelines are coherent, and it doesn’t prove that they are well-correlated with anything of direct interest to NLP consumers. Furthermore doing lots of work on annotation doesn’t fit into people’s research agendas, and so the work is often done in a quick and sloppy manner. These factors conspire to produce corpora that are full of of quirks, oddities and oversimplifications, and a bunch of NLP systems that end up being
overoptimised for those malannotations and underoptimised for the phenomenon you were actually trying to capture.

I’ve perpetrated (with collaborators) an extensive set of annotation guidelines (for named-entity recognition) myself. We got to put a lot of work into it, and when they got published, I was pleasantly surprised by how well-received the paper was (I was steeling myself for the work being perceived as being uninteresting). The inter-annotator agreement is pretty good, but how good are they really? There are still parts of the guidelines that make me wince, and yet I don’t want to revise them because reversing the decisions we made would make me wince harder. Having some framework where I could make variations in the annotation guidelines, re-annotate a corpus and see how it affects a task which can be evaluated against some ‘wild’ data
would be very helpful in evaluating the guidelines.

The unspoken assumption in all of this worrying about manual annotation-driven NLP is that wild data is better conceived than our annotation guidelines. I’m not convinced that this is necessarily true in all cases. In particular, at various times I’ve had cause to look at various ontologies (and related resources) - many of which were not explicitly designed with NLP work in mind - and there are a lot of compromises and strange ideas there too. See various parts of this blog passim for examples.

I suppose that at this point I have to trot out the tired old cliche that the formalisation of knowledge is hard. But why is this? Is it just that people are
hopelessly confused, or is it that informal knowledge is surprisingly powerful in ways that we do not (yet) understand? And if the latter, is it possible to create formalisms that capture the useful properties of informal knowledge, and use that to get around our problems.

Another appearance

Tuesday, September 18th, 2007

I’ll be talking about aspects of chemical information that impinge on the Semantic Web (or if you prefer, the semantic web), at:

SciComp@Cam, Thurs 20th Sept 2007 17:15-19:15 at the Centre for Mathematical Sciences on Wilberforce Road, Cambridge.

There will also be presentations from Matt Wood (Sanger Institute) on microformats and Renato Golin (EBI-EMBL) will be talking about RDF.

More information can be found at: http://scicomp.org.uk

Silly name of the day

Monday, September 17th, 2007

octopamine

Octopamine. From the name you’d expect eight of something to be involved - mostly when you see ‘oct’ in a name, it’s a clue to the structure. And indeed the formula is C8H11NO2, eight carbons. But they’re not in a nice straight chain, so maybe it’s just a co-incidence - and indeed it is. What there really is is eight legs - and not on the structure itself. It’s a compound that was first isolated from octopuses (octopii? octopodes?), hence the name.

The ‘amine’ bit really does imply organic nitrogen, on the other hand.

The shrinking horizons of $FIELD

Monday, September 17th, 2007

The Shrinking Horizons of Computational Linguistics, Ehud Reiter, Computational Linguistics 2007 (33), 283-287.

Reiter’s thesis is that CL is more insular than it was ten years ago. The evidence: authors of papers for ACL journals and conferences cite more NLP-oriented work than they used to, and less linguistics, psychology and HCI work. Also they cite more mathematics and statistics. Furthermore there is an essentially separate community of research on language-oriented topics that publishes in places such as Artificial Intelligence that cites much more work from the broader language community.

This is very familiar to a chemist with biological leanings. There seems to be a process whereby interdisciplinary work becomes a disciple of its own, which acquires a name, and becomes heavily focussed on a narrow subset of the overlap between the two (or more) parent disciplines. The original Chemistry/Biology crossover was organic chemistry - however after people realised that complex carbon compounds weren’t necessarily the product of biological processes, the name got applied to carbon chemistry. Therefore the chemistry-biology interface gets called “biochemistry”, which acquires a focus on proteins, so you end up getting “molecular biology” which is more related to genetics, and also “biological chemistry”, “chemical biology”, “bioorganic chemistry” (which tend to be more focused on small molecules, unnatural amino acids/nucleotides, and more likely to occur in chemistry departments) and even “bioinorganic chemistry” (which, due to the processes of linguistic evolution, isn’t an oxymoron).

There are advantages to creating these fields with fixed research agendas; they’re very convenient because you can have specialised journals, conferences, peer review processes, courses, textbooks - everything you need to do research all under one roof. Job applications are made easier - you can put the appropriate field name on your CV rather than a couple of sentences of waffle. People working in the field can converge on a single language in which to describe what they’re working on, and all of the pains of genuinely interdisciplinary research all melt away. Why get reviews from a statistician, a chemist, a biologist and a computer scientist when you can get reviews from a bunch of QSAR people? Finally, not having to pander to everyone’s worldview lets you get on with developing your field nice and efficiently, allowing you to find pragmatic and empirically-justifiable techniques that work, rather than theory-laden techniques that don’t. There’s an old quote, “every time I fire a linguist our recognition rate goes up”, and there’s a lot of truth to it.

There are several dangers to all of this. The first is that other interdisciplinary areas get neglected. Now I seem to be making a career out of research in neglected cross-disciplinary areas, and there are definite problems (there are also big advantages, too - there’s much less danger of getting scooped!). The second is gradual loss of outside ideas to fertilise the field. The third is a focus on doing the same old things again and again, making slight incremental improvements rather than trying out bold new ideas (these can be a pain to publish, as some reviewers will knee-jerk reject papers they don’t understand. OTOH some reviewers will evidently knee-jerk accept papers they don’t understand, which is the only explanation I can think of for how some papers can get published). The fourth is an obsession with artificial or artificially constrained problems, where the existence of a named scientific community means that things that are plainly irrelevant to the wider scientific context can maintain their momentum due to a subcommunity of researchers who all cite each other. The fifth - and I make this accusation hesitantly, but I have heard it made a lot more boldy - is that named fields collect mistakes, misconceptions and bad practices. In particular, it seems to be common to accuse various fields of having endemically bad statistics. This can occur because the field does not collaborate with statisticians, the papers are not reviewed by statisticians, and so faulty norms and practices can be kept within the field and away from the people with the background to critique them.

Finally, there’s the danger that your work gets lost to the wider scientific community. People doing the sort of work that gets published in Artificial Intellegence don’t read ACL journals or go to ACL conferences, because it’s not relevant to them. There is an obvious vicious circle here - the people who don’t read ACL journals/conferences don’t publish there either, the field gets less and less appealing to outsiders, and insiders lose more and more of their links to the larger context.

So how do we fight back against this, to connect a hybrid discipline back to its parents? And is it even worth trying? Here, unfortunately, is where I don’t have the answers…

What is it with BioCyc in PubChem??

Tuesday, September 11th, 2007

I’ve had cause to look at and use PubChem quite a lot, and the basic idea - lots of chemical databases all under one roof and accessible to the public - isn’t a bad one. However, there are some deficiencies, especially with names from, ah, certain sources. BioCyc seems to come up quite a lot here - you can see what PubChem has of theirs with this query. The names in BLOCK CAPITALS in particular seem to be associated with quite a lot of strangeness.

In particular, the following are a source of amusement: this, this, this, this and this. But there are less exciting names too. You get cases where all of the locants lose their commas (such as 1235-TETRAHYDROXYBENZENE, oddly abbreviated things like TOLUENECISDHDIOL, and names that just stop prematurely like BETA-HYDROXYANDROST-5-EN-17-ONE-3-SULFAT.

Then again, not every strange thing in PubChem is attributable to BioCyc, for example ,
this and this

And now for something completely different…

Monday, September 10th, 2007

doi:10.1016/j.polymer.2007.05.018

Computational linguistics: A new tool for exploring biopolymer structures and statistical mechanics

Ken A. Dill, Adam Lucas, Julia Hockenmaier, Liang Huang, David Chiang and Aravind K. Joshi

Some anonymous (chemisty) labmate put this on my desk. These people have the thesis that computational linguistics techniques can be adapted to help with the problem of folded biopolymers such as proteins and RNA. In particular, they have an algorithm called ZAMDP (Zipping and Assembly Mechanism by Dynamic Programming) which is described as being like the CKY algorithm for chart parsing, which is used to predict a subset of RNA folds. Due to the limitations of the CFGs that underly the technique, it can’t handle structures such as pseudoknots. They therefore end with a discussion of some of the more interesting formalisms used within the computational linguistics community that sit in between context free and context sensitive grammars; in particular they mention Tree Adjoining Grammars. At this point I am obliged to jump up and down and shout, “HPSG! Typed feature structures” out of pure project-based favouritism.

This all looks interesting - although none of it is quite enough within my subspecialisms for me to judge it too closely. These days, most of the stuff I work directly on is finite state, but Real Soon Now I should be getting direct experience with the outputs of UBPSG and HPSG-based parsers. Also, in the biochemistry domain - well, I can remember a few things about folds from second year biochemistry, and I can see how the interactions in parallel beta sheets relate to those strange cross-serial sentences in that dialect of Swiss German, but I’ve not actually done any work on modern fold prediction.

Forthcoming appearances

Wednesday, September 5th, 2007

In November, I’ll be at the German Conference on Chemoinformatics, to talk about some of my work, with particular reference to the relationships between chemical structure and scientific text.