Hal Daume III writes about whether manual annotation is a bad idea or not. The post is a follow-up to an earlier post, in response to an article in Computational Linguistics (well worth reading, if you can get to it).
The argument goes like this: NLP work needs examples of the sorts of outputs that it is to produce, to use as training and evaluation data. In some cases (for example machine translation), there are examples of this “in the wild” that have been created for purposes other than NLP work. In others (for example part-of-speech tagging) such resources are unlikely to be directly useful to anyone but NLP researchers and so we have to create our own rather than finding ones that have been created for other purposes. Thus we end up with a set of annotated corpora, treebanks etc..
The trouble with annotated corpora is that they are a source of error and oversimplification. Language is complex, and annotation is hard. If you give naive annotators minimal guidelines to work with (”go and mark up all of the genes in these abstracts”), you get a lot of variation in what people
mark up. You can reduce this by creating extensive guidelines and doing repeated inter-annotator agreement studies until your metrics are good, but that only proves that your guidelines are coherent, and it doesn’t prove that they are well-correlated with anything of direct interest to NLP consumers. Furthermore doing lots of work on annotation doesn’t fit into people’s research agendas, and so the work is often done in a quick and sloppy manner. These factors conspire to produce corpora that are full of of quirks, oddities and oversimplifications, and a bunch of NLP systems that end up being
overoptimised for those malannotations and underoptimised for the phenomenon you were actually trying to capture.
I’ve perpetrated (with collaborators) an extensive set of annotation guidelines (for named-entity recognition) myself. We got to put a lot of work into it, and when they got published, I was pleasantly surprised by how well-received the paper was (I was steeling myself for the work being perceived as being uninteresting). The inter-annotator agreement is pretty good, but how good are they really? There are still parts of the guidelines that make me wince, and yet I don’t want to revise them because reversing the decisions we made would make me wince harder. Having some framework where I could make variations in the annotation guidelines, re-annotate a corpus and see how it affects a task which can be evaluated against some ‘wild’ data
would be very helpful in evaluating the guidelines.
The unspoken assumption in all of this worrying about manual annotation-driven NLP is that wild data is better conceived than our annotation guidelines. I’m not convinced that this is necessarily true in all cases. In particular, at various times I’ve had cause to look at various ontologies (and related resources) - many of which were not explicitly designed with NLP work in mind - and there are a lot of compromises and strange ideas there too. See various parts of this blog passim for examples.
I suppose that at this point I have to trot out the tired old cliche that the formalisation of knowledge is hard. But why is this? Is it just that people are
hopelessly confused, or is it that informal knowledge is surprisingly powerful in ways that we do not (yet) understand? And if the latter, is it possible to create formalisms that capture the useful properties of informal knowledge, and use that to get around our problems.
