Unilever Centre for Molecular Informatics
 

Peter Corbett

Teaching computers to read chemistry papers

 

What’s chloroform?

My life as a postdoc is almost up, and so this blog will soon be reaching some sort of conclusion. Still, there’s some time to write a post or two that I’ve been meaning to write for a while.

In my field, if you want to make progress, you end up having to look at papers and come up with formalisms for what people are trying to say. This isn’t easy, partly due to language being complicated, partly due to the subject matter being complicated, and partly due to the authors’ mental model of the subject matter being complicated. I think this latter concern is potentially quite interesting. This is probably best illustrated through example.

One of the papers we did was Pyridines, Pyridine and Pyridine Rings: Disambiguating Chemical Named Entities, in which we showed how the word “pyridine” can have at least three different meanings - referring to a whole, precise compound, a class of compounds, and a part of a molecular structure of a compound - and how this was a very general process that was widespread throughout nomenclature. We came up with a formalism for this, a set of annotation guidelines, an inter-annotator agreement study, and a classifier. There were also some things we hinted at in the introduction, but didn’t have the time or effort to do any work with.

One of these is the bulk substance/molecule distinction. Consider chloroform. If I ask someone what chloroform is, chances are they’ll say something like “trichloromethane” or “CHCl3“, and be satisfied. I’d like to propose - somewhat cautiously, I’ll discuss my doubts later - another, somewhat circular definition: “the liquid found in bottles legitimately marked chloroform“. I say “legitimately” because old chloroform bottles have a number of uses beyond that of storing chloroform, really you ought to black out the name or cover it over entirely, but sometime a bit of scribbling will do to cancel the implication that the bottle is a bottle of chloroform, and yet leave the markings intact. Or maybe you were in a rush and forgot to mark the bottle as no longer a chloroform bottle, very naughty, but I’m pretty sure it has happened.

Partly this distinction arises because there are various contexts that involve chloroform molecules and not bulk chloroform. For example, a crystal structure could include a chloroform molecule or two per unit cell, or you could have an entirely hypothetical chloroform as part of a simulation or a reaction mechanism you’ve drawn, or chloroform as a solute. In situations like that, the properties of chloroform listed on wikipedia (colourless liquid, 1.48g/cm3, m.p. -63.5 °C, b.p. 61.2 °C, refractive index (nD) 1.4459, etc.) don’t apply. They do apply to bulk chloroform.

Let me tell you a story about chloroform. My recollection of some of the details may be wrong, but the essential truth isn’t affected by these. In my previous research group, a student was doing experiments where she would dissolve some materials in chloroform, leave them stirring for a few days, take a sample, and analyse it by mass spectroscopy (MS) to see what compounds had been formed in the reaction and in what ratios. These experiments seemed to be repeatable, well behaved, small variations in the initial conditions gave small variations in the results, other people had done similar experiments and had achieved similar results, it all looked good. She left that bit of research to do something else, and came back a few months later. The results were not reproducible. I can’t remember whether it was simply a case of the product ratios being very different, or entirely new products being produced, or whether the products came out of solution. Whichever way, the nice repeatable experiment was not reproducible several months later.

She checked all sorts of things to see what she was doing differently, and eventually found an answer. It turns out that there are at least two different grades of chloroform in the lab. “Bench chloroform”, I forget it’s proper designation, probably “technical grade” - is reasonably cheap, reasonably pure, and is commonly used by synthetic organic chemists as a solvent for their reactions, their purification procedures, and for thin layer chromatography - a quick, cheap and cheerful analytical technique. “Analytical chloroform” is purer, more expensive (but on the scales of the experiment, this extra expense was not significant), and is typically used by analytical chemists in techniques that use expensive machines to make accurate measurements. This distinction is common to many solvents. In the case of chloroform, however, the problem is not just one of random accidental impurities where varying levels of care is taken to remove them. In chloroform, one impurity is deliberately added. Quoth wikipedia:

During prolonged storage hazardous amounts of phosgene can accumulate in the presence of oxygen and ultraviolet light. To prevent accidents, commercial chloroform is stabilized with ethanol or amylene…

This is not quite our experience, but very close. She found that bench chloroform was stabilised with methanol (quite dissimilar to chloroform in properties), and that analytical chloroform was stabilised with a smaller quantity of amylene (rather more chloroform-like and innocuous). Her earlier work had used bench chloroform, her later work had used analytical chloroform. To confirm this, she added a little methanol to analytical chloroform, and used that as her solvent, and the experiments started working again, just as they always had done when using unadulterated bench chloroform. This artificial bench chloroform was always written up as “chloroform + 0.x% methanol” (I forget the value of x), but was quite possibly purer chloroform than the chloroform that gets written up as “chloroform”.

So, these impurities in our substances exist, but are often - but not always - unimportant. What are the linguistics of this? I could take this in one of two ways. Consider a hypothetical paper that might say “Compound 12 was dissolved in chloroform”. This statement is acceptable and would pass peer review, even though the reviewer knows that chloroform is never absolutely pure. However, if the reviewer happened to know that compound 12 had actually been dissolved in a 1:1 mixture of chloroform and methanol, he would be right to reject the paper, and probably to accuse the author of fraud. Nevertheless, usually there’s only a very little methanol involved, so we need some way of being able to say that the statement is true whilst still maintaining a clean conscience.

One approach is what I call the lexical approach: to say that the word “chloroform” itself has a sense than means “bulk chloroform”, and that that sense includes an acceptable level of impurity. The other approach, which I might slightly loosely term the pragmatics approach, is to say that “chloroform” really does mean just the collection of CHCl3 molecules, that it should be taken as read that the statements implicitly includes “and some impurities that don’t make a difference” that can be taken as read, and that this can be implicitly read because of context; partly the general context of scientific research, and partly the presence of a paragraph earlier in the experimental section that says where you bought your solvents from and what grade they were (this is not always present, if it is, it is commonly cut-and-pasted from the last person in the group to do similar work and may not always be accurate and up to date; even if it is, it may be somewhat vague and give a list of suppliers that doesn’t precisely say which supplier supplied which chemical. Also there are issues with storage times, methods of drying, etc. which may affect the precise composition of the solvents).

I’m inclined to prefer the lexical view; I find it hard to say, for example, that “ultrapure water” is less pure than plain unmarked “water”. Also, there are specific impurities that emerge from handling and storage; thus there is often a need for “dry chloroform” and “degassed chloroform”, which may yet have tiny amounts of residual water and/or oxygen in them. OTOH, I think you’d say that the impurities are in the chloroform, rather than a part of the chloroform, so maybe my linguistic intuitions can’t make their mind up.

Oddly, if you take the lexical view, then what counts as “chloroform” isn’t determined by a purity threshold; bench chloroform is chloroform, whereas if you take analytical chloroform, carefully distill it to remove the amylene, and then carefully add a tiny amount of an additive, that doesn’t count as “chloroform”, despite being purer than bench chloroform. The end result of this is that the word “chloroform” isn’t just ambiguous (a trivial matter, one that can easily be resolved), but also at least one of the senses is vague. Well, perhaps not very vague, but not 100% free from vagueness either. To a certain extent the vagueness is also reduced by the fact that the manufacturers publish specifications for their various grades of chloroform. I’d have to do some research to find out how consistent they were.

There are also a set of semi-related issues to do with isotopes, but I’ll gloss over them.

The big upshot of all of this is that the world of chemicals is in many ways very close to being a nice well-defined reality, but not quite there completely, and occasionally the inherent vagueness in using words (either in describing your model of reality, or in describing how your well-defined (quasi-)mathematical model relates to reality) leaks through and causes disturbances, but only very rarely.

Now if I were feeling perverse, I would rush out and say “chloroform is socially constructed” and present this as some great discovery. Obviously I don’t want to do this. From where I’m sitting, at first glance it looks like “X is socially constructed” is code for “X is an ill-founded notion of something that doesn’t really exist, we’re better off without the concept of X, the people who talk about X aren’t doing anything related to the truth and the concept of X only exists to oppress people”, and well, in the case of chloroform in particular and science in general, I disagree very strongly. I say “at first glance”, because I’m very aware that there are whole areas of research and study that get systematically misrepresented - for example see Bad Science or Language Log throughout. It may yet be the case that the notion of social construction that I encounter is a strawman and/or a distortion and that there may yet be a community of people who are prepared to talk about social constructs which may nevertheless be useful, (close to) true and generally a good thing. The fact that I haven’t knowingly met them yet is merely evidence, not proof, for the conjecture that they don’t exist. Either way, it’s not really a very satisfactory situation to be in. I’d like to be able to talk - critically at times - about the structure of scientific knowledge (as implicitly revealed by people’s use of language?) without worrying about getting caught up in a stupid game of “Gotcha!”, and it’s not as easy as it should be.

Anyway, it seems to me that the sort of activities that people like me do as domain experts for computational linguistics has consequences beyond that of merely being able to measure and improve our F scores (and that trying to get and improve F scores gives us a useful perspective on these things), and it would be nice to be able to publish somewhere other than within the BioNLP literature, which at times feels like a ghetto within a ghetto. However, my working arrangements are about to change very soon, so this post is more a regret post than anything else. Oh well…

Leave a Reply