Unilever Centre for Molecular Informatics
 

Peter Corbett

Teaching computers to read chemistry papers

 

Oscar3-a5

June 26th, 2009

The files for oscar3-a5 are now up on sourceforge. This is the last release on my watch, but not the last release forever, as among other things OMII are working on the project, and I’m sure Oscar3 will have a future beyond even that. Anyway, new in alpha 5:

  • Extensive improvements to OPSIN, thanks to Daniel Lowe
  • “subtypes” - is a chemical name referring to a specific whole compound, a class of compounds, or part of a compound
  • A new, improved method of using chemical names from PubChem for name resolution
  • Lots of general code clean up, bug fixes, etc.
  • And more…

Anyway, this effectively concludes my job here at the Unilever Centre, and thus this blog. There are a few more jobs to do, most notably the distribution of cake to fellow UCC people, but for now, I guess this is it.

What’s chloroform?

June 23rd, 2009

My life as a postdoc is almost up, and so this blog will soon be reaching some sort of conclusion. Still, there’s some time to write a post or two that I’ve been meaning to write for a while.

In my field, if you want to make progress, you end up having to look at papers and come up with formalisms for what people are trying to say. This isn’t easy, partly due to language being complicated, partly due to the subject matter being complicated, and partly due to the authors’ mental model of the subject matter being complicated. I think this latter concern is potentially quite interesting. This is probably best illustrated through example.

One of the papers we did was Pyridines, Pyridine and Pyridine Rings: Disambiguating Chemical Named Entities, in which we showed how the word “pyridine” can have at least three different meanings - referring to a whole, precise compound, a class of compounds, and a part of a molecular structure of a compound - and how this was a very general process that was widespread throughout nomenclature. We came up with a formalism for this, a set of annotation guidelines, an inter-annotator agreement study, and a classifier. There were also some things we hinted at in the introduction, but didn’t have the time or effort to do any work with.

One of these is the bulk substance/molecule distinction. Consider chloroform. If I ask someone what chloroform is, chances are they’ll say something like “trichloromethane” or “CHCl3“, and be satisfied. I’d like to propose - somewhat cautiously, I’ll discuss my doubts later - another, somewhat circular definition: “the liquid found in bottles legitimately marked chloroform“. I say “legitimately” because old chloroform bottles have a number of uses beyond that of storing chloroform, really you ought to black out the name or cover it over entirely, but sometime a bit of scribbling will do to cancel the implication that the bottle is a bottle of chloroform, and yet leave the markings intact. Or maybe you were in a rush and forgot to mark the bottle as no longer a chloroform bottle, very naughty, but I’m pretty sure it has happened.

Partly this distinction arises because there are various contexts that involve chloroform molecules and not bulk chloroform. For example, a crystal structure could include a chloroform molecule or two per unit cell, or you could have an entirely hypothetical chloroform as part of a simulation or a reaction mechanism you’ve drawn, or chloroform as a solute. In situations like that, the properties of chloroform listed on wikipedia (colourless liquid, 1.48g/cm3, m.p. -63.5 °C, b.p. 61.2 °C, refractive index (nD) 1.4459, etc.) don’t apply. They do apply to bulk chloroform.

Let me tell you a story about chloroform. My recollection of some of the details may be wrong, but the essential truth isn’t affected by these. In my previous research group, a student was doing experiments where she would dissolve some materials in chloroform, leave them stirring for a few days, take a sample, and analyse it by mass spectroscopy (MS) to see what compounds had been formed in the reaction and in what ratios. These experiments seemed to be repeatable, well behaved, small variations in the initial conditions gave small variations in the results, other people had done similar experiments and had achieved similar results, it all looked good. She left that bit of research to do something else, and came back a few months later. The results were not reproducible. I can’t remember whether it was simply a case of the product ratios being very different, or entirely new products being produced, or whether the products came out of solution. Whichever way, the nice repeatable experiment was not reproducible several months later.

She checked all sorts of things to see what she was doing differently, and eventually found an answer. It turns out that there are at least two different grades of chloroform in the lab. “Bench chloroform”, I forget it’s proper designation, probably “technical grade” - is reasonably cheap, reasonably pure, and is commonly used by synthetic organic chemists as a solvent for their reactions, their purification procedures, and for thin layer chromatography - a quick, cheap and cheerful analytical technique. “Analytical chloroform” is purer, more expensive (but on the scales of the experiment, this extra expense was not significant), and is typically used by analytical chemists in techniques that use expensive machines to make accurate measurements. This distinction is common to many solvents. In the case of chloroform, however, the problem is not just one of random accidental impurities where varying levels of care is taken to remove them. In chloroform, one impurity is deliberately added. Quoth wikipedia:

During prolonged storage hazardous amounts of phosgene can accumulate in the presence of oxygen and ultraviolet light. To prevent accidents, commercial chloroform is stabilized with ethanol or amylene…

This is not quite our experience, but very close. She found that bench chloroform was stabilised with methanol (quite dissimilar to chloroform in properties), and that analytical chloroform was stabilised with a smaller quantity of amylene (rather more chloroform-like and innocuous). Her earlier work had used bench chloroform, her later work had used analytical chloroform. To confirm this, she added a little methanol to analytical chloroform, and used that as her solvent, and the experiments started working again, just as they always had done when using unadulterated bench chloroform. This artificial bench chloroform was always written up as “chloroform + 0.x% methanol” (I forget the value of x), but was quite possibly purer chloroform than the chloroform that gets written up as “chloroform”.

So, these impurities in our substances exist, but are often - but not always - unimportant. What are the linguistics of this? I could take this in one of two ways. Consider a hypothetical paper that might say “Compound 12 was dissolved in chloroform”. This statement is acceptable and would pass peer review, even though the reviewer knows that chloroform is never absolutely pure. However, if the reviewer happened to know that compound 12 had actually been dissolved in a 1:1 mixture of chloroform and methanol, he would be right to reject the paper, and probably to accuse the author of fraud. Nevertheless, usually there’s only a very little methanol involved, so we need some way of being able to say that the statement is true whilst still maintaining a clean conscience.

One approach is what I call the lexical approach: to say that the word “chloroform” itself has a sense than means “bulk chloroform”, and that that sense includes an acceptable level of impurity. The other approach, which I might slightly loosely term the pragmatics approach, is to say that “chloroform” really does mean just the collection of CHCl3 molecules, that it should be taken as read that the statements implicitly includes “and some impurities that don’t make a difference” that can be taken as read, and that this can be implicitly read because of context; partly the general context of scientific research, and partly the presence of a paragraph earlier in the experimental section that says where you bought your solvents from and what grade they were (this is not always present, if it is, it is commonly cut-and-pasted from the last person in the group to do similar work and may not always be accurate and up to date; even if it is, it may be somewhat vague and give a list of suppliers that doesn’t precisely say which supplier supplied which chemical. Also there are issues with storage times, methods of drying, etc. which may affect the precise composition of the solvents).

I’m inclined to prefer the lexical view; I find it hard to say, for example, that “ultrapure water” is less pure than plain unmarked “water”. Also, there are specific impurities that emerge from handling and storage; thus there is often a need for “dry chloroform” and “degassed chloroform”, which may yet have tiny amounts of residual water and/or oxygen in them. OTOH, I think you’d say that the impurities are in the chloroform, rather than a part of the chloroform, so maybe my linguistic intuitions can’t make their mind up.

Oddly, if you take the lexical view, then what counts as “chloroform” isn’t determined by a purity threshold; bench chloroform is chloroform, whereas if you take analytical chloroform, carefully distill it to remove the amylene, and then carefully add a tiny amount of an additive, that doesn’t count as “chloroform”, despite being purer than bench chloroform. The end result of this is that the word “chloroform” isn’t just ambiguous (a trivial matter, one that can easily be resolved), but also at least one of the senses is vague. Well, perhaps not very vague, but not 100% free from vagueness either. To a certain extent the vagueness is also reduced by the fact that the manufacturers publish specifications for their various grades of chloroform. I’d have to do some research to find out how consistent they were.

There are also a set of semi-related issues to do with isotopes, but I’ll gloss over them.

The big upshot of all of this is that the world of chemicals is in many ways very close to being a nice well-defined reality, but not quite there completely, and occasionally the inherent vagueness in using words (either in describing your model of reality, or in describing how your well-defined (quasi-)mathematical model relates to reality) leaks through and causes disturbances, but only very rarely.

Now if I were feeling perverse, I would rush out and say “chloroform is socially constructed” and present this as some great discovery. Obviously I don’t want to do this. From where I’m sitting, at first glance it looks like “X is socially constructed” is code for “X is an ill-founded notion of something that doesn’t really exist, we’re better off without the concept of X, the people who talk about X aren’t doing anything related to the truth and the concept of X only exists to oppress people”, and well, in the case of chloroform in particular and science in general, I disagree very strongly. I say “at first glance”, because I’m very aware that there are whole areas of research and study that get systematically misrepresented - for example see Bad Science or Language Log throughout. It may yet be the case that the notion of social construction that I encounter is a strawman and/or a distortion and that there may yet be a community of people who are prepared to talk about social constructs which may nevertheless be useful, (close to) true and generally a good thing. The fact that I haven’t knowingly met them yet is merely evidence, not proof, for the conjecture that they don’t exist. Either way, it’s not really a very satisfactory situation to be in. I’d like to be able to talk - critically at times - about the structure of scientific knowledge (as implicitly revealed by people’s use of language?) without worrying about getting caught up in a stupid game of “Gotcha!”, and it’s not as easy as it should be.

Anyway, it seems to me that the sort of activities that people like me do as domain experts for computational linguistics has consequences beyond that of merely being able to measure and improve our F scores (and that trying to get and improve F scores gives us a useful perspective on these things), and it would be nice to be able to publish somewhere other than within the BioNLP literature, which at times feels like a ghetto within a ghetto. However, my working arrangements are about to change very soon, so this post is more a regret post than anything else. Oh well…

Cambridge/OMII Workshop on OSCAR and OPSIN

June 5th, 2009

As readers of Peter Murray-Rust’s blog will know, there will be an OSCAR/OPSIN workshop in Cambridge, provisionally on the 9th of July.

Can a biologist fix a radio

March 13th, 2008

In case you haven’t already seen it, Can a Biologist Fix a Radio?. The paper describes what would happen if a group of biologists were given a broken radio (and a big, big pile of working and broken radios for reference) and left to get on with it with current biology methods. Needless to say the authors don’t think they’ll do very well, and go on to criticise biology for unclear language, a lack of formality, and an unwillingness to use calculus. The paper is well written, entertaining, makes some (as far as a chemist can tell) valid points, and is well worth a read.

Except that I’m not sure I agree with all of the conclusions of the paper. Quoth the paper:

A related argument is that engineering approaches are not applicable to cells because these little wonders are fundamentally different from objects studied by engineers. What is so special about cells is not usually specified, but it is implied that real biologists feel the difference. I consider this argument as a sign of what I call the urea syndrome because of the shock that the scientific community had two hundred years ago after learning that urea can be synthesized by a chemist from inorganic materials. It was assumed that organic chemicals could only be produced by a vital force present in living organisms. Perhaps, when we describe signal transduction pathways properly, we would realize that their similarity to the radio is not superficial. In fact, engineers already see deep similarities between the systems they design and live organisms.

Now I’m as strongly against vitalism as the next man, probably more so even, but I think the author is missing something here. The thing about a radio, or any product of engineering, is that they pretty much come with a promise that they are comprehensible, because the designers have to comprehend them. With cells, there’s no such promise; evolution doesn’t have to understand the things it comes up with, they just have to work. Thus evolution may be at liberty to use components, effects and interactions which engineers are wise to steer clear of.

Also:

First, the radio analogy suggests that an approach that is inefficient in analyzing a simple system is unlikely to be more useful if the system is more complex.

Forgive me for sounding defeatist here, but it depends of how you read the “more useful”. Certainly an approach that give bad results on a simple system can be expected to give worse results on a more complex system. However it does not follow that an approach that gives good results on a simple system will even work at all on a more complex system, let alone outperform the previously-inefficient approach. In chemistry we know how nuclei and electrons work; the quantum mechanics is not in doubt, and we see no reason to propose any mysterious additional forces. However, the mathematics is intractable if we want exact solutions, even numeric solutions that converge on exactly the right answer. We have approximations, but even those can be difficult, and often give the wrong answer. The last I heard, theoretical chemists were still perplexed by how some of the properties of bulk water were orders of magnitude out from their calculations, for instance. This does not mean that theoretical/computational chemistry (you don’t want to do these calculations by hand) is useless in all cases, just that in many cases a combination of experimentation, intuition, handwaving, diagrams with curly arrows on and a tolerance for repeated failure is an approach that yields actual progress that just can’t be had by clinging too firmly to more formal approaches. Slow, painful, deeply inefficient progress, but progress nevertheless.

Don’t get me wrong, I’m not disparaging any efforts to put some formalism into biology, I’m just a bit put off by the idea that the current biological community is wasting its time and money on the current mainstream approach.

Parsing or execution?

January 28th, 2008

There’s a link I’ve been sent. Buried within it is a set of statements with some interesting implications:

…Adam Kennedy conjectures that Perl is unparseable, and suggests how to prove it. Below I carry out a rigorous version of the proof, which should put the matter beyond doubt… …the term “parse” here is being used in its strict sense to mean static parsing — taking a piece of code and determining its structure without executing it. In that strict sense the Perl program does not parse Perl. The Perl program executes Perl code, but does not determine its structure…

They then go on to prove that parsing Perl is equivalent to solving the Halting Problem, which is well known for not being solvable. The very interesting bit is the last statement: The Perl program executes Perl code, but does not determine its structure…. In short, the very notion of “parsing Perl” is only ever going to be a model of what’s really going on, and sometimes a faulty model at that.

A while back on a (now effectively dead) project I wrote myself a little Forth-like language that also didn’t really do parsing: it took a line of code, put everything on that line onto a few stacks (there was a small amount of cleverness with multiple stacks to let me use infix notation (1 + 1) rather than RPN (1 1 +)). As such there was never a necessity to make a full representation of what the “structure” of the line of code was. You could try drawing a tree structure if you liked but it didn’t really matter to the interpreter - the interpreter dealt with its input as it came in, and was happily doing computation on one half of a line of code (and forgetting how it had arrived at the results it had arrived at) before it had even really looked at the other half.

I now ask myself: how much (if any) of human language processing is like this? Linguists and NLP researchers will happily draw parse trees for entire sentences, and they may be a good models[1] for the way language works, but could we do something different, something that takes a greater account of the fact that some words arrive in the brain before (or after others), and processing of the earlier parts of a sentence can begin before the later parts have been read/heard?

[1] “All models are wrong. Some models are useful.”

Oooops…

January 7th, 2008

A small but important bug seems to have crept into Oscar3 alpha 3. Oh well. You can download the bugfix release, alpha 4, at the usual place.

Oscar3 alpha 3

December 21st, 2007

A new release of Oscar3 has been made: get it from the download page on sourceforge.

Interesting…

December 17th, 2007

Try feeding queries like: “The * salt of *” to google. Before too long you should get a page saying: “We’re sorry… … but your query looks similar to automated requests from a computer virus or spyware application. To protect our users, we can’t process your request right now. We’ll restore your access as quickly as possible, so try again soon. In the meantime, if you suspect that your computer or network has been infected, you might want to run a virus checker or spyware remover to make sure that your systems are free of viruses and other spurious software. We apologize for the inconvenience, and hope we’ll see you again on Google.”. There’s a CAPTCHA, you type in the word, and normal service is resumed.

What Google does

November 19th, 2007

My invite codes for PowerSet came through recently, so I went to try out their systems. At the moment they have some demos based on Wikipedia content: three constrained question-answering demos that let you ask things like “Who did the pope criticize?”, and also a demo that lets you do specify a subject, a verb and an object (or one or two of the same) and see what comes out. I had a go, and well, sometimes it worked and sometimes it didn’t. I was left feeling a bit unsatisfied.

Often when I discuss NLP systems and their failings with people, the common way of getting around iffy success rates is to point to search engines like Google and say “how often do you hit ‘I feel lucky’ these days?”. In other words, we already have applications which present some results of varying quality to the users and let them sift through them.

However, something has made me feel a bit different about the situation. Google does two things:

1) It finds documents that match your query
2) It finds documents that meet your information needs

The point about this is that Google and the rest all do (or appear to do) Task 1 really quite well - nice and reliably. It’s easy to understand what’s going on with Task 1, the user can model it quite well. Task 2 is where the unreliability and the need for big results pages creeps in. But from a user’s point of view, Task 2 is magic[1] - doing it “properly” would require the computer to be telepathic given the short (often one word) queries that are typically typed in. It’s therefore hard to complain that Google is being “stupid” when it serves up hits that are plainly irrelevant to your information needs - anything better than the results from Task 1 is a bonus.

I wonder if this may be a key to user acceptance of NLP systems: coupling of unreliable “magic” components to add value to nice reliable easy-to-understand string-matching-based systems.

[1] Arthur C. Clarke’s maxim that sufficiently advanced technology is indistinguishable from magic may be pertinent here.

German Chemoinformatics Conference

November 15th, 2007

I’ve just got back from the 3. German Conference on Chemoinformatics in Goslar.

The conference started with a Free Software session. The first presentation was of KNIME (the K is silent, as in knight), the Konstanz Information Miner. It’s a data pipelining/mining environment, a bit like PipeLine Pilot. It uses the Eclipse Rich Client platform, and includes wrappers for things like Weka (a machine learning toolkit), R (statistics) and the CDK (dealing with chemistry). It’s an extensible system - they even made a new node for it during the demo. I wonder how easy it is for an inexperience user to do so - if it is really just an afternoon’s work to write a wrapper, it seems likely that writing wrappers for text mining may be useful. Next up was pgchem::tigress, a module for doing chemical substructure/similarity search within PostgreSQL. Could potentially be useful. Finally, there was RDKit, a hybrid Python/C++ chemical informatics toolkit - the aim is to use python as the front-end language, with C++ behind the scenes for speed.

This was followed by presentations from commercial vendors. Of note: there’s now a text analytics module for Pipeline Pilot. Also, Tripos have decided to get into KNIME - they’ve integrated lots of their software with it, and quite warmly recommended it during their slot.

On to the presentations. As this was a chemistry conference, and not a computer science conference, these weren’t backed by full papers - some of them were still quite interesting though. There were four subject areas, which I will treat in turn with some selected presentations:

Chemoinformatics and Drug Development:

Jürgen Bajorath (University of Bonn) talked about molecular similarity analysis. His presentation centered around the idea of SAR (Structure-Activity Relationship) landscapes. These were classified using a measure called SARI - the SAR Index, which distinguished essentially continuous SARs (where small changes in structure give small changes in properties) from discontinuous (”clifflike”) SARs (where small changes in structure could have big effects). There was also a subclassification of heterogenous SARs, which had intermediate SARI values, into “heterogenous-relaxed” SARs (with a mixture of continuous and discontinuous regions) and “heterogenous-constrained” (continous within the boundaries of a structural constraint) - again there were numerical methods for distinigushing these.

Peter Ertl (Novartis) presented work on natural products, which are receiving renewed interest within the Pharma industry. His main work was to generate a natural product-likeness score, based on HOSE codes and Naive Bayes. An interesting detail was the fact that part of the data cleanup he used was a recursive deglycosylation procedure, to turn glycosides into their corresponding aglycones. He commented that glycosylation affects the pharamacokinetics of natural products, but not their activity.

Josef Scheiber (Novartis) addressed the prediction of side effect profiles. The idea was to correlate the side effects of drugs (as described by the MedDRA terminology) with the targets that the drugs acted on - such that if you could predict or measure the activity of a new drug candidate against the various drug targets, you could use those activities to predict the side effect profile.

Computational Materials Science and Nanotechnology: these three presentations (which were a bit outside of my interests, so I won’t go into details) had a strong theme of multi-scale modeling - of producing systems that could connect detailed atomistic simulations with coarser-grained, larger-scaled simulations of bulk properties.

Chemical Information: Stephen Heller (currently consulting for NIST) reviewed the current proliferation of very large chemical databases. A general theme was that a lot of these were really quite disappointing - only a few contained any data or links other than names, structures and other identifiers, and many of them worked by aggregating the contents of other databases. Then there was me. Finally, Roger Sayle (OpenEye) considered the matter of multiligual name-to-structure and structure-to-name. His approach relied on the fact that non-English systematic names all followed the grammar of English systematic names. This means that to do name-to-structure in German, you first analyse your name into its components, translate them into English, and then do English name-to-structure. This turns out to work remarkably well.

Molecular Modelling: The keynote talk didn’t really fit the session title, but was nevertheless very interesting: Tudor Oprea (University of New Mexico) gave a rather philosophical talk on “Black Swans” - unexpected and unpredictable occurrences that contradict millions of previous observations. His thesis is that pharma is particularly prone to this, as “activity cliffs” (as Jürgen Bajorath also addressed) - small differences in structure which could cause big differences in activity - and in a pharma company’s fortunes - exist - and furthermore the regulatory process means that the definition of what is and isn’t a drug is a social, and not a scientific definition. As for the rest of the presentations - well, molecular modelling isn’t really my thing, but Kay Hamacher (Technische Universität Darmstadt) had the very nice idea of creating a pared-to-the-bone derivative of molecular dynamics, treating amino acids within a protein as single entities, connected by springs. The idea is to be able to do high-throughput simulations, producing a middle ground between sequence-based bioinformatics and atomistic MD.

So that was the conference. In terms on my interest - text - well, it was clear that there were various text-related activites going on here and there, but getting details about them or any sort of coherent picture as to what’s going on is tricky. I am becoming increasingly convinced that what the world needs is for a conference on chemical natural language processing to happen at some point. But is it ever going to happen?