Origins of CML - 0

June 6th, 2009

I shall be writing this blog mainly in the first person but you realise that CML is the joint product of Henry Rzepa and myself over many years. Simply, CML would not have happened without Henry. Perhaps in 2009 I am the more active contributor but it’s a joint creation. So, if ever I write things that appear to be just due to me please mentally replace this by PMRz what I call our symbiote.

What are the origins of CML? I think I go back to ca 1980 when I was writing code to extend Sam Motherwell’s great FORTRAN toolkit for the Cambridge database BIBSER (biliograpgraphic search), CONNSER the first and greatest chemical substructure algorithm, and GEOM78 a geometry calculation tool. In 1976-1978 I used to visit Cambridge (from Stirling) and work with Sam on extracting structures from the database and analysing them. There was a rough division of labour and ideas. I came with a number of ideas and Sam would modify CONNSER and GEOM to support these literally within a day or so. Recall (if you were alive then) that this involved overlays, blank COMMON, EQUIVALENCE and other mindbending ways of dumping the core.

I took the problems back to Stirling and integrated Sam’s output with SPSS (a well known stats package from the social sciences). Not surprisingly this can of spaghetti started to get out of control. I did great amounts of analysis on the floor of our living room with an acoustic modem (http://en.wikipedia.org/wiki/Acoustic_coupler) where the handset was plugged into rubber cups. It used to run at 110 baud (110 bits, yes, bits per second). Since a character took 11 bits (to check and repair loss) we got 10 characters a second.

The sums were originally done at Cambridge (on Phoenix) but I ported the software to UMRCC (the regional computing Centre at Manchester) on a CDC 7600. The results were printed out on folding line printer paper on a boustrophedonic teletype (ASR33?) I would then extract data by hand and enter them into stats programs, but gradually moved to doing the stats remotely. Remote graphics was always difficult we could get printer plots from Aberdeen but it took a week. So I generally evolved ASCII plots.

The point of this is that during these sessions I had a lot of time to think about how to do it better. It was obvious the software had to be modular and I gradually got to thinking about modular data. Remember we used FORTRAN IV which did little to encourage modular.

In 1981/2 I spent a sabbatical with Jenny Glusker in Philadelphia and there developed a VAX version of the software. I extended this to plot aggregations of data in 2 and 3 dimensions. Again the idea of modular components was clear. I returned to start up molecular modelling / computer graphics in Glaxo and found myself working with a completely different set of files ChemX, Mol, etc. I couldn’t use these with my analysis code. It seemed completely wasteful not to have a common format, so I started an activity within the Molecular Graphics Society to systematize file types.

In effect this was an attempt to build a chemical ontology. The word hadn’t been used then within science and it wouldn’t have helped if it had. I didn’t get much take up and there was active resistance from some software companies who regarded their formats as a commercial weapon.

However the crystallographers had a much more unified view of the world. I continue to congratulate the Int. Union of Crystallography for its efforts in this area. In the mid 1980’s there was an active group led by David Brown to create a self-defining format for crystallography. It was, I think, called CSF (Crystallographic Structure File) but I’m open to correction. It was essentially tagged data controlled vocabulary applied to scalars and arrays.

This then developed into to the CIF, which is now the standard method of exchanging crystallographic information. This was based on data supported by data dictionaries which themselves were constrained to a dictionary definition specification. I started to use the CIF approach to model my scientific world this was long before XML but it was essentially isomorphic to XML.

During this period I had gradually advanced my language skills from FORTRAN to BBC-BASIC and C (Part of this was through teaching the MSc in Birkbeck.) So when C++ came along (1991?) I translated my approach to C++ and started to develop a toolkit/library. That’s probably effectively when CML as a data modelling approach started.

I started with the most obvious components geometry and numbers. These are still an integral part of CML (the Euclid library). This was then extended to molecules, atoms and bonds and by ca 1993 I had a set of objects. But I needed a way to display and manipulate them.

At that stage I met Henry Rzepa - I’m not sure how. Henry visited me at Greenford and we found we had a common interest in the Internet and its power for disseminating chemistry. It must have been about the time of Mosaic 1993. Anyway Henry and I went to the first WWW meeting he ran a session on chemistry and I one on biology. We had an early version of RasMol which ran on UNIX and Henry had prepared a demo. We had it running the day before on a CERN machine but when we cam back the next day someone had wiped the shared libraries to save space. We got the thing running again 5 mins after Henry’s talk started.

The theme of WWW1 was, of course, the use of HTML (and HTTP) to create distributed information. Because it was in CERN and all HTTP sites at that stage were academic the emphasis was all on science. How could you carry maths in HTML? And so how could you do the same for chemistry? We didn’t know how.

But Henry went to WWW2 later that year and came back with the idea that the future of the web would be SGML. And that is when I started to cast my objects into SGML. So probably late 1994 is when Chemistry and Markup Languages came togther.

Later posts will take it from there.

Test of CML in blog post

May 27th, 2009

This is a test of CML. Here I am simply writing some CML into a post and checking that it formats correctly. However if you are interested, it’s an aoutput from a Gaussian calculation.

<cml id="C2H5" xmlns="http://www.xml-cml.org/schema"   
  xmlns:gauss="http://wwmm.ch.cam.ac.uk/dict/gauss" 
  xmlns:xsd="http://www.w3.org/2001/XMLSchema"
  xmlns:cml="http://www.xml-cml.org/schema"
  xmlns:units="http://www.xml-cml.org/units/units">
    <molecule formalCharge="0" spinMultiplicity="2" title="#NewComb uB971/6-311+G(d,p)">
      <atomArray>
        <atom id="a1" elementType="C" x3="0.9166762042" y3="0.0730268241" z3="4.9909560473"/>
        <atom id="a2" elementType="H" x3="1.8271445526" y3="-0.2241902314" z3="5.5413327124"/>
        <atom id="a3" elementType="H" x3="1.0043564838" y3="1.1498285377" z3="4.8066561748"/>
        <atom id="a4" elementType="H" x3="0.9550579512" y3="-0.4459962779" z3="4.0264362923"/>
        <atom id="a5" elementType="C" x3="-0.321223388" y3="-0.2601308563" z3="5.7505945921"/>
        <atom id="a6" elementType="H" x3="-0.7206191756" y3="0.4228986482" z3="6.4931268102"/>
        <atom id="a7" elementType="H" x3="-0.7720826283" y3="-1.2444366444" z3="5.678007371"/>
      </atomArray>
      <bondArray>
        <bond atomRefs2="a1 a2" id="a1_a2" order="1"/>
        <bond atomRefs2="a1 a3" id="a1_a3" order="1"/>
        <bond atomRefs2="a1 a4" id="a1_a4" order="1"/>
        <bond atomRefs2="a1 a5" id="a1_a5" order="1"/>
        <bond atomRefs2="a5 a6" id="a5_a6" order="1"/>
        <bond atomRefs2="a5 a7" id="a5_a7" order="1"/>
      </bondArray>
      <formula formalCharge="0" concise="C 2 H 5" dictRef="cml:calculatedFormula"/>
      <formula inline="C2H5(2)" convention="gauss:archive"/>
      <parameterList>
        <parameter dictRef="gauss:method">
          <scalar dataType="xsd:string">ub971</scalar>
        </parameter>
        <parameter dictRef="gauss:basis">
          <scalar dataType="xsd:string">6-311+g(d,p)</scalar>
        </parameter>
        <parameter dictRef="gauss:ginc">
          <scalar dataType="xsd:string">BOTTICELLI</scalar>
        </parameter>
        <parameter dictRef="gauss:calctype">
          <scalar dataType="xsd:string">FOpt</scalar>
        </parameter>
        <parameter dictRef="gauss:method">
          <scalar dataType="xsd:string">UB971</scalar>
        </parameter>
        <parameter dictRef="gauss:basis">
          <scalar dataType="xsd:string">6-311+G(d,p)</scalar>
        </parameter>
        <parameter dictRef="gauss:user">
          <scalar dataType="xsd:string">SS663</scalar>
        </parameter>
        <parameter dictRef="gauss:date">
          <scalar dataType="xsd:date">30-Mar-2009</scalar>
        </parameter>
        <parameter dictRef="gauss:zero">
          <scalar dataType="xsd:integer">0</scalar>
        </parameter>
        <parameter dictRef="gauss:keyword">
          <scalar dataType="xsd:string">NewEstmFC</scalar>
        </parameter>
      </parameterList>
      <propertyList>
        <property dictRef="gauss:versionvalue">
          <scalar dataType="xsd:string">AM64L-G03RevE.01</scalar>
        </property>
        <property dictRef="gauss:hfvalue">
          <scalar dataType="xsd:double">-79.1483095</scalar>
        </property>
        <property dictRef="gauss:s2value">
          <scalar dataType="xsd:double">0.754159</scalar>
        </property>
        <property dictRef="gauss:s2-1value">
          <scalar dataType="xsd:double">0.0</scalar>
        </property>
        <property dictRef="gauss:s2avalue">
          <scalar dataType="xsd:double">0.750012</scalar>
        </property>
        <property dictRef="gauss:rmsdvalue">
          <scalar dataType="xsd:double">4.717E-9</scalar>
        </property>
        <property dictRef="gauss:rmsfvalue">
          <scalar dataType="xsd:double">2.774E-6</scalar>
        </property>
        <property dictRef="gauss:thermalvalue">
          <scalar dataType="xsd:double">0.0</scalar>
        </property>
        <property dictRef="gauss:dipolevalue">
          <vector3>0.1354115 0.0062667 -0.0213725</vector3>
        </property>
        <property dictRef="gauss:pgvalue">
          <scalar dictRef="gauss:pgframevalue" dataType="xsd:string">X(C2H5)</scalar>
          <symmetry dictRef="gauss:pointGroup" pointGroup="C01"/>
        </property>
      </propertyList>
    </molecule>
  </cml>

Harvard and OA

February 16th, 2008

For anyone who hasn’t read the Open Access blogs in the last few days, Harvard has announced a major policy urging/requiring faculty to make all their output Openly accessible. Dorothea puts it bluntly and this is a good place to start:

From Dorothea Salo at Caveat Lector:

…I would be afraid, very afraid, right now if I were a journal publisher who believed my profits depended on preventing widespread self-archiving or playing dog-in-the-manger with copyright. The Harvard policy puts publishers in an extraordinarily weak position. They can’t denounce it; that’s tantamount to denouncing faculty, which would be utterly suicidal. (Publishers can and do slag librarians. They can and do slag government. They can’t slag faculty, and they know it.) I don’t think they can sue….They can’t prevent eager librarians at Harvard from setting up and filling a repository. Even their standard lines of FUD won’t work —they can’t seriously spin this as “a vote against peer review,” because really, is Harvard going to do anything that damages peer review? Of course not! All the publishers can realistically do is plead poverty, and a look at their lobbying budgets and profit margins scotches that argument….

Another viciously clever move was the per-article, in-writing petition requirement for opting out. Suddenly a publisher who wants its articles out of the Harvard IR has to contact each and every Harvard faculty member who publishes with them, for every single article published. Can you imagine the backlash from faculty? …

Stopping other institutions from following in Harvard’s footsteps is a completely different game from stopping legislation in Washington….[T]he opprobrium the AAP faced over PRISM would be a wet firecracker by comparison. Whereas Washington is a single big fat noisy target, faculty governance bodies are legion, and they tend to do their work quietly and in private….

I have a feeling the deafening silence coming from publishers right now is deliberate. Their only realistic hope is that the Harvard policy sinks like a stone in a vast sea of institutional indifference, and the best way for them to create that outcome is to keep their mouths shut so that the initial flurry of coverage and interest fades quicker.

The ball is in our court now, we open-access advocates. We can’t let Harvard’s fusillade go quiet. Come on, Cornell. Come on, California. Come on, MIT and Yale and (dare I say it?) Wisconsin. Let’s do this thing.

PMR: This could snowball very easily. The moral force is overwhelming and there is no case against the moral case. Academics not not publish to create profits for publishers. A publisher who provides a service - for a reasonable fee - will be valued. A publisher who resells academically created content increasingly will not.

So the message is clear. We should write to our own institutions and ask what they are doing in this area. I’m abroad and don’t quite know when I can get my act together. I may even have to find out how my University is run.

Tables in CML - and general strategy

February 11th, 2008

I get personal email about CML which I regard as private but there are often general points which are most usefully answered on the blog as they can then attract further discussion. So a correspondent writes:

The Nov-2006 blog entry The CML blog - Nov 2006
states that the preferred approach to supporting TABLES is via support for the following entities

  • table
  • tableHeader
  • tableHeaderCell
  • tableRowList
  • tableRow
  • tableCell

Hmmm…so a question ==> :-\
Should I regard these 6 entities as being fully-supported CML elements, on a par with <atomArray> or <zMatrix>

PMR: The correspondent is using CML to implement a system and adds:

I’m finding strong reasons for making the native database structures generic enough to support CML, but I’m backing away from attempting to give “first class” support for EVERY CML element. For example, AtomArray is an example of a CML element which is most natural to store in a database as [collection of singleton attributes ] + [array of atoms]

PMR: Thanks very much for these questions. There are some general and then some specific points.

  • Once something has been introduced to CML it has to stay because people may be using it - and they don’t have to register their interest.
  • Everything in CML has to be supportable by code, and by default that is JUMBO. So all elements in CML have JUMBO code (I believe in “rough consensus and running code” and this is the running code bit. That doesn’t mean you have to use JUMBO - simply that an element “works” in some respects. Moreover JUMBO has over 2000 units tests and these confirm the implementation and also give an indication of what the default behaviour is. JUMBO therefore acts as the reference semantics where these are not clear for the spec. I’m hoping that this blog will allow us to go through the spec (not in alphabetical order) and enhance the communal understanding and agreement.
  • Chemistry is part of physical science. We had anticipated that over the last 10 yeas many groups would have developed generic approaches to physical science (of which tables are a part). Very few have, so that CML carries the extra burden of representing common non-chemical objects. Henry and I published this as a separate language (STMML. A Markup Language for Scientific, Technical and Medical) but it hasn’t caught on. Something needs to fill this gap - numeric quantities, matrices, units, etc.- and at present it seems to be CML
  • CML only has scope as a communication language. How you implement your internal data structures is up to you. An atomArray can easily be held as a relational table as long as the structure is roughly rectangular, but if different atoms have different children this becomes harder. So atomArray is well supported in FoX and presumably uses good old FORTRAN arrays internally. So long as conformant CML goes in and out then it’s not our concern and shouldn’t be.
  • Having said that CML turns out to be an excellent data representation language for some problems. We couldn’t have built our polymer builder without CML. Because we needed to invent symbolic chemical computing inter alia.

Now the specifics. Physical science needs tables. HTML looks attractive but HTML tables are row-based and not necessarily rectangular. We needed all of the following:

  • rectangular row-based tables with potentially complex objects
  • rectangular column-based tables with potentially complex objects
  • and at the insistence of the FORTRAN community a structure to hold large tables in ASCII with minimal markup (i.e. not requiring each cell to be marked). This is as compact as you can reasonably get while preserving “obvious” implicit semantics and without using binary (for which there is no simple solution - see Toby’s latest posting on his blog). BTW if you are computationally inclined Toby discusses a wide range of important issues

So if you have a rectangular table you could use HTML with CML for the cell contents. You’d need to have both namespaces but that’s not difficult now. And there are advantages - you get some of the rendering for free.

But for the others we needed to define the semantics  and they seem to work out well - it’s possible to roundtrip between different representations without semantic loss.

So the only constraint is that the serialized XML markup should be consistent with the CML schema. How you process it is up to you.

And no, no-one other than me has to create a system which honours all CML schema components. FoX is a good example - they have about 10 elements out of the 100:  cml, module, molecule, atomArray, atom (but not bond), property, parameter, scalar, array, matrix. They can do everything they want with these and we are increasingly calling them microformats.

I expect this will be an important and contnuing them on this blog.

Natral language and CML concepts

February 10th, 2008

In an earlier post ( Natural language in chemistry ) I posted a typical account of a synthetic chemical procedure and measurements made on pure substances. Almost all the primary concepts can be captured in CML and while the running text cannot (yet) be deeply parsed we can create an document where the main concepts are identified and marked. Here is an indication of those aspects which CML can represent.

  •  Foo, Bar. Major concepts in CML, usually mapped directly onto XMLElements (to be discussed later).
  • Sectioning. Sections are critical in adding semantics and pragmatics to the understanding of the chemistry.
  • Unnecessary or unrepresentable concepts. In many cases these phrases add little information and could be omitted from the semantics
  • qualifiers. Many of the main concepts have qualifiers or data associated with them. Attributes or simple content (ASCII) is used to represent these.

Experimental

General

Melting points were recorded using a Kőfler hot stage apparatus and are uncorrected. IR spectra were recorded on a Perkin-Elmer Model 983G instrument coupled to a Perkin-Elmer 3700 Data Station as KBr disks, or films (liquids). Proton nuclear magnetic resonance (NMR) spectra were recorded at 300MHz and 500MHz using Bruker DPX 300 and DRX500 NMR spectrometers respectively. [...snip...]

(S)-5-(1-Hydroxy-1-methyethyl)-2-pyrrolidin-2-one (1).

A solution of methylmagnesium iodide (130 mL, 2M) in diethyl ether was added via canula to a mechanically stirred, ice cooled, solution of (S)-pyroglutamic acid ethyl ester (10.02 g, 63.9 mmol) in THF (150 mL) and the resulting mixture was vigourously stirred for 6 hr. [... snip ...] titled product as a white oily solid (8.83 g, 96%). For characterisation purposes a small sample was recrystallised from ethyl acetate/ether mixture to give the titled compound as a white amorphous solid, m.p. 62-64.5°C. For preparative purposes the crude material was used without purification in the next stage. = +14.9° (c = 0.105, CHC13); (Found: C, 58.4; H, 9.2; N, 9.9 %. C7H13NO2 requires: C, 58.7; H, 9.1; N, 9.8 %); m/z (%) 144(MH+, 1.4), 128(9), 85(100), 84(73); umax(KBr) 3403, 3333, 1684, 1379 cm‑1; dH (300 MHz) 1.137 and 1.219 (2 x 3H, 2 s), 1.80-2.00 and 2.00-2.19 (2 x 1H, 2m), 2.30-2.45 (2H, m), 3.56 (1H, dd, J = 8.0, 5.9), 4.18 (1H, br), 7.73 (1H, br); dC (75 MHz) 21.83, 23.15, 26.36, 30.48, 63.62, 71.59, 179.51.

Later posts will show what the XML looks like and how to select the CMLElements required.

FoX Golem and Dalton

February 9th, 2008

As part of the COST project (see COST D37 Meeting in Rome) we are collaborating on interoperability of QM codes. A major mechanism is the short-term exchange of scientists (I forget the acronym) and on Wed/Thursday Kurt Mikkelsen from Copenhagen visited us in Cambridge. Unfortunately I was already away but Toby White and Andrew Walkingshaw took charge. The aim was to explore what would be required to XMLise the Dalton code of which Kurt is a developer.

Dalton is a large Fortran program and so the natural appraoch is to use Toby’s FoX program to add CML output (we generally start with output as input requires complex logic for many codes). From Toby:

FoX is in use by several leading scientific simulation codes:

In principle all output statements in the program can be converted, but in practice we normally start with the ones of most interest to users. It’s common to write the CML/XML to a separate stream (channel) and this is selected at the start of the program (More details in a later post). You convert as many output (WRITE or PRINT) statements as you want (or have patience for). At this stage the main things that you select are a uniue ID (often the FORTRAN variable) and an optional title (for humans to understand). If the quantity is numeric you will have to state the units - we are strict about this!

Then you can run the program with a number of different input examples. This is where Andrew’s Golem comes in. It can act inseveral ways. The first is to tell you what you have got in these outputs. Yes, you wrote the output statements, but you don’t know excetly ehen and how often they might be called. Golem analyses all that. Then it suggests what a CML dictionary based on these concept might look like.
Toby, Andrew and Kurt managed all of this within a day or two. That’s a great credit to all of them, but it also means that CMLising a large program is not terrifying. We haven’t finished, because all large programs (and Dalton is 10 years old) have a lot of output statements and a lot of concepts.

Just to give you an idea of the power, here’s part of a typical entry:


/cml:cml/cml:module[@dictRef="dalton:WAVFUN"]/cml:property[@dictRef="dalton:EMCSCF"]

That says that the concept EMCSCF in the Dalton dictionary always occurs as a CML property which is a child of the CML module described by the WAVFUN concept in the Dalton dictionary. That’s all automatic. And very impressive.

In short this couldn’t have been done without CML and without Toby and Andrew.

Thanks

============UPDATE===========

Toby has already blogged this at FoX and Dalton Excerpt:

So the obvious first step was to equip Dalton with CML output. And I’m happy to announce that in a total of under 7 hours, we:

  • ported Dalton to a new platform, since the most convenient compiler to hand was g95 on Mac OS X.
  • updated its whole configure/build/job-submission process to allow for compiling with FoX, and dealing with an additional XML output file
  • went through Dalton, and added CML output such that the most important quantities (molecular structure and related quantities, basis set metadata, and calculated polarizabilities up to the 3rd harmonic) are all now output to CML.

Kurt went home with a version of the Dalton source code ready to be used for his database calculations immediately (although we compiled on g95, the result is portable to any existing Dalton platform). There will be more to be done before it makes it to a public release of Dalton, of course (which only happens every three or four years anyway) - but we can welcome Dalton to the CML stable now.

The CML Namespace

February 9th, 2008

In this blog I shall cover many aspects of CML. Where possible the order and selection has some pedagogic logic but I will also answer questions from correspondents and I have just got one on namespaces.  There has been confusion as CML evolved and this is a clarifications and an introduction.

Suppose I write the valid CML:

<list>... </list>

and someone else has independently created a different language which uses the same elements name there will be confusion. How can we tell the difference? The answer is to associate a unique namespace with each and to use prefixes to distinguish them:

<cml:list xmlns:cml="http://www.xml-cml.org/schema">... </cml:list>

<foo:list xmlns:foo="http://www.foo.org/bar">... </foo:list>

Here :

http://www.xml-cml.org/schema
is the CML namespace. It can be associated with elements or attributes through a prefix (in this case “cml” though it can be any string consistent with the rules (numbers, letters, dot, minus, hash).) In the case of XML the namespace is purely a string - it is NOT an address even though it looks like one. It is similar in that respect to a DOI or ISBN.

When we started CML we used a namespace

http://www.xml-cml.org/schema/core
on the basis that as we extended CML to reactions, etc we would have more namespaces to support them. And that as we created new versions the namespaces would reflect that. There would be lots of namespaces in CML representing different components and different versions.

It was unworkable. So about 2-3 years ago we reverted to a single namespace for all CML:

http://www.xml-cml.org/schema

I have promised that it will never change regardless of the version of CML. Versions and components require a different strategy (but we have other approaches for that). All CML-conformant software (JUMBO, XSLT, FoX, Blue Obelisk, etc.) should use the string above. If you are using something different you should change it as soon as it makes business sense as your CML will not be interoperable with the mainstream.

Note that you will come across other namespace-strings in CML - some with the same domain name - but they are not the CML Namespace and not related to to. RDF requires that every component other than literals carry a namespace URI and some of these look similar to the CML namespace. But they are used for different purposes and we’ll cover them later.

So please ask questions about CML as we go along.

Natural language in chemistry

February 5th, 2008

This year I intend to put effort into formalising CML by writing full-text discourse, giving, examples and clarifying specs and practice. This is a large task and even if I did one element or one attribute per day (a reasonable target) it will take a year. Maybe that will happen. In any case at the end I will have something like a “book”. If so it will be a Web2.0 book as it will be informed by your comments.

Why should I use CML and not Molfiles, asks Rich Apodaca.

The first point to make is that CML is not a molecular markup language (we nearly called it that at the start) but is a language for chemistry as a whole. That’s ambitious - almost to the extent of hubris - but I still believe it’s feasible. CML doesn’t have to do everything because there are other components we can reuse - such as MathML, ThermoML, CIF and AniML.  But ultimately CML supports the natural ways that chemists and machines communicate chemistry:

  • human to human (e.g. journal articles and theses)
  • human to machine (please run this job or store this data)
  • machine to human (here is some information I found of calculated)
  • machine to machine (here is data I have calculated for you to use in the next step)

We are therefore informed by how humans write chemistry and how machines write chemistry and take these as our corpus of exemplars. Here is a typical snippet from Beilstein Journal of Organic Chemistry (I have chosen that becuase it is Open Access so I don’t have to ask permissions):

http://bjoc.beilstein-journals.org/content/pdf/1860-5397-4-6.pdf

Experimental

General

Melting points were recorded using a Kőfler hot stage apparatus and are uncorrected. IR spectra were recorded on a Perkin-Elmer Model 983G instrument coupled to a Perkin-Elmer 3700 Data Station as KBr disks, or films (liquids). Proton nuclear magnetic resonance (NMR) spectra were recorded at 300MHz and 500MHz using Bruker DPX 300 and DRX500 NMR spectrometers respectively. [...snip...]

 

(S)-5-(1-Hydroxy-1-methyethyl)-2-pyrrolidin-2-one (1).

A solution of methylmagnesium iodide (130 mL, 2M) in diethyl ether was added via canula to a mechanically stirred, ice cooled, solution of (S)-pyroglutamic acid ethyl ester (10.02 g, 63.9 mmol) in THF (150 mL) and the resulting mixture was vigourously stirred for 6 hr. Saturated aqueous ammonium chloride solution (100 mL) was added with ice cooling and the mixture was stirred until all the solids dissolved.  The organic layer was removed and discarded and the aqueous layer was placed in a continuous extractor and extracted with dichloromethane over 4 days. Each day the extraction solvent was replaced. The combined extracts were dried over sodium sulfate and concentrated to give the titled product as a white oily solid (8.83 g, 96%). For characterisation purposes a small sample was recrystallised from ethyl acetate/ether mixture to give the titled compound as a white amorphous solid, m.p. 62-64.5°C. For preparative purposes the crude material was used without purification in the next stage.  = +14.9° (c = 0.105, CHC13); (Found: C, 58.4; H, 9.2; N, 9.9 %. C7H13NO2 requires: C, 58.7; H, 9.1; N, 9.8 %); m/z (%) 144(MH+, 1.4), 128(9), 85(100), 84(73); umax(KBr) 3403, 3333, 1684, 1379 cm‑1; dH (300 MHz) 1.137 and 1.219 (2 x 3H, 2 s), 1.80-2.00 and 2.00-2.19 (2 x 1H, 2m), 2.30-2.45 (2H, m), 3.56 (1H, dd, J = 8.0, 5.9), 4.18 (1H, br), 7.73 (1H, br); dC (75 MHz) 21.83, 23.15, 26.36, 30.48, 63.62, 71.59, 179.51.

We set out to make sure we can capture as much of this as possible. It includes text and data relating to apparatus, methods, materials, recipes, compounds, physical and chemical data and chemical reactions. It’s taken year but I am reasonably confident that CML can now describe almost all this content in a formal manner without losing critical information and where a machine can understand most of it. Over several posts I shall try to show the philosophy. You might like to think how you would capture all this - we’d welcome additional insights.

CML - what and why

February 2nd, 2008

Why should I use CML and not Molfiles, asks Rich Apodaca.

I can’t answer that question till I know what you want to do, Rich. So to start here are some basic aspects of CML.

CML - Chemical Markup Language - is a conformant XML langauge describing chemistry. It has the advantage of being in XML so that any XML-conformant toolset can, in pricniple, do something useful with it. It has the disadvantage that XML frightens some people and seems unnatural.

CML addresses a wide range of chemistry and solves many, but not all, common problems. It covers:

  • molecules, atoms and bonds
  • chemical substances and recipes
  • reactions
  • spectra
  • crystal structures and other solid state objects
  • chemical computation

It is not simply “another file format” but an expressive language in which a wide range of concepts can be constructed. It can express certainty and uncertainty and be clear about the degree. It has elements of natural language. It also allows for sub-communities to develop dialects which use some fo the language and can also indicate that components should be interpreted in particular ways.

It is primarily aimed at communicating chemistry without semantic loss between systems which do not otherwise interoperate. These include:

  • humans to humans (e.g. authors to publishers)
  • humans to machines (e.g. job submission or ingestion of data)
  • machines to humans
  • machines to machines  (program to program)

As a result complex semantic chains (workflows) can be built using XML as the transport layer

It separates ontology (meaning) from syntax and semantics by couping concepts to dictionaries (through the <tt>dictRef</tt> attribute. This allows groups of chemists and other scientists to build theirown vocabularies. The three most active areas of CML usage at present are:

  • export and import from repositories (or databases)
  • coupling processes in computational chemistry (e.g. input and output of large QM codes)
  • semantic publishing including the use of several markup languages (CML, MathML, SVG, XSLT, etc.

In our own group we have used it to create a polymer building system and to represent Markush structures in a machine-processable way. It is also used to hold chemistry resulting from chemical natural language processing (OSCAR3).
We are also using it to transfrom to and from RDF representations of molecules, substances and their properties.

I’ll continue to write at fairly regular intervals.

Comments

February 2nd, 2008

Many thanks to Rich for sending the following. Some immediate comments, and please stimulate further discussion


Peter, this sounds like a good idea.

Some things I find difficult about CML:

(1) Lack of a large, up-to-date and easily-locatable collection of examples focussed on small molecules and illustrating “best practices” and how to deal with difficult cases. A collection of side-by-side comparisons of molfile and CML for a particular molecule. The blog might be a place to begin assembling it. The entry on CML-2.2.1 at:

http://cml.sourceforge.net/

looks like its examples and XSD were last updated in 2003.

PMR. This is all true. There is a problem in trying to maintain disparate sites. Partly this is because I have moved sufficiently often that things get fragmented - bits of CML are to be found in Birkbeck, Nottingham, Cambridge and xml-cml,org.

(2) A single, authoritative site for up-to-date information on CML and a statement from said site to this effect. For example, there used to be an xml-cml.org site which seems to now be defunct, is listed on the Wikipedia article, and doesn’t redirect to the right site.

PMR: We have had to change our site for technical and business reasons.

(3) A set of simple, practical arguments, with illustrative examples, for why CML should be used in my next project and not molfile.

PMR: This is what the next series of posts will be addressing. We had intended to run the site as a Wiki at Sourceforge and this got seriously spammed. There are reasons why we don’t want the home page to be at cam.ac.uk - Henry is at Imperial. I shall also need to be able to uplift all the material from a site and relocate if necessary and this is a considerable challenge.

In summary - yes the pages are a mess and we are reconstructing them. HTML and writable Wikis have not worked well. Sourceforge has worked well for the code and the formal examples but not for the informal ones. I believe that a blog will allow a free approach to presentation and allow good discussion - let’s see.