Archive for the ‘chemistry’ Category
Working with the NCI
Monday, February 4th, 2008Automatic assignment of charges by JUMBO
Thursday, January 31st, 2008The charges in the structure are indeed wrong. There are two challenges…Why chemistry-rich RSS feeds matter… data minging,
The example shown by Peter was nicely chosen: something is wrong with that example. It uncovers a bug in the pipeline, that could have been uncovered by a simple agent monitoring the RSS feed. That is why this technology is important! It allows pipelining of information between services. Anyway, before you read on, check the structure in the example yourself (Bis(pyrimidine-2-carboxylato-K2N,O)copper(II)). Done? Checked it? You saw the problem, right? Good.
- for structures with more than one moiety (isolated fragment) in the structure it is formally impossible to know the changes if the author doesn’t give them. The authors can give them in _chemical_formula_moiety but they are often difficult to parse correctly and in any case they often aren’t given. In those cases we don’t try to assign charges. (The crystallographic experiment itself cannot determine charges).
- In cases where the fragment contains only light atoms it is usually (but not always) possible to allocate charges by machine. In cases with metals it’s usually impossible to do a good job. The molecule in questions is:

The molecule itself is neutral. The easiest way is not to put any charges. Anything else in uncomfortable. We can have + charges on the N’s which is natural, but then there are 2 - charges on the CU. That’s formally correct but since the mertal is usually described as Cu(II) it’s not happy. Or we can play around with thearomaticity, or dissociate the Cu-N or C-O bonds but that’s not happy either. And this is simple compared with may metal structures.
What we have been doing is to disoociate the metal, do the aromaticity and charges, and then add the metal back. In doing so it’s easy to forget the charges and that is what has happened. We’ll try to fix it.
But in the end the only thing that matters is the total electron count and the spin state (which normally isn’t given except in the text). Cu2+ is d9 so it has one unpaired electron. But Fe is much more difficult and it’s virtually impossible to do anythig automatic. We’ll probably simply leave the charges off…
Semantic Chemical Computing
Wednesday, January 30th, 2008Does the semantic web work for chemical reactions
Friday, January 4th, 2008A very exciting post from Jean-Claude Bradley asking whether we can formalize the semantics of chemical reactions and synthetic procedures. Excerpts, and then comment…
PMR: This is very important to follow - and I’ll give some of our insights. Firstly, we have been tackling this for ca. 5 years, starting from the results as recorded in scientific papers or theses. Most recently we have been concentrating very hard on theses and have just taken delivery of a batch of about 20, all from the same lab. I agree absolutely with J-C that traditional recording of chemical syntheses in papers and theses is very variable and almost always misses large amounts of essential details. I also agree absolutely that the way to get the info is to record the experiment as it happens. That’s what the Southampton projects CombeChem and R4L spent a lot of time doing. The rouble is it’s hard. Hard socially. Hard to get chemists interested (if it was easy we’d be doing it by now). We are doing exactly the same with some industrial partners. They want to keep the lab book.The paper lab book. That’s why electronic notebook systems have been so slow to take off. The lab book works - up to a point - and it also serves the critical issues of managing safety and intellectual property. Not very well, but well enough. J-C asksModularizing Results and Analysis in Chemistry
Chemical research has traditionally been organized in either experiment-centric or molecule-centric models.
This makes sense from the chemist’s standpoint. When we think about doing chemistry, we conceptualize experiments as the fundamental unit of progress. This is reflected in the laboratory notebook, where each page is an experiment, with an objective, a procedure, the results, their analysis and a final conclusion optimally directly answering the stated objective. When we think about searching for chemistry, we generally imagine molecules and transformations. This is reflected in the search engines that are available to chemists, with most allowing at least the drawing or representation of a single molecule or class of molecules (via substructure searching). But these are not the only perspectives possible. What would chemistry look like from a results-centric view? Lets see with a specific example. Take EXP150, where we are trying to synthesize a Ugi product as a potential anti-malarial agent and identify Ugi products that crystallize from their reaction mixture. If we extract the information contained here based on individual results, something very interesting happens. By using some standard representation for actions we can come up with something that looks like it should be machine readable without much difficulty:It turns out that for this CombiUgi project very few commands are required to describe all possible actions:
- ADD container (type=one dram screwcap vial)
- ADD methanol (InChIKey=OKKJLVBELUTLKV-UHFFFAOYAX, volume=1 ml)
- WAIT (time=15 min)
- ADD benzylamine (InChIKey=WGQKYBSKWIADBV-UHFFFAOYAL, volume=54.6 ul)
- VORTEX (time=15 s)
- WAIT (time=4 min)
- ADD phenanthrene-9-carboxaldehyde (InChIKey=QECIGCMPORCORE-UHFFFAOYAE, mass=103.1 mg)
- VORTEX (time=4 min)
- WAIT (time=22 min)
- ADD crotonic acid (InChIKey=LDHQCZJRKDOVOX-JSWHHWTPCJ, mass=43.0 mg)
- VORTEX (time=30 s)
- WAIT (time=14 min)
- ADD tert-butyl isocyanide (InChIKey=FAGLEPBREOXSAC-UHFFFAOYAL, volume=56.5 ul)
- VORTEX (time=5.5 min)
- TAKE PICTURE
By focusing on each result independently, it no longer matters if the objective of the experiment was reached or if the experiment was aborted at a later point. Also, if we recorded chemistry this way we could do searches that are currently not possible:
- ADD
- WAIT
- VORTEX
- CENTRIFUGE
- DECANT
- TAKE PICTURE
- TAKE NMR
- What happens (pictures, NMRs) when an amine and an aromatic aldehyde are mixed in an alcoholic solvent for more than 3 hours with at least 15 s vortexing after the addition of both reagents?
- What happens (picture, NMRs) when an isonitrile, amine, aldehyde and carboxylic acid are mixed in that specific order, with at least 2 vortexing steps of any duration?
I am not sure if we can get to that level of query control, but ChemSpider will investigate representing our results in a database in this way to see how far we can get.
Note that we can’t represent everything using this approach. For example observations made in the experiment log don’t show up here, as well as anything unexpected. Therefore, at least as long as we have human beings recording experiments, we’re going to continue to use the wiki as the official lab notebook of my group. But hopefully I’ve shown how we can translate from freeform to structured format fairly easily. Now one reason I think that this is a good time to generate results-centric databases is the inevitable rise of automation. It turns out that it is difficult for humans to record an experiment log accurately. (Take a look at the lab notebooks in a typical organic chemistry lab - can you really reproduce all those experiments without talking to the researcher?) But machines are good at recording dates and times of actions and all the tedious details of executing a protocol. This is something that we would like to address in the automation component of our next proposal. Does that mean that machines will replace chemists in the near future? Not any more than calculators have replaced mathematicians. I think that automating result production will leave more time for analysis, which is really the test of a true chemist (as opposed to a technician). Here is an example[...]
database, as long as attribution is provided. (If anyone knows of any accepted XML for experimental actions let me know and we’ll adopt that.)
I think this takes us a step closer from freeform Open Notebook Science to the chemical semantic web, something that both Cameron Neylon and I have been discussing for a while now.
If anyone knows of any accepted XML for experimental actions let me know and we’ll adopt thatCML has been designed to support and Lezan Hawizy in our group has been working in detail over the last 4 months to see if CML works. It’s capable of managing inter alia:
- observations
- actions
- substances, molecules, amounts
- parameters
- properties (molecules and reactions)
- reactions (in detail) with their conditions
- scientific units
- solvents
- reagents
- apparatus
- procedures
- appearances
- units
- common molecules
Is the scientific archive safe with publishers?
Monday, December 31st, 200717. Metalate on December 1, 2007 11:00 AM writes… Has anyone noticed that OL has removed all but the first page of the Supporting Info from the 2006 paper? Is this policy on retracted papers? And if so, why? Permalink to CommentPMR: I wasn’t reading this story originally, so went back to the article:
As I’ve [Rudy Baum] written on this page in the past, one important consequence of electronic publishing is to shift primary responsibility for maintaining the archive of STM literature from libraries to publishers. I know that publishers like the American Chemical Society are committed to maintaining the archive of material they publish.PMR: I am not an archivist but I know some and I don’t know of any who deliberately censor the past. So I have some open questions to the American Chemical Society (and to other publishers who have taken on the self-appointed role of archivist):
- what is the justification for this alteration of the record? Why is the original not still available with an annotation?
- who - apart from the publisher - holds the actual formal record of publications? And how do I get it? (Remember that a University library who subscribes to a journal will probably lose all back issues - unlike paper journals the library has not purchased the articles, only rented them). I assume that some deposit libraries hold copies but I bet it’s not trivial to get this out of the British Library.
- where and how can I get hold of the original supplemental data? And yes, I want it for scientific purposes - to do NMR calculations. Since it was originally free, I assume it is still free.
Open Data: publishers are the problem
Tuesday, December 18th, 2007- pm286 Says: October 26th, 2007 at 7:54 am (1) All data come from Free sources - i.e. visible without a subscription. Some journals (Acta Crystallographica and RSC for example) do not copyright the data. Others like ACS add copyright notices. It is our contention, and Elsevier has agreed for its own material, that facts are not copyrightable. We have therefore extracted and transformed facts and mounted these. Where the original material (CIF) does not carry copyright we mount it on our pages - where it does we do not, but we have the transformed data. In those cases it would be possible to recreate the original CIF data in semantic form ,but not the exact typographical layout which contains meaningless whitespace.I am not aware that ACS or Elsevier have ever made statements of any kind about our Open Data efforts.You may scrape anything, must you must honour the source and the metadata and you should add the Open Data sticker. If you scrape the link (simplest) you may simpy point to our site. If you scrape more data you should ensure that the integrity of the data is maintined and that if it is re-used the re-used data should still clearly show our metadata.
We have already done the work to scrape certain data from the site but have chosen to be extra careful with taking the declaration of Open Data made to all data sources. My primary worry was with the data scraped from the ACS journals. With this caution in mind I sent a letter to the copyright department at ACS as outlined here. In fact I made a couple of phone calls, sent the email about 2 more times and finally managed to talk to a nice gentleman from the ACS copyright department and brought my concerns to light. Since then we have exchanged multiple emails, spoken again on the phone and I have been told that a meeting of minds from both Washington and Ohio was being scheduled to discuss the situation. That’s 2 months after my original email. Today I received the following email and I am excerpting from it.. “Thank you for your inquiry about the proposed use by ChemSpider of information in the CrystalEye database that has been published within certain ACS journal publications. In light of your query, we are examining the manner in which ACS published material is represented within that database as well as the nature of your proposed use, so that we can respond in an informed manner to your request. <snip> If you will be attending the ACS National Meeting in New Orleans, perhaps we could confer with you at that time to discuss our findings and advise you appropriately? Communicators Name withheld ” What I thought was a simple question and done with the intention that ChemSpider was safe turns out not to be so simple. It could take until March 2008 to get an answer! At this stage we will not be publishing any of the CrystalEye data without confirmation from each of the publishers that this is allowed. I asked the question previously “Who gets to declare data open or not?“ and even received the question “Why even offer the option of closed?” The primary reason is that we have turbulent times ahead of us around such issues of “openness” and until these are navigated I am working to keep ChemSpider “safe “. I am willing to participate, support and contribute to the evangelism of openness but am equally concerned with keeping ChemSpider alive for the close to 3000 users per day now accessing the service. It was an interesting day to receive this email about a potential FIVE MONTH delay to a decision about Open Data especially now that Science Commons have released a Protocol for Implementing Open Access Data just yesterday. … So, while protocols are exposed to the community by Science Commons the challenge of utilizing them now begins…I will be in communication with members of the Science Commons soon to determine how ChemSpider can it into the model…PMR: This is, unfortunately, completely typical. Earlier this year I wrote to Tetrahedron (an Elsevier journal) asking if they would consider posting CIFs (crystallographic data):
PMR: Five editors - I haven’t had the courtesy of a reply. This is not uncommon - I didn’t get replies on Open topics from Wiley, Springer (first time round) either. Either journals are not in the habit of replying - they consider ordinary scientists too low in the foodchain to merit consideration (most likely) - or they regard anything Open as a pain and want to slow it by inaction (also most likely). They have their set way of doing things - God ordained in 1972 that the world belongs to the publishers and they don’t want to see it change. Another typical example. I was invited to write an article for Serials Review on Open Data. I asked if I could write my artcile in HTML and embed my own copyright material, noted as such under appropriate licence. The editorial office siad that would come back to me. It’s now past the closing date of the submission. After ca. 6 weeks I got the reply:Request for Open publication of crystallographic data in Elsevier’s Tetrahedron
=========== Open letter to editors of Tetrahedron ========== Professor L. Ghosez, Professor Lin Guo-Qiang Subj: Request for Open publication of crystallographic data in Tetrahedron Dear editors, I have recently been reviewing access to supplemental data in chemistry publications, in particular crystallographic data (”CIFs”). Many publishers (IUCr, RSC, ACS…) expose these on their websites as Open Data (for examples see: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=455). The data are acknowledged not to be copyrightable (see http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=447) where your colleague Jennifer Jones (copied) has confirmed:, Professor T. Lectka , Professor S.F. Martin , Professor W.B. Motherwell , Professor R.J.K. Taylor , Professor K. Tomioka Other Elsevier journals such as those publishing thermochemistry (see last blog post) are now actively making the supplemental data Openly available on the journal website. I am therefore asking whether Tetrahedron (and perhaps other Elsevier chemistry journals) might consider publishing their data Openly in this way and would be grateful for your views. (This is an Open letter (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=456) and I would like to publish your reply so please mark any confidential material as such). Thank you for considering this
- Dear Peter Murray-Rust
- Thanks for your email. Data is not copyrighted. If you are reusing the entire presentation of the data, then you have to seek permission, otherwise, you can use the data without seeking our permission.
- Yours sincerely
- Jennifer Jones
- Rights Assistant
- Global Rights Department
- Elsevier Ltd
- PO Box 800
- Oxford OX5 1GB
- UK
- Tel: + 44 (1) 865 843830
- Fax: +44 (1) 865 853333
- email: j.jones@elsevier.com
Facts and data are not copyrightable but the expression of data is copyrightable. If you wish to use third-party data in a different format within your article, including full acknowledgement to the source of the data, then that would be acceptable. However, if you wish to retain the expression of the data, then you will need to include alternate diagrams within the article.So I can use the data - IF I can get it. If I can only get a graph then I can’t unless I redraw it. Is redrawing a graph a useful activity for science - do I need to answer? The only value is that it adds some random errors to the data (or systematic ones) that would be fun to give as exercises in bad scientific practice for students. “Expression of the data” - i.e. the author’s graphs - are not re-usable. So what’s the answer? Currently I use the “ask forgiveness, not ask permission” mode. And if the “owners” ot the data (read “appropriators”) send the lawyers and ask for a take-down - make a huge public fuss. As the world did when Shelly Batts “stole” a graph from from Wiley (Sued for 10 Data Points). And Wiley backed down. The publishers don’t like public fuss. So a few months ago I would have advised Chemspider “go ahead”. But they ran foul of another publisher (I think it was the Royal Society of Chemistry). I never understood the details but Chemspider linked to publicly visible papers (not Open) and were asked to take the links out of the Chemspider database. This doesn’t even seem to make sense. I would have thought publishers would like people linking to their papers - maybe it was the metadata. So I appreciate Chemspider’s wish to remain on the correct legal side of the publisher. But [the publishers'] actions destroy scientific data in the current century. Chemistry publishers [OA publishers and IUCr excepted] are actively and passively resisting the re-use of data. They copyright factual data, hide it, require take-downs, refuse to reply to reasonable letters - everything. They are simply in the way between the creator of the data and the consumer As I have blogged we now have an exciting project sponsored by Microsoft on eChemistry. We are going to fill repositories with data. And we are going to get that data (”not copyrightable” - see above) from any source we reasonably can. It will be available to the whole world. It will probably be stamped CCZero. CrystalEye will be in there. We shall, of course, include the source (provenance) as we really care about it and metadata. So people will know where it came from. Why can’t the ACS reply “Yes” to Chemspider by return? Does it really make sense for chemistry publishers to be universally seen as Luddites? Because the world will sweep these restrictive practices away, and the business will have moved from the publishers to somewhere in the twenty-first century (the one we are in).
Open Access - Chemistry World reviews the dilemma
Thursday, December 6th, 2007This was a commissioned article, I think (Rebecca interviewed a number of people including me by phone) and does not, I think, represent any explicit or implicit policy of the RSC itself. I think the article gives a fair account of the current position in chemistry (the article is free-to-read and I give selected quotes):
But the saga [NIH bill] has highlighted a widening rift in the chemical community over open access publishing - and the contentious provision could yet be revived.
…
Major scholarly societies joined the Association of American Publishers (AAP) in lobbying against the proposal, including the American Chemical Society (ACS), the American Association for Clinical Chemistry, the Biochemical Society, and the RSC (publishers of Chemistry World).
PMR. I suspect, though I do not know, that this is distinct from the PRISM movement which was also launched from the AAP
…
But the battle lines are already being drawn. The ACS wants the NIH policy to remain voluntary. ‘Depending on how they implement this, it could represent a federal taking of copyrighted materials,’ ACS spokesman Glenn Ruskin told Chemistry World.
A compulsory policy would need costly monitoring and penalisation systems, Ruskin said. ‘Why expend monies on a mandatory policy, when they could get to their endpoint a lot quicker if they just worked more cooperatively with the publishers?’
‘The idea of public access to research information is a little bit specious,’ added Robert Parker, managing director of RSC publishing. ‘The UK government will be funding the London Olympics in 2012, but that doesn’t mean that everybody can have free tickets - there is a big difference between funding something and having it be freely available.’
PMR: Factually the current position is that almost all chemistry publishers (such as ACS and RSC) continue to hold the copyright on closed access articles funded by governments. Maybe the analogy with the Olympics is a little bit stretched.
The Partnership for Research Integrity in Science and Medicine (PRISM) argues that the Congress bill could damage peer review by compromising the viability of non-profit and commercial journals. Predictably, the campaign has sparked outrage among open access lobby groups. In the wake of the furore, nine publishers have disavowed PRISM, including Cambridge University Press, Oxford University Press, Columbia University Press and University of Chicago Press. The ACS - which had been closely involved with PRISM - has now also played down links with the campaign.PMR: PRISM is playing Haydn’s farewell symphony. No one seems to support it (I don’t know about the RSC- maybe this is a chance for them to comment). Is anyone left?
PMR: Just in case anyone is unfamiliar with the RSC’s use of “Open Science” - this is not full Open Access under the BBB declaration but is a free-to-read version where the journal retains copyright. Readers can decide whether this is a good bargain compared with full Open Access offerings (it’s not the worst).As a result, the steps taken by the RSC and ACS to enter this new world of publishing have received a stilted response from chemists.
For roughly a year, the RSC has had an Open Science service that allows authors to pay to make their article freely accessible to all. The basic fee for a primary research article is £1600 with a 15 per cent discount for RSC members, owner societies of RSC journals, and authors from subscribing organisations. So far, just four authors have participated.
Indeed, there are calls for bold and decisive leadership on this increasingly divisive issue from all sides of the chemistry community. ‘Vision is needed. Where we are at the moment is unacceptable,’ said the ACS’s Ruskin.PMR: I have indeed argued frequently that bold and decisive leadership is necessary and that it should come from learned societies and International Unions who are respected by the community. But if it doesn’t come from there, the community will find another way and in the Internet era that can happen very quickly.
Dog food is tasty!
Tuesday, November 13th, 2007PMR: Thanks. I’ll need help. First we need to make sure the WordPress version is correct. I have 2.03. There are no immediate plans to upgrade but this might swing it. I would re-open the CMLBlog (which is sleeping till I can author better). I probably need some hand-holding. I think we are gradually getting places. Some years ago we (Henry, Egon and me) hacked CMLRSS. It works, but only with a complicated bespoke client. Now we’ve got a better handle of the technology and with Atom+PNG we can direct intravenous feeds of CrystalEye. Every new structure with full structural diagram (well every organic one). Here’s Jim’s post… That’s what real publishers should be thinking about. What’s inside the post as well as on the surface. Come to think of it we can probably put it on ICE.I’m looking forward to seeing Peter Murray Rust eat my dog food. He’s lucky cos at our place the hounds eat relatively benign dry food. [...]I’m going to follow up on using ICE for blogging to WordPress soon which is what that dog food stuff is about, but Peter has just pointed out some issues with getting papers into institutional repositories and I wanted to discuss some of his points here.
[...]I liked the last bit, so I added some emphasis:
And if I were funding repositories I would certainly put resource into communal authoring environments. If you do that, then it really is a one-click reposition instead of the half-day mess of trying to find the lost documents.I’ll be sure to mention this to our friends at DEST.
CrystalEye: data loss and corruption through legacy files
Sunday, November 4th, 2007- Andrew Dalke Says: November 4th, 2007 at 2:32 am e
- PMR: Moreover crystal structures contain problems such as disorder and partial occupancy which are impossible to hold in an SDFile as far as I know without corrupting the data.
- “Corruption” is a strong word. Why not think of it as the way you wrote in your “Round-trip format conversion” wikipedia article?
- eMolecules gives the formula NH=O and the molecular weight as 31.014.
- The PubChem Project gives both NH=O and the correct formula (NO)
-
Items 1 - 20 of 20
One page. 1: CID: 945 Related Structures, Literature, Other Links 2: CID: 145068 Related Structures, Assays, Literature, Other Links - Chemspider gives 4 different formulae, (HN=O, NO+, NO, NO-) all with different molecular weights.
- When a document in one format is converted to another there is likely to be information loss. Is “information loss” necessarily “corruption”? From my experience in dealing with PDB files, which has some of these crystallographic properties, I think there can be meaningful information despite the information loss. So long as the tools and the users understand that there are limitations in the conversion.
COST D37 Meeting in Rome
Saturday, October 27th, 2007What is COST? COST is one of the longest-running instruments supporting co-operation among scientists and researchers across Europe. COST now has 35 member countries and enables scientists to collaborate in a wide spectrum of activities in research and technology. [...]PMR: I’m always proud to be involved in European collaborations. When I was born Europe was tearing itself apart. Whatever we may think of the bureaucracy involved it’s worth it. Science and scientists have always been a major force in international collaboration, and the prevention of conflict. The meeting itself (COST D37) is aimed at interoperability on chemical computation:
PMR: So I’ll be talking about the World Wide Molecular Matrix (WWMM) and Andrew will talk on Golem - which will transduce the output of computational programs into ontologically supported components that can be fed into other programs without loss of information. I shall try to present as much as possible from the WWW, linking into CrystalEye and OpenNMR.Objective
Realistic modelling in chemistry often requires the orchestration of a variety of application programs into complex workflows (multi-scale modelling, hybrid methods). The main objective of this working group (WG) is the implementation, evaluation and scientific validation of workflow environments for selected illustrator scenarios.Goals
In the CCWF group, the focus is on the implementation and evaluation of quantum chemical (QC) workflows in distributed (Grid) environments. This is accomplished by:
- The implementation of workflow environments for QC by adapting standard Grid technologies.
- Fostering standard techniques (interfaces) for handling quantum chemical data in a flexible and extensible format to ensure application program interoperability and support of an efficient access to chemical information based on a Computational Chemistry ontology.
- The implementation of computational chemistry illustrator scenarios from areas of heterogeneous catalysis, QSAR/QSPR, and rational materials design to demonstrate the applicability of our approach.