Unilever Centre for Molecular Informatics
 

petermr’s blog

A Scientist and the Web

 

Archive for the ‘"virtual communities"’ Category

OPSIN: why it can become the de facto name2structure

Saturday, May 16th, 2009
In a previous post I reviewed our chemical language processing tools - OSCAR and OPSIN. This post updates progress on OPSIN, the IUPACName2Structure converter.

Why do we need a name2structure converter? It’s because chemists use language to communicate the identities of obejcts. It’s possible to talk simple chemistry over the phone whereas it wouldn’t ben easy to describe star maps, isotherms, engineering drawings, etc. And, because of this, chemists often abbreviate names - it’s easier to say “mesitylene” than “1,3,5-trimethyl benzene” or “DDT” instead of “paradichlorodiphenyltrichloromethane” (experts will cringe at the horror of this name which is seriously non-systematic and which could not be worked out by man or machine. There is, however, a lovely limerick based on it).

The rules for naming compounds are set out by the Int. Union or Pure and Applied Chemistry. Even if you are not a chemist, have a look at:  IUPAC Nomenclature Home Page which represents years of devoted work by chemists, much of the organization done by Gerry Moss. There are many reasons why the field is complicated:

  • almost all compounds can be named in many ways. Thus CH3-O-CH3 could be called methyl ether, dimethyl ether, 2-oxa-propane and so on. IUPC has recomendations for which of these should be used but they are often ignored, and sometimes are honoured in the breach. Most practising chemists, unless they routinely patent a lot of compounds neither know these recommendations nor care.
  • Errors are common. Letters can be elided, brackets missed etc. and plain mistakes made. How many readers could say accurately what the structure (if any) is of capric chloride, caproic chloride, caproyl chloride, caprilyl chloride, and capriloyl chloride. Don’t be a goat, it matters :


Buy Caprylic Acid Tablets. Stay fit and healthy, naturally. HollandandBarrett.com/CaprylicAcid

AND

Capric Acid Bulk tankers, drums and other sizes call 877-KIC-Bulk for pricing



So nomenclature is a black art. It’s semi-finite in that there are currently a finite number of compounds known (some 10s of millions) and a finite set of rules that can be used to generate an infinite set of  names. In a similar way there are a finite set of English words that can be used to generate an infinite set of articles. So, in principle, we could encode a finite set of rules, updated every year when IUPAC generate more rules that would completely interpret chemical name space.

In practice however the labour of doing this has been too great for anyone. Even the marker leaders in name2structure would not correctly interpret all the examples in the IUPAC rulebook. There’s a very long tail - many rules which apply to only a few compounds - or none - in the 30 million. Not cost-effective  at this stage. [There would be a cost-effective way if IUPAC rules were semantically encoded, but that's many years away if at all.].



Ideally there should be one name2structure converter, sanctioned by IUPAC. Just like there is one InChI, sanctioned by IUPAC. In bioscience this would have happened. But in chemistry we have a  mess of competitive products, of very variable quality. They cost money (some are free to academics), have many errors, have no agreed standard of quality, have no believable metrics, have no way of input from the community.

A classic picture of anticommons.

So why are we developing OPSIN? In research terms it’s a “solved problem”. We are frequently told academia shouldn’t do things that the commercial sector does better.

In fact we are doing things better and we are doing language research. The motivations are:

  • generic use of language. Chemistry often uses phrases like “substituted pyridines”. There is no formal way of representing this concept and we are developing languages that provide a grammar. This is hard, it’s research and it’s valuable for the community, such as interpreting patents.
  • disambiguation. This is a key problem in NLP and certainloy worthy of research. What does “chloroethylbenzene”? It’s ambiguous and could be any of 5 structures (ClCCc1ccccc1,CC(Cl)c1ccccc1, Clc1ccccc1CC, Clc1cc(CC)ccc1, Clc1ccc(CC)cc1) or which one has further stereoisomers. Which did the author mean? Can this be deduced from context?. OPSIN will indicate whether a structure is ambiguous and in time may even attempt to reason what what meant.


These are the research reasons. We’ve now been joined by Daniel Lowe, a first-year PhD student supported by Boehringer Ingelheim to do research into machine interpretation of patents containing chemistry. Daniel’s made an excellent start, primarily by extending OPSIN. When he took this over from PeterC it was not a competitive tool.

Now it is.

How do we measure its success? There are no agreed corpora or metrics for chemistry NLP so we have to be careful. The essentials are to be Open and systematic and to invite community buy-in.

In essence Daniel has taken a representative set of 10000 “formally correct” IUPAC names and analysed them with OPSIN and 2 other commercial programs. (You will appreciate that it is not easy to get funding to buy programs simply to test them so there are others we cannot use). At present we find for one corpus  progA ~ OPSIN ~ progB and in two others progA > OPSIN > progB (yes, you will be kept guessing).

Treat all metrics with great suspicion, but Opsin’s recall (i.e names it translates correctly) is around 80% and it has the lowest error rate (incorrectly translated names) of all programs (ca 1%). [You should ask "on what corpus?" - and shortly we'll tell you and Open it.] We believe than the main reason why OPSIN < progA is vocabulary. Adding vocabulary is tedious as there is a very long tail. It’s good to do while watching cricket (as I am doing) but it’s still slow.

So this is the time when we can invite crowdsourcing. Until recently that wasn’t an option, but now OPSIN has a good infrastructure and it’s possible to add vocabulary without having to modify code. Much of OPSIN’s vocabulary is in external files which are fairly easy to modify and which won’t break the system.

OPSIN has, of course, always been Open Source and so - in principle - anyone could modify it. But in practice many OS projects have an incubation period where the infrastructure is being built and it’s very difficult to have an uncontrolled community process. Now we can offer a controled community process where large numbers of people can make small but useful contributions.

There are two methods of approach, and we’ll start with the first:
  • become a developer on Sourceforge and modify the template files to add vocabulary. Some examples of vocabulary we are missing are cabohydrates, nucleic acid components and amino-acids.
  • We should develop an interface that allows users of OPSIN to add vocabulary interactively. Thus is it fails to parse 1,5-dihydroxymanxane, OPSIN tell the user it didn’t know what maxane was and ask for a structure+locants.


So if you are interested in helping with OPSIN please let us know. Half a dozen vocabulary contributors could make rapid progress.

And when this is done we’ll have a tool that interprets IUPAC names and which, as it is Open, can become a de facto standard.

funding models for software, OSCAR meets OMII

Saturday, May 16th, 2009
In a previous post I introduced our chemical natural language tools OSCAR and OPSIN. They are widely used, but in academia there is a general problem - there isn’t a simple way to finance the continued development and maintenance of software . Some disciplines (bioscience, big science) recognize the value of funding software but chemistry doesn’t. I can count the following other approaches (there may be combinations)
  • Institutional funding. That’s the model that ICE: The Integrated Content Environment uses. The major reason is that the University has a major need for the tool and it’s cost-effective to do this as it allows important new features to be added.
  • Consortium funding. Often a natural progression from the latter. Thus all the major repository software (DSPACE, ePrints, Fedora) and content/courseware (Moodle, Sakai) have a large formal member base of instutions with subventions. These consortia may also be able to raise grants.
  • Marginal costs. Some individuals or groups are sufficiently committed that they devote a significant amount of their marginal time to creating. An excellent example of this is George Sheldrick’s SHELX where he single-handedly developed the major community tool for crystallographic analysis. I remember the first distributions - in ca 1974 - when it was sent as a compressed deck of FORTRAN cards (think about that).  For afficionados there was a single variable A(32768) in which different locations had defined meanings only in George’s head. Add EQUIVALENCE, blank COMMON and any alteration to the code except by George led to immediate disaster. A good strategy to avoid forks. My own JUMBO largely falls into this category (but with some OS contribs).
  • Commercial release. Many groups have developed methods for generating a commercial income stream. Many of the computational chemistry codes (e.g. Gaussian) go down this route - an academic group either licenses the software to a commercial company, or set up a company themselves, or recover costs from users. The model varies. In some cases charges are only made to non-academics, and in some cases there is an active academic devloper community who contribute to the main branch, such as for CASTEP
  • Open Source and Crowdsourcing. This is very common in ICT areas (e.g. Linux) but does not come naturally to chemistry. We have created the BlueObelisk as a loose umbrella organisation for Open Data, Open Standards and Open Source in chemistry. I believe it’s now having an important impact on chemical informatics - it encourages innovation and public control of quality. Most of the components are created on marginal costs. It’s why we have taken the view that - at the start - all our software is Open. I’ll deal with the pros and cons later but note that not all OS projects are suited for crowdsourcing on day one - a reliable infrastructure needs to be created.
  • 800-pound gorilla. When a large player comes into an industry sector they can change the business models. We are delighted to be working with Microsoft Research - gorillas can be friendly - who see the whole chemical informatics arena as being based on outdated technology and stovepipe practices. We’ve been working together on Chem4Word which will transform the role of the semantic document in chemistry. After a successful showing at BioIT we are discussing with Lee Dirks, Alex Wade and Tony Hey the future of C4W
  • public targeted productisation. In this there is specific public funding to take an academic piece of software to a properly engineered system. A special organisation, OMII, has been set up in the UK to do this…
So what and why and who and where are OMII? :
OMII-UK is an open-source organisation that empowers the UK research community by providing software for use in all disciplines of research. Our mission is to cultivate and sustain community software important to research. All of OMII-UK’s software is free, open source and fully supported.
OMII was set up to exploit and support the fruits of the UK eScience program. It concentrated on middleware, especially griddy stuff, and this is of little use to chemistry which needs Open chemistryware first. However last year I bumped into Dave DeRoure and Carole Goble and they told me of an initiative - ENGAGE - sponsored by JISC - whose role is to help eResearchers directly:
The widespread adoption of e-Research technologies will revolutionise the way that research is conducted. The ENGAGE project plans to accelerate this revolution by meeting with researchers and developing software to fulfil their needs. If you would like to benefit from the project, please contact ENGAGE (info@omii.ac.uk) or visit their website (www.engage.ac.uk).
ENGAGE combines the expertise of OMII-UK and the NGS ? the UK?s foremost providers of e-Research software and e-Infrastructure. The first phase, which began in September, is currently identifying and interviewing researchers that could benefit from e-Research but are relatively new to the field. “The response from researchers has been very positive” says Chris Brown, project leader of the interview phase, “we are learning a lot about their perceptions of e-Research and the problems they have faced”. Eleven groups, with research interests that include Oceanography, Biology and Chemistry, have already been interviewed. The results of the interviews will be reviewed during ENGAGE’s second phase. This phase will identify and publicise the ‘big issues’ that are hindering e-Research adoption, and the ‘big wins’ that could help it. Solutions to some of the big issues will be developed and made freely available so that the entire research community will benefit. The solutions may involve the development of new software, which will make use of OMII-UK’s expertise, or may simply require the provision of more information and training. Any software that is developed will be deployed and evaluated by the community on the NGS. “It’s very early in the interview phase, but we?re already learning that researchers want to be better informed of new developments and are keen for more training and support.” says Chris Brown. ENGAGE is a JISC-funded project that will collaborate with two other JISC projects ? e-IUS and e-Uptake ? to further e-Research community engagement within the UK. “To improve the uptake of e-Research, we need to make sure that researchers understand what e-Research is and how it can benefit them” says Neil Chue Hong, OMII-UK’s director, “We need to hear from as many researchers and as many fields of research as possible, and to do this, we need researchers to contact ENGAGE.”
Dave and Carole indicated that OSCAR could be a candidate for an ENGAGE project and so we’ve been working with OMII. We had our first f2f meeting on Thursday where Neil, and two colleagues, Steve and Steve came up from Southampton (that’s where OMII is centered although they have projects and colleagues elsewhere). We had a very useful session where OMII have taken the ownership of the process of refactoring OSCAR and also evangelising it. They’ve gone into OSCAR’s architecture in depth and commented favourably on it. They are picking PeterC’s brains so that they are able to navigate through OSCAR. The sorts of things that they will address are:
  • Singletons and startup resources
  • configuration (different options at statup, vocabularies, etc.)
  • documentation, examples and tutorials
  • regression testing
  • modularisation (e.g. OPSIN and pre- and post-processing)
And then there is the evangelism. Part of OMII-ENGAGE’s remit is to evangelise, through brochures and meetings. So we are tentatively planning an Open OSCAR-ENGAGE meeting in Cambridge in June. Anyone interested at this early stage should mail me and I’ll pass it onto the OMII folks. … and now OPSIN…

OPSIN and OSCAR - Chemical language processing

Saturday, May 16th, 2009

This blog is about new developments in our chemical language processors OSCAR and OPSIN and about how OMII (eScience) and we are taking them forward.  WE also have a JISC project with NacTEM - CheTA and I’ll write more later about that.

Many of you will know that we have been interested for several years in the Natural Language Processing (NLP) of chemistry texts. “Text-mining” - the extraction of information from texts - is now commonplace (and will remain so until we move away from PDF as the only means of communication). Our interest has been wider - with Ann Copestake and Simone Teufel in the Computer Laboratory we’ve been trying to get machines to understand that language of chemical discourse - “why was this paper written?”, “what is the authors relation to others?”, etc.

But to do this we needed language processing tools which were chemistry-specific, and since 2002 we’ve developed the OSCAR and OPSIN tools (see http://sourceforge.net/projects/oscar3-chem) . OSCAR was the first, developed initially by Joe Townsend and Chris Waudby through summer studentships from the Royal Society of Chemistry. The first version of OSCAR was developed to check the validity of data in chemical syntheses and has been mounted on the RSC’s website for 5-6 years.

I know from hearsay that this is widely used though I don’t have any download figures.This software is variously referred to as OSCAR and internally as OSCAR-DATA or OSCAR1. It is a measure of its quality that it has been mounted for > 5 years and has run with no reported problems and required no maintenance. I continue to emphasize the value of making undergraduates full members of the research and development process and why in our group we continue to highlight their importance.

You will need some terms now:

  • chemical natural language processing - applying the full power of NLP to chemically oriented text.  This includes approaches such as tree banking where we try to interpret all the possible meanings of a sentence or phrase: “time flies like an arrow” (Marx) or “pretty little girls school”. There are relatively few systems which do this, at least in public.
  • chemical entity recognition. A subset of chemical NLP where the parsers identity words and phrases representing chemical concepts. To do this properly it’s necessary to recognize the precise phrase. Thus “benzene sulfonic acid” represents a single phrase and to interpret is as “benzene” and “sulfonic acid” is wrong. We also recognize phrases to do with reactions, enzymes, apparatus, etc.  This is an area where we have put in a lot of work.
  • Chemical name recognition is an important subset of chemical entity recognition. Names can be recognised by at least (a) direct lookup - required for trivial or trade names (”cholesterol”, “panadol”) (b) machine-learning techniques on letter or n-gram frequencies and (c) interpretation (below).
  • Chemical name interpretation, e.g. of (IUPAC) names (e.g. 1-chloro-2-methyl-benzene). The Int. Union of Pure and App. Chemistry (IUPAC) oversees the rules for naming chemicals which runs to hundreds of pages. It looks algorithmic to code or decode chemical names. It is NOT. Some computer scientists have taken this as a toy language system and been defeated, because it is actually a natural language with rules, exceptions, irregular formations and a great deal of non-semantic vocabulary. It includes combinations (semi-systematic) such as  7-methyl-guanosine where if you don’t know what guanosine is you can make little progress (but not none, you know there is a methyl group).
  • Information extraction. The (often large-scale) extraction of information from documents. This is never 100% “correct”, partly through lack of vocabulary, partly through variations in language including “errors”, and partly because of ambiguity. We use the terms recall (how many of the known chemical phrases were actually found) and precision (how many of the retrieved phrases were correctly identified as chemical). Note that this requires agreement as to which phrases are chemical and this must be done by humans. This annotated corpus requires much tedious work, and to be useful must be redistributable in the community. Without it any reported metrics on the performance of tools are essentially worthless. There is commercial value in extracting chemical information and so, unfortunately, most metrics in this area are published as marketing figures. Note that the performance of a tool is not absolute but depends critically on the selection of documents on which it is run.
During this process Joe and Chris enhanced OSCAR by adding chemical name recognition using n-grams and bayesian methods. This gave a tool which was able to recgnize and interpret large amounts of the wrold’s published chemical syntheses. It’s at that stage that we run into the non-technical problems such as publisher firewalls, contracts, copyright and all the defences mounted against the free digital era (but that’s a different post).

The next phase was a collaborative grant between Ann Copestake and Simone teufel of the Cambridge Computer Laboratory and myself, funded by EPSRC (SciBorg). I reemphasize that Sciborg is about many aspects of language processing besides information extraction. We were delighted to include publishers as partners, RSC, Int. Union of Crystallography and Nature Publishing Group. All these have contributed corpora, although these are not wholly Open.

In NLP an important aspect is interpreting sentence structure through Part-of-speech-tagging. Thus “dihydroxymanxane reacts with acetyl chloride” has the structure NounPhrase Verb Preposition NounPhrase. There’s a splendid tool, Wordnet, that will interpret natural language components - here is what it does for “acetyl chloride” (identifying it as a Noun). But it fails on “dihydroxymanxane” - not suprising as my colleague Willie Parker coined the name manxane in 1972 and the dihydroxy derivative is generated semi-systematically. There are an infinite number of chemical names and we need tools to identify and interpret them.

OSCAR was therefore developed futher by Peter Corbett to recognise chemical names in text and our indications are that its methods are not surpassed by any other tool. Remember that results are absolutely depedent on an annotated corpus and on the actual corpora analysed. It’s easy for any tool to get good results on the corpus it’s been trainied on and lousy ones for different material. But, on a typical corpus from RSC publications OSCAR3 scores over 80% combined precision and recall. (Before you brag that your tool can do better, the study also showed that expert chemists only agreed 90%, so that is the upper limit. If chemists cannot agree on something, then machines cannot either).

OSCAR3 is now widely used. There have been over 2600 downloads from SourceForge (yes, of course OSCAR3 is Open Source). We get little feedback because chemistry is a secretive science but this at least means that there are relatively few bugs. Of course there may also be people who find they can’t install OSCAR3 but don’t contact us. The European Patent Office has used OSCAR3 on over 70,000 patents.

So OSCAR can justify some effort to make it even more usable and that’s why we have approached OMII. See below…

When we first started OSCAR we realised that we needed a name2structure parser if we were going to understand the chemistry. It’s valuable to know that dihydroxymanxane is a chemical, but even better if we know it is 1,5-dihydroxybicyclo[3.3.3]undecane because chemists can interpret that. So I started  by writing a separate tool to interpret chemical names (there weren’t and there aren’t now any other Open Source programs to do this). Joe Townsend took over and researched the literature for parsing methods, and handed this over to PeterC at the start of Sciborg. Peter made useful enhancements to this and included it as a subcomponent OPSIN. Peter deliberately did enough work to interpret common chemical names and included it in the OSCAR processing chain.

I want to be very clear. OPSIN has never been promoted as a tool to compete with commercial name2structure tools (there are 3-4) . It was an adjunct in the Sciborg program. If PeterC or I had spent more time increasing its power it would have been at the expense of what the grant was for. It met its given purpose well - to highlight the value of automatic translation and markup of names, and led - in part - to the RSC’s development of Project Prospect where chemical concepts in publications are semantically marked. From time to time we see anecdotal reports that OPSIN is not up to the standard of commercial tools and that is used as an argument for poor quality in Open Source projects and - sometimes - the relative inability of academics to do things properly. That’s unfair, but we have to bite our lips.

That’s now massively changing and I believe that in a few months time OSCAR and OPSIN will be seen as a community standard in chemical language processing and chemical entity interpretation. Being Open Source that will lead to increased community effort which has the power to leapfrog some of the commercial offerings. More in the next blog post.

Trust in scientific publishing

Saturday, May 9th, 2009
[Please excuse formatting - reinstalling ICE soon]

Two stories have coincided – both relate to the role of trust in scientific publishing.


The first is when I was rung by Emma Marris, reporting for Nature, last week and asked what I thought of the financial problems in the American Chemical Society. I said that I wasn’t really the right person to ask as I had no previous or specialist area, but that it was essential that Scientific Societies were a key part of the scientific community. She’s included a quote in this week’s Nature:


http://www.nature.com/news/2009/090506/full/459017a.html


American Chemical Society makes cutbacks to fight financial losses.

Emma Marris

The American Chemical Society (ACS), the world’s biggest scientific society, is feeling the effects of the global economic downturn.


On 28 April, six months after tightening its belt a first notch, the society laid off 56 people, 3% of its employees…. [the rest is Pay-to-read]


I can’t reproduce the article (copyright) but here’s my bit …


Even vocal critics of the society’s opposition to open-access publishing aren’t delighting in its financial woes. Peter Murray Rust of the University of Cambridge, UK, whose blog covers open-access chemical information, says that he wishes the society well. “I have not been a supporter of many of [its] policies,” he says, “but I would say that we absolutely need national scientific societies.”


As Emma says I have been critical of some aspects of the ACS’s public policy, most notably its proactive role in PRISM – a coalition of (a few) leading publishers to discredit Open Access. From Peter Suber’s blog (2007):

[3]   July 2006 - As Nature later reports, Several publishing executives with ACS, Wiley and Elsevier meet with PR operative, Eric Dezenhall, to discuss a plan to defeat open access.  Dezenhall advises the executives to equate Open Access with a reduction in peer review quality.


This and similar actions have led people to question the scientific integrity of the participants .

In the C21 one of the critical commodities is trust. A typical (and misguided) mantra is: “You can’t trust anything in Wikipedia”. So who can ,by their nature be trusted in the scientific arenas? I’ll try the following list and am happy for comments:

  • learned societies (and international scientific unions)

  • universities, national laboratories and government agencies

  • libraries

  • funding bodies including (most) charities

  • (some) regulatory bodies if business is conducted publicly


Scientific societies have a critical role and that’s why I wish to see a healthy and growing involvement of scientific societies in establishing trust. Trust cannot be mandated, it has to be earned. It is hardly won and easily lost. In C21 Openness and democratisation are major tools in speeding up the growth of trust.


I’ve excluded the commercial publishers. There are worthy ones but there are also ones driven at least partly by the search for revenue at the cost of trust. The following story (http://www.earlham.edu/~peters/fos/2009/05/elsevier-and-merck-published-fake.htmll) broke recently about Elsevier’s publication – for money paid by Merck – of a fake journal. The “journal” was made to look like a typical medical peer-reviewed journal


Merck paid an undisclosed sum to Elsevier to produce several volumes of a publication that had the look of a peer-reviewed medical journal, but contained only reprinted or summarized articles–most of which presented data favorable to Merck products–that appeared to act solely as marketing tools with no disclosure of company sponsorship. …


The Australasian Journal of Bone and Joint Medicine, which was published by Exerpta Medica, a division of scientific publishing juggernaut Elsevier, is not indexed in the MEDLINE database, and has no website (not even a defunct one). …


This might well have gone unnoticed in a pre-digital age and it’s clear that the blogosphere is a major tool in detecting unacceptable publication. So – as many have noted – here is a commercial company which has campaigned to rubbish Open Access as “junk science” behaving in a manner which totally destroys any trust in their ethics and practice. I have no option but to say that I now cannot absolutely trust the ethical integrity of every piece of information in Elsevier journals.


The need for Open, trusted, scientific data and discourse is now clear. The scientific societies are well placed to help us make the change from closed paper to open trusted semantic digital. They clearly need a business model that transforms the new qualities into a revenue stream. This will not be easy but it has to be tried – there is no alternative. Some of the modern tools will help – the ability to mashup, aggregate, etc. will lead to new forms of high-quality information that will have monetary value. Certified validated information will lead to productivity gains and may be a valuable commodity.


So this should be a time for scientific societies to look positively to the future rather than fearfully at the receding past.

British Library document on copyright

Saturday, May 9th, 2009
From Ben White of the BL (who sought views from me and others to go into the document). There is a lot positive in this and I really hope the Government takes the recommendations seriously in revising the law. [BTW the format of the document itself is strange and rather difficult to read on screen - it looks more like a poster].

Please find attached the British Library’s latest paper on Copyright and Research. http://www.bl.uk/ip/pdf/copyrightresearchreport.pdf

We had an event (see podcast if you have the time at www.bl.uk/ip) this Tuesday to discuss copyright and research - those on the panel included Lynne Brindley, CEO of the British Library, as well as IP and Higher Education Minister David Lammy, Torin Douglas BBC etc. Lots of great people in the audience too of course!

Please spread the word regarding the paper!

Sincerely yours

Ben White

Here is an excerpt:
In a supreme irony, the ease of access enabled by the digital age actually leads to greater access restrictions:

1. Researchers increasingly find a black hole when researching 21st Century material– ironically the material of previous centuries has become easier to access than the websites, word documents and blogs of today because clearing rights to give access to modern day material can be lengthy and expensive.
 Currently Google blocks post 1868 material on their Google Books site from users in the European Union because of the longer duration of copyright in the EU. This means that European researchers wanting to read material up to 1923 have to travel to the United States to view material that is freely available there on the web but not in Europe. Much of this material was of course produced by Europeans…
 Some historical publishers have had to abandon post war social history projects as the rights issues are too complex.

2. Researchers of the future find a black hole when researching late 20th Century history as much of our digital history has decayed and become digitally corrupted.
 Parts of the British Library’s archive of celebrated photographer, Fay Godwin, may no longer be accessible to researchers when Microsoft and Adobe no longer support Windows XP/ Vista and Photoshop (CS3) servers, as the servers are essential for viewing some of her digital photographic collection. Restrictions in copyright law mean that the British Library can do little practically to prevent this.

3. Computer based research techniques become restricted by copyright and contract law. Computer technology has already significantly changed the way in which scientific research is conducted. Scientists increasingly do not read books or journals, but by writing computer programmes search, analyse and extract data from written sources in a technique known as ‘data mining’ or ‘text mining’. Science is propelled forward by access and collaborative reuse of scientific information. It is important that computer based research techniques are allowed for by future copyright law, in the way that in the analogue world we have protected research activity through ‘fair dealing’.
 Medical researchers write their own computer programmes to search across thousands of digitised articles in their libraries to extract important medical data, such as the relationship between a certain enzyme and the spread of cancer. Despite this, the researcher is not able to share the results of their findings with other scientists as this will contravene the terms of their licence with the database provider, and the relationship between the provider and the university.
It is heartening to see such a positive view being promoted at a national level. Perhaps this is something that individual libraries can help to support and propagate. Hopefully it can give encouragement to those who wish to challenge the unacceptable status quo.

BioIT - Chem4Word

Monday, April 27th, 2009

I’m in Boston for Bio-IT World Conference & Expo 2009 for two main reasons, an invited talk “the Chemical Semantic Web” (Computational Chemistry track) and also our first public demonstration of the Chem4Word software (research.microsoft.com/en-us/projects/chem4word/ ) . For those who are at the meeting, the first’s on Wednesday morning, the second on Tuesday lunchtime.

The C4W demo has been worked on very hard for the last month. There was a dress rehearsal in Redmond at the Microsoft External Research meeting which was ready about 5 minutes before the presentation. We took the decision to freeze that functionality and to show it in Boston after the bugs had been ironed out. The discipline of having a fixed deadline (an international meeting) is an excellent way of concentrating minds within a project. Rudy Potenzone is demo-ing the software but I’ve got the demo on my machine as well.

What does Chem4Word do? It’s more important to say what it is.

At one level it’s an add-in that chemists can use to author documents. An the other end it’s a toolkit which can be used to develop the next generation of bench-top chemical software. I owe Rudy some introductory material, so I might as well use this blog to do it.

Chem4Word is an Open Platform for collaborative chemical software development in a .NET environment.

C4W will be transferred to CodePlex (the MS Open software site) and will be available for anyone to help develop, much as in the spirit of the Blue Obelisk. Learning from other Open Source chemistry projects we have though closely about sustainability of management.

Chem4Word is an Add-In to Word2007 that creates a semantic authoring tool for chemistry.

Word2007 is a platform that supports semantic authoring. Its use of smartTags allows words and phrases to be linked to a range of document components, including a Gallery, a Navigator.

Chem4Word uses (chemical) Ontologies.

With the new Microsoft Research Ontology Add-In external ontologies (we use Nico Adams’ ChemAxiom) document components can be managed by a formal ontology. At one level this is a chemical spell-checker, at another a thesaurus, at another a converter between scientific units and at yet another a transformation tool between scientific concepts.

Chem4Word emphasizes semantics by using CML as its exposed data model

Current chemical toolkits require a fixed data model for objects. C4W communicates with CML (and other XML) as its data model. This gives a declarative programming model where there are no side effects. Effectively this is a new programming language for chemistry, both formal and flexible

Chem4Word is modular

The graphics and UI are decoupled from the chemical engine. This means that commands can be issued to the engine from sources other than the UI. The document is also modular – it’s possible to examine the chemistry, the links, the tags all as XML and to build document processors independent of Word.

Chem4Word supports validation

All CML has to conform to a schema (CML-Lite) and can be validated at every stage. The import pipeline takes 4-5 stages with validation and normalization. It is impossible to import or author an invalid file. This is intended as an important contribution to bringing needed quality into chemistry.

Chem4Word integrates Text and chemistry and styles

The Word document introduces ChemistryZones : which are chunks of the document representing chemistry. These are all backed by a CML object which itself can have many components, currently:

  • single molecule

  • compound molecule (salts, hydrates, complexes)

  • formula

  • name

Each of these can be displayed in a chemistry zone, making it possible to change the representation of an object, while preserving the semantics. The Navigator allows the user to select a given zone or to navigate from it.

Current functionality

The current project had to balance functionality, semantics and aesthetics and has put most emphasis on semantics. The primary functionality is currently:

  • manage gallery, navigator and other Word concepts

  • create chemistry zones

  • import CML molecules

  • validate them

  • render them, with different styles in different zones

  • tweak them (move atoms to prettify the molecule)

  • change atoms

We have deliberately not (yet) introduced chemical editing tools as we wish to get the UI framework correct and validate the semantics. With the large number of molecules now available (e.g. in Pubchem) we can convert these to valid CML outside C4W and import them. This means that unless chemists are working with new molecules C4W will already support many of their authoring needs.

The future

The current project runs for another few months at the end of which we’ll have a release version. (We shall make the current version available to a few pre-alpha collaborators). A major emphasis is to create a distribution which is well designed for development and even if that means limiting the initial functionality. We’ll work hard on developing use cases where C4W is useful, especially in the creation of compound documents.

We’ll tell you then where this is going after that.

This blog authored with ICE + Open Office; thanks to PeterSefton and USQ

(Note: Just when I thought I had the ICE plugin working, it now fails to post. I think this may be due to firewalls or something else, but I can’t grab the error message as it disappears. So I have to cut and paste. I think that’s why the fonts go wonky)

Three days to save the European Internet

Sunday, April 26th, 2009

Two days ago I had no idea the European Internet was under severe threat, and I’m a European. Part of the problem is that Europe is incredibly complicated and the governance is baroque and bizarre. It uses terms like (Acquis communautaire) admittedly I suffer from Anglophone blindness, but in any language the complexity of terminology and governance is horrendous.

The normal thing most Brits do is ignore it. I have a cosy feeling that continentals are more educated but that’s probably false. So we have a governance process that’s out of control. They pay themselves huge allowances, are regularly corrupt, but as a war baby I reckon that’s a small price to pay for not carpet-bombing civilians. Yes, the UK tabloids regularly bash the Common Agriculture Policy, etc. but…

I was shocked out of my complacency when the issue of Software Patents in Europe arose. I went to UCL (London) to hear Richard Stallman talk on this and was embarrassed to find an American who knew how European government worked. He know where the power lay, the Council of Ministers (who are unelected), etc. and he gave us clear instructions as to how to best mobilise.

Now we are at it again. Although I’m an educated citizen of Europe I don’t know how to promote my views best. But one of the great powers of the Web is that it promotes e-democracy. Not only can anyone say what they want but groups can use crowdsourcing to assemble arguments and advocacy. So I know that I can read up rapidly on the issues and know what the best use of my very limited efforts is. (Here I think it’s mainly raising the issues on this blog and writing as an individual to my MEP).

I’ve found Twitter very useful here. 2-3 followers have in the rather cryptic style of Twitter pointed out that there are two issues.

  • Net neutrality

  • 3-strikes

Both are evil but the wisdom seems to be that net (non)neutrality is even more evil. What’s NN? Here’s a helpful site (http://www.savetheinternet.com/=faq). Essentially Net Neutrality is about the infrastructure of the net as provided by the companies such as telcons, which by default do not have our interests at heart.

From the site:

What is Network Neutrality?

Network Neutrality — or “Net Neutrality” for short — is the guiding principle that preserves the free and open Internet.

Put simply, Net Neutrality means no discrimination. Net Neutrality prevents Internet providers from blocking, speeding up or slowing down Web content based on its source, ownership or destination.

Net Neutrality is the reason why the Internet has driven economic innovation, democratic participation, and free speech online. It protects the consumer’s right to use any equipment, content, application or service on a non-discriminatory basis without interference from the network provider. With Net Neutrality, the network’s only job is to move data — not choose which data to privilege with higher quality service.

Who wants to get rid of Net Neutrality?

The nation’s largest telephone and cable companies — including AT&T, Verizon, Comcast and Time Warner — want to be Internet gatekeepers, deciding which Web sites go fast or slow and which won’t load at all.

They want to tax content providers to guarantee speedy delivery of their data. They want to discriminate in favor of their own search engines, Internet phone services, and streaming video — while slowing down or blocking their competitors.

These companies have a new vision for the Internet. Instead of an even playing field, they want to reserve express lanes for their own content and services — or those from big corporations that can afford the steep tolls — and leave the rest of us on a winding dirt road.

The big phone and cable companies are spending hundreds of millions of dollars lobbying Congress and the Federal Communications Commission to gut Net Neutrality, putting the future of the Internet at risk.

Isn’t the threat to Net Neutrality just hypothetical?

No. By far the most significant evidence regarding the network owners’ plans to discriminate is their stated intent to do so.

The CEOs of all the largest telecom companies have made clear their intent to build a tiered Internet with faster service for the select few companies willing or able to pay the exorbitant tolls. Network Neutrality advocates are not imagining a doomsday scenario. We are taking the telecom execs at their word.

And you should read more.

Here’s an analogy. I shall start my journey to BioIT on two trains, East Coast Capital Connect (used to be British Rail) and Transport for London (the tube). Each makes up its own rules as the what services operate, what the fare structure is. For example if I want to travel from Cambridge to London they decide that I cannot have a cheap fare at certain times even though I have a concession. So as a class of citizen I am discriminated against in favour of corporate passengers (customers). That’s Train non-neutrality.

If I travel at the wrong time I incur a penalty. Let’s call that a strike. And let’s assume that a company decides that a recidivist breaker of this rule gets banned from travelling. That’s a per person decision, and somewhat analogous to the three strikes rule. There may be good reasons for wanting to ban individuals repeated disorderly behaviour for example. I don’t know, but I expect there are people banned from rail travel.

So in writing to my MEP I referred him to a summary of the issues better than trying to explain them myself when I don’t know what’s being voted on when and by whom.

I hope he knows.

Open Chemistry Data at NIST

Friday, April 24th, 2009

I had a wonderful mail this morning from Steve Heller …

Peter

I am helping the NIST folks get additional GC/MS EI (electron impact only) mass spectral for their WebBook and mass spec database. http://webbook.nist.gov/chemistry/ and http://www.nist.gov/srd/nist1a.htm The question I have for you is would you be willing to post something on your blog suggesting it would be useful for people to donate their EI MS to the NIST folks. The WebBook is Open Data which is where the spectra would go first/initially. In addition, the spectra would also go into the NIST mass spec database to add to the existing database they provide. NIST is in the process of setting up an arrangement with the Open Access Chemistry Central folks to do this and I wanted to see if you also would be willing to cooperate/collaborate as well. Cheers Steve

PMR: Many of us have known the NIST webbook for many years. It was the first, and for some time the only, openly accessible chemistry resource on the web (outside bio-stuff like PDB). NIST are a US government agency whose role is – in large part – to produce standards (data, specs) for resources in science and engineering. Part of this role is to support US commerce through these activities.

The webbook has many thousands of entries for compounds. Even if you aren’t a chemist, have a look as it’s an ideal exemplar of how data should be organised. The impressive thing is that it has complete references for all data and also concentrates on error estimation. In many ways it is the gold standard of chemical data. (I agree that things like Landolt-Bernstein are very important but in the modern web-world monographs costing thousands of dollars are increasingly dated). And it was Steve and colleagues (especially Steve Stein) who got the InChI process started – because they had so much experience in managing data publicly it made sense to promote the InChI identifier for compounds.

(In passing, NIST has also made an important contribution to our understanding of the universe by measuring the fundamental constants to incredible accuracy).

So is NIST in CKAN – the Open Knowledge Foundation’s growing list of packages of Open Data? YES (from http://www.ckan.net/package/read/nist)

Metadata:

Notes:

About

The NIST Data Gateway provides easy access to NIST scientific and technical data. These data cover a broad range of substances and properties from many different scientific disciplines.

Openness

Much of the material appears to be in the public domain as it is produced by the US Federal Government, but it varies from dataset to dataset.

Note that there is some fuzziness about what is meant by openness here – the NIST pages carry “all rights reserved” and “the right to charge in future”. But Steve’s motivation is clear here and it’s part of the role of OKFN/CKAN to help determine what the rights are.

I’m also interested in the reference to Open Access Chemistry Central. This raises the very important question of where Open Data should be located. The bioscience community has shown that a mixture of (inter)governmental organizations can work extremely well but this is less clear in chemistry at present. We are in exploration phase with a number of initiatives trying out models such as Pubchem (gov), Chemspider (independent/commercial), Crystaleye (academic), NIST (gov), Wikipedia Chemistry (independent), NMRShiftDB(academia), Chemistry Central (commercial/publisher) etc. I am sure there will be a need for multiple outlets – the variation in the sites above is too great for any single organization.

What is important is that this is Linked Open Data because then it does not matter who exposes it. LOD has a number of requirements including

  • Open Data (not just accessible)

  • Semantic infrastructure (e.g. XML/RDF)

  • Identifier systems

  • Appropriate metadata and/or Ontologies

I’ll be talking about this at BioIT next week in Boston (where I shall meet up with Steve). I’ll be bloggins more over the next two days.

In Cambridge we have just been funded by JISC to enhance our repository of chemistry data, which will include Mass Spec. I don’t know how much is EI, but our mission is to make the data Open and where this happens then we will certainly send it off to Steve. There’s a certain amount of technology needed but between us I think we could get an excellent public prototype.

More – much more – soon.

This blogpost was prepared with ICE+OpenOffice.

ICE-cold in Toowoomba

Monday, April 20th, 2009
I am here for all too short a time working with Peter Sefton and colleages on a number of collaborations on authoring and publishing tools. Peter runs the  Australian Digital Futures Institute at the University of Southern Queensland in Toowoomba - a lovely place in the mountains west of Brisbane.

We have a joint project funded by JISC - ICE-Theorem - and I’ll blog later when we have had the demo. This is a great arrangement because we have been able to contract much of the work to Peter’s group. Having now met the current group (and it’s grown since I was last here) I can say that it has a critical mass of committed developers which is very hard to put together in most academic institutions, especially those which depend on “research” output rather than technology. We’ve built up a strong mutual understanding over the last 3 or so years.

We have our differences of approach, but wherever popssible we are looking for these to compement each other. Good academic web tools will depend on a mixture of diversity and synergy. That means trying out new ideas but not getting locked into one’s own approach because you want glory or money (the chances are relatively small).

What often happens in the academic content/publishing world is that technology “empires” spring up - managing repositories, courseware, etc. They often mutate into political organisations with large consortia, where the pace is governed not by technology but by the need to satisfy everybody’s interests. At the other end of the spectrum are the geeks - in the best sense - who want to build systems in  days.

They often do. And Toowoomba is one of the places where it happens.

Peter has been showing me the Fascinator - it’s a lightweight desktop repository based on Fedora (but that’s excchangeable). We have an apparently similar approach in Jim Downing’s Lensfield. However we are looking to see how these two complement each other - Peter is document-centered, we are data-centered and there is enough difference that it make sense to go forwrad on both fronts.

But I have to rush …

Conspiracy and chemistry and an invitation to lunch

Friday, April 10th, 2009
Antony Williams (Chemspider) and Stuart Cantrill (Nature) have recently blogged about what the blogosphere is seeing as censorship on the Web by the American Chemical Society. This is a bold and serious claim and needs some background. The facts,  from Stuart:
One story that caught our eye was from Outsell, which, in their words, ‘is the only research and advisory firm focused on the publishing, information, and education industries’. The article was entitled ‘Chemical Bonding InChI by InChI’ and it offered an analysis of how certain publishers were making use of InChIs – those of you unfamiliar with InChIs can go here for a primer.

Daniel Pollock at Outsell had published an article on March 30th 2009 entitled “Chemical Bonding InChI by InChI”. He discussed the InChI Resolver and the efforts to raise enthusiasm for the InChI. He also discussed the efforts of both Nature Publishing Group and the Royal Society of Chemistry to proliferate the use of InChIs. [...]

The article them moves on to consider whether CAS (Chemical Abstracts Service), which is owned by the ACS, will also embrace InChIs. The conclusion was that we may have to wait a while for that to happen.

So why do you need to know this? Well, the story from Outsell has been withdrawn (on April 8th) – and more than that, in fact, it has been removed from their archives (although the original story is cached on Google and you can find it here).

Whether it is right to completely remove every trace of a story that you withdraw is a discussion for another day – but now all that remains is a brief notice indicating that the original story did not hold up to Outsell’s internal standards.

Outsell now say that the original article wasn’t balanced and that the ‘tone of the piece could be taken to single out CAS as being late in responding to the trends’. Surely readers could make that judgement for themselves?

The great shame is that the whole article has simply been removed and an analysis of how cross-publisher development on an important topic such as the InChI – which may have a significant role to play in chemistry publishing – has been lost.
Antony uses stronger language and speculation  (Conspiracy Theories and InChIs - Why was the Article Removed? - there’s much more that is worth reading) :
Conspiracy theories are already moving around the community. The majority of people I have discussed this with believe that the retraction was likely forced by CAS
PMR: In short the best guess is that CAS see InChIs as a threat (I’ll discuss the foolishness of this below) and that they put pressure on Outsell to retract. I don’t know under what auspices Daniel writes for Outsell - employee, invited contributor - etc. but Outsell have the right to moderate what is published on their site. They may feel that Daniel’s article detracted from their brand; I take the opposite view - that the article was well written and that the retraction has done Outsell damage. (Contrast a foward-looking company like Talis whose Panlibus blog written by employees is a major enhancement). It emphasizes the problem of employees publishing under their company name, and I have empathy for Daniel.

The retraction seems to be typical of the knee-jerk reaction that CAS applies to anything that could conceivably be seen as challenging their monopoly. For example last year Wikipedia volunteers started checking CAS numbers for correctness and the first reaction was to tell them that they were in breach of contract. Not “we are glad to see quality applied to chemistry”; “glad to see responsible use of unique identifiers”. After the natural blogosphere outrage (including this blog) CAS relented. I doubt they will relent this time.

It’s difficult to know what the reality is - but there are too many stories about clandestine and lobbying practices at ACS to ignore them. PRISM, the ACS mole, the constant lobbying, etc. ACS frequently resort to legal action (e.g. against Google) and I suspect there was a phone call from lawyers. We’ll probably never know.

Does this protect CAS’s monopoly? No, quite the reverse. It makes them look foolish, out of touch, and ultra-monopolistic. They have a huge turnover, and a monopoly of complete chemical information so they are immune from competition, right? Wrong.

Here is the UK Guardian newspaper recounting the demise of Encyclopedia Britannica (which I estimate has similar turnover to CAS):
By 1990 sales revenues had reached $650m. Yet within five years, EB underwent a near-death experience. What almost killed it was a product that most of its executives regarded as a joke, an encyclopedia on CD-Rom launched by Microsoft and called Encarta. The original content was licensed from an outfit with the Dickensian name of Funk and Wagnalls, and some of it gave trash a bad name. So Microsoft spruced it up, added multimedia content and made it easy to use. To the astonishment of EB’s board, this meretricious object triggered a precipitous decline in sales of their gold-standard product.

Faced with catastrophe, the Benton foundation put EB up for sale. It took 18 months to find a buyer, a Swiss billionaire named Jacob Safra who bought the company for half its book value. The story of Britannica is now a business-school case study in how rapidly competitors can emerge - apparently from nowhere - in a digital world. The First Rule of Business nowadays is that somewhere out there someone (and not just Google) is incubating a business plan that is based on eating your lunch
So where are the lunch-eaters coming from? Surely we cannot recreate a database of 30 million compounds? No, we can’t - we can create something much better - Linked Open Chemistry. It won’t come from a single source but from all the Open chemistry efforts that have grown over the last 1-3 years. They include Chemspider, Pubchem, ChEBI, Wikipedia, The Blue Obelisk, CrystalEye, Open Noteboook Science and a number of others. None have found the ACS as a body receptive to the new wave of chemistry. They are bringing to the lunch table:
  • Openness
  • Re-use and sharing
  • Immediacy
  • Innovation
  • Linkedness
  • Semantics
  • Quality Control
  • Community
Not all of these are fully developed but they are part of the Linked Data of the future and they can grow quickly. CAS’s actions and perceived stance is uniting them in a common effort to make chemical data free. Antony and I are meeting and the end of this month and we’ll be seeing how our offerings fit together. Yes, we’ve had differences, but these have helped to re-orient both of us and we now have at least a common goal of liberating chemistry.

There are some simple approaches which can revolutionize the way chemistry is captured and aggregated. Our own approach is semantic publishing (e.g. Chem4Word) means that the tacky business of text-and image mining could disappear. Yes, it needs a culture change in chemistry, but that is looking likelier all the time. Meanwhile, high prices, restrictive practices will only serve to drive more people (including those outside chemistry) to create Linked Open Chemical Data.

After all it’s OUR DATA.