Egon Willighagen blogged this. There is now a real opportunity for the Open Source chemistry community to create high-quality tools for the extraction of molecular information from legacy documents. Besides full-text articles other good areas to look are probably theses and supplemental data.
Before I copy the post, I’ll review the methods available (to the Open Source community)
explicit connection table. This is the best, but rare. It might occur in theses, but is uncommon. (Some word documents include binary CDX and/or MDL files but this is an awful hack. I’ve done it and don’t recommend it)
Implicit connection table. PLEASE USE InChI! In the absence of this there might be a SMILES
crystal structure. This is very good and uses CIF2CML. see CrystalEye (http://wwmm.ch.cam.ac.uk). Crystal structure coordinates are often reported in theses and supplemental data
output of computational chemistry programs. Again very good and uses CIF2CML code.
Chemical name. Parsable by OPSIN (part of the OSCAR3 package). Probably runs at between 25% and 70% depending on the domain. Will be improved by lots of little incremental bits (see below).
Spectra data. Very variable and usually incomplete. Works for small molecules. Use SENECA or lookup against shifts in NMRShiftDB. Very useful to check structures created by other methods
Chemical structure diagram. This is what is discussed below. Remember that although it’s easy for a human to understand a picture it can be very difficult for a machine. We can divide it into three parts (a) turn a bitmap into a series of graphics primitives (lines, text) (b) turn the graphics primitives into chemical primitives (bonds, atoms, labels). The first can be very hard, especially for fuzzy diagrams. The second is much easier, especially when the first has worked well. It is well suited when the input is PDF which although disgusting and horrendous can reveal the graphics primitives. I have done this for several instances of supplemental data and it’s variable. With an increasing amount of diagrams munged into PDF the vectors are often captured well. The third depends on the chemical semantics. Much of it involves recognising conventions (e.g. what does “OBz” mean?). I’m hopeful
In both names, spectra and diagrams there is a lot of heuristics and this is where everyone can help. There are probably a few hundred abbreviations, groups, etc. in common use and enough to give us a high degree of success. If we all add a few of these we can make rapid progress. You don’t have to be a programmer to do it.
Also, as Egon says, the combination of the methods will help a lot. What’s “THF”? It could be tetrahydrofuran or tetrahydrofolate. If you know the formula is C4H8O you know it’s the second. If you know it’s got two fused six-rings in, even if you can work out the atoms, it’s clearly not the second. And so on.
Enough from me:
We would like to announce a new addition to the set of chemoinformatics tools available from the Computer-Aided Drug Design Group at the NCI-Frederick. OSRA is a utility designed to convert graphical representations of chemical structures, such as they appear in journal articles, patent documents, textbooks, trade magazines etc., into SMILES.OSRA can read a document in any of the over 90 graphical formats parseable by ImageMagick (GIF, JPEG, PNG, TIFF, PDF, PS etc.) and generate the SMILES representation of the molecular structure images encountered within that document.
The email does not give any information on the fail rate, but the demo they provide via the webinterface does show some minor glitches (the bromine is not recognized):
The source reuses OpenBabel and uses the GPL license. The value equal to that of text mining tools like OSCAR3, and together they sounds like the Jordan and Pippen of mining chemical literature.
I posted about it yesterday not knowing that you have already posted it. That’s funny! I found it in my del.ico.us network and you via CCL … so the social network seems to work
Joerg, I am officially on holiday, but reading my email… so, missed the del.ico.us trigger…Interesting that you meantion the CCL mailing list as social network… to me, social networks were more like being able to socialize with accounts outside my main areas of interest, which CCL would be…
I did some testing on this the day it was released and found a number of issues during the tests and blogged about it here http://www.chemspider.com/blog/?p=83However, as a first release it definitely has potential and I am looking forward to helping them
… and, whether or not it’s usable directly in other code we should be able to abstract much of the functionality into code-independent data files
=========== Open letter to editors of Tetrahedron ==========
Professor L. Ghosez ,
Professor Lin Guo-Qiang ,
Professor T. Lectka ,
Professor S.F. Martin ,
Professor W.B. Motherwell ,
Professor R.J.K. Taylor ,
Professor K. Tomioka
Subj: Request for Open publication of crystallographic data in Tetrahedron
Dear editors,
I have recently been reviewing access to supplemental data in chemistry publications, in particular crystallographic data (”CIFs”). Many publishers (IUCr, RSC, ACS…) expose these on their websites as Open Data (for examples see: http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=455). The data are acknowledged not to be copyrightable (see http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=447) where your colleague Jennifer Jones (copied) has confirmed:
Dear Peter Murray-Rust
Thanks for your email. Data is not copyrighted. If you are reusing the entire presentation of the data, then you have to seek permission, otherwise, you can use the data without seeking our permission.
Yours sincerely
Jennifer Jones
Rights Assistant
Global Rights Department
Elsevier Ltd
PO Box 800
Oxford OX5 1GB
UK
Tel: + 44 (1) 865 843830
Fax: +44 (1) 865 853333
email: j.jones@elsevier.com
Other Elsevier journals such as those publishing thermochemistry (see last blog post) are now actively making the supplemental data Openly available on the journal website. I am therefore asking whether Tetrahedron (and perhaps other Elsevier chemistry journals) might consider publishing their data Openly in this way and would be grateful for your views.
(This is an Open letter (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=456) and I would like to publish your reply so please mark any confidential material as such).
Thank you for considering this
Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road, Cambridge CB2 1EW, UK
+44-1223-763069
=========== Open letter to editors of Tetrahedron ==========
I have been reviewing the availability of Open Data for cyberscience - concentrating recently on crystallography and chemical spectra as examples. I’ll propose a new business model here, still very ill-formed and I welcome comments. It applies particularly to disciplines where the data are collected in a fragmented manner rather than being coordinated as in, for example, survey of the earth or sky. I call this fragmentation “hypopublication”.
However the Internet has the power to pull together this fragmentation if the following conditions are met:
the data are fully Open and exposed. There must be no cost, no impediment to access, no registration (even if free), no forms to fill in.
the data must conform to a published standard and the software to manage that standard must be Openly available (almost necessarily Open Source). The metadata should be Open.
the exposing sites must be robot-friendly (and in return the robots should be courteous).
Such a state nearly exists in modern crystallography. The situation for macromolecules is that authors are required to deposit data in a central repository (http://www.rcsb.org). For small molecules there is less Open Data but a significant amount is available because of the work put in by:
the International Union of Crystallography (IUCr), which for at least 30 years has pioneered the development of data standards and ontologies emerging in its current Crystallographic Information File specification.
a number of publishers who have Openly exposed CIF data files on their websites for every article which contains relevant crystallography. They include the IUCr itself, the Royal Society of Chemistry, the American Chemical Society, the Chemical Society of Japan, and the American Mineralogist. (There may be others - if so I apologize and ask them to come forward). The licences are occasionally a bit fuzzy but the spirit and intention is clear. The data are there as a scientific record and to be re-used.
The Crystallography Open Database - a volunteer activity which has aggregated approximately 50 K CIFs from donations.
The Internet now means that the data can be reliably aggregated as in our Crystaleye knowledgebase. This also acts as an immediate alerting system - as soon as a new piece of interesting crystallography is published, subscribers to our RSS feeds are notified immediately.
The criticism is sometimes made that unless data is inspected by humans it cannot be certified as fit for purpose. This depends entirely what the purpose is. It’s often better to have data of variable quality than no data at all. And it’s always better to have data of variable KNOWN quality rather than none, even if the quality is often known to be low. It’s a balance of precision and recall (Why 100% is never achievable). Joe Townsend here has shown in his PhD that if we lower the recall of crystallographic data (i.e. throw out everything that is known to have errors) we can get very high precision indeed without having to inspect the data.
Our remaining problem is that not all publishers expose the data Openly. The rest of this post explores why they should think of doing so.
Before the Internet it was necessary to have central repositories to put data in, but now with all publishers online the data can just as easily be posted on their sites. Even if there is no intrinsic search mechanism on the publisher sites, researchers like Nick Day (here) can create tools for managing the data and metadata in CrystalEye. So why don’t all publishers expose their crystallography - I think it’s just a matter of priorities and hope this post will advance the case.
Data costs money. True, but the amount is falling. I don’t know how much it costs the publishers above to manage the exposure of the crystallography files - and I’m not asking - but it’s obviously not prohibitive. They’ve done it (I assume) because they think it’s an important part of the publication process - allowing science to be verified, providing a record, allowing new research to build on old. So they have - presumably - included the cost within the general cost of publication (which is covered mainly by subscriptions but for some of the articles also paid-by-author/funder Open Access).
The main cost of the process - the creation of communal metadata - is already past. This is probably the largest barrier to any group trying to emulate the idea. But it’s also happening in thermochemistry (ThermoML) where a number of journals:
all require data to be published at source and made Openly available. Here’s a sample issue which lists the Open data:
==================================
ThermoML Data for The Journal of Chemical Thermodynamics, Vol. 39, No. 6 June 2007
Developed in cooperation between The Journal of Chemical Thermodynamics and the Thermodynamics Research Center (TRC)
The full Table of Contents for this issue is available from JCT. The numbers below correspond to the numbers in the full Table of Contents.
2.
Low pressure solubility and thermodynamics of solvation of oxygen, carbon dioxide, and carbon monoxide in fluorinated liquidsPages 847-854
J. Deschamps, D.-H. Menz, A.A.H. Padua and M.F. Costa Gomes
ThermoML Data (To download: right-click on link and select “Save Link Target As” )
3.
High pressure phase behaviour of the binary mixture for the 2-hydroxyethyl methacrylate, 2-hydroxypropyl acrylate, and 2-hydroxypropyl methacrylate in supercritical carbon dioxidePages 855-861
Hun-Soo Byun and Min-Yong Choi
ThermoML Data (To download: right-click on link and select “Save Link Target As” )
===================================
You’ll see that the data are Open.
So couldn’t this be a model for all of science? As I have posted recently I’m going to write to the editors of Elsevier’s Tetrahedron suggesting that they make all their crystallographic data available Openly. They agree it’s not their copyright, so it’s just a question of how to do it - files on a website shouldn’t be a major expense.
And funders should encourage this. If you are urging authors and journals to publish Open full-text, please extend this to data. Yes, there are some technical difficulties in some cases such as metadata, complexity and size but they probably aren’t too scary. And in any case the community will help work out how to use them.
==== copy of letter to CCDC requesting clarification on copyright ====
To:data_request@ccdc.cam.ac.uk
Greetings
(Sorry to use a generic address but I am not sure who is the person to contact about permissions).
We have a systematic program of carrying out quantum mechanics calculations on organic crystal structures which uses the original CIFs as deposited by authors of peer-reviewed publications. In some cases the CIFs are openly accessible and openly re-usable from the publisher’s website (e.g. Acta Cryst., RSC). In other cases (e.g. Elsevier) the CIFs have been deposited at CCDC and are requestable without charge. and we have started to do this (see http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=452).
I would be grateful if you could clarify the copyright and re-use position of the retrieved files. Both your website and the actual file carry a notice which suggests that the files may be copyrighted and there are also apparently restrictions on re-use. To quote:
Conditions of Use of CIFs provided from the CCDC CIF archive Individual CIF data sets are provided freely by the CCDC on the understanding that they are used for bona fide research purposes only. They may contain copyright material of the CCDC or of third parties, and may not be copied or further disseminated in any form, whether machine-readable or not, except for the purpose of generating routine backup copies on your local computer system.
If you agree to the foregoing terms and conditions then please click on the “Accept” button below. If you do not accept the foregoing terms and conditions you should not click on the “Accept” button but should click on the “Do NOT accept” button below or the “back” button on your browser.
Elsevier (and other major publishers) have confirmed that these files are data and therefore not copyrightable.
You will appreciate that an adherence to formal wording of this licence could prevent proper scientific work being carried out. For example we routinely make all our raw data Openly available so that people can repeat our work (and have deposited 250,000 molecular structures and calculations in our Institutional Repository). Could you please confirm that the CIFs are not, in fact, copyrighted and that we have the right to re-use them in an Open manner and to redistribute them. We will provide complete provenance so that the authors’ identities (and where possible the article alongside which they were published) will be made clear.
Many thanks
Peter
NOTE: This letter is published to my blog: http://wwmm.ch.cam.ac.uk/blogs/murrayrust
( probably as http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=454 ) where we have been able to clarify licences on several publisher’s web sites. I would like to publish your reply in the same way (so please indicate if there is any material which should not be made public).
Peter Murray-Rust
Unilever Centre for Molecular Sciences Informatics
University of Cambridge,
Lensfield Road, Cambridge CB2 1EW, UK
+44-1223-763069
Tommy Thompson, Republican presidential candidate and former Secretary of the Department of Health and Human Services, has announced his science platform: double the budget of the NIH (to $58 billion/year), cure breast cancer in 10 years, and this:
Create an open source research community on the Internet where research can be organized and discussions can be conducted with experts. This online community will be a centralized repository for research where all of the world’s people can contribute their time, money or expertise toward helping with this global fight.
PMR: This seems amazing. I am an illiterate in US politics but it’s gratifying to know that Open Source Research has a high-enough profile that it is taken seriously at political level. It’s clear that we should be able to keep pushing politicians on this issue. It’s good for science, good for humanity. If we are going to save the planet we need Open Science as well as cycling to work. People and organizations who hide science and hide data through inertia or private gain are going to have an increasingly difficult position justifying their actions to the community.
We should remember that the US managed the moon-shot and also built the National Cancer Institute to cure cancer. The physical world seems to be easier than the biological world but I am certain of one thing: we shall need all the shared infromation we can get. Publishers, information companies, pharma - think about different ways of doing things. The answer may already be out there.
Regular readers will know of our Crystaleye repository where Nick Day’s robots have - quite legally - extracted ca 100,000 crystal structures from the Open AND closed literature. However it is not yet comprehensive as some publishers do not expose their data Openly but have a deposition arrangement with the Cambridge Crystallographic Data Centre (CCDC) [No formal relationship to us, although 100 metres away and built by the same architect]. So we are now looking to extract these structures into Crystaleye.
As I blogged recently (THANK YOU ELSEVIER!) Elsevier have no objection to the extraction of crystallographic data from their journals. The first journal I’m starting with is Tetrahedron [1]. Here I take you through the process from first the author’s point of view and then the reader’s/user’s. From the Guide for authors (and omitting large amounts of unexceptionable material) we have:
X-ray crystallographic data: Prior to submission of the manuscript, the author should deposit crystallographic data for organic and metalorganic structures with the Cambridge Crystallographic Data Centre. The data, without structure factors, should be sent by e-mail to deposit@ccdc.cam.ac.uk, as an ASCII file, preferably in CIF format. Hard copy data should be sent to CCDC, 12 Union Road, Cambridge CB2 1EZ. A checklist of data items for deposition can be obtained from the CCDC Home Page on the World Wide Web (http://www.ccdc.cam.ac.uk) or by e-mail to: fileserv@ccdc.cam.ac.uk, with the one-line message, send me checklist. The data will be acknowledged, within three working days, with one CCDC deposition number per structure deposited. These numbers should be included with the following standard text in the manuscript: Crystallographic data (excluding structure factors) for the structures in this paper have been deposited with the Cambridge Crystallographic Data Centre as supplementary publication nos. CCDC. Copies of the data can be obtained, free of charge, on application to CCDC, 12 Union Road, Cambridge CB2 1EZ, UK, (fax: +44-(0)1223-336033 or e-mail: deposit@ccdc.cam.ac.Uk). Deposited data may be accessed by the journal and checked as part of the refereeing process. If data are revised prior to publication, a replacement file should be sent to CCDC.
PMR: Relatively simple - if I want the crystallographic data from a structure it can be retrieved from CCDC. Let’s see it from the reader’s point of view. Although Tetrahedron is closed, there is a Free-to-view issue which includes:
Calix[4]azacrowns: self-assembly and effect of chain length and O-alkylation on their metal ion-binding properties
Pages 62-70
Issam Oueslati, Pierre Thuéry, Oleksandr Shkurenko, Kinga Suwinska, Jack M. Harrowfield, Rym Abidi and Jacques Vicens
SummaryPlus | Full Text + Links | PDF (1255 K)
PMR: Now all of you should be able to click along with me since the Full Text is Free-to-read… and we find towards the end…
5.3. X-ray crystal data for 1 and 4
Crystal data and refinement details for 1·CH3CN·CH3OH. C53H71N3O7, M=862.13, monoclinic, space group C2/c, a=35.499(1), b=11.8598(2), c=25.6668(8) Å, β=15.288(1), V=9770.5(4) Å3, Z=8, Dc=1.172 g cm−3, μ=0.077 mm−1, F(000)=3728. Refinement of 601 parameters on 6923 independent reflections out of 42440 measured reflections (Rint=0.037) led to R=0.081, wR=0.192, and S=1.13. Crystal data and refinement details for 4·CH3CN·CHCl3. C54H70Cl3N3O6, M=963.48, orthorhombic, space group Pna21, a=12.6378(2), b=33.1849(13), c=12.6464(5) Å, V=5303.7(3) Å3, Z=4, Dc=1.207 g cm−3, μ=0.223 mm−1, F(000)=2056. Refinement of 608 parameters on 7285 independent reflections out of 25830 measured reflections (Rint=0.054) led to R=0.056, wR=0.157, and S=1.07. Crystallographic data for the structures of 1 and 4 have been deposited with the Cambridge Crystallographic Data Centre as supplementary publication nos. CCDC 621387 and CCDC 621388. Copies of data can be obtained, free of charge, on application to CCDC, 12 Union Road, Cambridge CB2 1EZ, UK.
PMR: ( 1 and 4 signify compounds 1 and 4 in the main text most authors and publishers use a unique numbering scheme within the paper.) So I can apply to CCDC for the structures which should be free. Will they be Open? Here’s what we have to do: On the Request Structure page:
Since 1994, under official deposition arrangements with a number of journals, the Cambridge Crystallographic Data Centre (CCDC) has provided copies of the supplementary data of individual published structures for bona fide research purposes. Data from before 1994 are currently only available from the distributed Cambridge Structural Database (CSD).
Supplementary data arriving at the CCDC electronically in CIF format whether as part of journal deposition arrangements or directly from individuals are held on trust in the CCDC Supplementary Data Archive on behalf of those journals and individuals. After publication, these data are converted into CSD entries by the addition of bibliographic and chemical text, chemical structural data, and the results of crystal structure validation.
In January 2002 CCDC provided a web form for data retrieval, which requires you to enter brief literature citation details and the CCDC Deposition Number (CCDCnnnnnn) which should appear in the paper.
This free service permits rapid access to supplementary CIF data for bona fide research purposes. The complete Cambridge Structural Database containing fully validated information may also be available within your institution or department.
PMR: and now the conditions…
Conditions of Use of CIFs provided from the CCDC CIF archive Individual CIF data sets are provided freely by the CCDC on the understanding that they are used for bona fide research purposes only. They may contain copyright material of the CCDC or of third parties, and may not be copied or further disseminated in any form, whether machine-readable or not, except for the purpose of generating routine backup copies on your local computer system.
If you agree to the foregoing terms and conditions then please click on the “Accept” button below. If you do not accept the foregoing terms and conditions you should not click on the “Accept” button but should click on the “Do NOT accept” button below or the “back” button on your browser.
PMR: This doesn’t look crystal clear. “They may contain copyright material of the CCDC or of third parties”. A very fuzzy statement. They may only be used for “bona fide research purposes”. This is an unclear phrase. “may not be copied or further disseminated in any form, whether machine-readable or not”. This is fairly clear. The user has very few rights if any. Anyway I “agree” the conditions for once and find the form:
(e.g. one: 217777
more than one: 217777 218383
range: 218383-218386
other: 1220/32 or wn6031)
Journal:
Year:
First page:
Volume:
(omit if journal has no volume)
Author surname:
(First or principal surname, e.g. Cox, Smith)
PMR: I have to fill in a form for each paper. This is rather tedious - the data are transmitted by email - , but let’s continue with at least one. I send it off and give it the email I want the CIF to be sent to. A minute later I get the email which looks like:
Thank you for using the Cambridge Crystallographic Data Centre
CIF Depository request form.
Your request returned 1 structure.
Tetrahedron (2007), 63, 62
Deposition Number(s) 621387
CIF file for 1 structure is attached to this message.
========================================================================
CCDC No Acell Bcell Ccell Space Gp.
621387 12.6378 33.1849 12.6464 Pna21
========================================================================
CCDC Depository
http://www.ccdc.cam.ac.uk/
LEGAL NOTICE
Unless expressly stated otherwise, information contained in this
message is confidential. If this message is not intended for you,
please inform postmaster@ccdc.cam.ac.uk and delete the message.
The Cambridge Crystallographic Data Centre is a company Limited
by Guarantee and a Registered Charity.
Registered in England No. 2155347 Registered Charity No. 800579
Registered office 12 Union Road, Cambridge CB2 1EZ.
##############################
#########################################
#
# Cambridge Crystallographic Data Centre
# CCDC
#
#######################################################################
#
# This CIF contains data from an original supplementary publication
# deposited with the CCDC, and may include chemical, crystal,
# experimental, refinement, atomic coordinates,
# anisotropic displacement parameters and molecular geometry data,
# as required by the journal to which it was submitted.
#
# This CIF is provided on the understanding that it is used for bona
# fide research purposes only. It may contain copyright material
# of the CCDC or of third parties, and may not be copied or further
# disseminated in any form, whether machine-readable or not,
# except for the purpose of generating routine backup copies
# on your local computer system.
#
# For further information on the CCDC, data deposition and
# data retrieval see:
# www.ccdc.cam.ac.uk
#
# Bona fide researchers may freely download Mercury and enCIFer
# from this site to visualise CIF-encoded structures and
# to carry out CIF format checking respectively.
#
#######################################################################
data_4.CH~3~CN.CHCl~3~
_database_code_depnum_ccdc_archive ‘CCDC 621387′
_audit_creation_method SHELXL
[... rest of CIF snipped (seems to be verbatim author deposition) ...]
PMR: and I now have one extra file for Crystaleye. But am I allowed to post it on our server. We’ll write to the CCDC and find out. But this post is quite long enough for today…
[1] One of Robert Maxwell’s first journals. When it came out it was rather exciting. A specialist journal for carbon compounds (organic chemistry). And because carbon often has a tetrahedral environment, this was a very trendy name for the 1970’s. I published in it.
I talked today with a scientist (R) whom I meet frequently and who works in a leading bioscientific research establishment (not a University, but with Nobel laureate and FRS on the staff). In addition to their day job, R acts as a peer-reviewer for a leader bioscience journal (J). R gets about 1 paper a week from J and has to return a review within 21 days. R does not get paid for their reviewing, which can certainly run into hours per journal. Like many other scientists R does it because it contributes to science. R and J take review seriously and J has given R training in how to review. The reviewing can be seen as a credit on R’s CV.
Some of you may not be familiar with how peer-review in scholarly journals works, so here’s a rough overview. An author (A) sends a manuscript (M) to an editor (E) on J which decides on which reviewers (R, R1, R2) should review M. R, R1… are normally asked to decide on:
whether M is in scope for J
whether it reports a significant scientific advance (i.e. not repeating work already known)
whether the science - as reported in M - is sound and potentially capable of being reproduced
whether A shows that they are aware of published work that impinges on M
and in many journals to give some idea of the “score” or “importance” of M. This could be “very important”, “minor advance, but useful”, etc.
R is normally dependent on:
the material in M.
the references (citations) in M to other work, C1, C2…
R’s knowledge of the field from meetings, conversations, reading, etc.
R must keep M confidential but is normally allowed to consult close colleagues in confidence on small matters of fact.
R relies heavily on the material in the references (C1, C2…). These can contain background material, precise recipes, data, closely argued positions, etc. Without C1, C2 it is normally impossible to review the paper.
R tells me that on average there are at least 3 references (C1,C2,C3) per manuscript (M) which are closed and to which R’s institution does not have access. R cannot review the paper responsibly without C1, C2, C3. So what should R do?
send M back and say they cannot review it
pay the cost (3 * USD30 = 90 USD) for access to these references
ask R’s institution to pay for access (why should they?)
get an interlibrary loan (takes days and anyway costs money)
ask a friend (me) for a copy of C1, C2, because Cambridge subscribes to these closed journals. I have to say no, because that would be a breach of copyright and whatever byzantine conditions the publishers have agreed with our library.
do a bad review
THE LACK OF ACCESS TO PUBLICATIONS HARMS PEER-REVIEW
This is a sample of one, but by mathematical induction it applies to all peer-review. Therefore the hidden cost to science of closed access is enormous. Either the reviewers pay 90 USD per paper (which I strongly doubt), or they violate copyright (unthinkable), or they do bad reviews (certainly not) or …
So, closed access publishers, by preventing reviewers (and there are tens of thousands) reading your publications you are damaging peer-review. In a paper age this was accepted - you couldn’t read everything - or it took ages. But in the electronic age it isn’t necessary. We ought to be getting better quicker peer-review because of e-paper.
Are we? and do you care enough to do something about it? If not Open Access (the obvious answer), WHAT?
Avid and continual readers of this blog will remember that some of us in the Blue Obelisk have set out to monitor the posted policy and licenses of “open access” publishers or publishers which offer some “open access” products. We are going systematically, though more slowly that we would have liked (and happy to have committed volunteers) through the public pages of these publishers.
Some are easy - they state simply that they offer CC-BY licenses. Others have more complex pages, which sometimes are inconsistent. We are blogging such instances - hopefully in an objective fashion - and giving the publishers the opportunity to clarify policies.
I commented factually on a journal published by Libertas Academica (“Open Access” at libertas academica). Now we have great news - they (Tom Hill) understand the issue and are making simple and positive changes to their site. I reproduce the mail in full.
Subject: Open Access at Libertas Academica
From: “Tom Hill”
To:
Dear Dr Rust,
Earlier today I read your blog entry on OA at Libertas Academica (http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=415) with great interest. I regret not having done so earlier, although I came upon it as a result of an uncharacteristic bout of Saturday morning corporate navel-gazing by way of a Google search on the company’s name. I have now added your blog’s RSS feed to my bookmarks and I’m sure I’ll become a regular reader.
I very much appreciate your critique on our OA policy. It appears that in our haste to develop our journals we have neglected to make our policy, particularly with respect to copyright, as transparent as it should be.
I have therefore made the following changes:
1. We now clearly apply the CC-BY licence.
2. I have asked our web developer to remove the obsolete copyright statement from the bottom of all web pages. Given that this is Saturday, this change will probably not take place until Monday NZ time.
I wonder if you would be willing to communicate the gist of these changes to your readers in some way? Irrespective of this, thanks for your feedback.
Regards,
Tom Hill
____________________________________________
Tom Hill
“Analytical Chemistry Insights” “Biomarker Insights” “Bioinformatics and Biology Insights” “Cancer Informatics” “Clinical Medicine: Arthritis and Musculoskeletal Disorders” “Clinical Medicine: Cardiology” “Clinical Medicine: Circulatory, Respiratory and Pulmonary Medicine” “Clinical Medicine: Oncology” “Drug Target Insights” “Evolutionary Bioinformatics” “Gene Regulation and Systems Biology” “Integrative Medicine Insights” “Perspectives in Medicinal Chemistry” “Translational OncoGenomics”
LIBERTAS ACADEMICA
This is wonderful - and I hope that many of the problem we have will turn out to be simple lack of clarity on web pages.
On our side we hope - in time - to be able to summarise the acccess and re-use rights of all “open access” chemistry publishers. We started with “Analytical Chemistry Insights” because it was the first in the alphabet - so if you are a publisher of chemistry listed on the DOAJ list and your journal is later in the alphabet than “A” and you wish to clarify your website before we get to you … please drop us a note.
In summary this shows dramatically the value of labels.
Here are some quotations from the ICSU report:
“…Full and open access” to data implies equitable,
non-discriminatory access to all data that are of
value for science. It does not necessarily equate to
immediate access or ‘free of cost’ at the point of
delivery, although this is certainly the ideal in many
situations, particularly with regard to publicly
funded data. Data should be made available with
minimal delay but a short ‘privileged access’ period
for original data producers may be justified in some
situations. Excessive charging for data that is by
definition discriminatory against some scientists is
clearly contrary to the principle of full and open
access but some cost-recovery is not necessarily
excluded…”
“…There are several economic models for providing
scientists with access to data for research and education.
They include, among others, (1) free and open access to
research data by scientists, with financial support for data
dissemination and preservation assumed by others,
including government science agencies and private
foundations; (2) open access to scientific data for research
and education for the cost of reproduction (that is,
recovering the operational costs of data dissemination);
(3) free and open access to metadata, and cost-recovery
pricing for data (or data licenses) in order to support the
full data infrastructure. When this last approach is
employed by a commercial company, the financial charges
for data must be sufficient to recover all investment costs
and to make a profit for investors. An important variation
on this includes licensing for scientists to use specific
bodies of data at reduced cost…”
PMR: As as said earlier I have considerable respect for CODATA and if this is their position I know they have laboured hard over preparing it - the content, the intent and the phrasing. So they have chosen “open access” as a descriptive phrase and used it several times. However they make it quite clear that this does not necessarily mean “toll-free”.
Whereas in another branch of ICSU, ICSTI it is very clear that “open access” is used in the sense of BOAI.
Oh dear.
We have a major and committed organisation using phrases in a completely confusing way. So it is not, perhaps, surprising that we do not always make ourselves clear to each other. And occasionally world views collide and raise the heat of debate.
What should we do? I think we have to be more precise about what we are talking about. We need to devise labels that we understand. And that will be the theme of later posts here. But I wonder if CODATA/CSPR might not consider removing the phrase “full and open access” from its sponsored pay-to-view databases.
In recent posts Request for CODATA definition of Open Access- and http://wwmm.ch.cam.ac.uk/blogs/murrayrust/?p=445 I was concerned about the use of “open access” to describe a pay-to-access database. I had a very useful and constructive reply from softCon about the “open access” database of spectra which I append below, together with my reply. I will comment in a later post…
PMR: Many thanks for your prompt, full and constructive reply. I hope I can make a similarly constructive response. I will embed comments in your text, which is otherwise verbatim.
Dear Dr. Murray-Rust,
thank you for your e-mail concerning the “UV/Vis+ Spectra Data Base”. Please
let me begin my answer with some additional information.
We started the database in August 2000 and in the beginning all data
(spectra and datasheets) are completely free accessible for everyone. We
thought that this would be helpful to convince the users of the database to
help us in maintaining the database and to convince commercial users
(unilever, bayer, basf, pfizer etc) which benefits from the database or
governmental organizations to provide us with financial support, but that
was naive from our side. During the first six months when the database was
on-line we’ve several thousands of users (commercial and non-commercial) but
we have only 2 (TWO!) users which were willing to help us in maintaining the
database by the provision of data and we’ve got no financial support. Due to
this experience, we’ve decided to change our database policy. The database
was subdivided into a complete free-of-charge “Literarure-Service”
(meta-data) and a “Spectra-Service” (spectral data) for which a subscription
is required. You can access the “Spectra-Service” either by supporting us in
maintaining the database (provision of spectra data) or by paying a moderate
annual fee. Currently almost three fourths of the “spectra-service” users
have complete free-of-charge” access. We make no profit with the database.
PMR: I understand and appreciate the business model. I don’t have any concern about charging for data per se.
To maintain a fast growing database is not only a really hard and never
ending work but also cost-intensively. However, to operate and maintain such
a database financial support is required. Both database services are
operated in accordance to the “Open Access” definitions and regulations of
the CSPR Assessment Panel on Scientific Data and Information (International
Council for Science, 2004, ICSU Report of the CSPR Assessment Panel on Data
and Information; ISBN 0-930357-60-4). We’ve added a link to the original
document on our web-site.
Here are some quotations from the ICSU report:
“…Full and open access” to data implies equitable,
non-discriminatory access to all data that are of
value for science. It does not necessarily equate to
immediate access or ‘free of cost’ at the point of
delivery, although this is certainly the ideal in many
situations, particularly with regard to publicly
funded data. Data should be made available with
minimal delay but a short ‘privileged access’ period
for original data producers may be justified in some
situations. Excessive charging for data that is by
definition discriminatory against some scientists is
clearly contrary to the principle of full and open
access but some cost-recovery is not necessarily
excluded…”
“…There are several economic models for providing
scientists with access to data for research and education.
They include, among others, (1) free and open access to
research data by scientists, with financial support for data
dissemination and preservation assumed by others,
including government science agencies and private
foundations; (2) open access to scientific data for research
and education for the cost of reproduction (that is,
recovering the operational costs of data dissemination);
(3) free and open access to metadata, and cost-recovery
pricing for data (or data licenses) in order to support the
full data infrastructure. When this last approach is
employed by a commercial company, the financial charges
for data must be sufficient to recover all investment costs
and to make a profit for investors. An important variation
on this includes licensing for scientists to use specific
bodies of data at reduced cost…”
Thank you very much for this. I may comment later in a blog that it is unfortunately for the “open access” publishing community that ICSU has chosen the phrase “open access” to mean an affordable charge structure”. I accept that by some standards 100EUR is non-discriminatory.
Indeed we are one step ahead to the ICSU recommendations since we provide
free-of-charge access to the meta data/related data without any
cost-recovery and in addition the database user can decide if he is willing
to help us in maintaining the database or to pay a moderate utilization fee
which ensures that the database will be operated, developed and maintained
in the future.
Again, currently all meta-data (datasheets) are free accessible as well as
other related data (e.g. software, satellite-data etc t.b.d.).
Finally, as mentioned in the ICSU report “WHO PAYS - Data production and
management are costly”. We’ve currently no idea how to finance this database
except by charging some of its users with a moderate fee. Do you have any
ideas?
PMR: I agree that data are costly, though technology brings some costs down. With Open Access in the publishing sense there is a strong movement towards author-pays supported-by-funder. The major charities (Wellcome, HHMI) are making allowances for authors to pay for publication as Open Access (toll-free access and hopefully re-use).
I suggest you have a look at what the NIST group (Michael Frenkel and colleagues) have done with ThermoML. Here the publishers have a model where if thermochemistry is published (there are 4 or 5 journals) it has to be in ThermoML and has to go into an Open Access database. This seems to work to everyone’s benefit. I’ll write more later … but this might be a useful model for you. It won’t pay YOU directly but may create a data stream at near zero cost.
PMR: more comments in later post…
Peter, I hope that these information will give you an idea about our
intensions.
Best regards,
Andreas
————————–
Dr. Andreas Noelle
science-softCon
Auf der Burg 4
63477 Maintal
Germany
Phone: +49 6181 498414
Fax: +49 6181 498415
e-mail: andreas.noelle@science-softcon.de
www.s-sc.de