Unilever Centre for Molecular Informatics University of Cambridge

CrystalEye

Frequently Asked Questions

The data

Where does the crystallography come from?

Where are you currently aggregating the crystallography from?

How is the aggregation going?

Do any authors wish their data to be copyrighted and withheld from the community?


The CrystalEye system

Can you convert every CIF file you find into CML?

How does CrystalEye manage chemistry?

How is the data recommunicated?

What 2D and 3D rendering software do you use?

What other software have you used?

So it's all automated?

How is this related to the Cambridge Crystallographic Data Centre (CCDC) and the CSD?


Browsing the crystallography

How do I browse the crystallography?

What do the coloured ((DP)), ((DU)) and [[P]] symbols mean on the issue pages?

What are the differences in the way organic, inorganic and organometallic structures are represented?

Why can I not get access to the structures before year X?


Miscellaneous

Who has worked/is working on this project?


More to come soon (or even sooner if you email with a question) ...



The data

Where does the crystallography come from?

There are thousands of crystal structures published in online journals every month. When an author has a structure published, they are obliged to provide the (complete) output of the structure elucidation experiment (in the form of a CIF file) as supplementary material.

As this supplementary data is a set of facts and is not part of the article full-text it does not fall under the copyright, and it should therefore be free to both view and download. We have created a web spider which 'listens' for new journal issues to be published and checks them for any CIF files. Upon finding a CIF file, it is downloaded and the data within is then recommunicated by passing it through the CrystalEye system.

In the near future, we hope to extend the system to aggregate from Institutional Repositores, and also provide a method for self deposition.

Where are you currently aggregating the crystallography from?

We are doing this for all publishers (we know of) who provide links to the supplementary CIFs from their sites, namely RSC, IUCr, ACS and the Chemical Society of Japan (only Chemistry Letters) and Elsevier (only Polyhedron). Wiley, Springer and Blackwell do not expose CIFs (thereby depriving the scientific community of data). We have also merged CIFs from the Crystallography Open Database with the data from the publisher's websites.

However, even if the CIF files are free to download, that doesn't mean the website owner looks kindly on you sending a web spider to do it for you. Both the Royal Society of Chemistry and the International Union of Crystallography have kindly allowed us to do this. For other publishers we aggregate the CIFs by hand before passing them through the CrystalEye system.

How is the aggregation going?

So far we have aggregated around 100,000 CIF files in this way (as of September 4 th 2007).

Do any authors wish their data to be copyrighted and withheld from the community?

We haven't found any. We suggest that authors publishing CIFs use a Creative Commons license, making their views clear. A simple way to do this would be to add this into the software that produces CIFs.


The CrystalEye system

Can you convert every CIF file you find into CML?

As long as the CIF conforms to the CIF specification we have no trouble parsing it - of course in the real world this isn't always going to happen. Our CIF parser has a small set of heuristics in it to fix commonly encountered minor problems, but there are some that we can't (and wouldn't want to) recover from.

How does CrystalEye manage chemistry?

An accurate modern structure contains all the atoms (including hydrogen). We check that the atom count is the same as mentioned in _chemical_formula_moiety. If not, we flag the structure as problematic. Otherwise we use the author-assigned change on the moieties and our own heuristics for assigning double bonds. That works in most cases. Sometimes the authors omit the charges on a charged structure in which case we try to guess, but ultimately it is the author's statement. (Hopefully CrystalEye will help to raise the quality of chemical information in CIFs). It is a tragedy that authors do not use the _chemical_conn* records which are specifically for this.

How is the data recommunicated?

For each journal issue passed through the system, a set of webpages are generated to allow easy browsing of the crystallography within. Both 2D and 3D renderings of the structures are provided (e.g. here ). The webpages further down provide access to the original CIF, as well as all the data files generated by the system from it. These include:

  • CML (which also contains the CheckCIF data, the article DOI and the InChI and SMILES for the structure, and additional chemistry such as bond orders and charges),
  • CheckCIF,
  • ellipsoid plot,
  • bond length, angle and torsion summaries.
  • fragments (ring-nuclei, metal ligands, metal centres, metal clusters, ring-ring and ring-terminus linkers).

The system also maintains a number of RSS and CMLRSS feeds which summarize the latest crystallography to have been published. You can subscribe to feeds by class, journal, atoms or bonds. So, if for instance you were interested in Ag-C bonds, you could subscribe to one of the feeds here . Alternatively, if you were interested in structures containing a particular atom, you would go here .

What 2D and 3D rendering software do you use?

For the 2D layouts we use the CDK and for 3D we use Jmol . Both have large active (and very helpful) communities based at Sourceforge .

What other software have you used?

All of the software used and created in this project is Open Source. I won't list them all here, but if you are interested, you can read through a description of the project on our group wiki here then you'll get an idea of the different software used for the various parts of the system.

So it's all automated?

Yes, nothing is done by hand. The aggregation, file and website generation and RSS updating are all done robotically.

How is this related to the Cambridge Crystallographic Data Centre (CCDC) and the CSD?

It isn't. CCDC are a not-for-profit organisation which for many years has aggregated crystal structures from the literature and applied a variety of cleaning methods to the data. It has records going back about 50 years and is available by subscription only. At one stage many journals required that a copy of the crystallographic data associated with a publication were deposited directly with CCDC (often by the journal editors) and some journals still do this. However, with the advent of CIF and ePublishing, several journals, most notably those of IUCr, run online checking facilities and directly publish the CIF as supplemental data. It is this Open Data publication that makes CrystalEye possible. CCDC produce their own version of the CIF file (for subscribers or one-off queries) which includes a unique 6- or 8-character REFCOD. CrystalEye does not use this CIF, or the REFCOD or any other data or software from CCDC.

The main differences between CrystalEye and CCDC are:

CrystalEye CCDC
CIFs from 1991-present comprehensive for organics and organometallics
Robotically cleaned cleaning includes humans and machines
Chemistry generated from CIF Chemistry added by humans and machines
Links directly to journal articles unknown
RSS feeds on daily basis unknown
Includes inorganic structures does not include inorganic structures
Conserves all data from original CIF unknown
Metadata from journal free text when publishers allow unknown
Per-journal browsing facility unknown

CrystalEye will be introducing chemistry, and data-based search as we include the backlog.


Browsing the crystallography

How do I browse the crystallography?

An issue table of contents looks something like the image below.

In this, each row in the table corresponds to one crystal structure found at the issue. A row might look like:

If you click on the left hand column the structure represented by that row will be shown in the 2D and 3D rendering section at the bottom of the webpage. The middle column provides a link back to the original article (by using the DOI). If you click on the right hand column you will be taken to another webpage summarizing the structure in closer detail.

You can also navigate through the structures using the navigation arrows above the 3D image. The arrows above the 2D image are slightly different. These are for browsing the 2D images of different molecules within the same crystal structure. At present there is a bug that may cause the unit-cell not to appear automatically in the Jmol applet. To force it to be shown you should right-click on the applet and follow the options style>unitcell>dotted.

What do the coloured ((DP)), ((DU)) and [[P]] symbols mean on the issue pages?

((DP)) or ((DU)) next to the structure formula in the left hand column indicates that the crystal structure in the corresponding CIF file is disordered. ((DP)) indicates that our system could resolve the disorder and display the major occupied structure from the crystal. ((DU)) indicates that the system could not understand the disorder, and hence the structure shown will still contain all of the disorder information.

[[P]] indicates that the structure is a polymeric organometal. This is noted as polymeric structures are represented by the system by displaying the unit cell with all of its atoms, rather than trying to display a discrete moiety.

What are the differences in the way organic, inorganic and organometallic structures are represented?

For inorganic or polymeric organometallic structures we generate the unit cell with all atoms inside. For all other structures we generate the unique discrete molecules in the unit cell.

The structures for which we generate the unit cells with all atoms are not assigned bond orders or charges. There is obviously no point in generating 2D images for such structures either.

Those structures for which we generate the unique discrete molecules are assigned bond orders and charges by the system. If the system is able to do this, then it also generates 2D structure diagrams of the molecules.

Why can I not get access to the structures before year X?

Since a major aspect of CrystalEye is the RSS feeds for current awareness, we don't want to flood readers with all past data at once. We are therefore concentrating on the latest journals (mainly 2007) so that the CMLRSS can be tried out. Simultaneously we shall be adding a search facility to the retrospective data and shall announce this shortly. When that happens all entries will be retrievable.


Miscellaneous

Who has worked/is working on this project?

CrystalEye was started by Nick Day as part of his PhD, working under the supervision of Peter Murray-Rust at the Unilever Centre for Molecular Informatics, University of Cambridge, UK.

The following people and organisations have provided significant assistance in the development of CrystalEye, in the form of advice, bug reports, testing, feedback etc.

  • Jim Downing
  • Simon 'Billy' Tyrrell
  • Mark Holt (summer student funded by the International Union of Crystallography)
  • The International Union of Crystallography
  • The Royal Society of Chemistry

If you have any comments or questions, please direct them to ned24@cam.ac.uk .