Unofficial InChI FAQ

FAQ Overview
        What is this FAQ?
        Can I help?
        Who is responsible for InChI?
        Where can I find out more?
        Is there an InChI mailing list?
        Are there other InChI FAQs?
        Who maintains this FAQ?
Introduction to InChI
        What is an InChI?
        Wasn't it called IChI?
        Is the name (or capitalisation) likely to change again?
        So....how is InChI pronounced?
        Is the version number considered to be part of the InChI string?
        What is the purpose of the InChI?
        What is the scope of the InChI?
        Is InChI free?
        Is InChI open?
        Is InChI stable?
        Do I need to know about InChI versions?
        What does an InChI look like?
        Who is using InChI?
        Where can I find examples of InChIs?
        Are there any tutorials?
Understanding InChIs
        Why are layers used in an InChI?
        How is a layer represented in the identifier?
        Specifically, what are InChI layers?
        Isn't InChI too complicated?
        Is information for each layer required in the input information?
        Are layers reusable?
        Is InChI extensible?
        Can an InChI be invalid?
        How do I check that the InChI represents my compound?
        May I edit an InChI independently of the wInChI-1.exe or cInChI-1.exe results?
        What is the benefit of using w-InChI (the GUI application) over c-InChI (the command line application)?
Chemical Structure Representation Issues
        How is an InChI created from the input information?
        How does InChI deal with the many equivalent ways of arranging bonds and charges in delocalized structures?
        How is stereochemistry represented?
        What is the difference between ? and u in chiral centres?
        Why may a stereo layer appear several times in a single InChI?
        How does InChI represent salts?
        How does InChI represent organometallic compounds?
        What is the difference between salt and metal disconnection?
        How does InChI deal with structures that are composed of multiple interconnected (covalently bonded) components?
        How does InChI represent compounds with mobile H-atoms (tautomerism, for example)?
        Why is there a Fixed-H layer if tautomerism is represented in the Main layer?
        How does InChI manage isotopes?
        How does InChI manage charge?
        Can InChI represent radicals?
        Can InChI represent different spin states?
        What Can InChI Currently Not Represent?
InChI Syntax
        Does the formula always represent the complete composition of the substance?
        Is there always a connection table layer (/c)?
        Is there always an H layer (/h)?
        Does the total number of hydrogens in the /h layer represent the number of hydrogens in the input compound?
        What does the /p layer mean?
Comparing InChIs
        Can I compare structures by looking at their InChIs?
        Can I compare structures by looking at specific layers from their InChIs?
        If two InChIs are the same, do they refer to the same compound?
        If two InChis are different, are they different compounds?
        How can I compare similar compounds?
The Current InChI Release
        Where can I get the current InChI release?
        Where can I find the InChI Technical Manual?
        How do I install InChI?
        How do I create an InChI?
        Can I link or call InChI from my program?
        Which formats does InChI accept?
        With which chemical drawing packages can you 'cut-and-paste' directly into the InChI Generator?
        Can I use InChI if I don't know the connection table?
        Is there a way to generate an InChI if I have a connection table, but not in CML, Mol or SDF form?
        Other than a connection table, what is needed to generate an InChI?
        Can I regenerate the structure from InChI?
        What happens if the input structure has no mobile hydrogen atoms but mobile hydrogen perception is specified?
        What is the 'Auxiliary Information' in the InChI output?
Program Flags
        The InChI program has many flags both in the GUI (wInChI-1) and the command line (cInChI-1). Can they affect the generated InChI?
        What do the stereochemical flags do?
        What does RecMet do?
        What does FixedH do?
        What does Compress do?
Strategies for Creating InChIs
        Do I need to know how my molecular information was created?
        Does InChI require all atoms including hydrogens?
        What are the problems if I can't find out about this?
        Can InChI fix these problems automatically?
InChI in the Real World
        How is InChI being developed?
        I am a chemical supplier - is InChI useful to me?
        I maintain a chemical database - is InChI useful to me?
        I am a publisher - is InChI useful to me?
        I am in pharma - is InChI useful to me?
        What Is InChI Not Designed For?
InChI and Other Technologies
        Can search engines use InChIs?
        How does CML relate to InChI?
        How does InChI differ from SMILES?
Questions and Answers from the InChI Discuss mailing list
        So...what is this section all about?
        Who is DT who has answered all of the questions asked to the mailing list?
        The InChI Generation Process
        General InChI representation of structures
        InChI representation of multiple components
        InChI input
        Hydrogen Layer
        Stereochemical Layer
        Isotopic Layer
        Reconnected Layer
        InChI failures

FAQ Overview

What is this FAQ?

This FAQ is an attempt to answer common questions on the concepts, structure and meaning of InChIs. It has no official status but we work very closely with the IUPAC and NIST groups on InChI and revise this FAQ frequently. Where possible we quote directly from the official InChI/IUPAC site and the distribution.

Can I help?

That would be great! Please get in touch with us at ned24@cam.ac.uk if you have any:

  • corrections to make,
  • questions that you would like answering,
  • questions that you know the answers to and would like to see included,
  • details of collections that have been InChIfied,
  • details of publications which have mentioned, or used InChI

All communications are most welcome!

Who is responsible for InChI?

InChI is a project of IUPAC described at: http://www.iupac.org/inchi/

The current members of the Project are:

  • Task Group
    Chairman: A. McNaught
    Members: S. Heller, S. Stein, D. Tchekovskoi, J. Kahovec and A. Yerin.

Where can I find out more?

In reverse chronological order:

  1. The technical manual and other material in the distribution is the currently normative reference. The download of the current release is linked from the InChI page at the IUPAC website.
    http://www.iupac.org/inchi

  2. An Open Source/Open Access/Open Data and the IUPAC International Chemical Identifier - (InChI), Stephen R. Heller, ACS Washington DC meeting, August 28th 2005.
    http://www.hellers.com/steve/pub-talks/acs-805/frame.htm

  3. Chemical Naming Method Unveiled, Sophie Rovner, C&E News, 22nd August 2005, Volume 83, Number 34, pp39-40.
    http://pubs.acs.org/isubscribe/journals/cen/83/i34/html/8334sci1.html

  4. Analysis of a Set of 2.6 Million Unique Compounds Gathered from the Libraries of 32 Chemical Providers, A. Monge, A. Arrault, C. Marot, L. Morin-Allory.
    http://www.univ-orleans.fr/icoa/eposter/eccc10/monge/

  5. International chemical identifier goes online, Chem. World, 16 May 2005.
    http://www.rsc.org/chemistryworld/Issues/2005/June/this_month/International_chemical_identifier.asp

  6. Application of InChI to Curate, Index, and Query 3-D Structures, M. D. Prasanna, Jiri Vondrasek, Alexander Wlodawer, T.N. Bhat, PROTEINS: Structure, Function and Bioinformatics, 60:1-4 (2005)
    Note: InChI version 0.932 is used throughout this article.
    http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=pubmed&dopt=Abstract&list_uids=15861385

  7. Enhancement of the chemical semantic web through the use of InChI identifiers, Simon J. Coles, Nick E. Day, Peter Murray-Rust, Henry S. Rzepa, Yong Zhang, Organic & Biomolecular Chemistry, 2005, 3(10), 1832 - 1834, DOI:10.1039/b502828k
    http://pubs.rsc.org/ej/OB/2005/b502828k.pdf
    The supplementary data from this paper can be found at: http://www.rsc.org/suppdata/ob/b5/b502828k/index.html

  8. More on InChI, Distributed Structure-Searchable Toxicity (DSSTox) Public Database Network.
    http://www.epa.gov/nheerl/dsstox/MoreonINChI.html

  9. Googling for INChIs; A remarkable method of chemical searching, P. Murray-Rust, H. S. Rzepa and Y. Zhang, W3C Workshop on Semantic Web for Life Sciences, 27-28 October 2004, Cambridge, Massachusetts USA.
    http://lists.w3.org/Archives/Public/public-swls-ws/2004Oct/att-0019/

  10. The INChI as an LSID for molecules in lifescience, P. Murray-Rust, H. S. Rzepa and S. Stein, W3C Workshop on Semantic Web for Life Sciences, 27-28 October 2004, Cambridge, Massachusetts USA.
    http://lists.w3.org/Archives/Public/public-swls-ws/2004Sep/att-0026/inchi.html

  11. Representation and use of Chemistry in the Global Electronic Age, P. Murray-Rust, H. S. Rzepa, S. M. Tyrrell and Y. Zhang, Org. Biomol. Chem., 2004, 2, 3192 to 3203.
    http://www.ch.ic.ac.uk/rzepa/obc/

  12. That INChI Feeling, David Bradley, Reactive Reports, Sep 2004 (issue 40)
    http://www.reactivereports.com/40/40_3.html

  13. XML in Chemistry and Chemical Identifiers, Tony N. Davies, Chemistry International, Vol. 26, No. 4, July-August 2004
    http://www.iupac.org/publications/ci/2004/2604/pp6_2002-022-1-024.html

  14. IUPAC Project Meetings: Extensible Markup Language (XML) Data Dictionaries and Chemical Identifier, Wendy A. Warr, NIST, Gaithersburg, Maryland, USA, November 12-14 2003
    This contains many useful reports of discussion on the subject.
    http://www.warr.com/inchi.pdf

  15. An Open Standard for Chemical Structure Representation: The IUPAC Chemical Identifier, Stephen E. Stein, Stephen R. Heller, and Dmitrii Tchekhovskoi, in Proceedings of the 2003 International Chemical Information Conference (Nimes), Infonortics, pp. 131-143.
    http://www.hellers.com/steve/resume/p157.html

  16. Unique Labels for Compounds, Michael Freemantle, C&EN, Vol.80, No. 48, 2 Dec 2002
    http://pubs.acs.org/isubscribe/journals/cen/80/i48/html/8048sci1.html">

  17. Chemists synthesize a single naming system, Nature, 23 May 2002
    http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=12024181&dopt=Abstract

  18. That IChI Feeling, David Bradley, The Alchemist, April 24th 2002
    http://www.chemweb.com/alchem/articles/1015947904091.html

  19. What's in a Name, The Alchemist, March 21st 2002
    http://www.chemweb.com/alchem/articles/1015947151360.html

Is there an InChI mailing list?

Yes. An Sourceforge project for InChI has been set up to provide an Open Source focus for development of InChI facilities and applications under its Artistic Licence. The project has a discussion list where "comments, questions and offers of help are welcomed". To sign up for the discussion list, visit this page.

Are there other InChI FAQs?

As far as we are aware, there are currently no other InChI FAQs available on the web.

Who maintains this FAQ?

This FAQ has been created by Nick Day as part of his PhD project on the Chemical Semantic Web at the Unilever Centre, Department of Chemistry. For any questions, comments or additions, please send an e-mail to ned24@cam.ac.uk. We have reused material from a number of sources which we acknowledge below.

Our group (http://wwmm.ch.cam.ac.uk) has done a considerable amount of work developing InChI applications (referred to by "we" in the text).

Introduction to InChI

What is an InChI?

An InChI (IUPAC International Chemical Identifier) is a string of characters capable of uniquely representing a chemical substance. It is derived from a structural representation of that substance in a way designed to be independent of the way that the structure was drawn (thus a single compound will always produce the same identifier). It provides a precise, robust, IUPAC approved tag for representing a chemical substance.

Example of different structures giving the same InChI string

Wasn't it called IChI?

Yes it was. Originally it was called the IChI (IUPAC Chemical Identifier) and in July 2004 was renamed INChI (IUPAC-NIST Chemical Identifier) to acknowledge the development work at NIST. It was renamed again in November 2004 to InChI (IUPAC International Chemical Identifier) to allow trademark, copyright and licensing issues to be resolved before distribution of version 1.0.

Is the name (or capitalisation) likely to change again?

No, this is the third and final name. The capitalisation should now also be finalised and reflects necessary IPR aspects.

So....how is InChI pronounced?

We believe the correct pronunciation is IN'chee.

Is the version number considered to be part of the InChI string?

Yes, the version number is an inalienable part of the Identifier.

What is the purpose of the InChI?

The project aim is to create a method for generating a freely available, non-proprietary identifier for chemical substances that can be used in printed and electronic data sources, thus enabling easier linking of diverse data compilations and unambiguous identification of chemical substances.

InChI is not a registry system. It does not depend on the existence of a database of unique substance records to establish the next available sequence number for any new chemical substance being assigned an InChI. The chemical structure of a compound is its true identifier, but structures are not unique or convenient for computers. So the project seeks to convert the structure (in the form of its connection table) to a unique string of characters by fixed algorithms, generating the InChI. Two requirements must be fulfilled in doing this:

  • Different compounds must have different identifiers, with all the information needed to distinguish the structures.
  • Any one compound has only one identifier, including only the necessary information to identify that compound.

What is the scope of the InChI?

Taken from the InChI Technical Manual.

"It was agreed at IUPAC meetings prior to the start of this project that the first version of the InChI should cover well-defined, covalently-bonded organic molecules. It was also agreed to include substances with mobile hydrogen atoms (tautomers, for instance). In the course of this project, it was found that straightforward extension organometallic compounds could be represented. Methods were found to also include variable protonation. Also, the present version only considers traditional organic stereochemistry (double bond - sp2 and tetrahedral - sp3) and the most common forms of H-migration (tautomerism). However, the layered structure of the InChI allows future refinements with little or no change to the layers currently used."

The current version of InChI can also represent salts and structures with isotopic enrichment.

Is InChI free?

Yes it is.

Is InChI open?

It is intended that the source code is freely re-usable and a license has been developed to reflect that. To view the InChI License Agreement go to http://www.iupac.org/inchi/license.html. Since the InChI code has a normative role (i.e. it acts as the final arbiter) it is not freely modifiable (minor modifications are allowed for portability).

Is InChI stable?

The InChI specification is now stable with the first full release version - InChI v1.

Do I need to know about InChI versions?

To date there have been two test releases of InChI: 0.932beta and 1.12beta and one Release Candidate: 1.0RC. These versions are now obsolete due to the first full release of InChI, available from http://www.iupac.org/inchi. As of September 2005 we estimate that >99% of InChIs on the Web have been created using InChI 1.

While you do not necessarily need to know about previous versions of InChI, you may well come across examples on the internet - at least before conscientious webmasters update them to the current version! InChIs generated with the full release start with "InChI=1" whereas InChIs generated with the previous releases will start with "INChI=0.932Beta", "INChI=1.12Beta" and "InChI=1.0RC".

What does an InChI look like?

It is a text string composed of segments (layers) separated by delimiters (/). If multiple disconnected parts of a structure are present, semicolons within each layer separate them.

Each layer in an InChI string contains a specific class of structural information. This format is designed for compactness, not readability, but can be interpreted manually. The length of an identifier is roughly proportional to the number of atoms in the substance. Numbers inside a layer usually represent the canonical numbering of the atoms from the first layer (chemical formula) except H.

The InChI string for naphthalene, for instance, is: InChI=1/C10H8/c1-2-6-10-8-4-3-7-9(10)5-1/h1-8H

Structure of naphthalene

Who is using InChI?

InChI is currently being incorporated into a variety of public and commercial chemistry databases:

  • US National Institute of Standards and Technology (NIST) - 150,000 structures
  • NIH/NCBI/PubChem project - >3.2 million structures
  • Thomson ISI - 2+ million structures
  • US National Cancer Institute(NCI) Database - 23+ million structures
  • US Environmental Protection Agency(EPA)-DSSToX Database - 1450 structures
  • Kyoto Encyclopaedia of Genes and Genomes (KEGG) database - 9584 structures
  • University of California at San Francisco ZINC - >3.3 million structures
  • BRENDA enzyme information system (University of Cologne) - 36,000 structures
  • Chemical Entities of Biological Interest (ChEBI) database of the European Bioinformatics Institute - 5000 structures
  • University of California Carcinogenic Potency Project - 1447 structures
  • Compendium of Pesticide Common Names - 1437 (2005-03-03) structures

Journals that have adopted InChI:

  • Nature Chemical Biology.
  • Beilstein Journal of Organic Chemistry.

Software that have incorporated InChI generation:

InChI websites

InChI-to-Structure software

InChI generation web-services:

InChI web-search services:

  • We have provided an InChI/Google Web Service which can be found at http://wwmm-svc.ch.cam.ac.uk/wwmm/html/googleinchiserver.html. This allows you to draw a structure into a 2D editor and then at the click of a button all InChI-fied instances of that molecule on the web will be located and summarised for you!!

Other InChI developments:

  • The European Patent Office is considering its use as their standard for structure representation.

Where can I find examples of InChIs?

We have generated InChIs for ca.250,000 molecules from the NCI database and 9585 molecules from the KEGG database. These can be found at:

Other InChIs can be found at:

Are there any tutorials?

We are currently working on other InChI tutorials and demonstrations to supplement this FAQ. They will be posted here as soon as they are complete!

Understanding InChIs

Why are layers used in an InChI?

Taken from the InChI Technical Manual.

"Since a given compound may be represented at different levels of detail, in order to create a robust expression of chemical identity it was decided to create a hierarchical 'layered' form of the Identifier, where each layer holds a distinct and separable class of structural information, with the layers ordered to provide successive structural refinement."

Layers are used because they are logical (they separate the variables) and understandable; they are flexible for chemists (they represent known levels of information) and extra layers could easily be added to future releases of the identifier (e.g. conformations, coordination etc).

How is a layer represented in the identifier?

Layers and sub-layers are both separated by the "/" delimiter. All layers and sub-layers (except for the chemical formula sub-layer of the Main Layer) start with /? where ? is a lower-case letter to indicate the type of information held in that layer. If we look again at the InChI for naphthalene:

Diagram highlighting the delimiters between layers and sub-layers in an InChI

If you would like to view a full list of all the possible sub-layers in an InChI string it is worth consulting page 9 of the Technical Manual in the InChI Release.

Specifically, what are InChI layers?

There are 6 InChI layer types, each representing a different class of structural information:

  1. Main layer
  2. Charge layer
  3. Stereochemical layer
  4. Isotopic layer
  5. Fixed-H layer
  6. Reconnected Layer

Note: The Fixed-H layer is optional and can be selected by unchecking the 'Mobile H Perception' box in the InChI generation program. The Reconnected Layer is also optional and can be selected by checking the 'Include Bonds to Metal' box in the InChI generation program.

Structures showing InChIs with each of the 6 different layer types

While the InChI is divided up into different layers to describe different types of structural information, each of these layers is also split into sub-layers to allow full description of each part of the structure (note: there is no sub-sub-layering). For instance, the Main layer can be split up into three sub-layers:

  1. Chemical formula
  2. Atom connections
  3. Hydrogen atoms

You can see that the top of the above five structures has an InChI with all three sub-layers of the Main layer.

Breakdown of sublayers in the Main layer

The Main layer is also the only layer that will appear in every possible InChI that can be generated and the only sub-layer of the Main layer that will always appear is Chemical Formula.

If you would like to view a full list of all the possible sub-layers in an InChI string it is worth consulting page 9 of the Technical Manual in the InChI Release.

Isn't InChI too complicated?

The InChI for the structure below shows the challenge in representing our different views of chemistry. For example, to a bioscientist "glutamic acid" and "glutamate" are the same thing, but to a computational Chemist the loss of a proton, or its variable site of attachment is critical. InChI is the only approach that allows us to describe this flexibility.

A structure showing all possible InChI layer types

Most of us can probably use a simple subset of the layers, such as in the naphthalene example, where there is no complication. If you require fuzzy concepts and searches, you will develop a knowledge of which part of InChI you need.

And remember that much chemistry is currently represented in an imprecise form or even missing, stereochemistry being an obvious example.

Is information for each layer required in the input information?

Taken from the InChI Technical Manual.

"No, the 'layered' model allows chemists to represent chemical substances at a level of detail of their choosing. Except for the Main layer (atoms and their bonds), the presence of a layer is not required and appears only when corresponding input information has been provided."

Example of extra layers being added to an InChI string as more information is input in the chemical structure

Are layers reusable?

Detailed information contained in a layer depends on preceding layers, so layers may not be 'excised' and reused. However, bottom layers may be 'pruned', leaving a valid, though less constraining InChI.

For instance, if all layers following formula are eliminated, the InChI will apply to all substances with that formula.

Diagram showing how pruning layers from an InChI effects the structure that is represented

Is InChI extensible?

Yes, as InChI is composed of hierarchical layers, new layers could be added to the specification to refine the information represented by current layers. Future versions of InChI, for example, could include phase information and crystal structure, conformations, electronic states and additional classes of stereochemistry.

Consideration is currently being given to further extension of the program to include polymers.

Can an InChI be invalid?

If an Identifier is produced, it will be a unique representation of whatever was submitted and is not 'invalid'. While some checking is done, and warnings are issued if the input structure is ambiguous, errors and ambiguities in the input will remain in the output.

If, for example, a molecule with hypervalent C is submitted (e.g. 'C(CH3)6'), a valid InChI will be produced, though with a warning that the carbon valence has been exceeded.

Example of an InChI produced for an invalid structure

InChI should not be used to represent mixtures (except in the special case of a racemic mixture).

How do I check that the InChI represents my compound?

To better understand what InChI does it is strongly suggested that you run the Win32 GUI application wInChI-1.exe against your test structures because it displays:

  • input structures as InChI understands them, with all H and charges;
  • their initial numberings;
  • canonical numberings, equivalence, tautomeric groups;
  • stereo parities;
  • bond changes (these cannot be observed in any other way but in wInChI or under a debugger);

You may try wInChI under WINE -- we have a report about a success in running wInChI-1.exe under WINE build 20050524 running on Mandrake 10.1.

May I edit an InChI independently of the wInChI-1.exe or cInChI-1.exe results?

Although this may give apparently reasonable answers it should NOT be done as it is error-prone and may break relations in the InChI. HIGHLY DEPRECATED

What is the benefit of using w-InChI (the GUI application) over c-InChI (the command line application)?

To better understand what InChI does it is strongly recommended that you run the Win32 GUI application wInChI.exe against your test structures as it displays:

  • Input structures as InChI understands them, with all H and charges.
  • Their initial numberings.
  • Canonical numberings, equivalence and tautomeric groups.
  • Stereo parities.
  • Bond changes (these cannot be observed in any other way but in w-InChI or under a debugger).

It is also possible to run w-InChI under WINE. Some parties have been successful in running w-InChI.exe under WINE build 20050524 running on Mandrake 10.1.

Chemical Structure Representation Issues

How is an InChI created from the input information?

An InChI identifier is created from an input connection table (in MOL, SDF or CML format) in three steps:

  1. Normalization - conventions are removed while maintaining a complete description of the compound. Steps involved are:
    • Ignore electron density and use simple atom connectivity only.
    • Disconnect salts and metal atoms in organometallic compounds.
    • Normalise mobile-hydrogens, variable protonation and charge.

  2. Canonicalization - a set of atom labels are algorithmically generated that do not depend on how the structure was initially drawn. The algorithm used for this step is based on the Morgan algorithm1.

  3. Serialization - the set of labels derived during canonicalization are converted into a string of characters, the InChI.

How does InChI deal with the many equivalent ways of arranging bonds and charges in delocalized structures?

When computing atom numbers (during canonicalization), bond orders and charge positions are ignored. Electron density and pi-electrons are important for descibing a large portion of interesting chemistry but they can be ignored here as they are not important for naming.

This does not introduce ambiguity as long as all H-atoms are accounted for. InChI only uses bond orders for perceiving stereochemistry ((Z)- vs (E)- but-2-ene, for example) and mobile H. It only stores the net charge, without regard to position.

Example of the differing delocalisation giving the same InChI string

Note that the above InChI does not contain any information on the double bond position or charges.

How is stereochemistry represented?

The two types of represented stereochemistry, sp2 and sp3, are expressed as separate sub-layers. At present, sp2 stereochemistry is extracted from input x,y coordinates (and is not calculated for rings of 7 or fewer members), while sp3 stereochemistry is derived from 'in-out' wedge bond types or x,y,z coordinates. Relative, absolute and racemic stereoisomers are distinguished.

Stereodescriptors may also be explicitly entered as unknown, which is distinct from cases where an expected stereodescriptor is missing, in which case an 'unspecified' (or undefined) tag is used.

Therefore, depending on the completeness of the stereo description entered, a variety of sets of stereodescriptors are possible for a structure with multiple stereocenters. This is a common source of ambiguity and error in chemical structure representation. An advantage of the layered representation is that all of these variations are contained in a single layer.

Example of an InChI with both double-bond and sp3 stereochemistry

What is the difference between ? and u in chiral centres?

A non-stereo bond (straight line) encodes to ?, a wiggly bond encodes to u. The semantics of the wiggly bond are poorly defined and may mean "unknown" or "mixture" according to the author. InChI is capable of identifying unknown stereocentres and we deprecate the use of wiggly (or any other similar) bonds.

Examples of u and ? stereochemistry

Why may a stereo layer appear several times in a single InChI?

As additional levels are added, stereodescriptors may change, and this may result in a new stereo layer.

An example is shown below. We see that in the Stereo Layer, as no isotopes are taken into consideration, only the chiral center at 3 is calculated as H and D are seen as equivalent. In the Isotopic Layer the H and D are seen as different and so an Isotopic Stereo sub-layer is included with stereo information for chiral centres 2 and 3.

Example of an InChI with more than one stereo layer

How does InChI represent salts?

Each separate, covalently bonded entity in a salt is treated independently. The information for each component is separated by ';' in each layer. Note, however, that in keeping with common convention, in the chemical formula sub-layer of the Main layer the components are separated by a dot. InChI uses simple rules to separate these components if they are entered as a single entity.

Example of an InChI for a salt

Click here for an explanation of the InChI definition of a salt.

How does InChI represent organometallic compounds?

No widely accepted means of representing organometallic substances exists. Ferrocene, for instance, may be drawn with the central iron atom connected to each of the two rings, to each of the atoms in the rings, to each of the bonds in the rings or not connected at all.

The default approach taken by InChI is to represent the structure as the individual, interconnected components along with the separated, unconnected metal atoms. For a large majority of organometallic compounds, this provides a unique InChI.

If a bonded organometallic structure representation is desired, however, it may be specified by selecting "Include bonds to metal", which adds an extra 'Reconnected' layer to the end of the current InChI. This layer, however, may depend on drawing conventions.

Showing the two different representations of ferrocene possible by InChI

Note that in the InChI for the default representation, the layers are sectioned by ; to separate the information for each of the components. To minimize the length of the InChI, as the two cyclopentadienyl rings are identical both are represented in the same section of each layer (indicated by 2*). In the Reconnected Layer, the structure is treated as one large component.

What is the difference between salt and metal disconnection?

With metal disconnection, the user may request to append the Reconnected layer (which represents the structure as given in the input structure) to the normalized InChI. This is done by selecting "Include bonds to metal"in the InChI generation program.

A disconnected salt cannot be reconnected in this way. For instance, if you were to enter sodium ethanoate with a bond between O and Na (as below) into the InChI generation program, this bond would be disconnected with no choice to reconnect it.

Sodium ethanoate

Click here for an explanation of the InChI definition of a salt.

How does InChI deal with structures that are composed of multiple interconnected (covalently bonded) components?

Many substances are best represented as multiple, independent structures. InChI will represent such substances by simply appending the individual layers for each component in each layer and sorting these components using a set of fixed rules (these are represented as conventional 'dot-disconnected' units in the formula layer, or with semicolons in other layers).

InChI creation assumes that if multiple structures are present in a single input connection table, they are components of a single compound. In most cases, it is possible to extract the InChI of each component from a composite InChI by excising the corresponding part of each layer. The order of the components in the layers is strictly defined.

How does InChI represent compounds with mobile H-atoms (tautomerism, for example)?

The Main Layer must be the same for any arrangement of mobile hydrogen atoms. This is achieved by the logical removal of mobile-H atoms and the tagging of H-donor and H-receptor atoms.

As an example we shall look at guanine (taken from the InChI Technical Manual), some of whose tautomeric structures are shown below:

Tautomeric structures of guanine

If we create an InChI for one of those tautomeric forms (with the optional Fixed-H layer selected) we get:

The InChI of Guanine

Note: Donors and receptors of H and changeable bonds are highlighted.

If we take a closer look at the H-atom sub-layer of the Main Layer we see that on generation the InChI program has signified that atom number 1 has one H and that 4H atoms are shared by atoms 6, 7, 8, 9, 10 and 11.

Why is there a Fixed-H layer if tautomerism is represented in the Main layer?

The Fixed-H layer is useful if you wish represent one particular tautomer of a given structure. If we create InChIs for the tautomers below with 'Mobile H Perception' checked, then the normalisation performed by the generation program will make their InChIs identical.

InChIs of two different tautomers with Mobile H Perception

However, if we uncheck 'Mobile H Perception' then the extra Fixed-H layer will be appended. This layer is essentially an InChI for the whole structure without normalization of the mobile hydrogens, thus giving an InChI that specifies a single tautomeric form of the structure.

InChIs of two different tautomers without Mobile H Perception

How does InChI manage isotopes?

InChI represents isotopes as a single layer in the identifier.

Example of an isotopic layer

For each isotopically enriched atom in the structure, the InChI layer will hold that atom's canonical number followed by the isotopic shift (i.e. +0 for chlorine-35(35-35) or +1 for carbon-13 (13-12)), followed by isotopic hydrogen (D or T) if present e.g.

Example of a structure with isotopic hydrogens (D or T)

The only complexity arises when there are isotopically labelled hydrogens that can undergo tautomerism. In the Hydrogen sub-layer of the Main layer these hydrogen atoms are treated as non-isotopic; the number of these mobile isotopic hydrogens atoms is appended to the "exchangeable isotopic hydrogen atoms" part of the isotopic layer. The same is done to isotopic hydrogen atoms that may be subject to heterolytic dissociation in aqueous solution (for example D in RS-D).

Example of a structure with isotopic hydrogens that can undergo tautomerism

The Hydrogen sub-layer of the Main layer does not take isotopic labelling into account and thus treats the deuterium atoms as hydrogen. The layer states that the four H are shared between atoms with canonical numbers 2, 3 and 4 (the two N and O).

The Exchangeable Isotopic-H sub-layer of the Isotopic layer states that two deuterium atoms are shared over the whole structure.

How does InChI manage charge?

For most compounds the /q layer uses a positive or negative integer to represent the actual charge on the species and the formula represents the correct composition.

For certain hydrides, or compounds derivable from hydrides the charge is derived by removing or adding a proton(s) from a neutral hydride. The formula is then NOT the actual composition of the compound but a neutral hydride from which it is derived by (de)protonation. Although a table is given in Appendix 1 of the Technical Manual, you should not try to work out which method the InChI will use.

Examples of charge

Can InChI represent radicals?

No. InChI can be used to calculate the total number of electrons and nuclear charge from which the parity of electrons can be worked out. However it cannot represent triplet states.

Examples of radicals

Can InChI represent different spin states?

No.

What Can InChI Currently Not Represent?

InChI currently does not support the representation of:

  • Polymers
  • Conformers
  • Complex organometallics
  • Cluster molecules
  • Polymorphs
  • Excited state and spin isomers
  • Non-local stereochemistry/chirality
  • Topological isomers
  • Mixtures
  • Reactions
  • Unspecific isotopic enrichment
  • Markush Structures

InChI Syntax

Does the formula always represent the complete composition of the substance?

Normally yes, but if a charged species can be described by (de)protonation from a neutral compound, then it will not. For example:

Hydrides of oxygen

Is there always a connection table layer (/c)?

No. For example:

  • This might not be known: ethylene oxide and ethanal both have a formula C2H4O and the string -- InChI=1/C2H4O -- is compatible with either. The semantics of such an InChI are not precisely defined.
  • Mononuclear hydrides such as OH2 have no connection table. (note the InChI for H-H which is an inconsistency - InChI=1/H2/h1H)

Is there always an H layer (/h)?

No. Such as compounds that do not contain hydrogen, such as CO2.

Does the total number of hydrogens in the /h layer represent the number of hydrogens in the input compound?

Normally yes, but if the /p flag is present it must be adjusted for this.

Examples of (de)protonation

What does the /p layer mean?

This is the number of protons that must be added or removed to the formula to give the input composition. See examples above.

Comparing InChIs

Can I compare structures by looking at their InChIs?

Taken from the InChI Technical Manual.

"If two InChIs are the same, then it is safe to assume that the compounds that they represent are the same. However, the layered structure of InChI permits the representation of some compounds at different levels of detail or completeness.

If for example, one InChI is completely contained in another, then the second may be viewed as a more detailed representation of the first (for example, (Z)-but-2-ene may be viewed as a more detailed representation than but-2-ene). Or, for example, if one set of InChIs were derived from a collection with no stereo information and another contains complete stereo information, comparisons should be made with stereo information removed. Of course, manual confirmation may be necessary using chemical names if stereo distinctions are important."

Example two compounds; one a more detailed description of the other

The above image shows two structures, one being a more specific description of the other. Both have identical Main Layers, so the chemical formula, atom connection and hydrogen information for the structure are identical. The added stereochemical and isotopic information in the lower structures gives it two extra layers in the InChI.

Can I compare structures by looking at specific layers from their InChIs?

Taken from the InChI Technical Manual.

"Values computed for each layer depend on prior layers. As a consequence, for example, two stereochemical layers for different compounds cannot be directly compared - comparisons must involve the complete set of preceding layers. Layers do not, however, depend on successive layers."

Example showing the comparison of two identical stereochemical stereochemical layers from different compounds

The above image highlights why you cannot directly compare layers such as the stereochemical layer without taking earlier layers into account. The two compounds are clearly different (and thus have different main layers), yet both have the same stereochemical layer. This is because the stereochemical layer only holds the canonical number for a stereocentre and its stereochemistry, with all the information from the earlier layers used in calculating the stereochemistry being absent. Therefore, directly comparing the stereochemical layers of the two compounds would be analogous to stating "these two compounds are identical, as each has a stereocentre at atom no. 2 with stereochemistry of '-' ".

If two InChIs are the same, do they refer to the same compound?

If the compounds have been properly represented, then they should be identical regardless of the original method of representation. See Appendix 4 of the Technical Manual.

If two InChis are different, are they different compounds?

It is not possible starting from the same structure and the same degree of certainty of all facets to generate different InChIs.

Formally you cannot assert that they are the same in all respects, although the differences may only represent different levels of knowledge. If they differ only in certain layer(s), or in the absence of layer(s) then they represent "the same compound" with different levels of knowledge. See here (1, 2) for examples.

How can I compare similar compounds?

It is not possible to use InChI syntax to compare molecules with different but similar connection tables.

It may be possible to compare different tautomoers.

Examples of comparing InChIs of two tautomers

It may be possible to show that compounds differ in chirality.

Examples of comparing InChIs of two stereoisomers

If two compounds are declared enantiomers they will have the SAME /t string and differ by in the /m layer (/m0 or /m1)

Examples of comparing InChIs of two enantiomers

If two compounds have a components on common (e.g. ions in salts, or ligands disconnected from metals) it will be possible to identify identical fields. For example A+B- and C2+2B- will have fields which in principle can be syntactically separated.

Examples of excising components from layers in InChIs

The Current InChI Release

Where can I get the current InChI release?

InChI software is available as a free download and can be obtained from http://www.iupac.org/inchi.

The download includes programs for generating and testing the Identifier, along with a User Guide, a Technical Manual and sample structure files (MOL files, SDF files and CML files).

Where can I find the InChI Technical Manual?

The InChI Technical Manual is a document which is part of the InChI release (download as explained above).

How do I install InChI?

To use this program, first extract the contents of the zip file to a directory of your choice. To start the InChI Generator (which runs under 32-bit Windows Operating Systems) simply open the extracted file wInChI-1.exe.

For users of Unix-like operating systems or for those familiar with the Windows 'Command Prompt', an executable program is also provided - cInChI-1. The principal use of the program is to allow batch processing within other programs for the processing of multiple structure files. At present, this program is intended primarily for processing SDF files.

How do I create an InChI?

The InChI test release contains a User Guide that directs you through the creation of InChIs using the sample structure files and the generation program provided.

For a demonstration of how to create InChIs using the sample structures and generation program provided in the InChI release click here.

We have also implemented the current InChI release to create an InChI Web Service with which you can create InChIs. Not only can you use this to create InChIs from Mol, SDF or CML files, but you can also draw a structure and create the InChI directly from that. Information on how to use the Web Service is given on the page.

Can I link or call InChI from my program?

InChI is written in C and can be compiled on most systems. It can be packaged into a dll for Windows or a library for UNIX. It is not available in other languages as it contains substantial algorithms which require manual porting and conformance testing. In our own work we have been able to call it from Java using System.exec. This is not ideal but it seems to work.

Which formats does InChI accept?

The InChI Test release accepts Mol files (*.mol), concatenated Mol files (*.sdf), CML files (*.cml) or the program output produced with the "Full auxiliary information" (i.e. the output produced when you click on the 'Write Result' button in the InChI Generator).

The simplest way is to drag the input structure file from Windows Explorer directory list into the InChI window. Structures also may be copied from certain chemical structure editors (ISIS/Draw with "Copy Mol/Rxnfile to the Clipboard" option or from ACD/ChemSketch) and pasted into the InChI window (Select Edit -> Paste from the InChI menu). Input structure file pathname may be provided as a command line option when you start wINChI.

For a demonstration of how to "copy-and-paste" structures directly from 2D drawing packages directly into the InChI generation program click here.

Selection of the input structure may also be done by clicking the 'Open' button in the top-left hand corner of the InChI generator.

With which chemical drawing packages can you 'cut-and-paste' directly into the InChI Generator?

The packages you can use to do this are (as far as we are aware - this is by no means exhaustive!):

  • BKChem
  • ChemDraw
  • Marvin
  • ACD/ChemSketch

Can I use InChI if I don't know the connection table?

Yes. If you do not know the connection table (i.e. do not have access to the *.mol, *.sdf or *.cml file representation of a chemical structure) then you can simply draw the structure in one of the 2D chemical drawing packages described above and then 'cut-and-paste' that structure into the Windows application.

Once this has been done, the InChI is created automatically for you and presented in the lower section of the application window.

Is there a way to generate an InChI if I have a connection table, but not in CML, Mol or SDF form?

There is, though as you may know the InChI generation program will currently only accept CML, Mol or SDF files.

We have set up an Open Babel Server at our WWMM Portal which can be used to convert to CML, Mol or SDF form from file types as listed here.

Help on how to use the Server is provided on the page.

Other than a connection table, what is needed to generate an InChI?

If the species has potentially mobile hydrogen atoms, the user needs to specify whether to represent the substance assuming H-mobility as associated with tautomerization or as a substance with all H-atoms fixed. Most users will probably wish to select "Mobile H Perception".

For a demonstration of switching "Mobile-H Perception" on and off and the effects it has on the InChI click here.

Also, if the structure is organometallic, the user needs to specify whether the fully bonded structure will be appended to the disconnected metal atoms structure (by selecting "Include bonds to metal"). Except for compounds composed entirely of covalent bonds, this representation is generally not preferable (as the InChI will contain a layer that may depend on drawing conventions).

Also, when appropriate, a stereodescription may be entered as a single enantiomer, a racemic mixture, or, with two or more stereocenters, as a relative stereochemical description.

Can I regenerate the structure from InChI?

Yes. If you have the program output produced with "Full auxiliary information" then you can simpy open this file in the Windows application and the structure will be redrawn for you.

The InChI layers are created without loss, so several programs can recreate connection tables that hold this information. Some tables cannot hold certain layers, so this will be lost.

We are committed to making InChI=>CML a completely lossless transformation

What happens if the input structure has no mobile hydrogen atoms but mobile hydrogen perception is specified?

If there are no mobile hydrogen atoms, no mobile hydrogen layer will be present, so this specification will be no effect. The same idea holds for the stereochemistry and isotopic layers - if there features are not present in the structure, the corresponding layers are absent in the InChI.

What is the 'Auxiliary Information' in the InChI output?

The InChI code produces a range of auxiliary information. This includes warnings and errors, atom equivalence information to checking the correctness of the structure, mapping input atom positions to output positions, and 'reversibility' information for re-drawing the structure.

Program Flags

The InChI program has many flags both in the GUI (wInChI-1) and the command line (cInChI-1). Can they affect the generated InChI?

Yes. We detail the cases below. There are two sorts:

  1. those that affect the InChI
    • SAbs, SRel, SRac, SUCF, SNon, SUU, NEWPS - affect stereochemical layer
    • RecMet - affects metal bonding
    • FixedH, NoADP, DoNotAddH - hydrogens and specification of particular tautomer
    • Compress - produce compressed InChI
  2. those that affect the running, display, destination, etc. but not the InChI

You can toggle all of the above options in cInChI-1 (the command line application) but you cannot toggle SUU, Compress and DoNotAddH in wInChI-1 (the Windows application).

Note that all flags modify the InChI by appending layers, not be altering the core InChI.

What do the stereochemical flags do?

1. SAbs, SRel, SRac are three mutually exclusive statements about the overall chirality of the compound:

  1. SAbs: the compound is a single enantiomer where the absolute configuration is known at all of the known centres in the /t layer. There may also be unknown centres but at least one must be absolutely known. The flag /s1 is set automatically. The /m flag will always be present (it is always absent in SRel, and SRac). Note that two enantiomeric InChIs will have identical /t layers but will differ by having /m0 or /m1. The /m1 configuration is logically identical to the /m0 configuration with all signs changed. Example of inverting stereochemistry Note that /m forces /s1 but this flag is still a required part of the InChI. SAbs is the default, so beware that leaving this altered will imply that the compound is of known absolute configuration

    If a mirror image of a structure is identical to the structure then no /s1 or /m will be added. The same will happen in case of SRel or SRac. For example: Example of inverting stereochemistry
  2. SRel: the compound is a single enantiomer but its absolute configuration is not known. There may be additional /t centres which are not known but at least one must be signed. This forces /s2
  3. SRac: The compound is a 1:1 mixture of enatiomers. At least one /t centre will be signed. The sign only acts to provide the relative stereochemistry to other signed /t centres
Examples of the three stereo options

2. SUCF only applies to MDL Molfiles in which the CHIRAL flag is set. By default this is set to 0/off. The combinations are:

  • SUCF on, CHIRAL 1 => SAbs
  • SUCF on, CHIRAL 0 => SRel
  • SUCF off, CHIRAL ignored. However defaults to SAbs

There appears to be no way of using SUCF and CHIRAL to denote a racemic mixture. This is a confusing issue and it is probable that the creators of MOLFiles with CHIRAL had a range of views as to what it meant, as the MDL documentation is sparse ("1=chiral, 0 = not chiral"). We recommend that SAbs/SRel/SRac be used in place of SUCF where possible.

Note that many drawing programs do not allow the user to specify the chiral flag so the information is very variable. It is more likely that the maintainer of a collection will know whether some or all of the compounds are of known chirality.

3. SNon excludes the stereochemical layer. We believe this is purely syntactic and should have no effect on the rest of the InChI. However we see no point in excluding it as it throws away information, even if that is to state that no information is known about the centres. DEPRECATED.

Example of SNon in action

4. SUU. This includes stereodescriptors on omitted or undefined chiral centres (i.e. with ? or u) if all stereocentres in the structure are omitted or undefined.

Example of SUU in action

5. NEWPS. If unset this interprets BOTH ends of a wedge bond as stereocentres. This convention has been widely condemned through the chemical community and we STRONGLY DEPRECATE it. It is a great pity that it was set as the default - indeed it should never have been introduced. Therefore we recommend that all InChI generation turn on NEWPS and set this flag on the commandline.

Unfortunately the default convention creates additional "known" centres about which nothing is in fact known. It is likely that careless InChI generation will create thousands of incorrect structures. WE HAVE ASKED THAT THIS BE REGARDED AS A BUG AND REMOVED ASAP.

What does RecMet do?

It appends the (metal) Reconnected layer (/r) to the InChI. See here for an example.

What does FixedH do?

It appends an additional fixed hydrogen layer (/f). See here for an example.

What does Compress do?

Compress mainly transforms some of base-10 numbers to base-27 numbers, where letters are used instead of digits. Also a more compact expression of the connection table is used. This allows to avoid most of punctuation signs that split the InChI string into short segments which could prevent effective search for longer segments of InChI on the web. No information is lost due to compression. It is just an alternative expression of the same identifier.

As there is currently no simple way of comparing the normal and compressed representations, the compressed InChI should be used only for internal purposes, not for structure information exchange.

Strategies for Creating InChIs

Do I need to know how my molecular information was created?

Yes!

Conventional methods of representing molecules structure often have much fuzziness and you may have to make assumptions about the defaults used by the creator. Examples are:

  • Are all hydrogen atoms explicit?
    If you know they are, then you should run InChI with the DoNotAddH flag set. This is by far the most powerful way of ensuring that your InChI is likely to be correct.
  • Are all charges included in the files?
    Sometimes the creation mechanism omits charges on atoms. This can make it very difficult to calculate the correct molecular constitution and the total electron count.
  • Are all stereo centres explicit?
    Many historical data files have no stereochemistry. Many others have partial stereochemistry (e.g. "everyone knows what the stereochemistry of androstance steroids is so we needn't put it in."
  • Is the stereochemistry absolute or relative?
    Even if the all the stereocentres are given the absolute stereochemistry may be unknown or the sample may be racemic.

Does InChI require all atoms including hydrogens?

While bond orders are not used in the representation, hydrogen atoms are required. If there is ambiguity concerning the number of H-atoms in a structure (i.e., its chemical formula is not clear), a reliable InChI cannot be created.

The InChI generator uses accepted valence rules to detect such ambiguity and issues warnings when detected (in the output Auxiliary information).

What are the problems if I can't find out about this?

The worst are likely to be that:

  • the hydrogen count is seriously wrong.
  • the total electron count is wrong and the bond orders are incorrect.
  • stereocentres are flagged as known when they are not.

Can InChI fix these problems automatically?

Not really for hydrogens. It is possible to ask InChI to add hydrogens but this can go wrong and we suggest this should be done before the InChIfication and subjected to quality control.

InChI will flag unknown stereo centres. It may then be possible to search this InChI against other InChIs where the sterocentres are known and perhaps to add the missing information by hand.

InChI in the Real World

How is InChI being developed?

Work is currently underway to extend InChI to include polymer representation. Exploration into the need for other extensions, including the ability to handle Markush structures and other attributes such as phases and excited states are also being looked at.

To enable development of InChI facilities and applications in an Open Source context, a project to encompass this work has been registered with SourceForge.net (see http://sourceforge.net/projects/inchi); people wishing to participate should contact the project administrator (mcnaughta@rsc.org) or the IUPAC Secretariat (secretariat@iupac.org). To receive and discuss proposals for InChI enhancements, an internet listserver has also been established; people wishing to participate in these discussions should contact Alan McNaught (mcnaughta@rsc.org).

I am a chemical supplier - is InChI useful to me?

We think so! We have shown that InChI scales well (we have over 200, 000 compounds indexed by InChI). If your catalog is InChIfied then it can be easily indexed automatically by search engines, giving you a greatly increased exposure. Your customers will have a better indication of the precise nature of each compound you supply and will be more easily able to integrate into the supply chain.

I maintain a chemical database - is InChI useful to me?

Certainly. See, for example, the PubChem site. Other chemical databases are starting to InChI-fy their content.

I am a publisher - is InChI useful to me?

Certainly. The value of InChI is outlined here in "Representation and use of Chemistry in the Global Electronic Age" - Peter Murray-Rust, Henry S. Rzepa, Simon M. Tyrrell and Y. Zhang.

Nature Chemical Biology has recently adopted InChI and discussions are under way to use InChI as the chemical identifier in the Beilstein Journal of Organic Chemistry (BJOC).

I am in pharma - is InChI useful to me?

Certainly. InChI will allow a precise description of your in-house compound collection. It also makes it easier to discover what you partners have when you merge or acquire someone. The regulators (e.g. patents) are keen on using InChI.

What Is InChI Not Designed For?

  • Manual generation:
    For all but the simplest structures, the algorithms are too complex to be implemented manually.
  • Human parsing:
    While with an understanding of the syntax of the Identifier, it may be 'reverse-engineered' to show its various layers, its compact form is not well suited for this. It may, however, be easily parsed and the contents of each layer examined and traced to the original structure, but end users would never be expected to do this. The demonstration program provides such parsing.
  • Substructure searching:
    The Identifer has no advantages over the more commonly used connection table formats for substructure and structure similarity searching. The InChI layers are designed solely to deal with the different ways of representing the same compound.
  • Structure display:
    Coordinates are not a part of the Identifier. While these may optionally be stored along with the identifier as auxiliary information, more flexible and widely used connection table formats exist for this purpose.
  • A connection table:
    The Identifier may be thought of as a very restricted sort of connection table since it contains the 'connectivity' of a compound. However, it holds only the information needed to uniquely identify a substance, so does not include information often held in 'connection tables' such as coordinates, bond types, positions of charges or moveable bonds, etc. The ordering of atoms is important in InChI - this order is not important in most connection tables.

InChI and Other Technologies

Can search engines use InChIs?

Yes! An InChI string can be used as a robust web-based query that has high recall and precision. We have written a paper on this titled "Enhancement of the chemical semantic web through the use of InChI identifiers", which can be found at http://pubs.rsc.org/ej/OB/2005/b502828k.pdf.

The supplemental data from this paper can be found at http://www.rsc.org/suppdata/ob/b5/b502828k/index.html.

Please contact us at ned24@cam.ac.uk if you have any queries.

We have also provided an InChI/Google Web Service which can be found at http://wwmm-svc.ch.cam.ac.uk/wwmm/html/googleinchiserver.html. This allows you to draw a structure into a 2D editor and then at the click of a button all InChI-fied instances of that molecule on the web will be located and summarised for you!!

How does CML relate to InChI?

CML and InChI are distinct projects but there is much communality. CML is designed to hold InChI as XML <identifier> and we see InChI+CML as a primary method of representing chemistry in an XML environment.

We have generated CML (with the InChI string) for the NCI and Kegg databases. These can be viewed at http://wwmm.ch.cam.ac.uk/data/nci and http://wwmm.ch.cam.ac.uk/data/kegg respectively.

How does InChI differ from SMILES?

Like InChI, the SMILES language allows a canonical serialization of molecular structure. However, SMILES is proprietary and unlike InChI is not an open project. This has led to the use of different generation algorithms, and thus, different SMILES versions of the same compound have been found.

In fact, we have found seven different unique SMILES for caffeine on Web sites:

  1. [c]1([n+]([CH3])[c]([c]2([c]([n+]1[CH3])[n][cH][n+]2[CH3]))[O-])[O-]
  2. CN1C(=O)N(C)C(=O)C(N(C)C=N2)=C12
  3. Cn1cnc2n(C)c(=O)n(C)c(=O)c12
  4. Cn1cnc2c1c(=O)n(C)c(=O)n2C
  5. N1(C)C(=O)N(C)C2=C(C1=O)N(C)C=N2
  6. O=C1C2=C(N=CN2C)N(C(=O)N1C)C
  7. CN1C=NC2=C1C(=O)N(C)C(=O)N2C
The structure of caffeine and its InChI string

Questions and Answers from the InChI Discuss mailing list

So...what is this section all about?

This section covers all relevant questions and answers that have been sent to the InChI-Discuss mailing list. We have extracted the important information from the messages sent and added diagrams and a few extra words to aid clarity. The Q&A's have been grouped into (hopefully) sensible categories to aid their discovery.

Who is DT who has answered all of the questions asked to the mailing list?

'DT' refers to Dmitrii Tchekovskoi who has designed and written virtually all of the code that goes into the InChI applications.

The InChI Generation Process

What are the general steps of going from an input file to an InChI?

InChI does the following in the following order:

1. preprocesses the whole structure:
1a. creates disconnected preprocessed structure
1b. creates reconnected preprocessed structure
2. Extracts a component from 1a
3a. Normalizes a component as the one that has mobile H
4a. Normalizes a component as the one that has fixed mobile H
5. Canonicalizes 3a+4a results (constitutional)
6. Canonicalizes 5+3a+4a results (stereo)
7. Repeats 2-6 for the rest of components from 1a
8. Executes 2-6 for all components from 1b instead of 1a
9. Sorts the all the canonicalized components
10. Serializes all the results

Although steps 2-6 are concatenated, the normalization can be done independently of the canonicalization at the expense of increasing memory allocation.

In the CreateOneStructureINChI() [runichi.c:3285] the Main Cycle [runichi.c:3603]

for ( i = 0, nRet = 0;
!sd->bUserQuitComponent &&
i < cur_prep_inp_data->num_components; i ++ ) {...}


should be split in two, and a container for holding all structures that are sequentially extracted by GetOneComponent() into inp_cur_data should be created, etc.

General InChI representation of structures

Can InChI represent a proton?

This may be represented in InChI because it is an ionized atom of a chemical element present in the periodic table - hydrogen.

InChI=1/p+1/i/hH/fH/q+1/i1+0

Can InChI represent electrons or neutrons?

Currently these two cannot be represented in InChI because these are not [ionized] atoms of a chemical element present in the Periodic Table.

Can InChI represent an alpha particle?

This may be represented in InChI as a doubly ionized atom of a chemical element - helium - present in the Periodic Table:

InChI=1/He/q+2/i1+0

Chemical elements in InChI

Q: There is an element symbol in the InChI code consisting of "" (an empty string). Is this trapped by the InChI software as not being an element?

A: A non-empty element name string that matches one of 104 predefined element names is interpreted as an element. All other element names are not elements and produce errors.

An element symbol "" in the code is used for the sole purpose of detecting the end of the element list. It may be considered as a programming trick.

The description of the 104 valid elements in the previous e-mail needs a footnote:

Element name "D" is equivalent to element="H", isotope 2.
Element name "T" is equivalent to element="H", isotope 3.
Any isotopic attributes of "D" or "T" provided in the structure are ignored.

What does an empty InChI represent? is it valid? and how can it be generated?

It represents an empty structure that, for example, may appear in a database when a structure of a substance is not available. No attempt is made to tell a missing structure from a non-existent one.

It is valid in the sense that (a) it is what was requested during the InChI testing period and (b) InChI interprets the corresponding AuxInfo as an empty structure.

To generate an empty structure the cInChI-1 may be fed with a Molfile that has zero atoms and zero bonds (one of the possible Molfiles attached); InChI command line parameter "WarnOnEmptyStructure" is required. This parameter is the last in the list displayed when cInChI-1 is run without any command line parameters.

The strings of empty InChI and AuxInfo are:
InChI=1//
AuxInfo=1//

InChI representation of multiple components

In InChIs of structures containing more than one component, is the ; separator necessary between information for components if one is empty?

Yes. Delimiters between components are kept if at least one connection table is not empty. Consider the InChI for Zn(CH2CH3)2.

The chemical formula layer "2C2H5.Zn" shows us that the two ethyl groups will be considered first in all subsequent layers. The atom connection layer "c2*1-2;" which can be broken up into:

  • c - statement that this is the start of the atom connection layer.
  • 2* - statement that there are two identical components with the following atom connections.
  • 1-2 - atom with canonical number 1 is bonded to the atom with number 2.
  • ; - separator between the atom connection information for 2C2H5 and Zn.
  • empty string - statement that the component Zn has no atom connection information.

In the case of a structure where no component contains any information for a particular layer (for instance the atom connection layer in CH3-Zn-CH2-Zn-CH3) the layer is omitted.

InChI input

General rules for InChI input

0. InChI input is a chemical structure made out of atoms optionally connected by bonds. Optionally an empty InChI is allowed.

1. Atoms. Atoms have types and attributes.

1.1. Atom Types.

InChI may represent only chemical structures made out of atoms of the first 104 chemical elements present in the Periodic Table (from H to Rf).

1.2. Number of Atoms.

InChI input may contain not more than 1023 atoms.

1.3. Atom Attributes (all optional).

The atoms inside the chemical structure may have the following attributes: charge (integral), radical (list:singlet, doublet, triplet), isotopic mass (integral), coordinates (2D or 3D).

2. Bonds connect pairs of atoms (those are 2-center bonds; no 3-center, etc. bonds allowed). Bonds have types and optionally stereo attributes. Any pair of atoms may be connected by not more than one bond.

2.1. Bond Types.

Bond types are single, double, and triple.

Note.
As an exception, "aromatic bonds" are also allowed only to be immediately converted into alternating single/double bonds by the algorithm. Since this conversion is not bulletproof at all partially because the formal AND accepted by everybody definition of aromatic bonds does not exist, the usage of the aromatic bonds is strongly discouraged.

2.2. Number of Bonds.

Not more than 20 bonds per atom are allowed.

2.3. Bond stereo attributes.

In case of 2D coordinates, single bonds may have stereo attributes Up or Down. In case of 3D or 0D coordinates these attributes are ignored. The Either attribute is always recognized as an Unknown stereochemisry.

The Either attribute of a double bond is always recognized as an Unknown stereo bond geometry.

The Unknown stereo (when no stereo description is available) is considered different from the Undefined one.

3. Implied conventions or semantics.

3.1. Stereobonds in 2D. In case of the "NEWPS" option only the atom at the narrow end of a single stereobond receives a relevant stereochemistry; otherwise (by default) both atoms do.
It should be pointed out that at least in the 21st century the latter, "perspective", interpretation has always been criticized by the experts in stereochemistry, see for example http://www.iupac.org/projects/2003/2003-045-3-800.html . However, not all followers of the "perspective" depiction have been eradicated or converted yet.

3.2. Implied H. Since InChI code is to be able to process legacy Mol/Sdfiles that often contain implied atoms H, some "standard" valences are assumed (see Appendix 1 in the InChI Technical Manual) and atoms H may be added unless this addition is prohibited by the user (-DoNotAddH option).

Hydrogen Layer

Multiple mobile hydrogen groups

Q: Will an InChI ever contain multiple mobile hydrogen groups in the same layer?

A: Such structures do exist. For example, oxalamide:

The mobile hydrogen groups are:

  • (H2,3,5) - showing that there are two hydrogen atoms that are mobile between atoms with canonical number 3 and 5.
  • (H2,4,6) - showing that there are two hydrogen atoms that are mobile between atoms with canonical number 4 and 6.

Please notice there is no comma between )(.

In the AuxInfo you may even find that these two groups are constitutionally equivalent.

Stereochemical Layer

Can the /s sub-layer of the stereochemical layer appear in more than one layer?

In general, /s may appear in M, MI, F, and FI. However, you may want to take a look at the "Appendix 2. Abbreviations and Layer Precedence", section "a. Layer Precedence" (InChI Tech. Man.)

The reason is that /s segment (as well as /t, /b, and /m) may be deliberately omitted if it is same as in the preceding layer. For stereochemical layers the precedence is somewhat complicated because it depends on the existence (that is, non-emptiness) of the layers in M or F.

One more (not quite obvious) reason for omitting a contribution from a non-isotopic component to an isotopic stereo segment /t or /b (located MI or FI) is that this component, unlike isotopic components, *always* has isotopic stereo exactly same as its non-isotopic stereo in the preceding layer. The contribution from such a non-isotopic component to the isotopic layer /m is ".", as if the inversion does not change its /t stereo.

/s may not be omitted in, for example, MI if some of the components do not have /t in M and have /t in MI even though another component has absolute or relative stereo in both M and MI and as the result both /s segments are same.

The /s sub-layer

Q: It's not clear to me what "/s" modifies - is it "/t" (tetrahedral center), "/b" (double bond stereo) or both? Is it possible to modify both "/t" and "/b" independently?

A: /s refers only to the stereochemistry that changes upon spatial inversion (or a reflection in a plane since the reflection is an inversion and a proper rotation). The inversion cannot change (E) or (Z) double bond stereo. Therefore /s modifies /t (sp3 stereochemistry) and has nothing to do with double bond stereo /b.

How /s modifies the information in the rest of the stereochemical layer

Q: Let's say I have a molecule that has two tetrahedral chiral centers, and two double bonds. I only know the relative configurations of the tetrahedral stereocenters, but I know the absolute geometries of both double bonds.

How would an InChI capture this information? Does "/s" modify the entire stereo block that preceeds it Or can both "/b" and "/t" be modified independently? For example, would the following be legal?

A: The syntax of the InChI you wrote is formally incorrect (please excuse my nit-picking: I hope these subtle details may help your understanding of the syntax issues; see a-c and e-g below). The correct syntax (for a single component) would be:

"InChI=1/.../b2-1+/t4+,5-/s2..." -- relative stereo
"InChI=1/.../b2-1+/t4+,5-/m1/s1..." -- absolute stereo or
"InChI=1/.../b2-1+/t4+,5-/m0/s1..." -- absolute stereo but not
"InChI=1/.../b2-1+/t4+,5-/m./s1..." -- (wrong - see (i)) or
"InChI=1/.../b2-1+/t4+,5-/s1..." -- (wrong - see (f)) or
"InChI=1/.../b2-1+/t4+,5-/m1/s2..." -- (wrong - see (g)) or
"InChI=1/.../b2-1+/t4+,5-/m1..." -- (wrong - see (f))

Explanation:

(a) in /b, the delimiter between the canonical numbers of the atoms connected by a stereo bond is "-"

(b) in /b, the first canonical number of the two atoms connected by a stereo bond is always greater than the second

(c) the order of the segments describing the stereochemistry is always this:
/b /t /m /s

(d) in /b, the character after the second canonical number may be only one of the following four:
+ - ? u

(e) (/m and /s1) or (/s2) may be present only if /t is present

(f) /s1 is present only if and only if /m is present

(g) In case of a relative stereo (/s2) it does not matter whether the /t stereo shown in InChI was inverted or not. Moreover, /m would make different InChIs of the same relative stereo drawn as inverted and non-inverted. Therefore, in case of a /s2 (relative stereo) the /m segment is never present.

(h) If the spatial inversion does not affect /t then no /s and /m is present in InChI.

(i) Contents of /m should include at least one character 1 or 0 (which means at least one of the components is chiral), otherwise both /m and /s are omitted.

(j) there may not be more than one /m and /s in each of M, MI, F, FI.

Some of the above statements or their combinations may be equivalent.

Missing /s sub-layer is isotopic stereochemical layer

Q: In trying to understand the Isotopic Layer, I came across the following InChI in the inchi-samples/Samples.sdf file of the InChI binary distribution (Structure 28):

InChI=1/C7H8O4/c1-3(9)5-4(2-8)6(5)7(10)11/h2,5-6,8H,1H3,(H,10,11)/b4-2-/t5-,6+/m0/s1/i1TD,3+1/t1-,5+,6-/m1

I am unclear about the lack of "/s" following "/m" in the Isotopic Layer:

/i1TD,3+1/t1-,5+,6-/m1

Given previous guidance that:
"(c) /s segment is not present only in the following cases:
(c1) there is no stereogenic atoms in the structure, or
(c2) all stereogenic atoms have unknown or undefined stereo, or
(c3) the structure is not chiral"

A: The stereochemical layer is "repetitive". You may want to take a look at the "Appendix 2. Abbreviations and Layer Precedence", section "Layer Precedence" in the InChI Technical Manual.

Below I call a "segment" the following /string:
.../string/... or .../string[End of Line]
there is no slash inside the string.

In general, if an InChI "repetitive" segment /e in a given layer L is exactly same as its namesake segment /e' in the preceding layer L', then /e is omitted (strings /e and /e' are identical, the ' mark the position inside the InChI string).

Also parts of the segments referring components that are not isotopic ("entities") are left empty in the isotopic layer and represented there by the ";".

Note that the /m layer is included in both stereochemical layers as the stereochemistry is inverted between the two.

Isotopic Layer

What is the ordering of Deuterium and Tritium in the isotopic layer?

The ordering of isotopes goes in descending atomic mass. Thus the ordering for Tritium, Deuterium and Protium is T>D>1H.

Note that if you specify the generic H (rather than the specific protium isotope 1H) then it will obviously not appear in the isotopic layer (which only deals with specific isotopic enrichment).

Reconnected Layer

Multiple entities in the reconnected layer

Q: Can a reconnected layer ("/r") ever consist of more than one entity? I'm defining "entity" as a connected molecular graph. I've seen examples of InChIs that contain a single "reconnected" entity (such as the ferrocene example, for which I appreciate the help I got). But would a reconnected layer ever contain multiple reconnected entities?

A: In general, it is possible. In fact, it is the case if you calculate an InChI of a compound drawn as two or more coordination compounds (entities) that are not connected to each other by chemical bonds.

For example:

General note on the Reconnected layer

Dmitrii Tchekovskoi -
I would like to emphasize that only the disconnected form is generally reliable. The 'reconnected' form depends on drawing conventions including not only H-atoms, but also the connectivity drawn in the original. The 'reconnected' form was included in InChI to allow individuals to store their specific connectivity if they wish.

The default is suggested to be reconnected = OFF and in fact the default is reconnected = OFF. Comparisons between InChI should certainly be reconnected = OFF. This a most obvious application of our layering method.

Sub-layers of the reconnected layer

Q: Is it the case that the "reconnected" structure could contain a full-fledged InChI consisting of Main Layer, Charge Layer, Isotopic Layer, Fixed H Layer, and Fixed/Isotopic Combination Layer? (Of course, the version would not be present after "/r".) Or, are only subsets of these layers permitted in the "reconnected" structure? If so, which ones? The algorithm on pages 59-60 of the Technical Manual implies that all InChI layers may be present after "/r".

A: This is correct: all InChI layers may be present after "/r". As you rightfully suggested, after /r is "a full-fledged InChI" that may include all possible layers, including, for example, up to 4 (four) stereochemical layers for M, MI, F, and FI layers. Examples showing repetitions may easily be constructed by adding components (entities) that have no bonds to metal. These components will be repeated after /r.

From the viewpoint of the programming, the full InChI normalization and canonicalization algorithm is case of Ferrocene is called two times: (a) for the structure after disconnection of the bonds to metal, and (b) if -RecMet option was applied -- for the original structure.

As you may conclude, for the purpose of the restoration of the original structure, if /r layer is present then you may ignore everything before /r.

InChI failures

Problems reading Alias statements in MOL files

Q: In NMR, we sometimes report determined chemical shift assignments as Alias statements following the connection table in the MOL file. Some examples are below (determined carbon shifts in ppm):

A 1
212.28
A 2
211.06

Unfortunately, InChI fails when we include this information in the MOL files that we are generating InChIs from. As soon as they are removed the InChI is generated correctly. Shouldn't InChI generation ignore the Alias statements as the atom types are already specified in the atom lines of the MOL file?

A: No, it should not. Aliases are used by structure editing software to express chemical structures, not the data associated with them.

Some structure editors do not allow atoms D or T, therefore the only way to add them is through the Alias.

Some SDfile-based databases treat a hydrogen isotope entered as 2H differently from D.

In addition, aliases like NH2 sometimes occur.

By choosing a lesser evil, InChI treats a Molfile alias as an atom, possibly with hydrogen atoms and charge. If this interpretation fails, InChI produces an error. This allows it to rightfully fail if an alias is, for example, a valid but unrecognized by InChI abbreviation of an amino acid like Asp, Lys, etc.

Ignoring the failure to interpret an alias and calculating InChI as if the aliased atom is a carbon instead of what the Alias was supposed to mean would be a mistake. In general, failure is better than producing a wrong identifier.