What is the difference between a substance and a compound in PubChem?

PubChem users sometimes ask about the difference between a substance and a compound.  The question is not surprising as the names “substance” and “compound” alone do not inherently convey the difference.  In PubChem terminology, a substance is a chemical sample description provided by a single source and a compound is a normalized chemical structure representation found in one or more contributed substances.  The distinction is important as PubChem is organized in three separate databases: Compound, Substance, and BioAssay.  The diagram below explains the difference, but let’s explore this further.

PubChem 3 DBs

To understand the different databases in PubChem, it is helpful to know where the information comes from.  PubChem (http://pubchem.ncbi.nlm.nih.gov/) is an open archive of chemical substances and information about their biological activities.  Data is provided by hundreds of contributors (http://pubchem.ncbi.nlm.nih.gov/sources/), including publishers, researchers, chemical vendors, pharmaceutical companies, and a number of important chemical biology resources.  Each of these data sources contributes a description of chemical substance samples for which they have information.

PubChem calls these community-provided sample descriptions “substances.”  Each record found in the PubChem Substance database (http://www.ncbi.nlm.nih.gov/pcsubstance) contains information provided by an individual contributor about a particular chemical substance.  Substance records are independent of each other.  Two different Substance records (from the same or different providers) could provide different information about the same chemical structure.  For example, one substance record may give information about the biological role of aspirin, while another may give information about a research grade sample of aspirin.  The Substance database maintains the provenance of chemical substance information in PubChem.  It helps users see who provided what.  As a result, there may be many substance records about a given molecule, presenting a problem for users who are interested in an aggregated view of information on the molecule.  This is where the PubChem Compound database (http://www.ncbi.nlm.nih.gov/pccompound) comes into play.
PubChem substance vs compound
The Compound database is derived from the chemical structure contents found in the Substance database.  Each chemical is computationally examined with a series of validation and normalization steps.  This process results in a normalized representation of the chemical structure for a substance record.  Chemical substances in the Substance database that are not completely described or that fail normalization procedures are not included in the Compound database.  Those substances in the Substance database that pass chemical structure normalization procedures are linked to a “compound” record in the Compound database.  If two substances refer to the same chemical structure, they point to the same compound.  This allows data from different Substance data providers to be aggregated through a common Compound record.  However, also having separate substance records is still valuable to users, who, for example, might be interested in the provenance of a substance or a particular state of the chemical (e.g., a different tautomeric form).  In essence, a primary purpose of the PubChem Compound database is to provide a “non-redundant” view of the depositor-contributed chemical structure contents stored in the PubChem Substance database.

So, to answer the question posed at the beginning, what is the difference between a substance and a compound?  A substance is a contributed chemical substance sample description from a particular PubChem data provider.  A compound is a normalized chemical structure representation found in one or more contributed substance descriptions.

To read more on this topic, please consider exploring these links: