Atomic mass changes in PubChem

PubChem is now using the latest International Union of Pure and Applied Chemistry (IUPAC) recommendations for atomic mass and isotopic composition information.  In addition, PubChem is now restricting the allowed isotopes for a given element to those with a half-life of one millisecond or greater.

Fundamental changes within atomic mass information

Hydrogen and DeuteriumNormally atomic mass updates are not blog worthy; however, there are some fundamental changes in the way masses are conceptualized that affect the atomic weight values computed for nearly all compounds in PubChem.

Molecular weight is one of the most frequently requested pieces of information about a chemical.  To compute a molecular weight of a molecule, one consults a periodic chart and sums the average atomic weights of the elements comprising the chemical, while considering any specified isotopic enrichment information.  Although the molecular weight computation seems straightforward, as greater degrees of precision in atomic masses are known, the chemical science community is recognizing complex issues with average atomic weight and isotopic data.

The abundance ratio between different isotopes of a given element is used to determine its average atomic weight.  As the sensitivity of measuring equipment has increased, scientists now notice a distinct difference in these abundance ratios depending on the material source of that element.  To reflect this variation, and as explained in this IUPAC technical report, many elements are now given an atomic weight interval, consisting of a range of known discrete values reflecting the varying isotopic abundance ratios found in different elemental material sources. For example, the atomic weight interval of carbon is 12.0096 to 12.0116.

Another complicating factor is that the abundance ratio of naturally occurring isotopes is not available for all elements.  Some elements like radon do not have any stable isotope and no characteristic isotopic composition in earthly materials.  It means that no average atomic weight can be determined!  There are also a growing number of elements that do not exist in nature, being “synthesized” in the lab.  These artificially created elements are metastable, rapidly decaying into other elements.  Importantly, because different isotopes of a given element decay at different rates, the isotopic abundance ratio between isotopes is time-dependent.

All of these considerations contribute to the uncertainty in atomic weight and isotopic information, which in turn impacts the molecular weight of a compound.

What changes did PubChem make?

All molecular weights in the PubChem Compound database were updated as such:

  • Adoption of “conventional atomic weights”Periodic Table
    To provide a single, representative average atomic-weight value for an element ignoring any material source uncertainties, the latest IUPAC recommendations include a concept of “conventional atomic weight value” whereby most or all atomic-weight variation in normal materials is covered (with an interval of ± 1 in the last digit).  PubChem has adopted this approach for the twelve elements (hydrogen, lithium, boron, carbon, nitrogen, oxygen, magnesium, silicon, sulfur, chlorine, bromine, and thallium) with standard atomic weights given as intervals.
  • Standard atomic weights updated
    Standard atomic weights in PubChem use the latest values provided by IUPAC (except when a conventional atomic weight value is used).  For the thirty-four elements without any abundance information (e.g., technetium), the atomic weight of the most stable, non-theoretical isotope was used, as found in the NuBase2012 evaluation (http://amdc.in2p3.fr/nubase/nubtab12.asc) of nuclear and decay properties.
  • Trimmed precision of molecular weights
    To take into account the uncertainties in elemental abundances and masses, the precision of all molecular weight values were reduced from six to three digits beyond the decimal point.
  • Updated allowed isotopes for elements
    The internal PubChem knowledgebase used to generate the PubChem Compound database from the PubChem Substance database was updated.  (Read this blog if you are not familiar with how these two databases differ from each other.)  As a part of this, only isotopes for elements with an experimentally measured half-life of one millisecond or greater were allowed when using the NuBase2012 evaluation of nuclear and decay properties (http://amdc.in2p3.fr/nubase/nubtab12.asc).  This (slightly) modifies the scope of what can be found in the PubChem Compound database.

 Where can you learn more about this topic?

To learn more about this topic, please read the following:

  • Atomic weights of the elements 2013 (IUPAC Technical Report)
    Meija et al., Pure Appl. Chem. 2016; 88(3): 265-291.
    doi: 10.1515/pac-2015-0305
  • Isotopic compositions of the elements 2013 (IUPAC Technical Report)
    Meija et al., Pure Appl. Chem. 2016; 88(3): 293-306.
    doi: 10.1515/pac-2015-0503

New PubChem Data Sources Page

The PubChem Data Sources page is now updated.

What is the Data Sources page?

As an archive, PubChem contains information from hundreds of sources from all over the world.  Contributors can provide different types of content, such as substances, bioassays, and annotation.  The Data Sources page is an interface that helps one to determine, among other things, who provided what information.PubChem Data Sources

What changed?

As a part of an underlying technology update of PubChem, this page has been completely overhauled with a new look and feel.  The categorization describing the organization types providing content was simplified.  Sources of hierarchical classifications and textual annotations are now included.  There is now a unified data source table containing all primary information.  The updated interface provides new and improved capabilities to navigate as a function of data type, category, and country, while also including keyword searching, counts, and geographic visualization.

  • Filtering capability
    A panel (on the left-hand side of the screen) now summarizes (by count) key aspects of PubChem data sources.  By clicking the check boxes, one can filter the data sources listed.

By type
Classification of the type of information provided to PubChem.  This includes the ability to consider data ‘on-hold’ (to be released at a later date).

By category
General-purpose groupings that describe the contributing organization.

By status
Separates active contributors from legacy.  As explained in this post, some contributors or projects no longer exist (although their contributed data may still have substantial utility or value).

By geographic region
PubChem data contributors span the globe.  One can now filter and visualize by country.

PubChem - Nature Chemistry

  • Expanded sorting capability
    The improved Data Sources page allows users to sort by record counts and last-modified date.  For example, sort by last-modified date helps to identify organizations who recently updated their content.
  • Exploring sources on a mobile device
    As with other PubChem pages developed in recent years, great effort is taken to make the page adapt to the unique experience of mobile devices.  This means that without sacrificing features, the layout scales and complexity adjusts to match the appropriate screen size.
  • Improved individual data source page
    If you click on a data source link in PubChem, it now directs you to a dedicated page for that depositor.  Beyond showing contact information with its location displayed on a Google Map, it provides the date content was last updated and the current counts of submitted records.

Recent PubChem Publications: Read about What’s New!

PubChem PublicationsThe PubChem team published an article in the 2016 Nucleic Acids Research Database issue (Kim et al., Nucl. Acids Res., 2016, 44(D1), D1202-D1213, PMID: 26400175).  This article provides an overview of the PubChem Compound and Substance databases, including organization, contents, interfaces, programmatic access and other relevant tools and services.  Considerable changes have been made since these two databases were described in a previous paper published in 2008 (Bolton et al., Ann. Rep. Comput. Chem., 2008, 4, 217-241), and the newly published paper provides updated information on these resources.

Additional papers published about PubChem by the team in 2015 include:

To get a complete list of all articles published by the PubChem team, please visit the PubChem Publication page.

PubChem adds a “legacy” designation for outdated data

Sometimes information provided to PubChem by data contributors becomes outdated.  To address this, PubChem is introducing a “legacy” designation for collections that are not regularly updated.  This “legacy” designation applies to project/contributors that appear to no longer be active, as well as to their individual records.  This designation will help PubChem users quickly identify records that may have out-of-date information and/or hyperlinks.

Why a “legacy” designation?

PubChem Legacy Designation 1As an archive, PubChem accepts scientific data from contributors and maintains that data even if the contributing project is discontinued. While this helps ensure community access to the information lasts beyond the lifetime of a given scientific endeavor, the archival nature of PubChem does not allow anyone other than the data contributor to modify provided information.  Therefore, some records in PubChem can persist with outdated (or incorrect) data.  To help identify such cases, we are introducing a “legacy” indication for contributors and their records.  Please note that this does not mean that data identified as “legacy” is without value.  Quite to the contrary, some legacy collections successfully collected valuable scientific data for the research community, and are simply no longer updating the information.

How is a “legacy” designation determined?

A “legacy” designation is arrived at via a semi-manual, semi-automated procedure.  It involves aspects of examining contributor account information, individual records, and user reports.  For example, if the depositor website does not work for a period of time, attempts are made to contact the submitting organization.  If PubChem staff are unable to make contact with the data contributor or if an organization is no longer updating records, a legacy designation may be initiated.  Please note that a “legacy” designation can be removed at any time, when contact is reestablished and updates resume.

Impacts of legacy designation?

PubChem Legacy Designation 2If a data contributor is designated as “legacy”, all records deposited by the contributor are also designated as “legacy”.  While still searchable, these records will clearly indicate that they are “legacy”.  Please note that “legacy” records will not be shown in the “Chemical Vendors” section of Compound Summary pages.  In addition, in the “Substances by Category” section of the Compound Summary page, “legacy” substance records only will be found under “Legacy Depositors”.

Future plans?

The way PubChem implements both manual and automated processes to ascertain a “legacy” indication will likely evolve over time.  In addition, we are looking at the possibility of enabling users to separate out legacy records when searching and analyzing the database.

Significant Update to PubChemRDF!

PubChemRDF 1.5β is now available.  The new version is faster, supports linked data in new formats, features improved search and query functions, and contains new links.

What is PubChemRDF?

PubChemRDF expresses data in a Resource Description Framework (RDF) format using ontological frameworks and semantic web technologies.  It facilitates data sharing and analysis, and integrates with other National Center for Biotechnology Information (NCBI) resources along with external resources across scientific domains.  To learn more about this project, please see our earlier blog post and PubChemRDF release notes.

PubChem RDF v1.5-beta

What is new in PubChemRDF 1.5β?

The 1.5β release contains a number of new features and technological improvements including:

  • Faster Speed
    PubChemRDF data is now served from a triple-store and provides a noticeable speed improvement, especially for records with lots of data.  Previously, RDF was generated on the fly from data stored in disparate data systems.
  • Addition of MeSH
    Major improvements were made to the reference subdomain.  Most notable is the addition of Medical Subject Heading (MeSH) annotation of PubMed records.  This includes MeSH topical descriptors (with optional qualifier) that indicate the subject of an article and MeSH (supplementary) concepts that indicate things like chemicals and diseases discussed in an article.
  • Direct links to authoritative RDF resources
    PubChemRDF now enhances cross-integration by providing direct links to available authoritative RDF resources within applicable subdomains, including: reference, synonym, and inchikey to MeSH RDF; protein to UniProt RDF; protein and substance to PDB RDF; biosystem to Reactome RDF; substance to ChEMBL RDF; and compound to WikiData RDF.  For example, the links to PDB RDF help to distinguish proteins and associated chemical substances found in a Protein Data Bank (PDB) crystal structure.
  • Addition of ‘concept’ subdomain
    A new ‘concept’ subdomain provides the means to annotate PubChemRDF subdomains.  For example, annotation between nodes within the concept subdomain allows a hierarchy of concepts to be created, such as those in the WHO ATC classification.  These can then be applied, such as in the case of adding links from chemical substance synonyms to a WHO ATC classification to indicate its therapeutic and pharmacological properties.
  • New links added between the compound and biosystem subdomains
    Previously, the biosystem subdomain linked only to the protein subdomain.  The added links between the compound and biosystem subdomains help to indicate the chemical structure involved in a given pathway.
  • Support for protein complexes
    Protein complex targets are now distinguished within the bioassay subdomain and are linked to the component protein units.
  • Linked Data using JSON
    JSON-LD (or JavaScript Object Notation for Linked Data) is a method of transporting Linked Data using JSON. This addition helps those wanting to use JSON formatted data, for example, with JavaScript.

Where can I learn more about PubChemRDF?

To read more on this topic, please consider exploring these links:

Ten years of service

September 16, 2004 is a special day in the history of PubChem (https://pubchem.ncbi.nlm.nih.gov/).  It marks the beginning of PubChem as an on-line resource.  Now fast forward ten years.  PubChem provides information daily to many tens of thousands of users.  PubChem's 10th birthdayDespite the passage of time, PubChem’s primary mission remains the same: providing comprehensive information on the biological activities of chemical substances.

Growth

PubChem has faced many challengesGrowth in PubChem Depositors over the years.  Chief among them is scalability.  For example, within the first year of operation, the amount of available data in PubChem more than doubled.  To this day, the growth of contributors and data remains very strong, with hundreds of contributing organizations, 20% of which provide Growth in PubChem Substances and Compounds biological activity information to PubChem.  These data providers represent a highly varied cross-section of academic, commercial, and governmental entities.  Combined, they have contributed information on a significant fraction of all known organic small molecule chemical entities, numbering in the tens of millions.
Growth in PubChem bioactivity outcomes
PubChem was created to archive the output of the recently concluded Molecular Libraries Program (MLP – http://mli.nih.gov) high-throughput screening (HTS) initiative.   Most of the biological activity results in PubChem (>95%) are from MLP HTS centers; however, it is interesting to note that Growth in PubChem BioAssays MLP represents only a small fraction (<1%) of the biological experiments.  All told, there are over 225 million publically available biological activity reports in PubChem, with approximately two million chemicals having some form of biological testing data.  In addition, RNAi screening experiments are increasingly found in PubChem.

Interfaces

Providing chemical information to researchers in the biomedical science community is a key part of PubChem’s purpose.  Over the years, PubChem introduced and incrementally developed several interfaces, each with its own distinct purpose and set of use cases.  Primary to these is the Entrez search interface (https://www.ncbi.nlm.nih.gov/), where PubChem is organized as three distinct databases: Substance, Compound, and BioAssay.  Substance provides substance descriptions (accession number: SID), Compound provides the unique small-molecule chemical content of Substance (accession number: CID), and BioAssay provides biological experiment results for substances (accession number: AID).  [Go here to learn more about the different between Substance and Compound.]  Each of these databases has an advanced search interface and contain numerous indexes and filters, which can be combined to construct elaborate queries.  Additional interfaces exist to search and analyze information in PubChem, including the ability to analyze bioactivity information, download chemical and assay data, search by chemical structure or protein sequence, navigate using integrated classifications, visualize chemical 3-D information, and more.

PubChem continues to evolve the way it provides on-line content.  External search engines (like Google, Bing, and others) are now a key way in which researchers locate data.  In addition, programmatic interfaces now account for a significant portion of PubChem’s overall usage (+50%).  Key programmatic interfaces to PubChem include Entrez Utilities and PUG/REST.

Future

The world of information is forever changing and improving.  If the past ten years are any indication of what the future will bring to PubChem, the next ten are sure to be very exciting, with more data from a greater number of sources, additional types of data, increased annotation, improved interfaces, and advancements in ease of access.  With your support as contributors and users, PubChem will continue to serve the needs of the community.

Why contribute your data to PubChem?

PubChem is an open archive of chemical substances and their biological experimental results.  “Open” means that you can put your scientific data in PubChem and that others may use it. What kinds of chemical substances can you provide information about?  All kinds, including small molecule chemicals, RNAs, carbohydrates, peptides, complex mixtures, natural products, PubChem Uploadand more. And you can also provide the results of your biological experiments with these substances. Appropriate biological experimental results include biological assay screens (such as phenotypic, whole cell, defined target, high throughput, dose-response, validation, etc.), physical property measurements, and beyond.

There are many reasons to Upload your data to PubChem:

  • Maximize the benefit of your research.
    When research data is made publicly available, it helps to promote new scientific discovery. Other researchers can find your data, use it, and build upon it. This can lead to new research collaborations and improved insight into your results, thus helping to increase the impact of your research efforts and advance science more quickly. 
  • Save time and effort in open-access data sharing.
    Maintaining your own data archive and user interface takes precious time and adds to research costs. Data sharing requirements by journals and granting agencies may be satisfied by use of the PubChem data archiving platform. PubChem provides high-capacity interfaces, so you know your data will be accessible. Given that PubChem is part of NLM, you can rest assured that your data will be preserved and available without (login or paywall) barriers, now and for the foreseeable future. 
  • Maintain control over when your data becomes public.
    Timing when you release scientific data can be critical.  Release data too soon and you might not be able to file a patent or publish a paper.  If you need to time the release of your data with the publication of a paper, the filing of a patent, or in coordination with a grant administrator, you can set a hold-until date of up to one year in the future.  If anything changes, your hold-until date can be adjusted (shortened or extended). 
  • Share your held data with only those you choose.
    When you first submit data to PubChem, you are assigned stable identifiers for your substances and bioassays.  These identifiers can be used to prove that you have submitted data to PubChem even if the data are not yet publicly viewable.  If your data are on-hold, you can login to your PubChem Upload account and dynamically create unique, private URLs to individual data submissions to share with reviewers and collaborators.  At any moment during the hold-until period, you can delete access to these URLs.

Sharing scientific data is important. PubChem is upgrading its service to make it easier than ever to rapidly upload information about your chemical substances and biological experiment results. Scientific data, however, can be complex. PubChem Upload provides wizards to help guide you through the process of making data public.  In addition, use of standard spreadsheet formats and private FTP uploads for large datasets help to streamline data submission.

For more information, please see the following:

If you have any questions concerning these topics, please contact us via email at pubchem-deposit-help@ncbi.nlm.nih.gov.