Updates to the PubChem Assay Data Model

PubChem is updating its data model used for storing bioassay information.  This update will change the format of data uploaded to or downloaded from PubChem.  As a result, assay data depositors need to format their data based on the new data model to submit them to PubChem.  Also, software programs that download PubChem’s assay data (e.g., in ASN.1, XML, and CSV) for further analysis will need to be updated to load PubChem data correctly.

 

Major changes to the assay data specifications

Some important changes in the new data model are summarized below.  A full data specification is available at the PubChem FTP site.

  • Changes to panel assay specification
    PubChem Assay Data tableA panel assay contains bioactivity data for multiple targets (sometimes up to thousands).  In the past, the data for each target in a panel assay were stored in a few columns of the data table.  This led to data tables varying in the column width (up to tens of thousand columns), making it difficult to handle and display panel assay data.  In the new data model, the input format for panel assays will no longer be column-based, and each data point will be stored in a row, as shown in this example:
    https://pubchem.ncbi.nlm.nih.gov/bioassay/1433#section=Data-Table
    Our upload system will make changes accordingly. Note that all archived panel assays have been converted to this new row-based format.
  • GI to accession
    In the past, numeric identifiers called GI numbers were used to specify the proteins or nucleotides relevant to PubChem bioassays (e.g., assay targets or cross-references).  However, NCBI phased out the use of GI numbers in its databases, as explained in a series of blog posts.  Accordingly, GIs are replaced with accessions in the assay specification and new assay submissions will accept accessions only. All GIs in archived blobs are converted to accessions.
  • Inclusion of endpoint qualifiers
    Endpoint qualifiers (e.g. >, >=, =, <, <=) are included in the data specification.  Without these qualifiers, bioactivity data could be misinterpreted.  For example, while compounds with IC50 = 1 mM against a given target have different bioactivity from those with IC50 > 1 mM or IC50 < 1 mM against the same target, they could all look the same without endpoint qualifiers.  While many assays in PubChem have this qualifier information, users unknowingly ignored it in assay data analysis.  To address this issue, the new data format explicitly includes the endpoint qualifiers, e.g. the “Standard Relation” field as shown in this example:
    https://pubchem.ncbi.nlm.nih.gov/bioassay/2916#section=Data-Table
    All existing assays with the qualifier information have been annotated accordingly.
  • UTF-8 character supports
    Assay data archived in PubChem often contain UTF-8 characters, which are not presented correctly in a text file.  Examples are Greek letters (α, β, γ, …), commonly used in target names (e.g., β-glucuronidase) or units (°C or °F), often found in experimental protocols.  The new data format supports UTF-8 characters, as exemplified in the following assay (note that the assay title contains the character “β”):
    https://pubchem.ncbi.nlm.nih.gov/bioassay/1347299

 

The transition plans

All existing assay data have been converted to the new data format and are publicly available on our web pages and at our FTP site (under the Bioassay2 directory).  The data in the old data specification is still available at the FTP site (under the Bioassay directory) but will be archived as /Other/Bioassay1 by June 1, 2021. The Bioassay2 directory will then become Bioassay as the default.

PubChemRDF 1.7β has been released

A significant update has been made to PubChemRDF, machine-readable PubChem data formatted using the Resource Description Framework (RDF) (https://www.w3.org/RDF/).  (If you have never heard about PubChemRDF before, please read this PubChem blog first.)

What is PubChemRDF?

RDF is a World Wide Web Consortium (W3C) standard model for data interchange on the web.  In RDF, knowledge is expressed as statements, each of which consists of three discrete parts: a subject, an object, and a predicate that specifies the relationship between them.  So, the trio of these parts is called a triple.  For example, the sentence “asbestos can cause mesothelioma” consists of “asbestos” (subject), “mesothelioma” (object) and “can cause” (predicate).  Similarly, the sentence “ethanol is metabolized to acetaldehyde” can be broken down into a triple of “ethanol” (subject), “acetaldehyde”, “is metabolized to” (predicate).  In essence, RDF expresses knowledge into a directed, labeled graph.

PubChemRDF refers to the RDF-formatted PubChem data.  It contains information on various entities in PubChem (chemicals, bioassays, genes, proteins, pathways, literature, etc.) and their relationships.  With PubChemRDF, researchers can work with PubChem data using Semantic Web technologies (https://en.wikipedia.org/wiki/Semantic_Web).  In addition, PubChemRDF facilitates PubChem data sharing, analysis, and integration with data from other resources.

PubChemRDF 1.7-beta

What’s new in PubChemRDF 1.7β?

  • Updated vocabularies
    To define the semantic relationships (that is, predicates) between entities (subjects and objects), PubChemRDF uses pre-existing, domain-specific ontological frameworks (rather than creating new ones), such as Chemical Entities of Biological Interest (ChEBI) , CHEMical INFormation ontology (CHEMINF), Protein Ontology (PRO), Gene Ontology (GO), BioAssay Ontology (BAO), among others.  Since PubChemRDF was first introduced, some terms in these ontologies were deprecated or replaced with new ones.  These changes are now reflected in PubChemRDF 1.7β.
  • New subdomain
    In PubChemRDF 1.7β, a new subdomain, called Pathway, is added to encode information on biological pathways and their relationship with genes, proteins, and chemicals. This Pathway subdomain supersedes the BioSystem subdomain used in the previous versions of PubChemRDF.
  • GI to accession
    In the previous versions, numeric identifiers called GI numbers were used to denote proteins or genes.  However, NCBI phased out the use of GI numbers in its databases, as explained in a series of blog posts.  Accordingly, changes have been made to allow one to access PubChemRDF data using the ‘accession’ identifiers.

Where can I learn more about PubChemRDF 1.7β?

To learn more about this topic, please read the following:

Introducing PubChem Pathway Pages

PubChem Pathway Pages are now available. Each PubChem Pathway page provides information about chemicals, proteins, genes, and diseases involved in or associated with the biological pathway, which can be very important to provide a context to observed biological activity. In addition, all pathways associated with a given chemical, protein or gene are summarized on the corresponding page.

PubChem Pathways

All content comes from existing Pathway resources without any attempt to merge or combine them.  Each page for a given pathway can be accessed via an URL of this form:

https://pubchem.ncbi.nlm.nih.gov/pathway/SOURCE:PATHID

where SOURCE is the information source for the pathway and PATHID is the record identifier used by the source.  For example, the following URL directs to the pathway page for the citric acid cycle in human and mouse (ID: SMP0000057 and SMP0063477, respectively) from PathBank:

https://pubchem.ncbi.nlm.nih.gov/pathway/PathBank:SMP0000057  (for human)

https://pubchem.ncbi.nlm.nih.gov/pathway/PathBank:SMP0063477  (for mouse)

PubChem Pathways supersedes the NCBI BioSystems database, which is no longer being updated. If you have a NCBI BioSystems identifier (BSID) and the page exists in PubChem Pathways, you can access the corresponding PubChem Pathway page via, for example:

https://pubchem.ncbi.nlm.nih.gov/pathway/BSID:703092

Chemicals, proteins, and genes presented on PubChem Pathway pages are linked to corresponding PubChem pages, providing quick access to more detailed information on these entities.  In addition, the Pathway page provides information on the interactions or reactions among these entities.  The PubChem Pathway pages are searchable within PubChem Search.

Lastly, PubChem Pathway information is integrated with the NCBI Gene database (https://www.ncbi.nlm.nih.gov/gene/).  For instance, the following web page presents all pathways associated with the human EGFR (NCBI Gene ID: 1956) in PubChem Pathways:

https://www.ncbi.nlm.nih.gov/gene/1956#pathways

To learn more about the PubChem Pathway page, please read this Help page (https://pubchemdocs.ncbi.nlm.nih.gov/pathways).

Molecular property links to SpringerMaterials are now in PubChem

More than 32,000 compounds in PubChem now have links to hundreds of chemical and physical properties pertinent to chemistry, material science, physics, and other related fields available from SpringerMaterials (see this press release).  These links will help you quickly locate articles for the property in question.

Chemicals with SpringerMaterials links will contain a “SpringerMaterials Properties” section in the “Chemical and Physical Properties” table of contents.  This provides a list of chemical properties available at SpringerMaterials for this compound.  For example, the following link shows the list of the material properties for benzene (Figure 1).

https://pubchem.ncbi.nlm.nih.gov/compound/benzene#section=SpringerMaterials-Properties

SpringerMaterials data integration with PubChem
Figure1. The SpringerMaterials properties for benzene (CID 241) (https://pubchem.ncbi.nlm.nih.gov/compound/241#section=SpringerMaterials-Properties). The 13C nuclear magnetic resonance spectrum link takes you to the entry at the SpringerMaterials site.

Clicking on one of the properties in this list directs you to the SpringerMaterials web page showing a list of articles containing detailed information on that property.  Currently, more than 32,000 compounds have links to SpringerMaterials property data.  A list of these compounds is available through the PubChem Sources page or via the PubChem Classification Browser.

The addition of Springer Materials links to PubChem assists users in finding important data and literature available for chemicals.

 

Integration of WIPO’s PATENTSCOPE data with PubChem

The World Intellectual Property Organization (WIPO) is an international organization that aims to promote the protection of intellectual property throughout the world.  WIPO provided PubChem with more than 16 million chemical structures searchable in its patent database called PATENTSCOPE (see this press release).

PubChem-WIPO data integration

For each of the chemical structures contributed by WIPO, PubChem provides a direct link to PATENTSCOPE, which allows users to perform searches for patent documents relevant to that chemical structure.  For example, the following URL directs users to the “WIPO PATENTSCOPE” section of the cholesterol-lowering medication atorvastatin (CID 60823):

https://pubchem.ncbi.nlm.nih.gov/compound/60823#section=WIPO-PATENTSCOPE

By clicking the direct link presented in this section, users can search PATENTSCOPE for patent documents relevant to CID 60823 and further analyze returned hits using the tools available at PATENTSCOPE.

A list of chemical structures contributed by WIPO can be obtained through the PubChem Source page for PATENTSCOPE:

https://pubchem.ncbi.nlm.nih.gov/source/23607

The integration of WIPO’s chemical information with PubChem makes it easier for PubChem users to find pertinent patent information about chemicals.

Webinar on current access to TOXNET resources

NLM staff will participate in the next American Chemical Society webinar for the chemical information and cheminformatics community: An Overview of NLM’s Post-TOXNET Resources. TOXNET (the TOXicology Data NETwork) was retired in December 2019 as part of the reorganization associated with the NLM Strategic Plan. Most of TOXNET’s databases have been incorporated into other NLM resources such as PubChem and Bookshelf, or continue to be available elsewhere. This webinar will show you where to go now for TOXNET information.

  • Date and Time: Tuesday, March 17 at 1:00pm EDT.
  • Register 

A live Q&A session will follow the webinar.

PubChem presents at the American Chemical Society National Meeting in San Diego (August 25-29, 2019)

On August 25-29, 2019, the American Chemical Society National Meeting will be held in San Diego, CA, the theme of which is “Chemistry & Water”.  The PubChem team will be at the ACS meeting to present new developments and recent changes in PubChem.  Below is a list of presentations involving PubChem staff.

 

Day 1 (Sunday, August 25)

 

Day 2 (Monday, August 26)

  • CHAS 17: PubChem LCSS (J. Zhang)
    Rancho Santa Fe 3 – Marriott Marquis San Diego Marina, 3:35 PM – 3:55 PM

Day 3 (Tuesday, August 27)

 

Day 4 (Wednesday, August 28)

 

San_Diego_Convention_Center

San Diego Convention Center; Photo Credit: Visitor7 [CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0)]

 

PubChem Periodic Table and Element pages

UPDATE (July 29, 2020): See also the paper published in Chemistry Teacher International (doi:10.1515/cti-2020-0006).

The periodic table of chemical elements is one of the most recognized tools in science.  As we mark the 150th anniversary of the periodic table, the scientific community has declared 2019 to be “The International Year of the Periodic Table”.  PubChem is celebrating by launching the PubChem Periodic Table and corresponding Element pages.

Periodic_Table

While PubChem provides each chemical its own page, you can find elements there too.  Such pages are not suited for displaying information specific to elements (such as electronegativity and electron configuration).  The PubChem Periodic Table and Element pages help you navigate the abundant chemical element data available within PubChem, while providing a convenient entry point to explore additional information, such as bioactivities, health and safety data, available in PubChem Compound pages for specific elements and their isotopes.

PubChem Element Page

PubChem Element page content comes from scientific articles and various authoritative data sources, such as the International Union of Pure and Applied Chemistry (IUPAC), National Institute of Standard and Technology (NIST), International Atomic Energy Agency (IAEA), Jefferson Laboratory, and Los Alamos National Lab.

The PubChem Periodic Table provides three distinct views.  Table View is the traditional periodic table any scientist would instantly recognize.  List View provides a summary view, allowing you to see all properties available for each element at once.   Game View, added as an educational feature, helps test your knowledge of element names and symbols.

Clicking an element in the PubChem Periodic Table directs you to the corresponding Element page.  This page presents a wide variety of element information, including atomic properties (electron affinity, electronegativity, ionization potential, oxidation states, electron configuration, etc.) as well as isotopes, history, uses, and, most importantly, information source.  The element page can also be reached directly via URLs that includes atomic number, symbol, or name (all case insensitive).  For example, the following URLs are for the Element page for carbon:

https://pubchem.ncbi.nlm.nih.gov/element/Carbon

https://pubchem.ncbi.nlm.nih.gov/element/C

https://pubchem.ncbi.nlm.nih.gov/element/6

In addition, the data presented in the Periodic Table and Element pages are also available through programmatic access, using PUG-REST and PUG-View.

PubChem Homepage has a new look and feel!

We’ve redesigned PubChem’s homepage to give you easier access to the information you need, where you need it. The mobile-friendly, responsive design works on the device you want to use. And the streamlined, intuitive interface puts the data you need at your fingertips.

Here are some of the changes you can expect to see at the new PubChem homepage:

New PubChem Homepage

The menus at the top of the page and the sidebar have been replaced with a minimal set of important links. These links include “About,” “Blog,” “Submit,” and “Contact.” The “About” link will bring you to the PubChem Docs site, where you can find an exhaustive list of PubChem services and documentation.

In addition, data count and data source statistics have been highlighted. Each also includes a link you can follow to get more information on these statistics.

Finally, we’ve improved PubChem’s search New PubChem Homepagecapabilities. The three search boxes for compounds, substances and bioassays have been replaced with a single search box that covers all search types. Search results from the formerly separate search types (compound, substance, and bioassay) have also been integrated into a single search results display. In addition, search in PubChem now directly supports formula and structure search. We have many more details to share with you about the new PubChem search in a separate blog post, so keep your eyes open for that!

We want to know what you think!

PubChem’s new look and feel is a big step forward for PubChem, and we’re excited to share all of the improvements we’re making across PubChem with you!

What’s working well? What’s not? What’s missing? Send an email to pubchem-help@ncbi.nlm.nih.gov

Stay tuned to this blog for future announcements about the roll-out of all the new designs and features coming to PubChem!

PubChem presents at the American Chemical Society National Meeting in Orlando (March 31-April 4, 2019)

On March 31-April 4, 2019, the 257th American Chemical Society National Meeting will be held in Orlando, FL, the theme of which is “Chemistry for New Frontiers”.  The PubChem team will be at the ACS meeting to present new developments and recent changes in PubChem.  Below is a list of presentations that will be given by the PubChem staff.

 

Day 1 (Sunday, March 31)

 

Day 2 (Monday, April 1)

 

Day 4 (Wednesday, April 3)

 

Orange_County_Orlando_Convention_Center

Orlando Convention Center; Photo Credit: Visitor7 [ CC BY-SA 3.0 (https://creativecommons.org/licenses/by-sa/3.0) ]