Important Changes to PubChem Web Protocols

PubChem will no longer use HTTP web URLs in favor of HTTPS by September 30, 2016.

What does this mean to you?

Currently, PubChem supports both HTTP and HTTPS web URLs. For example, both URLs http://pubchem.ncbi.nlm.nih.gov and https://pubchem.ncbi.nlm.nih.gov take you to PubChem. However, by September 30, 2016, the HTTP web protocol will be retired in favor of the HTTPS protocol. Furthermore, the HTTPS web protocol will be implemented according to the HTTPS-Only Standard. Any attempt to access PubChem after September 30, 2016 using a web URL starting with “http:” may no longer work.

For the most part, this change will be invisible to you as PubChem started to use HTTPS protocol in early 2014. Today, many sites are using HTTPS when linking to PubChem with an URL. However, those still accessing PubChem using the HTTP protocol will need to be updated to the HTTPS protocol.

Why the change?

On June 8, 2015, the US federal government issued a HTTPS-only policy for all publicly accessible Federal websites.  As a part of this mandate, the National Center for Biotechnology Information (NCBI) recently announced important changes to NCBI Web Protocols to adopt HTTPS on September 30, 2016. A webinar is available on the NCBI YouTube channel that explains how this will affect access to web pages. PubChem resides at NCBI and will adopt the same HTTPS-only policy.

Why is this change being mandated?

The unencrypted HTTP protocol does not protect data from interception or alteration, which can subject users to eavesdropping, tracking, and the modification of received data. The regular unencrypted HTTP protocols create some vulnerabilities and may expose potentially sensitive information about users to hackers. The information may include browser identities, website contents, search terms, user submitted information, and more. Many commercial organizations such as banks have already adopted HTTPS-only policies to protect users when using their websites and services.

HTTPS verifies the identity of a website or web service for a connecting client, and encrypts nearly all information sent between the website or service and the user. Protected information includes cookies, user agent details, URL paths, form submissions, and query string parameters. HTTPS is designed to prevent this information from being read or changed while in transit. HTTPS provides a layer of protection for web users, however, it may be worth noting that HTTPS has several important limitations. IP addresses and destination domain names are not encrypted during communication. Even encrypted traffic can reveal some information indirectly, such as time spent on site, or the size of requested resources or submitted information.

To learn more, visit these websites:

 

Significant Update to PubChemRDF!

PubChemRDF 1.5β is now available.  The new version is faster, supports linked data in new formats, features improved search and query functions, and contains new links.

What is PubChemRDF?

PubChemRDF expresses data in a Resource Description Framework (RDF) format using ontological frameworks and semantic web technologies.  It facilitates data sharing and analysis, and integrates with other National Center for Biotechnology Information (NCBI) resources along with external resources across scientific domains.  To learn more about this project, please see our earlier blog post and PubChemRDF release notes.

PubChem RDF v1.5-beta

What is new in PubChemRDF 1.5β?

The 1.5β release contains a number of new features and technological improvements including:

  • Faster Speed
    PubChemRDF data is now served from a triple-store and provides a noticeable speed improvement, especially for records with lots of data.  Previously, RDF was generated on the fly from data stored in disparate data systems.
  • Addition of MeSH
    Major improvements were made to the reference subdomain.  Most notable is the addition of Medical Subject Heading (MeSH) annotation of PubMed records.  This includes MeSH topical descriptors (with optional qualifier) that indicate the subject of an article and MeSH (supplementary) concepts that indicate things like chemicals and diseases discussed in an article.
  • Direct links to authoritative RDF resources
    PubChemRDF now enhances cross-integration by providing direct links to available authoritative RDF resources within applicable subdomains, including: reference, synonym, and inchikey to MeSH RDF; protein to UniProt RDF; protein and substance to PDB RDF; biosystem to Reactome RDF; substance to ChEMBL RDF; and compound to WikiData RDF.  For example, the links to PDB RDF help to distinguish proteins and associated chemical substances found in a Protein Data Bank (PDB) crystal structure.
  • Addition of ‘concept’ subdomain
    A new ‘concept’ subdomain provides the means to annotate PubChemRDF subdomains.  For example, annotation between nodes within the concept subdomain allows a hierarchy of concepts to be created, such as those in the WHO ATC classification.  These can then be applied, such as in the case of adding links from chemical substance synonyms to a WHO ATC classification to indicate its therapeutic and pharmacological properties.
  • New links added between the compound and biosystem subdomains
    Previously, the biosystem subdomain linked only to the protein subdomain.  The added links between the compound and biosystem subdomains help to indicate the chemical structure involved in a given pathway.
  • Support for protein complexes
    Protein complex targets are now distinguished within the bioassay subdomain and are linked to the component protein units.
  • Linked Data using JSON
    JSON-LD (or JavaScript Object Notation for Linked Data) is a method of transporting Linked Data using JSON. This addition helps those wanting to use JSON formatted data, for example, with JavaScript.

Where can I learn more about PubChemRDF?

To read more on this topic, please consider exploring these links:

What is the difference between a substance and a compound in PubChem?

PubChem users sometimes ask about the difference between a substance and a compound.  The question is not surprising as the names “substance” and “compound” alone do not inherently convey the difference.  In PubChem terminology, a substance is a chemical sample description provided by a single source and a compound is a normalized chemical structure representation found in one or more contributed substances.  The distinction is important as PubChem is organized in three separate databases: Compound, Substance, and BioAssay.  The diagram below explains the difference, but let’s explore this further.

PubChem 3 DBs

To understand the different databases in PubChem, it is helpful to know where the information comes from.  PubChem (http://pubchem.ncbi.nlm.nih.gov/) is an open archive of chemical substances and information about their biological activities.  Data is provided by hundreds of contributors (http://pubchem.ncbi.nlm.nih.gov/sources/), including publishers, researchers, chemical vendors, pharmaceutical companies, and a number of important chemical biology resources.  Each of these data sources contributes a description of chemical substance samples for which they have information.

PubChem calls these community-provided sample descriptions “substances.”  Each record found in the PubChem Substance database (http://www.ncbi.nlm.nih.gov/pcsubstance) contains information provided by an individual contributor about a particular chemical substance.  Substance records are independent of each other.  Two different Substance records (from the same or different providers) could provide different information about the same chemical structure.  For example, one substance record may give information about the biological role of aspirin, while another may give information about a research grade sample of aspirin.  The Substance database maintains the provenance of chemical substance information in PubChem.  It helps users see who provided what.  As a result, there may be many substance records about a given molecule, presenting a problem for users who are interested in an aggregated view of information on the molecule.  This is where the PubChem Compound database (http://www.ncbi.nlm.nih.gov/pccompound) comes into play.
PubChem substance vs compound
The Compound database is derived from the chemical structure contents found in the Substance database.  Each chemical is computationally examined with a series of validation and normalization steps.  This process results in a normalized representation of the chemical structure for a substance record.  Chemical substances in the Substance database that are not completely described or that fail normalization procedures are not included in the Compound database.  Those substances in the Substance database that pass chemical structure normalization procedures are linked to a “compound” record in the Compound database.  If two substances refer to the same chemical structure, they point to the same compound.  This allows data from different Substance data providers to be aggregated through a common Compound record.  However, also having separate substance records is still valuable to users, who, for example, might be interested in the provenance of a substance or a particular state of the chemical (e.g., a different tautomeric form).  In essence, a primary purpose of the PubChem Compound database is to provide a “non-redundant” view of the depositor-contributed chemical structure contents stored in the PubChem Substance database.

So, to answer the question posed at the beginning, what is the difference between a substance and a compound?  A substance is a contributed chemical substance sample description from a particular PubChem data provider.  A compound is a normalized chemical structure representation found in one or more contributed substance descriptions.

To read more on this topic, please consider exploring these links:

PubChemRDF is Launched

Introducing PubChemRDF!

The PubChemRDF project encodes PubChem information using the Resource Description Framework (RDF).  One of the aims of the PubChemRDF project is to help researchers work with PubChem data on local computing resources using semantic web technologies.  Another aim is to harness ontological frameworks to help facilitate PubChem data sharing, analysis, and integration with resources external to the National Center for Biotechnology Information (NCBI) and across scientific domains.

What is RDF?

RDF stands for resource description framework and constitutes a family of World Wide Web Consortium (W3C) specifications for data interchange on the Web. RDF breaks down knowledge into machine readable discrete pieces, called “triples.” Each “triple” is organized as a trio of “subject-predicate-object.” For example, in the phrase “atorvastatin may treat hypercholesterolemia,” the subject is “atorvastatin,” the predicate is “may treat,” and the object is “hypercholesterolemia.” RDF uses a Uniform Resource Identifier (URI) to name each part of the “subject-predicate-object” triple. A URI looks just like a typical web URL.

RDF is a core part of semantic web standards.  As an extension of the existing World Wide Web, the semantic web attempts to make it easier for users to find, share, and combine information.  Semantic web leverages the following technologies: Extensible Markup Language (XML), which provides syntax for RDF; Web Ontology Language (OWL), which extends the ability of RDF to encode information; Resource Description Framework (RDF), which expresses knowledge; and RDF query language (SPARQL), which enables query and manipulation of RDF content.

How can PubChemRDF help your research?

PubChem users have frequently expressed interest in having a downloadable, schema-less database. PubChemRDF enables the NoSQL database access and query of PubChem databases.  Using PubChemRDF, one can download the desired RDF formatted data files from the PubChem FTP site, import them into a triplestore, and query using a SPARQL query interface. There are a number of open-source or commercial triplestores, such as Apache Jena TDB and OpenLink Virtuoso (a list of triplestores can be found here: http://en.wikipedia.org/wiki/Triplestore). Other than triplestores, PubChemRDF data can also be loaded into RDF-aware graph databases such as Neo4j, and the graph traversal algorithms can be used to query the RDF graphs. At last but not least, the ontological representation of PubChem knowledge base allows logical inference, such as forward/backward chaining.

The RDF data on the PubChem FTP site is arranged in such a way that you only need to download the type of information in which you are interested, so you can avoid downloading parts of PubChem data you will not use.  For example, if you are just interested in computed chemical properties, you only need to download PubChemRDF data in compound descriptor subdomain. In addition to bulk download, PubChemRDF also provides programmatic data access through REST-full interface.

Where can you learn more about this?

To get an overview of the PubChemRDF project, please view this presentation.  To learn more about detailed aspects of PubChemRDF and how to use it, please view this presentation. The PubChemRDF Release Notes provide additional technical information about the project.

Additional blog posts will follow on PubChemRDF project topics, including: the FTP site layout, the REST-full interface, and ways to utilize PubChemRDF for research purposes including using SPARQL queries.

PubChem Upload 1.0f Released

Submitting your data to PubChem is now easier than ever. PubChem Upload: click to see the large image The new PubChem Upload system offers streamlined procedures for data submissions and includes an extensive set of wizards, inline help tips, and templates to assist users.  First released as a beta in April 2013, PubChem Upload is now in final form (1.0f) and replaces the Deposition Gateway as the primary PubChem data submission system.  The PubChem Deposition Gateway, first introduced in April 2005, has been superseded as an interface and will be completely phased out in 2014.

What does it do?

PubChem Upload is a data submission system.  PubChem Upload: click to see the large image It allows contributors to provide substance descriptions (including chemical structures, names, crosslinks, and comments), assay experiment descriptions, and the results of substances being tested in assays.  There is a great deal of flexibility in the information that can be provided to PubChem.  For example, there are no limits (beyond the practical) on the number of assay readouts or the count of substances per assay that can be provided.  An abbreviated list of PubChem Upload features include:

  • PubChem Upload: click to see the large image The means to enter data and descriptive information by web form or by file, based on user preference.
  • Convenient spreadsheet formats (CSV, Excel & OpenOffice) as well as XML-based data specifications accommodate both one-off and frequent data providers.
  • A “Preview” function displays incoming data to show how it will appear in PubChem before being loaded.
  • An automated suite of validation checks help contributors identify potential issues before data is made public.

Why the new release?

Advances in web technologies provided us the opportunity to enhance the user experience by reducing the time and effort required to make substance descriptions and their associated biological activities available and useful for the public. PubChem Upload: click to see the large image The new PubChem Upload interface greets a new contributor who may only be interested in making a quick submission with a simple decision-tree set of wizards to guide them through the process of publishing their data in PubChem.  For the experienced user, the wizards can be avoided, and the enhanced upload and editing capabilities used instead.

There are many improvements over the older Deposition Gateway system. One noteworthy feature is that PubChem Upload offers an expanded ability to edit data directly in the browser.  The spreadsheet editor gives PubChem contributors the ability to upload large spreadsheets with minimal reformatting and to edit those large datasets online.

Potential future directions

PubChem staff places a high importance on continuing to improve the submission process and increasing the usefulness of data to the PubChem end-user.  One such direction is the use of controlled vocabulary annotations, or ontologies, such as BAO, GO, and MeSH, to help streamline the description of provided data.  This may, for example, improve the ability of PubChem end-users to utilize and analyze bioactivity results.

The new PubChem Upload system utilizes a RESTful model of data communication between client and server.  As such, it is now technically possible to document and support the creation of upload utilities that can be incorporated into third-party software such as ELNs and LIMs. Interfacing PubChem Upload directly with a properly configured laboratory data system may dramatically reduce the effort to publish data in PubChem.

Where can I learn more about PubChem Upload?

To get an overview of the PubChem Upload system, please view this presentation.  To get basic information, please read this abbreviated help document.  For a more extensive overview and detailed information about the features, please read the complete help document.