Why contribute your data to PubChem?

PubChem is an open archive of chemical substances and their biological experimental results.  “Open” means that you can put your scientific data in PubChem and that others may use it. What kinds of chemical substances can you provide information about?  All kinds, including small molecule chemicals, RNAs, carbohydrates, peptides, complex mixtures, natural products, PubChem Uploadand more. And you can also provide the results of your biological experiments with these substances. Appropriate biological experimental results include biological assay screens (such as phenotypic, whole cell, defined target, high throughput, dose-response, validation, etc.), physical property measurements, and beyond.

There are many reasons to Upload your data to PubChem:

  • Maximize the benefit of your research.
    When research data is made publicly available, it helps to promote new scientific discovery. Other researchers can find your data, use it, and build upon it. This can lead to new research collaborations and improved insight into your results, thus helping to increase the impact of your research efforts and advance science more quickly. 
  • Save time and effort in open-access data sharing.
    Maintaining your own data archive and user interface takes precious time and adds to research costs. Data sharing requirements by journals and granting agencies may be satisfied by use of the PubChem data archiving platform. PubChem provides high-capacity interfaces, so you know your data will be accessible. Given that PubChem is part of NLM, you can rest assured that your data will be preserved and available without (login or paywall) barriers, now and for the foreseeable future. 
  • Maintain control over when your data becomes public.
    Timing when you release scientific data can be critical.  Release data too soon and you might not be able to file a patent or publish a paper.  If you need to time the release of your data with the publication of a paper, the filing of a patent, or in coordination with a grant administrator, you can set a hold-until date of up to one year in the future.  If anything changes, your hold-until date can be adjusted (shortened or extended). 
  • Share your held data with only those you choose.
    When you first submit data to PubChem, you are assigned stable identifiers for your substances and bioassays.  These identifiers can be used to prove that you have submitted data to PubChem even if the data are not yet publicly viewable.  If your data are on-hold, you can login to your PubChem Upload account and dynamically create unique, private URLs to individual data submissions to share with reviewers and collaborators.  At any moment during the hold-until period, you can delete access to these URLs.

Sharing scientific data is important. PubChem is upgrading its service to make it easier than ever to rapidly upload information about your chemical substances and biological experiment results. Scientific data, however, can be complex. PubChem Upload provides wizards to help guide you through the process of making data public.  In addition, use of standard spreadsheet formats and private FTP uploads for large datasets help to streamline data submission.

For more information, please see the following:

If you have any questions concerning these topics, please contact us via email at pubchem-deposit-help@ncbi.nlm.nih.gov.

What is the difference between a substance and a compound in PubChem?

PubChem users sometimes ask about the difference between a substance and a compound.  The question is not surprising as the names “substance” and “compound” alone do not inherently convey the difference.  In PubChem terminology, a substance is a chemical sample description provided by a single source and a compound is a normalized chemical structure representation found in one or more contributed substances.  The distinction is important as PubChem is organized in three separate databases: Compound, Substance, and BioAssay.  The diagram below explains the difference, but let’s explore this further.

PubChem 3 DBs

To understand the different databases in PubChem, it is helpful to know where the information comes from.  PubChem (http://pubchem.ncbi.nlm.nih.gov/) is an open archive of chemical substances and information about their biological activities.  Data is provided by hundreds of contributors (http://pubchem.ncbi.nlm.nih.gov/sources/), including publishers, researchers, chemical vendors, pharmaceutical companies, and a number of important chemical biology resources.  Each of these data sources contributes a description of chemical substance samples for which they have information.

PubChem calls these community-provided sample descriptions “substances.”  Each record found in the PubChem Substance database (http://www.ncbi.nlm.nih.gov/pcsubstance) contains information provided by an individual contributor about a particular chemical substance.  Substance records are independent of each other.  Two different Substance records (from the same or different providers) could provide different information about the same chemical structure.  For example, one substance record may give information about the biological role of aspirin, while another may give information about a research grade sample of aspirin.  The Substance database maintains the provenance of chemical substance information in PubChem.  It helps users see who provided what.  As a result, there may be many substance records about a given molecule, presenting a problem for users who are interested in an aggregated view of information on the molecule.  This is where the PubChem Compound database (http://www.ncbi.nlm.nih.gov/pccompound) comes into play.
PubChem substance vs compound
The Compound database is derived from the chemical structure contents found in the Substance database.  Each chemical is computationally examined with a series of validation and normalization steps.  This process results in a normalized representation of the chemical structure for a substance record.  Chemical substances in the Substance database that are not completely described or that fail normalization procedures are not included in the Compound database.  Those substances in the Substance database that pass chemical structure normalization procedures are linked to a “compound” record in the Compound database.  If two substances refer to the same chemical structure, they point to the same compound.  This allows data from different Substance data providers to be aggregated through a common Compound record.  However, also having separate substance records is still valuable to users, who, for example, might be interested in the provenance of a substance or a particular state of the chemical (e.g., a different tautomeric form).  In essence, a primary purpose of the PubChem Compound database is to provide a “non-redundant” view of the depositor-contributed chemical structure contents stored in the PubChem Substance database.

So, to answer the question posed at the beginning, what is the difference between a substance and a compound?  A substance is a contributed chemical substance sample description from a particular PubChem data provider.  A compound is a normalized chemical structure representation found in one or more contributed substance descriptions.

To read more on this topic, please consider exploring these links:

PubChemRDF is Launched

Introducing PubChemRDF!

The PubChemRDF project encodes PubChem information using the Resource Description Framework (RDF).  One of the aims of the PubChemRDF project is to help researchers work with PubChem data on local computing resources using semantic web technologies.  Another aim is to harness ontological frameworks to help facilitate PubChem data sharing, analysis, and integration with resources external to the National Center for Biotechnology (NCBI) and across scientific domains.

What is RDF?

RDF stands for resource description framework and constitutes a family of World Wide Web Consortium (W3C) specifications for data interchange on the Web. RDF breaks down knowledge into machine readable discrete pieces, called “triples.” Each “triple” is organized as a trio of “subject-predicate-object.” For example, in the phrase “atorvastatin may treat hypercholesterolemia,” the subject is “atorvastatin,” the predicate is “may treat,” and the object is “hypercholesterolemia.” RDF uses a Uniform Resource Identifier (URI) to name each part of the “subject-predicate-object” triple. A URI looks just like a typical web URL.

RDF is a core part of semantic web standards.  As an extension of the existing World Wide Web, the semantic web attempts to make it easier for users to find, share, and combine information.  Semantic web leverages the following technologies: Extensible Markup Language (XML), which provides syntax for RDF; Web Ontology Language (OWL), which extends the ability of RDF to encode information; Resource Description Framework (RDF), which expresses knowledge; and RDF query language (SPARQL), which enables query and manipulation of RDF content.

How can PubChemRDF help your research?

PubChem users have frequently expressed interest in having a downloadable, schema-less database. PubChemRDF enables the NoSQL database access and query of PubChem databases.  Using PubChemRDF, one can download the desired RDF formatted data files from the PubChem FTP site, import them into a triplestore, and query using a SPARQL query interface. There are a number of open-source or commercial triplestores, such as Apache Jena TDB and OpenLink Virtuoso (a list of triplestores can be found here: http://en.wikipedia.org/wiki/Triplestore). Other than triplestores, PubChemRDF data can also be loaded into RDF-aware graph databases such as Neo4j, and the graph traversal algorithms can be used to query the RDF graphs. At last but not least, the ontological representation of PubChem knowledge base allows logical inference, such as forward/backward chaining.

The RDF data on the PubChem FTP site is arranged in such a way that you only need to download the type of information in which you are interested, so you can avoid downloading parts of PubChem data you will not use.  For example, if you are just interested in computed chemical properties, you only need to download PubChemRDF data in compound descriptor subdomain. In addition to bulk download, PubChemRDF also provides programmatic data access through REST-full interface.

Where can you learn more about this?

To get an overview of the PubChemRDF project, please view this presentation.  To learn more about detailed aspects of PubChemRDF and how to use it, please view this presentation. The PubChemRDF Release Notes provide additional technical information about the project.

Additional blog posts will follow on PubChemRDF project topics, including: the FTP site layout, the REST-full interface, and ways to utilize PubChemRDF for research purposes including using SPARQL queries.

PubChem Upload 1.0f Released

Submitting your data to PubChem is now easier than ever. PubChem Upload: click to see the large image The new PubChem Upload system offers streamlined procedures for data submissions and includes an extensive set of wizards, inline help tips, and templates to assist users.  First released as a beta in April 2013, PubChem Upload is now in final form (1.0f) and replaces the Deposition Gateway as the primary PubChem data submission system.  The PubChem Deposition Gateway, first introduced in April 2005, has been superseded as an interface and will be completely phased out in 2014.

What does it do?

PubChem Upload is a data submission system.  PubChem Upload: click to see the large image It allows contributors to provide substance descriptions (including chemical structures, names, crosslinks, and comments), assay experiment descriptions, and the results of substances being tested in assays.  There is a great deal of flexibility in the information that can be provided to PubChem.  For example, there are no limits (beyond the practical) on the number of assay readouts or the count of substances per assay that can be provided.  An abbreviated list of PubChem Upload features include:

  • PubChem Upload: click to see the large image The means to enter data and descriptive information by web form or by file, based on user preference.
  • Convenient spreadsheet formats (CSV, Excel & OpenOffice) as well as XML-based data specifications accommodate both one-off and frequent data providers.
  • A “Preview” function displays incoming data to show how it will appear in PubChem before being loaded.
  • An automated suite of validation checks help contributors identify potential issues before data is made public.

Why the new release?

Advances in web technologies provided us the opportunity to enhance the user experience by reducing the time and effort required to make substance descriptions and their associated biological activities available and useful for the public. PubChem Upload: click to see the large image The new PubChem Upload interface greets a new contributor who may only be interested in making a quick submission with a simple decision-tree set of wizards to guide them through the process of publishing their data in PubChem.  For the experienced user, the wizards can be avoided, and the enhanced upload and editing capabilities used instead.

There are many improvements over the older Deposition Gateway system. One noteworthy feature is that PubChem Upload offers an expanded ability to edit data directly in the browser.  The spreadsheet editor gives PubChem contributors the ability to upload large spreadsheets with minimal reformatting and to edit those large datasets online.

Potential future directions

PubChem staff places a high importance on continuing to improve the submission process and increasing the usefulness of data to the PubChem end-user.  One such direction is the use of controlled vocabulary annotations, or ontologies, such as BAO, GO, and MeSH, to help streamline the description of provided data.  This may, for example, improve the ability of PubChem end-users to utilize and analyze bioactivity results.

The new PubChem Upload system utilizes a RESTful model of data communication between client and server.  As such, it is now technically possible to document and support the creation of upload utilities that can be incorporated into third-party software such as ELNs and LIMs. Interfacing PubChem Upload directly with a properly configured laboratory data system may dramatically reduce the effort to publish data in PubChem.

Where can I learn more about PubChem Upload?

To get an overview of the PubChem Upload system, please view this presentation.  To get basic information, please read this abbreviated help document.  For a more extensive overview and detailed information about the features, please read the complete help document.