PubChemRDF is Launched

Introducing PubChemRDF!

The PubChemRDF project encodes PubChem information using the Resource Description Framework (RDF).  One of the aims of the PubChemRDF project is to help researchers work with PubChem data on local computing resources using semantic web technologies.  Another aim is to harness ontological frameworks to help facilitate PubChem data sharing, analysis, and integration with resources external to the National Center for Biotechnology (NCBI) and across scientific domains.

What is RDF?

RDF stands for resource description framework and constitutes a family of World Wide Web Consortium (W3C) specifications for data interchange on the Web. RDF breaks down knowledge into machine readable discrete pieces, called “triples.” Each “triple” is organized as a trio of “subject-predicate-object.” For example, in the phrase “atorvastatin may treat hypercholesterolemia,” the subject is “atorvastatin,” the predicate is “may treat,” and the object is “hypercholesterolemia.” RDF uses a Uniform Resource Identifier (URI) to name each part of the “subject-predicate-object” triple. A URI looks just like a typical web URL.

RDF is a core part of semantic web standards.  As an extension of the existing World Wide Web, the semantic web attempts to make it easier for users to find, share, and combine information.  Semantic web leverages the following technologies: Extensible Markup Language (XML), which provides syntax for RDF; Web Ontology Language (OWL), which extends the ability of RDF to encode information; Resource Description Framework (RDF), which expresses knowledge; and RDF query language (SPARQL), which enables query and manipulation of RDF content.

How can PubChemRDF help your research?

PubChem users have frequently expressed interest in having a downloadable, schema-less database. PubChemRDF enables the NoSQL database access and query of PubChem databases.  Using PubChemRDF, one can download the desired RDF formatted data files from the PubChem FTP site, import them into a triplestore, and query using a SPARQL query interface. There are a number of open-source or commercial triplestores, such as Apache Jena TDB and OpenLink Virtuoso (a list of triplestores can be found here: http://en.wikipedia.org/wiki/Triplestore). Other than triplestores, PubChemRDF data can also be loaded into RDF-aware graph databases such as Neo4j, and the graph traversal algorithms can be used to query the RDF graphs. At last but not least, the ontological representation of PubChem knowledge base allows logical inference, such as forward/backward chaining.

The RDF data on the PubChem FTP site is arranged in such a way that you only need to download the type of information in which you are interested, so you can avoid downloading parts of PubChem data you will not use.  For example, if you are just interested in computed chemical properties, you only need to download PubChemRDF data in compound descriptor subdomain. In addition to bulk download, PubChemRDF also provides programmatic data access through REST-full interface.

Where can you learn more about this?

To get an overview of the PubChemRDF project, please view this presentation.  To learn more about detailed aspects of PubChemRDF and how to use it, please view this presentation. The PubChemRDF Release Notes provide additional technical information about the project.

Additional blog posts will follow on PubChemRDF project topics, including: the FTP site layout, the REST-full interface, and ways to utilize PubChemRDF for research purposes including using SPARQL queries.

PubChem Upload 1.0f Released

Submitting your data to PubChem is now easier than ever. PubChem Upload: click to see the large image The new PubChem Upload system offers streamlined procedures for data submissions and includes an extensive set of wizards, inline help tips, and templates to assist users.  First released as a beta in April 2013, PubChem Upload is now in final form (1.0f) and replaces the Deposition Gateway as the primary PubChem data submission system.  The PubChem Deposition Gateway, first introduced in April 2005, has been superseded as an interface and will be completely phased out in 2014.

What does it do?

PubChem Upload is a data submission system.  PubChem Upload: click to see the large image It allows contributors to provide substance descriptions (including chemical structures, names, crosslinks, and comments), assay experiment descriptions, and the results of substances being tested in assays.  There is a great deal of flexibility in the information that can be provided to PubChem.  For example, there are no limits (beyond the practical) on the number of assay readouts or the count of substances per assay that can be provided.  An abbreviated list of PubChem Upload features include:

  • PubChem Upload: click to see the large image The means to enter data and descriptive information by web form or by file, based on user preference.
  • Convenient spreadsheet formats (CSV, Excel & OpenOffice) as well as XML-based data specifications accommodate both one-off and frequent data providers.
  • A “Preview” function displays incoming data to show how it will appear in PubChem before being loaded.
  • An automated suite of validation checks help contributors identify potential issues before data is made public.

Why the new release?

Advances in web technologies provided us the opportunity to enhance the user experience by reducing the time and effort required to make substance descriptions and their associated biological activities available and useful for the public. PubChem Upload: click to see the large image The new PubChem Upload interface greets a new contributor who may only be interested in making a quick submission with a simple decision-tree set of wizards to guide them through the process of publishing their data in PubChem.  For the experienced user, the wizards can be avoided, and the enhanced upload and editing capabilities used instead.

There are many improvements over the older Deposition Gateway system. One noteworthy feature is that PubChem Upload offers an expanded ability to edit data directly in the browser.  The spreadsheet editor gives PubChem contributors the ability to upload large spreadsheets with minimal reformatting and to edit those large datasets online.

Potential future directions

PubChem staff places a high importance on continuing to improve the submission process and increasing the usefulness of data to the PubChem end-user.  One such direction is the use of controlled vocabulary annotations, or ontologies, such as BAO, GO, and MeSH, to help streamline the description of provided data.  This may, for example, improve the ability of PubChem end-users to utilize and analyze bioactivity results.

The new PubChem Upload system utilizes a RESTful model of data communication between client and server.  As such, it is now technically possible to document and support the creation of upload utilities that can be incorporated into third-party software such as ELNs and LIMs. Interfacing PubChem Upload directly with a properly configured laboratory data system may dramatically reduce the effort to publish data in PubChem.

Where can I learn more about PubChem Upload?

To get an overview of the PubChem Upload system, please view this presentation.  To get basic information, please read this abbreviated help document.  For a more extensive overview and detailed information about the features, please read the complete help document.