PubChem is updating its data model used for storing bioassay information. This update will change the format of data uploaded to or downloaded from PubChem. As a result, assay data depositors need to format their data based on the new data model to submit them to PubChem. Also, software programs that download PubChem’s assay data (e.g., in ASN.1, XML, and CSV) for further analysis will need to be updated to load PubChem data correctly.
Major changes to the assay data specifications
Some important changes in the new data model are summarized below. A full data specification is available at the PubChem FTP site.
- Changes to panel assay specification
A panel assay contains bioactivity data for multiple targets (sometimes up to thousands). In the past, the data for each target in a panel assay were stored in a few columns of the data table. This led to data tables varying in the column width (up to tens of thousand columns), making it difficult to handle and display panel assay data. In the new data model, the input format for panel assays will no longer be column-based, and each data point will be stored in a row, as shown in this example:
Our upload system will make changes accordingly. Note that all archived panel assays have been converted to this new row-based format.
- GI to accession
In the past, numeric identifiers called GI numbers were used to specify the proteins or nucleotides relevant to PubChem bioassays (e.g., assay targets or cross-references). However, NCBI phased out the use of GI numbers in its databases, as explained in a series of blog posts. Accordingly, GIs are replaced with accessions in the assay specification and new assay submissions will accept accessions only. All GIs in archived blobs are converted to accessions.
- Inclusion of endpoint qualifiers
Endpoint qualifiers (e.g. >, >=, =, <, <=) are included in the data specification. Without these qualifiers, bioactivity data could be misinterpreted. For example, while compounds with IC50 = 1 mM against a given target have different bioactivity from those with IC50 > 1 mM or IC50 < 1 mM against the same target, they could all look the same without endpoint qualifiers. While many assays in PubChem have this qualifier information, users unknowingly ignored it in assay data analysis. To address this issue, the new data format explicitly includes the endpoint qualifiers, e.g. the “Standard Relation” field as shown in this example:
All existing assays with the qualifier information have been annotated accordingly.
- UTF-8 character supports
Assay data archived in PubChem often contain UTF-8 characters, which are not presented correctly in a text file. Examples are Greek letters (α, β, γ, …), commonly used in target names (e.g., β-glucuronidase) or units (°C or °F), often found in experimental protocols. The new data format supports UTF-8 characters, as exemplified in the following assay (note that the assay title contains the character “β”):
The transition plans
All existing assay data have been converted to the new data format and are publicly available on our web pages and at our FTP site (under the Bioassay2 directory). The data in the old data specification is still available at the FTP site (under the Bioassay directory) but will be archived as /Other/Bioassay1 by June 1, 2021. The Bioassay2 directory will then become Bioassay as the default.