Updates to the PubChem Data Model

PubChem is updating the data model for objects returned by the PUG View server. These objects are used by both programmatic users and by PubChem web pages. PubChem web users will not be directly affected by the data model changes. Programmatic users, however, will need to update the programs that retrieve and interpret data from PUG View. The following major changes are being made to the data model of the PUG View JSON/XML blobs:

  1. No more HTML markup within strings; instead, we will have an explicit markup object that separates primary strings from the various markup types.
  2. All values are lists, having separate fields for individual values.
  3. No more embedded tables in the data blobs.

 

No more HTML markup within strings.

PubChem is making a major effort to remove all embedded HTML from within the various strings in the data blobs. Such embedded markup is difficult for parsers to deal with when only a plain string is desired. For example, this is the old model:

{
    "StringValue": "Flipo RM: [Are the NSAIDs able to compromising the cardio-preventive efficacy of <a class=\"pubchem-internal-link CID-2244\" href=\"https://pubchem.ncbi.nlm.nih.gov/compound/aspirin\">aspirin</a>?]. Presse Med. 2006 Sep;35(9 Spec No 1):1S53-60.",
    "URL": "https://www.ncbi.nlm.nih.gov/pubmed/17078596"
}

In the new data model, the main string is in plain text, and the URL links (or other types of markup) are separate, with the character location of the markup on the original string indicated by start and length values. For example:

{
    "String": "Flipo RM: [Are the NSAIDs able to compromising the cardio-preventive efficacy of aspirin?]. Presse Med. 2006 Sep;35(9 Spec No 1):1S53-60. [PMID: 17078596]",
    "Markup": [
        {
            "Start": 139,
            "Length": 14,
            "URL": "https://www.ncbi.nlm.nih.gov/pubmed/17078596",
            "Type": "General link"
        },
        {
            "Start": 81,
            "Length": 7,
            "URL": "https://pubchem.ncbi.nlm.nih.gov/compound/aspirin",
            "Type": "PubChem Internal Link",
            "Extra": "CID-2244"
        }
    ]
}

This new format will make it easier for parsers to get at the relevant text data without a lot of programming overhead.  Please note that this removal of embedded HTML also includes escaped entities in HTML. These will instead be represented by a single UTF8 character (for example, “&deg;” à “°”) within the base string.

 

All values are lists.

In the new PubChem data model, all values are being converted to list types. We’ve done this to avoid the cumbersome necessity for data parsers to have to check separate fields for single values vs. lists. We’ll use JSON format for the following examples, but the XML data model is parallel to the JSON. Here are two examples of the old format:

{
    "ReferenceNumber": 19,
    "Name": "Melting Point",
    "Reference": [
        "PhysProp"
    ],
    "NumValue": 135,
    "ValueUnit": "°C"
}

or

"Information": [
    {
        "ReferenceNumber": 135,
        "Name": "Standard non-polar",
        "NumValueList": [
            1270,
            1315,
            1309,
            1309
        ]
    }
]

In the examples above, note the fields “NumValue” vs. “NumValueList” – this is cumbersome to code against. In the new system, the examples above would look like this, “Number” being used in both cases within a list structure:

{
    "ReferenceNumber": 35,
    "Name": "Melting Point",
    "Reference": [
        "PhysProp"
    ],
    "Value": {
        "Number": [
            135
        ],
    "Unit": "°C"
    }
}

or,

"Information": [
    {
        "ReferenceNumber": 75,
        "Name": "Standard non-polar",
        "Value": {
            "Number": [
                1270,
                1315,
                1309,
                1309
            ]
        }
    }
]

These changes will make it easier to code your data parsers.

 

No more embedded tables in the data blobs.

The use of embedded tables in the old system made it difficult for programmatic users to extract specific fields from within the table. The format required you to dig down into the rows and cells of the table to try and find the needed value. For example:

"Information": [
    {
        "ReferenceNumber": 182,
        "Name": "Computed Properties",
        "Table": {
            "ColumnName": [
                "Property Name",
                "Property Value"
            ],
            "Row": [
                {
                    "Cell": [
                        {
                            "StringValue": "Molecular Weight"
                        },
                        {
                            "NumValue": 180.159,
                            "ValueUnit": "g/mol"
                        }
                    ]
                }, …

In the new data below, the fields are more explicitly labeled with section names, the same way as other (non-table) values in the data:

{
    "TOCHeading": "Molecular Weight",
    "Description": "Molecular weight or molecular mass refers to the mass of a molecule. It is calculated as the sum of the mass of each constituent atom multiplied by the number of atoms of that element in the molecular formula.",
    "Information": [
        {
            "ReferenceNumber": 120,
            "Name": "Molecular Weight",
            "Value": {
                "Number": [
                    180.159
                ],
                "Unit": "g/mol"
            }
        }
    ]
}

These changes will make it easier to retrieve data from tables without a lot of programming overhead.

 

We want to know what you think!

In summary, PubChem’s new data model makes it easier to retrieve the data you need. As the data model is updated and released, you’ll be able to find detailed information on the schema here: https://pubchemdocs.ncbi.nlm.nih.gov/pug-view.

What’s working well? What’s not? What’s missing? Send an email to pubchem-help@ncbi.nlm.nih.gov