September 16, 2004 is a special day in the history of PubChem (https://pubchem.ncbi.nlm.nih.gov/). It marks the beginning of PubChem as an on-line resource. Now fast forward ten years. PubChem provides information daily to many tens of thousands of users. Despite the passage of time, PubChem’s primary mission remains the same: providing comprehensive information on the biological activities of chemical substances.
PubChem has faced many challenges over the years. Chief among them is scalability. For example, within the first year of operation, the amount of available data in PubChem more than doubled. To this day, the growth of contributors and data remains very strong, with hundreds of contributing organizations, 20% of which provide biological activity information to PubChem. These data providers represent a highly varied cross-section of academic, commercial, and governmental entities. Combined, they have contributed information on a significant fraction of all known organic small molecule chemical entities, numbering in the tens of millions.
PubChem was created to archive the output of the recently concluded Molecular Libraries Program (MLP – http://mli.nih.gov) high-throughput screening (HTS) initiative. Most of the biological activity results in PubChem (>95%) are from MLP HTS centers; however, it is interesting to note that MLP represents only a small fraction (<1%) of the biological experiments. All told, there are over 225 million publically available biological activity reports in PubChem, with approximately two million chemicals having some form of biological testing data. In addition, RNAi screening experiments are increasingly found in PubChem.
Providing chemical information to researchers in the biomedical science community is a key part of PubChem’s purpose. Over the years, PubChem introduced and incrementally developed several interfaces, each with its own distinct purpose and set of use cases. Primary to these is the Entrez search interface (https://www.ncbi.nlm.nih.gov/), where PubChem is organized as three distinct databases: Substance, Compound, and BioAssay. Substance provides substance descriptions (accession number: SID), Compound provides the unique small-molecule chemical content of Substance (accession number: CID), and BioAssay provides biological experiment results for substances (accession number: AID). [Go here to learn more about the different between Substance and Compound.] Each of these databases has an advanced search interface and contain numerous indexes and filters, which can be combined to construct elaborate queries. Additional interfaces exist to search and analyze information in PubChem, including the ability to analyze bioactivity information, download chemical and assay data, search by chemical structure or protein sequence, navigate using integrated classifications, visualize chemical 3-D information, and more.
PubChem continues to evolve the way it provides on-line content. External search engines (like Google, Bing, and others) are now a key way in which researchers locate data. In addition, programmatic interfaces now account for a significant portion of PubChem’s overall usage (+50%). Key programmatic interfaces to PubChem include Entrez Utilities and PUG/REST.
The world of information is forever changing and improving. If the past ten years are any indication of what the future will bring to PubChem, the next ten are sure to be very exciting, with more data from a greater number of sources, additional types of data, increased annotation, improved interfaces, and advancements in ease of access. With your support as contributors and users, PubChem will continue to serve the needs of the community.