PubChemRDF 1.7β has been released
Posted on October 19, 2020
A significant update has been made to PubChemRDF, machine-readable PubChem data formatted using the Resource Description Framework (RDF) (https://www.w3.org/RDF/). (If you have never heard about PubChemRDF before, please read this PubChem blog first.)
RDF is a World Wide Web Consortium (W3C) standard model for data interchange on the web. In RDF, knowledge is expressed as statements, each of which consists of three discrete parts: a subject, an object, and a predicate that specifies the relationship between them. So, the trio of these parts is called a triple. For example, the sentence “asbestos can cause mesothelioma” consists of “asbestos” (subject), “mesothelioma” (object) and “can cause” (predicate). Similarly, the sentence “ethanol is metabolized to acetaldehyde” can be broken down into a triple of “ethanol” (subject), “acetaldehyde”, “is metabolized to” (predicate). In essence, RDF expresses knowledge into a directed, labeled graph.
PubChemRDF refers to the RDF-formatted PubChem data. It contains information on various entities in PubChem (chemicals, bioassays, genes, proteins, pathways, literature, etc.) and their relationships. With PubChemRDF, researchers can work with PubChem data using Semantic Web technologies (https://en.wikipedia.org/wiki/Semantic_Web). In addition, PubChemRDF facilitates PubChem data sharing, analysis, and integration with data from other resources.
- Updated vocabularies
To define the semantic relationships (that is, predicates) between entities (subjects and objects), PubChemRDF uses pre-existing, domain-specific ontological frameworks (rather than creating new ones), such as Chemical Entities of Biological Interest (ChEBI) , CHEMical INFormation ontology (CHEMINF), Protein Ontology (PRO), Gene Ontology (GO), BioAssay Ontology (BAO), among others. Since PubChemRDF was first introduced, some terms in these ontologies were deprecated or replaced with new ones. These changes are now reflected in PubChemRDF 1.7β.
- New subdomain
In PubChemRDF 1.7β, a new subdomain, called Pathway, is added to encode information on biological pathways and their relationship with genes, proteins, and chemicals. This Pathway subdomain supersedes the BioSystem subdomain used in the previous versions of PubChemRDF.
- GI to accession
In the previous versions, numeric identifiers called GI numbers were used to denote proteins or genes. However, NCBI phased out the use of GI numbers in its databases, as explained in a series of blog posts. Accordingly, changes have been made to allow one to access PubChemRDF data using the ‘accession’ identifiers.
To learn more about this topic, please read the following:
- Help page: PubChemRDF 1.7β (https://pubchemdocs.ncbi.nlm.nih.gov/rdf)
- Publication: “PubChemRDF: towards the semantic annotation of PubChem compound and substance databases” (https://doi.org/10.1186/s13321-015-0084-4)
National Library of Medicine
8600 Rockville Pike
Bethesda, MD 20894