PubChemRDF
- V1.7.2 beta (see Release notes). For the latest statistics on counts of triples, please see RDF Statistics. Additional information on PubChemRDF is provided in the following paper:
[PubMed PMID: 26175801] [PubMed Central PMCID: PMC4500850] [Free Full Text]
Contents
- 1 Introduction
- 2 Ontology-based Data Integration
- 3 PubChemRDF URI Constructions
- 4 PubChemRDF Subdomains
- 4.1 PubChem Compound
- 4.2 PubChem Substance
- 4.3 PubChem Descriptors
- 4.4 PubChem InChIKey
- 4.5 PubChem Synonym
- 4.6 PubChem BioAssay
- 4.7 PubChem MeasureGroup
- 4.8 PubChem Endpoint
- 4.9 PubChem Protein
- 4.10 PubChem ConservedDomain
- 4.11 PubChem Gene
- 4.12 PubChem Pathway
- 4.13 PubChem Neighbor
- 4.14 PubChem Source
- 4.15 PubChem Reference
- 4.16 PubChem Concept
- 4.17 PubChem Taxonomy
- 5 RESTful Interface
- 6 RDF FTP Download Directory Layout
- 6.1 PubChem Compound
- 6.2 PubChem Substance
- 6.3 PubChem Descriptor
- 6.4 PubChem InChIKey
- 6.5 PubChem Synonym
- 6.6 PubChem BioAssay
- 6.7 PubChem MeasureGroup
- 6.8 PubChem Endpoint
- 6.9 PubChem Protein
- 6.10 PubChem ConservedDomain
- 6.11 PubChem Gene
- 6.12 PubChem Pathway
- 6.13 PubChem Source
- 6.14 PubChem Reference
- 6.15 PubChem Taxonomy
- 7 Loading PubChemRDF
- 8 PubChemRDF Use Cases
- Case 1: What protein targets does donepezil (CHEBI_53289) inhibit with an IC50 less than 10 microMolar?
- Case 2: What pharmacological roles of SID46505803 are defined by CHEBI?
- Case 3: What compound have a pharmacological role of NSAID as defined by CHEBI and molecular weight less than 200 g/mol?
- Case 4: What substances have a pharmacological role of NSAID as defined by CHEBI and the depositor-provided 3D X-ray structure information?
- Case 5: What protein targets are inhibited by substances with an IC50 less than 10 µM and have a pharmacological role of cholinesterase inhibitors as defined by CHEBI?
- Case 6: Which substances inhibit protein targets similar to ACCP05979 and have the function domain PSSMID188648?
- Case 7: What protein targets are inhibited by substances with IC50 less than 10 µM and have the same standardized chemical structure (CID3152)?
- Case 8: What substances inhibit the proteins involved in the same biological pathway: prostaglandin biosynthetic process (GO:0001516), with an IC 50 less than 10 µM?
- Case 9: What the pharmacological roles defined by CHEBI are for the substances that inhibit protein target ACCBF717249 with an IC50 less than 10 µM?
- Case 10: Summarize the statistics about the total number of substances tested in the PubChem database against each protein target.
- 9 Document Version History
1 Introduction
Semantic Web technologies are emerging as an increasingly important approach to distribute and integrate scientific data. These technologies include the trio of the Resource Description Framework (RDF), Web Ontology Language (OWL), and SPARQL query language. The PubChemRDF project provides RDF formatted information for the PubChem Compound, Substance, and BioAssay databases.
1.1 What is RDF?
RDF constitutes a family of World Wide Web Consortium (W3C) specifications for data interchange on the Web. RDF breaks down knowledge into machine-readable discrete pieces, called “triples.” Each “triple” is organized as a trio of ‘subject-predicate-object’. For example, in the phrase “atorvastatin may treat hypercholesterolemia,” the subject is “atorvastatin”, the predicate is “may treat”, and the object is “hypercholesterolemia.” RDF uses a Uniform Resource Identifier (URI) to name each part of the “subject-predicate-object” triple. A URI looks just like a typical web URL. RDF is a core part of semantic web standards. As an extension of the existing World Wide Web, the semantic web attempts to make it easier for users to find, share, and combine information. Semantic web leverages the following technologies: extensible markup language (XML), which provides syntax for RDF; web ontology language (OWL), which extends the ability of RDF to encode information; resource description framework (RDF), which expresses knowledge; and RDF query language (SPARQL), which enables query and manipulation of RDF content.
1.2 How can PubChemRDF help your research?
PubChem users have frequently expressed interest in having a downloadable database. Using PubChemRDF, one can download the desired RDF formatted data files from the PubChem FTP site, import them into a triplestore, and query using a SPARQL query interface. Together these tools enable the schema-less database access and query. There are a number of open-source and commercial triplestores such as the Apache Jena TDB and OpenLink Virtuoso (a list can be found here: http://en.wikipedia.org/wiki/Triplestore). Other than triplestores, PubChemRDF data can also be loaded into RDF-aware graph databases such as Neo4j, and the graph traversal algorithms can be used to query the RDF graphs. And last but not least, the ontological representation of the PubChem knowledge base allows logical inference, such as forward/backward chaining. The RDF data on the PubChem FTP site is arranged in such a way that you only need to download the type of information in which you are interested, thus allowing you to avoid downloading parts of PubChem data you will not use. For example, if you are just interested in computed chemical properties, you only need to download PubChemRDF data in the compound descriptor directory. In addition to bulk download, PubChemRDF also provides programmatic data access through RESTful interface.
This document provides detailed technical information (release notes) about the PubChemRDF project.
Additional information is available as follows:
- Slide presentation: PubChemRDF introduction
- Slide presentation: PubChemRDF detail
- Slide presentation: PubChemRDF tutorial
- Publication: PubChemRDF: towards the semantic annotation of PubChem compound and substance databases.
- PubChem Blog
- PubChemRDF FTP Site
1.3 PubChemRDF Graphs
2 Ontology-based Data Integration
As depicted in Figure 1, the PubChemRDF content includes a number of semantic relationships, such as those between compounds and substances, the chemical descriptors associated with compounds and substances, the relationships between compounds, the provenance and attribution metadata of substances, and the concise bioactivity data view of substances. Whenever possible, pre-existing ontological frameworks were used to semantically describe information available in the PubChem archive, rather than creating new ones. However, in some cases, no suitable types or relations were defined in standard ontologies, and a PubChem vocabulary was created to define these terms. The set of standardized ontologies used to define the domain-specific knowledge are found in Table 1 and includes: Chemical Entities of Biological Interest (ChEBI), CHEMical INFormation ontology (CHEMINF), Protein Ontology (PRO), Gene Ontology (GO), Semanticscience Integrated Ontology (SIO), Basic Formal Ontology (BFO), Ontology for Biomedical Investigations (OBI), Information Artifact Ontology (IAO), BioAssay Ontology (BAO), Units of Measurement (UO), Citation Typing Ontology (CiTO), FRBR-aligned Bibliographic Ontology (FaBiO), Dublin Core Metadata Initiative (DCMI) Terms, Simple Knowledge Organization System (SKOS), BioPAX, National Drug File-Reference Terminology (NDF-RT), and National Center Institute thesaurus (NCIt). All of the biomedical ontologies, such as ChEBI, CHEMINF, PRO, GO, BFO, SIO, and BAO, are interfaced by the NIH Roadmap National Center for Biomedical Ontology (NCBO) through its BioPortal, and comply with an evolving set of shared principles established by the Open Biomedical Ontologies (OBO) foundry. Adoption of these core ontologies helps to ensure that the mapping of chemical and biological information is compatible across multiple Semantic Web resources.
Table 1. The prefixes and corresponding namespaces of standardized ontologies used in PubChemRDF.
Vocabulary | Namespace | Prefix |
---|---|---|
RDF Schema | http://www.w3.org/2000/01/rdf-schema# | rdfs |
Vocabulary predicate used in PubChemRDF
|
||
RDF | http://www.w3.org/1999/02/22-rdf-syntax-ns# | rdf |
Vocabulary predicate used in PubChemRDF
|
||
OWL | http://www.w3.org/2002/07/owl# | owl |
OWL is mainly used to define the classes and predicates in PubChemRDF vocabulary that has namespace: http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#. | ||
XML Schema | http://www.w3.org/2001/XMLSchema# | xsd |
XML Schema is used to define the data type of literals; typical data types used in PubChemRDF includes integer, float, date, and so on. | ||
NDF-RT | http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl# | ndfrt |
The National Drug File - Reference Terminology (NDF-RT) classfication of drugs and drug ingredients is used to annotate PubChemRDF compounds. | ||
NCIt | http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl# | ncit |
National Cancer Institute Thesaurus (NCIt) classification of chemicals and drugs is used to annotate PubChemRDF compounds. | ||
SIO | http://semanticscience.org/resource/ | sioa |
Semanticscience Integrated Ontology (SIO) predicates used in PubChemRDF
|
||
CHEMINF | http://semanticscience.org/resource/ | cheminfa |
Chemical Information Ontology (CHEMINF) predicates used in PubChemRDF
Chemical Information Ontology (CHEMINF) classes used in PubChemRDF
|
||
SKOS | http://www.w3.org/2004/02/skos/core# | skos |
Simple Knowledge Organization System (SKOS) predicates used in PubChemRDF
|
||
BFO, OBI, IAO, RO, SO, UO | http://purl.obolibrary.org/obo/ | obo |
Basic Formal Ontology (BFO), Ontology for biomedical Investigations (OBI), Information Artifact Ontology (IAO), Relation Ontology (RO) predicates used in PubChemRDF
Sequence Ontology (SO), Gene Ontology (GO), Basic Formal Ontology (BFO), Unit Ontology (UO) classes used in PubChemRDF
|
||
ChEBI, PR, GO | http://purl.obolibrary.org/obo/ | obo |
Chemical Entities of Biological Interest (ChEBI) classifcation is used to annotate PubChemRDF compounds and substancs; Protein Ontology (PR), Gene Ontology (GO) classifications is used to annotate PubChemRDF protein targets; |
||
BAO | http://www.bioassayontology.org/bao# | bao |
BioAssay Ontology (BAO) predicate used in PubChemRDF
BioAssay Ontology (BAO) classes used in PubChemRDF
|
||
BioPAX | http://www.biopax.org/release/biopax-level3.owl# | bp |
Biological PAthway eXchange ontology (BioPAX) predicates used in PubChemRDF
|
||
CiTO | http://purl.org/spar/cito/ | cito |
The Citation Typing Ontology (CiTO) predicates used in PubChemRDF
|
||
FaBiO | http://purl.org/spar/fabio/ | fabio |
The FRBR-aligned Bibliographic Ontology (FaBiO) predicate used in PubChemRDF
The FRBR-aligned Bibliographic Ontology (FaBiO) classes used in PubChemRDF
|
||
PDBo | http://rdf.wwpdb.org/schema/pdbx-v40.owl# | pdbo |
Protein Data Bank Ontology (PDBo) predicate used in PubChemRDF
|
||
DCMI Terms | http://purl.org/dc/terms/ | dcterms |
Dublin Core Metadata Initiative (DCMI) terms predicates used in PubChemRDF
|
a The sio and cheminf ontologies share a URI namespace but are distinct.
3 PubChemRDF URI Constructions
In this document, PubChemRDF statements are written in the Turtle syntax with Uniform Resource Identifiers (URIs) in relative form. The Turtle prefix directives can be used to resolve the base URIs relative to the local part. A list of the PubChem subdomain namespaces is listed in Table 2. Both “303 URI” and “hash URI” were employed in the PubChemRDF project according to W3C recommendations; however, the “hash URI” was only used for the PubChem vocabulary subdomain, and the "303 URIs" were used for the rest of PubChemRDF subdomains. PubChem vocabulary serves as a terminology defining the types and relations of some PubChem-specific terms.
For instance, the URI for the type of PubChem specific 3-D structural similarity defined in the PubChem vocabulary is as follows:
http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#3D_structural_similarity
Table 2. The prefixes and corresponding namespaces of subdomains used in PubChemRDF. (For the latest statistics on counts of triples, see RDF Statistics)
Prefix | Namespace |
---|---|
compound | http://rdf.ncbi.nlm.nih.gov/pubchem/compound/ |
substance | http://rdf.ncbi.nlm.nih.gov/pubchem/substance/ |
descriptor | http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/ |
inchikey | http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/ |
synonym | http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/ |
bioassay | http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/ |
measuregroup | http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/ |
endpoint | http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/ |
protein | http://rdf.ncbi.nlm.nih.gov/pubchem/protein/ |
conserveddomain | http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/ |
pathway | http://rdf.ncbi.nlm.nih.gov/pubchem/pathway/ |
gene | http://rdf.ncbi.nlm.nih.gov/pubchem/gene/ |
reference | http://rdf.ncbi.nlm.nih.gov/pubchem/reference/ |
nbra | http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/ |
source | http://rdf.ncbi.nlm.nih.gov/pubchem/source/ |
concept | http://rdf.ncbi.nlm.nih.gov/pubchem/concept/ |
taxonomy | http://rdf.ncbi.nlm.nih.gov/pubchem/taxonomy/ |
a The RDF triples for the neighbor subdomain are currently only available through the RESTful interface.
The URIs for PubChem compounds and substances were constructed based on primary accession identifiers (CID and SID). For instance, the URIs for CID60823 and SID103554720 can be represented as:
http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID60823
http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID103554720
which can be abbreviated as compound:CID60823 and substance:SID103554720, respectively. The InChIKey URIs were constructed based on the value of InChIKey. For instance, the URI for InChIKey with value of “XUKUURHRXDUEBC-KAYWLYCHSA-N” (case-insensitive) can be represented as:
http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/XUKUURHRXDUEBC-KAYWLYCHSA-N
Most chemical descriptor namespace URIs were constructed based on a combination of CID/SID and descriptor labels, except in the case of depositor-provided synonyms. For instance, the URI for the molecular weight of CID60823 can be represented as:
http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/CID60823_Molecular_Weight
or simply as descriptor:CID60823_Molecular_Weight. The URI for the depositor-provided synonyms were constructed based on MD5 hash values, after first converting chemical names to lower-case. For example, ‘Atorvastatin [INN:BAN]’ becomes ‘atorvastatin [inn:ban]’ to produce the MD5 hash ‘7be8fb160fff31a7beea9df539fd36bd’ and ‘(3R,5R)-7-[3-(anilinocarbonyl)-5-(4-fluorophenyl)-2-isopropyl-4-phenyl-1H-pyrrol-1-yl]-3,5-dihydroxyheptanoic acid’ becomes ‘(3r,5r)-7-[3-(anilinocarbonyl)-5-(4-fluorophenyl)-2-isopropyl-4-phenyl-1h-pyrrol-1-yl]-3,5-dihydroxyheptanoic acid’ to produce the MD5 hash ‘c576a26b0c67fa6b072b61a0b4c57a6c’. The use of an MD5 hash in place of the actual chemical name allows PubChem information associated with any given chemical name to be directly accessed using RDF. For instance, the depositor-provided synonym of ‘Atorvastatin’ can be represented as:
http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/MD5_9a05646d461669f86de312d88ab5748a
or simply as syno:MD5_9a05646d461669f86de312d88ab5748a.
Some of the PubChem synonyms that are equivalent to World Health Organization (WHO) International Nonproprietary Names (INNs) represent pharmaceutical substances, and some of them are assigned with WHO Anatomical Therapeutic Chemical (ATC) codes. The ATC classification system can be used to search and group active ingredients in clinical drugs. Each ATC class was exposed as a skos:concept in the PubChemRDF concept subdomain, and the ATC codes were used to construct the URIs of those concepts. For instance, protein kinase inhibitors have the ATC code “L01XE”, and the URI for this concept is:
http://rdf.ncbi.nlm.nih.gov/pubchem/concept/ATC_L01XE
PubChem BioAssay records were annotated in different ways depending on the assay type in accordance with the BioAssay Ontology (BAO). Literature extracted bioassays, such as those from ChEMBL, were represented as an instance of BAO measure group (BAO_0000040), since these are summary results abstracted from the literature and missing specific information on how the biological experiment was performed. The URIs for assays records are constructed based on the PubChem BioAssay accession identifiers (AID). For instance, the URI for AID447528 can be assigned as:
http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID447528
or abbreviated as measuregroup:AID447528.
Some assays in PubChem aggregate literature abstracted bioactivity data from multiple publications. For instance, AID578 deposited by BindingDB contains bioactivity data tested against the epidermal growth factor receptor from different publications. In literature-derived assays of this type, a single bioassay record is broken down into multiple measure groups, and the fragment identifier of each individual measure group is based on the combination of AID and PubMed identifier (PMID), for example:
http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID578_PMID8879541
or abbreviated as measuregroup:AID578_PMID8879541. However, in contrast to literature-extracted assays, biological screening experiments, such as those from the NIH Molecular Library Program (MLP), were represented as an instance of BAO bioassay (BAO_0000015). For instance, the URI for AID1788 can be assigned as:
http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/AID1788
or abbreviated as bioassay:AID1788.
Although screening assays and literature-extracted assays are different, they are related. Each screening assay refers to an operational unit and may have one or more instances of BAO measure group (BAO_0000040). If the screening assay is a panel assay (for instance, testing against a panel of multiple targets as occurs when performing a lead profiling screen), the URIs are constructed based on the combination of AID and panel component identifier (PID), for example:
http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID1788_1
or abbreviated as measuregroup:AID1788_1. Therefore, the measure group serves as a basic concept interlinking chemical substances, molecular targets, and the bioactivity endpoints for a given PubChem BioAssay record.
The URIs of bioactivity endpoints were constructed based on the combination of SID and AID, plus PID if the endpoints were produced by panel screening assays or PMID if the endpoints were derived from aggregated literature-extracted assay. The following URIs demonstrate the different reference approaches used for bioactivity endpoints:
http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID103164874_AID443491
http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID99445338_AID2202_1
http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID8034062_AID578_PMID8879541
or abbreviated as endpoint:SID103164874_AID443491, endpoint:SID99445338_AID2202_1, and endpoint:SID8034062_AID578_PMID8879541, respectively.
The first URI above refers to an endpoint derived from a ChEMBL assay, the second URI refers to an endpoint produced by a panel screening assay (assay panel PID is 1), and the last URI refers to an endpoint derived from a literature-extracted assay (PMID is 10395478).
In the case of protein targets, the URIs are created based upon the National Center for Biotechnology Information (NCBI) Protein Accessions (note that NCBI phased out GI number):
http://rdf.ncbi.nlm.nih.gov/pubchem/protein/ACCAAI32976
or abbreviated as protein:ACCAAI32976. In the case of protein complexes as bioassay targets, the URIs are constructed based on a combination of accessions, which are in the ascending order:
http://rdf.ncbi.nlm.nih.gov/pubchem/protein/ACCNP_056953ACCNP_858045
The protein targets tested in PubChem BioAssay database can be linked to other NCBI databases, including the NCBI: Conserved Domain, Gene, and PubMed. Protein conserved domains contain recurring sequence patterns, which define the functional and/or structural units of protein sequences. NCBI conserved domains were identified through multiple sequence alignment, and were distinguished through position-specific scoring matrix (PSSM) models. Each PSSM model has a unique PSSM identifier (PSSMID). The NCBI Gene integrates gene information for various species. Each gene has a unique Gene ID (GID). NCBI PubMed database comprises more than 23 million citations from the biomedical literature. Each PubMed record is assigned with a unique identifier (PMID). PubChemRDF provides the RDF triples to expose the linkage information and the basic descriptions of those resources.
The URIs for conserved domains use PSSMIDs:
http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/PSSMID132758
or abbreviated as domain:PSSMID132758. The URIs for genes use NCBI gene IDs:
http://rdf.ncbi.nlm.nih.gov/pubchem/gene/GID367
or abbreviated as gene:GID367. The URIs for pathways use PubChem Pathway internal identifiers (PWIDs):
http://rdf.ncbi.nlm.nih.gov/pubchem/pathway/PWID10790
or abbreviated as pathway:PWID10790. The URIs for publication references use PMIDs:
http://rdf.ncbi.nlm.nih.gov/pubchem/reference/PMID10395478
or abbreviated as reference:PMID10395478.
The URIs for PubChem depositors are based on the names of depositors:
http://rdf.ncbi.nlm.nih.gov/pubchem/source/ChEMBL
or abbreviated as source:ChEMBL. If the names of depositors are numeric numbers, a prefix “ID” was added; or if the names contains the symbols including “,”, “.”, “&”, “(”, “)”, and “/”, those symbols were deleted; or if the names contains spaces, they were replaced by “_”.
The URIs for PubChem Compound 2-D and 3-D similarity neighbors and PubChem BioAssay protein target sequence similarity neighbors are available in these examples:
http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID68019409_2DSimilarity
http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID11330946_3DSimilarity
or abbreviated as nbr:CID60823_CID68019409_2DSimilarity, and nbr:CID60823_CID11330946_3DSimilarity, respectively.
4 PubChemRDF Subdomains
4.1 PubChem Compound
PubChem Compound RDF triples expose the linkage from compound to the chemical descriptor resources and interrelated compounds, such as compound identity groups (CIGs). [See Figure 1 for a diagram of links to other RDF subdomains.] For example, to resolve the URI in the RESTful interface for compound CID60823:
http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID60823
Link Type | Example RDF Triple |
---|---|
calculated chemical descriptor |
compound:CID60823 sio:has-attribute descriptor:CID60823_Molecular_Weight . |
parent compound |
compound:CID23665101 vocab:has_parent compound:CID60823 . |
component compound |
compound:CID22765305 cheminf:CHEMINF_000478 compound:CID2244 . |
compound identity group (CIG) |
compound:CID60823 cheminf:CHEMINF_000462 compound:CID53233926 . |
2-D similarity neighbora |
compound:CID60823 cheminf:CHEMINF_000482 compound:CID60822 . |
3-D similarity neighbora |
compound:CID60823 cheminf:CHEMINF_000483 compound:CID10745515 . |
a Given the large number of similarity neighbors for a given compound, the RDF statements for similarity neighbors are moved to a separate URI: http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID60823/nbr
If the compound has links to the corresponding Wikidata records, the RDF triples representing the cross-reference relations are provided:
If the compound can be mapped to the drug ontologies (i.e., ChEBI, NDF-RT and NCIt), the drug classes defined in the classification terminologies are used to annotate the compound:
compound:CID60823 rdf:type ndfrt:N0000022046 .
compound:CID60823 rdf:type ncit:C61527 .
4.2 PubChem Substance
PubChem Substance RDF triples expose the linkage between: substance and chemical descriptor resources, substance and standardized compound resources, substance and measure group resources, and substance and data source resources. For example, to resolve the URI for SID8032774:
http://rdf.ncbi.nlm.nih.gov/pubchem/substance/SID8032774
Link Type | Example RDF Triple |
---|---|
depositor-provided descriptor |
substance:SID8032774 sio:has-attribute descriptor:SID8032774_Depositor_Identifier . |
standardized compound |
substance:SID8032774 cheminf:CHEMINF_000477 compound:CID5327844 . |
data source |
substance:SID8032774 dcterms:source source:BindingDB . |
measure group |
substance:SID8032774 obo:RO_0000056 measuregroup:AID578_PMID9357527 . |
If the substance was deposited by ChEMBL database, a cross-link to the ChEMBL RDF resource is provided:
If the substance was deposited by NCBI MMDB, a cross-link to the RDF-based PDB resource is provided:
4.3 PubChem Descriptors
PubChem descriptor RDF triples expose the type, value and unit for a given descriptor.
For example, to resolve URI in the RESTful interface for the molecular weight of PubChem Compound record CID60823 and to provide the external depositor identifier for PubChem Substance record SID8032774:
http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/CID60823_Molecular_Weight
http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/SID8032774_Substance_Version
Link Type | Example RDF Triple |
---|---|
type |
descriptor:CID60823_Molecular_Weight rdf:type cheminf:CHEMINF_000334 . |
value |
descriptor:CID60823_Molecular_Weight sio:has-value "558.639803"^^xsd:double . |
unit |
descriptor:CID60823_Molecular_Weight sio:has-unit obo:UO_0000055 . |
4.4 PubChem InChIKey
PubChem InChIKey RDF triples expose the type, value and the link to the corresponding compound(s) for a given InChIKey. For example, to resolve the URI for the InChIKey with a value of “BSYNRYMUTXBXSQ-UHFFFAOYSA-N” (case-insensitive):
http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/BSYNRYMUTXBXSQ-UHFFFAOYSA-N
Link Type | Example RDF Triple |
---|---|
type |
inchikey:BSYNRYMUTXBXSQ-UHFFFAOYSA-N rdf:type cheminf:CHEMINF_000399 . |
value |
inchikey:BSYNRYMUTXBXSQ-UHFFFAOYSA-N sio:has-value "BSYNRYMUTXBXSQ-UHFFFAOYSA-N"@en . |
compound |
inchikey:BSYNRYMUTXBXSQ-UHFFFAOYSA-N sio:is-attribute-of compound:CID2244 . |
If the InChKey represents a chemical structure in the FDA UNII database , and the UNII code is incorporated as registry number in a MeSH concept, the annotation of InChIKey using MeSH concept is provided:
4.5 PubChem Synonym
PubChem synonym RDF triples expose the type and value of a given MD5 hash string, and the link to the corresponding compound(s). For example, to resolve the URI for the synonym “aspirin [ban:jan]”:
http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/MD5_f90df4a14db08de040a09d3546c1bb58
Link Type | Example RDF Triple |
---|---|
compound |
synonym:MD5_f90df4a14db08de040a09d3546c1bb58 sio:is-attribute-of compound:CID2244 . |
type |
synonym:MD5_f90df4a14db08de040a09d3546c1bb58 rdf:type cheminf:CHEMINF_000339 . |
value |
synonym:MD5_f90df4a14db08de040a09d3546c1bb58 sio:has-value "aspirin [ban:jan]"@en . |
If the synonym represents a MeSH term, or a registry number for a MeSH concept , the annotation of synonym using a MeSH concept is provided:
If the synonym represents a WHO INN that has been assigned with an ATC code, the annotation of synonym using an ATC classification system is provided:
4.6 PubChem BioAssay
PubChem BioAssay RDF triples expose the type, title, data source and the linkage to measure groups for a given assay. For example, to resolve the URI for the PubChem Assay record AID1788:
http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/AID1788
Link Type | Example RDF Triple |
---|---|
type |
bioassay:AID1788 rdf:type bao:BAO_0000015 . |
title |
bioassay:AID1788 dcterms:title "Discovery of novel allosteric modulators of the M1 muscarinic receptor: Agonist Ancillary Activity"@en . |
data source |
bioassay:AID1788 dcterms:source source:Vanderbilt_Screening_Center_for_GPCRs__Ion_Channels_and_Transporters . |
measure group |
bioassay:AID1788 bao:BAO_0000209 measuregroup:AID1788_1 . |
4.7 PubChem MeasureGroup
For high throughput screening assays, including panel assays, PubChem measure group RDF triples expose the title, type, as well as the linkage to the participating proteins and endpoints. For example, to resolve the URI for assay panel 1 from the PubChem Assay record AID1788:
http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID1788_1
Link Type | Example RDF Triple |
---|---|
type |
measuregroup:AID1788_1 rdf:type bao:BAO_0000040 . |
title |
measuregroup:AID1788_1 dcterms:title "Adenosine A1 (human)"@en . |
protein |
measuregroup:AID1788_1 obo:RO_0000057 protein:ACCNP_000665 . |
endpoint |
measuregroup:AID1788_1 obo:OBI_0000299 endpoint:SID56353039_AID1788_1 . |
For literature-extracted assays, PubChem measure group RDF triples expose the title, type, data source, as well as the linkage to the participating proteins and endpoints. For example, to resolve the URI for literature-extracted assay AID447528:
http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/AID447528
Link Type | Example RDF Triple |
---|---|
type |
measuregroup:AID447528 rdf:type bao:BAO_0000040 . |
title | measuregroup:AID447528 dcterms:title "Inhibition of ovine COX1 by enzyme immunoassay"@en . |
data source |
measuregroup:AID447528 dcterms:source source:ChEMBL . |
protein |
measuregroup:AID447528 obo:RO_0000057 protein:ACCP05979. |
endpoint |
measuregroup:AID447528 obo:OBI_0000299 endpoint:SID103164874_AID447528 . |
4.8 PubChem Endpoint
PubChem endpoint RDF triples expose the type, value, unit, reference, and the linkage to substance. For example, to resolve the URI for the bioassay endpoint between PubChem records SID103164874 and AID443491:
http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/SID103164874_AID443491
Link Type | Example RDF Triple |
---|---|
type |
endpoint:SID103164874_AID443491 rdf:type bao:BAO_0000190 . |
value |
endpoint:SID103164874_AID443491 sio:has-value "0.162"^^xsd:float . |
unit |
endpoint:SID103164874_AID443491 sio:has-unit uo:micromolar . |
substance |
endpoint:SID103164874_AID443491 obo:IAO_0000136 substance:SID103164874 . |
reference |
endpoint:SID103164874_AID443491 cito:citesAsDataSource reference:PMID19880317 . |
4.9 PubChem Protein
PubChem protein RDF triples expose the type (Protein Ontology), title, similarity neighbors, conserved domains, encoding genes, organisms, references, and the cross-links to UniProt RDF. For example, to resolve the URI for the NCBI Protein record ACCP00533:
http://rdf.ncbi.nlm.nih.gov/pubchem/protein/ACCP00533
Link Type | Example RDF Triple |
---|---|
type |
protein:ACCP00533 rdf:type bp:Protein . protein:ACCP00533 rdf:type obo:PR_000006933 . |
title |
protein:ACCP00533 dcterms:title "Epidermal growth factor receptor"@en . |
similarity neighbor |
protein:ACCP00533 vocab:hasSimilarProtein protein:ACCQ01279 . |
conserved domain |
protein:ACCP00533 obo:RO_0002180 conserveddomain:PSSMID213054 . |
encoding gene |
protein:ACCP00533 up:encodedBy gene:GID1956 . |
cross link to Uniprot |
protein:ACCP00533 skos:closeMatch uniprot:P00533 . |
organism |
protein:ACCP00533 bp:organism taxonomy:9606 . |
If protein entity has crystallized 3D structure in PDB database, a cross link to the RDF-based PDB resource is provided, for example:
If the protein complexes have been tested in the bioassays, the measure groups are linked to the protein complexes, which are typed and linked to their component protein units, for example:
protein:ACCNP_056953ACCNP_858045 rdf:type obo:GO_0043234 .
protein:ACCNP_056953ACCNP_858045 obo:BFO_0000178 protein:ACCNP_056953 .
protein:ACCNP_056953ACCNP_858045 obo:BFO_0000178 protein:ACCNP_858045 .
4.10 PubChem ConservedDomain
PubChem conserved domain RDF triples expose the type, title, and references of a given conserved protein domain. For example, to resolve the URI for the NCBI Conserved Domain record PSSMID132758:
http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/PSSMID132758
Link Type | Example RDF Triple |
---|---|
type |
conserveddomain:PSSMID132758 rdf:type obo:SO_0000417 . |
title |
conserveddomain:PSSMID132758 dcterms:title "NR_LBD_AR"@en . |
referencea |
conserveddomain:PSSMID132758 cito:isDiscussedBy reference:PMID17940184 . |
a The links to literature references are obtained from NCBI Entrez system.
4.11 PubChem Gene
PubChem gene RDF triples expose the type, title, symbol, description, organism, and references. For example, to resolve the URI for NCBI Gene record GID367:
http://rdf.ncbi.nlm.nih.gov/pubchem/gene/GID367
Link Type | Example RDF Triple |
---|---|
type |
gene:GID367 rdf:type bp:Gene . |
titlea |
gene:GID367 dcterms:title "androgen receptor"@en . |
organism |
gene:GID367 bp:organism taxonomy:9606 . |
symbol |
gene:GID367 vocab:geneSymbol "AR"@en . |
referencea |
gene:GID367 cito:isDiscussedBy reference:PMID19815331 . |
a The links to literature references are obtained from NCBI Entrez system.
4.12 PubChem Pathway
PubChem is superseding the NCBI BioSystem database with PubChem Pathways. PubChem pathway RDF triples expose the type, title, data source, organism, references, and participant (e.g. proteins, genes, and chemicals). For example, to resolve the URI for the PubChem Pathway Reactome:R-HSA-1474228:
http://rdf.ncbi.nlm.nih.gov/pubchem/pathway/PWID10790
Link Type | Example RDF Triple |
---|---|
type |
pathway:PWID10790 rdf:type bp:Pathway . |
title |
pathway:PWID10790 dcterms:title "Degradation of the extracellular matrix"@en . |
organism |
pathway:PWID10790 bp:organism taxonomy:9606 . |
referencea |
pathway:PWID10790 cito:isDiscussedBy reference:PMID21917992 . |
source |
pathway:PWID10790 dcterms:source source:ID23465 . |
has-participant |
pathway:PWID10790 obo:RO_0000057 protein:ACCP08254 . |
a The links to literature references are obtained from NCBI Entrez system.
If the pathway is from the Reactome pathway database, a cross-link to the Reactome RDF is provided, for example:
4.13 PubChem Neighbor
PubChem neighbor RDF triples describe similarity relationships and their supporting information. Currently, these exist between chemical records and between protein sequences, and the information is only available through RESTful interface The chemical 2-D similarity neighbor RDF triples expose the similarity relation type, compounds involved in the neighboring relation, the value and type of the similarity score, as well as the linkage between the neighboring relation and the evaluating score. For example, to resolve the URI for the neighboring relationship between the PubChem Compound records CID60823 and CID68019409:
http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID68019409_2DSimilarity
http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID68019409_2DTanimotoScore
Link Type | Example RDF Triple |
---|---|
relation type |
nbr:CID60823_CID10030610_2DSimilarity rdf:type vocab:PC2D_structural_similarity. |
compound |
nbr:CID60823_CID10030610_2DSimilarity sio:refers-to compound:CID60823 , compound:CID10030610 . |
supporting score |
nbr:CID60823_CID10030610_2DSimilarity sio:has-measurement-value nbr:CID60823_CID10030610_2DTanimotoScore |
score type |
nbr:CID60823_CID10030610_2DTanimotoScore rdf:type vocab:PC2D_Fingerprint_TanimotoScore . |
score value |
nbr:CID60823_CID10030610_2DTanimotoScore sio:has-value "0.98"^^xsd:double . |
The chemical 3-D similarity neighbor RDF triples expose the similarity relation type, compounds involved in the neighboring relation, the value and type of the shape and feature similarity scores (ST and CT, respectively), as well as the linkage between the neighboring relation and the evaluating score. For example, to resolve the URI for the neighboring relationship between the PubChem Compound records CID60823 and CID11330946:
http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID11330946_3DSimilarity
http://rdf.ncbi.nlm.nih.gov/pubchem/neighbor/CID60823_CID11330946_3DFeatureTanimotoScore
Link Type | Example RDF Triple |
---|---|
relation type |
nbr:CID60823_CID11330946_3DSimilarity rdf:type vocab:PC3D_structural_similarity . |
compound |
nbr:CID60823_CID10030610_2DSimilarity sio:refers-to compound:CID60823 , compound:11330946 . |
supporting score |
nbr:CID60823_CID10030610_2DSimilarity sio:has-measurement-value nbr:CID60823_CID11330946_3DFeatureTanimotoScore , nbr:CID60823_CID11330946_3DShapeTanimotoScore . |
score type |
nbr:CID60823_CID11330946_3DFeatureTanimotoScore rdf:type vocab:PC3D_Feature_TanimotoScore . nbr:CID60823_CID11330946_3DShapeTanimotoScore rdf:type vocab:PC3D_Shape_TanimotoScore . |
value |
nbr:CID60823_CID11330946_3DFeatureTanimotoScore sio:has-value "0.59"^^xsd:double . nbr:CID60823_CID11330946_3DShapeTanimotoScore sio:has-value "0.88"^^xsd:double . |
4.14 PubChem Source
PubChem data source RDF triples expose the type, title, contributor, homepage, and substance categorization classification for a given data source. For example, to resolve the URI for the PubChem data source “ChEMBL”:
http://rdf.ncbi.nlm.nih.gov/pubchem/source/ChEMBL
Link Type | Example RDF Triple |
---|---|
type |
source:ChEMBL rdf:type dcterms:Dataset . |
title |
source:ChEMBL dcterms:title "ChEMBL"@en . |
subject | source:ChEMBL dcterms:subject concept:Research_and_Development . source:ChEMBL dcterms:subject concept:Curation_Efforts . |
4.15 PubChem Reference
PubChem reference RDF triples expose the type, publication date, citation, title, MeSH headings/subheadings (in the MeshHeadingList of PubMed XML files), as well as the literature abstract mentionings of the chemicals (in the ChemicalList of of PubMed XML files) and diseases (in the SupplMeshList of PubMed XML file) provided by Medline indexers. All of the headings/subheadings are represented by MeSH Descriptor (identifier starts with ‘D’)/Qualifier (identifier starts with ‘Q’) pairs, and all of the chemicals are represented by MeSH concepts (identifier starts with ‘M’), and all of the diseases are represented by MeSH supplementary concept records (SCRs, identifier starts with ‘C’). For example, to resolve the URI for the NCBI PubMed record PMID 10395478:
http://rdf.ncbi.nlm.nih.gov/pubchem/reference/PMID10395478
Link Type | Example RDF Triple |
---|---|
type |
reference:PMID10395478 rdf:type fabio:JournalArticle . |
citation |
reference:PMID10395478 dcterms:bibliographicCitation "B D Palmer, A J Kraker, B G Hartl, A D Panopoulos, R L Panek, B L Batley, G H Lu, S Trumpp-Kallmeyer, H D Showalter, W A Denny; Journal of medicinal chemistry; 1999 Jul; 42(13):2373-82" . |
title |
reference:PMID10395478 dcterms:title "Structure-activity relationships for 5-substituted 1-phenylbenzimidazoles as selective inhibitors of the platelet-derived growth factor receptor"@en . |
publication date |
reference:PMID10395478 dcterms:date "1999-07-01"^^xsd:date . |
heading/subheading |
reference:PMID10395478 fabio:hasSubjectTerm mesh:D000255Q000378 . |
chemical list |
reference:PMID10395478 cito:discusses mesh:M0000395 . |
The literature abstract mentionings of diseases are optional, for instance:
4.16 PubChem Concept
The PubChem “concept” subdomain exposes RDF triples related to the biomedical concepts used to annotate the PubChemRDF resource, for instance, the WHO ATC codes used to annotate synonym instances. For example, to resolve the URI for the WHO ATC code L01XE:
http://rdf.ncbi.nlm.nih.gov/pubchem/concept/ATC_L01XE
Link Type | Example RDF Triple |
---|---|
type |
concept:ATC_L01XE rdf:type skos:concept . |
concept scheme |
concept:ATC_L01XE skos:inScheme concept:ATC . |
source |
concept:ATC_L01XE pav:importedFrom source:WHO . |
parent concept |
concept:ATC_L01XE skos:broader concept:ATC_L01X . |
label |
concept:ATC_L01XE skos:prefLabel "protein kinase inhibitors"@en . |
4.17 PubChem Taxonomy
The PubChem "Taxonomy" subdomain exposes the RDF triples of title, type, and closeMatch. For example, to resolve the URI for the PubChem Taxonomy Homo sapiens (human):
http://rdf.ncbi.nlm.nih.gov/pubchem/taxonomy/TAXID9606
Link Type | Example RDF Triple |
---|---|
title |
taxonomy:TAXID9606 dcterms:title "Homo sapiens (human)"@en . |
type |
taxonomy:TAXID9606 rdf:type biopax:organism . |
closeMatch |
taxonomy:TAXID9606 skos:closeMatch mesh:D006801 . |
5 RESTful Interface
5.1 URI Dereferencing
URIs can be resolved through the RESTful interface by either URI Suffix Extention or HTTP Accept Header. By default, it uses URI Suffix Extension first, then HTTP Accept Header. If neigher of them is provided, it will return RDF triples in HTML format. The supported formats are listed in Table 3.
Table 3. The MIME types supported in the PubChemRDF REST interface for dereferencing URIs.
MIME Type | HTTP Accept Header | URI Suffix Extension |
---|---|---|
Abbreviated RDF/XML |
application/rdf+xml+abbrev |
rdfxml-abbrev |
RDF/XML |
application/rdf+xml text/rdf |
rdfxml rdf xml |
HTML |
application/xhtml+xml text/html |
html htm |
TURTLEa |
application/n3 application/rdf+n3 application/turtle application/x-turtle text/n3 text/turtle text/rdf+n3 text/rdf+turtle |
turtle ttl n3 |
JSONb |
application/json text/json |
json |
JSON-LDc |
application/x-json+ld application/x-json+rdf application/json+ld application/json+rdf application/ld+json application/rdf+json |
Jsonld Json-ld ldjson ld-json |
N-TRIPLES |
text/plain |
ntriples |
a Turtle is an abbreviation for Terse RDF Triples Language; b JSON is short for JavaScript Object Notation; c JSON-LD is short for JavaScript Object Notation for Linked Data.
For instance, the following URLs can present the RDF triples with respect to CID2244 (Aspirin) in various RDF data formats:
http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.rdf
http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.html
http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.turtle
http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.json
http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.jsonld
http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244.ntriples
Different types of presentations can be produced through specifying the HTTP accept header as well. For instance, if the Linux cURL command is used to retrieve RDF triples regarding to CID2244, the following commands will output the RDF triples into the files:
- curl -L -H "Accept: text/rdf" -o CID2244.rdf http://rdf.ncbi.nlm.nih.gov/pubchem/compound/CID2244
If no HTTP header is specified, the default output format is html (text/html).
If web browsers are used to retrieve RDF triples, HTML format is typically the default. For instance, what Google Chrome sends in the accept header would be something like:
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
which means that it will take HTML or XML or others, but it prefers HTML (q=1.0) to XML (q=0.9) and others (q=0.8).
5.2 Query RESTful Interface
The resolution of URIs under http://rdf.ncbi.nlm.nih.gov/pubchem/ domain will return a 303 redirect HTTP status code, and the request will be redirected to https://pubchem.ncbi.nlm.nih.gov/rest/rdf/ domain. The RESTful interface under the later domain can also be used to query RDF triples. The query calls share the same base URL:
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?
The input queries can be provided through key-value pairs, and the keys can be as follows (all lowercase): “graph” (or “domain”), “name” (or “string”), “return” (or “retrieve”), “contain” (or “substring”), “subject” (or “subj”), “predicate” (or “pred”), “object” (or “obj”), “offset”, and “format”.
5.2.1 Queries Based on String Values
The following string values can be used to query PubChemRDF resources: substance synonyms, inchikey values, protein names, gene symbols, data source names, conserved domain titles, pathway titles, bioassay titles, measuregroup titles, reference titles, and concept labels. Two basic parameters must be provided including “graph” (or “domain”) and “name” (or “string”). For instance, the following query can retrieve the PubChemRDF synonym resource having the value of “aspirin”:
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&name=aspirin
Substring search is supported as well with the parameter “contain” (or “substring”), which can be either true or false:
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&name=aspirin&contain=true
The above queries return synonym resources. If the related compounds or substances are intended, another parameter, “return” (or “retrieve”), should be provided, which can be either “compound” (or “cid”) or “substance” (or “sid”):
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&name=aspirin&return=compound
The query functions support content negotiation with parameter “format” specified in Table 4. For instance, the following query will return JSON format:
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&name=aspirin&format=json
Table 4. The MIME types allowed and used in the PubChemRDF REST interface for query functions.
MIME Type | HTTP Accept Header | URI Suffix Extension |
---|---|---|
RDF/XML |
application/rdf+xml text/rdf |
rdfxml rdf xml |
HTML |
application/xhtml+xml text/html |
html htm |
JSONa |
application/json text/json |
json |
CSVb |
text/csv |
csv |
a JSON is short for JavaScript Object Notation; b CSV is short for comma-seperated values.
5.2.2 Queries Based on Triple Patterns
PubChemRDF REST interface provides simple SPARQL-like query capabilities for grouping and filtering relevant resources. Given the high computational costs of complicated SPARQL queries, only one triple pattern is allowed in the PubChemRDF REST interface. In addition, two basic parameters must be provided as well including “graph” (or “domain”) and “predicate” (or “pred”). For instance, the following query can retrieve the ChEBI class assignments for the PubChem substances:
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=substance&predicate=rdf:type
The number of records returned by each query request can be configured using parameter “limit”, which has maximum and default value as 10 000. Since all of the records have been pre-sorted, the rest of the records can be retrieved by specifying the “offset” parameter. For instance, the next 10 000 records (10 001 to 20 000) can be retrieved using the following query:
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=substance&predicate=rdf:type&offset=10000
In addition to the two basic parameters, either “subject” (or “subj”) or “object” (or “obj”) can be provided for filtering and grouping purpose. For instance, the following query can retrieve the first 10 000 synonyms that are drug brand names (trademarks):
https://pubchem.ncbi.nlm.nih.gov/rest/rdf/query?graph=synonym&pred=rdf:type&obj=sio:CHEMINF_000561
Multiple values of the given “subject” (or “subj”) or “object” can be supplied and queried, which should be delimited by comma (","). For instance, the following query can retrieve the synonyms that are either Chemical Abstracts Service registry numbers or European Commission numbers:
where the pagination parameter "offset" is useful to control which page to demonstrate.
5.3 HTTP Response Status
If the operation after redirection on RESTful interface was successful, the RDF triples will be retrieved along with a 200 HTTP status code. If the server encounters an error, it will return an HTTP status code other than 200 in the response header. The HTTP codes in the 400 range indicate errors on the request side (invalid input of some form), and the HTTP codes in the 500 range indicate errors on the PubChem side (timeout or other issue). In the response content, some descriptive messages will be returned, indicating the potential causes of the errors.
The HTTP status codes and corresponding descriptions are as follows:
HTTP Status | Error | Description |
---|---|---|
400 |
eBadRequest |
Bad Query URL or Request URI |
404 |
eNotFound |
Input URI is invalid or cannot be identified in databases |
405 |
eNotAllowed |
MIME output format is unspecified or invalid |
500 |
eServerError |
Some problem on the server side occurs |
504 |
eTimeout |
The request timed out (over 28 second) |
Please note that the HTTPS protocol works seamlessly in place of HTTP protocol for all URIs in the PubChemRDF RESTful interface. If you request data using ‘HTTPS’, URIs returned will be ‘https’. This is a feature and it may cause issues for some software packages that depend on the URI uniquely identifying an entity, down to the protocol requesting the URI. Generally speaking, all PubChem web-based resources are configured to work seamlessly with or without HTTP encryption (via ‘https’ or ‘http’ protocol, respectively). By default, and on the PubChemRDF FTP site, all URIs are specified using the HTTP protocol to the NCBI RDF website (http://rdf.ncbi.nlm.nih.gov).
6 RDF FTP Download Directory Layout
PubChemRDF data is available for bulk download at the PubChem FTP site (https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/). It is highly recommended to use the FTP URL (ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/) to avoid restrictions, such as Robot Exclusion Standard (robots.txt).
Data is updated in its entirety approximately once per week [i.e., no incremental update is provided at this time]. A Vocabulary of Interlinked Dataset (VoID) description file (void.ttl) is provided at the root directory of the PubChemRDF FTP site. This file provides general metadata information about each release of PubChemRDF, such as provenance information, statistics (e.g., triple counts), dataset release date, and files.
The fundamental layout of the PubChemRDF FTP site is such that it is partitioned into subsets corresponding to different PubChemRDF subdomains. This allows individual subdomains to be downloaded. The top level FTP directories correspond to the subdomains: compound, substance, descriptor, inchikey, synonym, bioassay, measuregroup, endpoint, protein, conserveddomain, gene, pathway, source, concept, and reference. Since compound and descriptor subdomains have the most number of triples, additional partitions have been applied to them. Compound subdomain has three subsets: general, nbr2d, and nbr3d; descriptor subdomain has two subsets: compound and substance. Within each subdomain, the RDF triples were further split based on the different semantic relations, such that the RDF predicates in each downloadable file are same.
All RDF data files are in turtle format and gzip compressed, as indicated by the suffix “.ttl.gz”. Data from one RDF subdomain may refer to other subdomains. Figure 1 helps to depict these interdependencies by means of arrows indicating out-going references to other subdomains. Each file name has the pattern as “pc_<link>_<range>.ttl.gz”. The <link> indicates the file content type, the “.ttl.gz” suffix indicates the file is in turtle RDF format and gzip compressed, and the <range> (optional) is a number to differentiate the file contents that is only available for large data sets. For example, “pc_compound2descriptor_000001.ttl.gz” indicates the semantic links are from compound to descriptor subdomains, and the number indicates that there are other files containing the same semantic associations.
6.1 PubChem Compound
Data for the PubChem “compound” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/
There are three subdirectories: “general”, “nbr2d”, and “nbr3d”. Each directory may have a ‘README’ file with more current information or additional information.
6.1.1 PubChem Compound “general”
Information contained here includes the links to chemical descriptors, ChEBI types, and the non-similarity based compound interrelationships, including parent, components, and compound identity group (CIG). CIGs consider related chemicals by varying degrees of identity. For example, cases of chemicals with identical connectivity (same atoms and bonds) but where stereo isomer and isotopic information may vary. The semantic links with the corresponding file names or prefixes are listed in the following table:
Semantic link | File name or prefix |
---|---|
sio:has-attributea |
pc_compound2descriptor_ |
vocab:has-parenta |
pc_compound2parent_ |
cheminf:has-component |
pc_compound2component.ttl.gz |
cheminf:has-stereoisomer |
pc_compound2stereoisomer.ttl.gz |
cheminf:has-isotopologue |
pc_compound2isotopologue.ttl.gz |
cheminf:has-uncharged-counterpart |
pc_compound2uncharged.ttl.gz |
cheminf:has-same-connectivity-witha |
pc_compound2sameconnectivity_ |
rdf:type |
pc_compound_type.ttl.gz |
a File prefixes followed by the range numbers.
6.1.2 PubChem Compound “nbr2d”
Information contained here includes links between compounds according to the PubChem 2-D “Similar Compounds” neighboring relationship . The semantic link is “cheminf:has-2D-similar-compound”.
6.1.3 PubChem Compound “nbr3d”
Information contained here includes links between compounds according to the PubChem 3-D “Similar Conformers” neighboring relationship . The semantic link is “cheminf:has-3D-similar-compound”.
6.2 PubChem Substance
Data for the “substance” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/substance/
Information contained here includes the links to the ChEBI types, the PDB crystal structures, depositor-provided descriptors including synonyms, the depositor-provided PubMed references, the standardized compound records, the measure groups, data sources, and the cross links to ChEMBL RDF. The semantic links with the corresponding file names or prefixes are listed in the following table:
Semantic link | File name or prefix |
---|---|
rdf:type |
pc_substance_type.ttl.gz |
pdbo:link_to_pdb |
pc_substance2pdb.ttl.gz |
sio:has-attributea |
pc_substance2descriptor_ |
cito:isDiscussedBy |
pc_substance2reference.ttl.gz |
cheminf:has-standardized-compounda |
pc_substance2compound_ |
bfo:participates-ina |
pc_substance2measuregroup_ |
dcterms:sourcea |
pc_substance_source_ |
skos:exactMatch |
pc_substance_match.ttl.gz |
a File prefixes followed by the range numbers.
6.3 PubChem Descriptor
Data for the “descriptor” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor/
There are two subdirectories: “compound” and “substance”.
6.3.1. PubChem Descriptor “compound”
Information contained here includes the type, value and unit (when applicable) of chemical descriptors (not including InChIKey). Since the chemical descriptor is a very large subdomain, it is further categorized into different descriptor types, and the downloadable files are organized accordingly. The file prefixes have the following pattern: pc_descr_<type>_<link>_<range>.ttl.gz. The descriptor types include InChI, canSMILES, isoSMILES, IUPACName, HBondDonor, HBondAcceptor, RotatableBond, Complexity, TautomerCount, XLogP3, DefinedAtomStereoCount, DefinedBondStereoCount, IsotopeAtomCount, HeavyAtomCount, UndefinedAtomStereoCount, UndefinedBondStereoCount, CovalentUnitCount, MolecularFormula, FormalCharge, MolecularWeight, MonoIsotopicWeight, ExactMass, and TPSA.
Representative semantic links with the corresponding file prefixes are listed in the following table:
Semantic type | Semantic link | File prefix |
---|---|---|
TPSA |
rdf:type |
pc_descr_TPSA_type_ a |
TPSA |
sio:has-value |
pc_descr_TPSA_value_ a |
TPSA |
sio:has-unit |
pc_descr_TPSA_unit_ a |
a File prefixes followed by the range numbers.
6.3.2. PubChem Descriptor “substance”
Information contained here includes the type and value for a given descriptor (i.e. substance version). The semantic links with the corresponding file prefixes are listed in the following table:
Semantic link | File prefix |
---|---|
rdf:type |
pc_SubstanceVersion_type_ a |
sio:has-value |
pc_SubstanceVersion_value_ a |
a File prefixes followed by the range numbers.
6.4 PubChem InChIKey
Data for the PubChem “inchikey” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/inchikey/
Information contained here includes the type and the value of a given InChIKey.
The semantic links with the corresponding file names or prefixes are listed in the following table:
Semantic link | File name or prefix |
---|---|
rdf:type |
pc_inchikey_type_ |
sio:has-value |
pc_inchikey_value_ |
dcterms:subject |
pc_inchikey_topic.ttl.gz |
sio:is-attribute-of |
pc_inchikey2compound_ a |
a File prefixes followed by the range numbers.
6.5 PubChem Synonym
Data for the PubChem “synonym” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/synonym/
Information contained here includes the type and the corresponding name string of a given MD5 hash string, as well as the links to the corresponding CID(s). The MD5 hash is used to provide a stable identifier for a given synonym. The mappings of synonyms to CIDs may be a subset of those possible from corresponding SID(s). PubChem performs processing on aggregated chemical information between PubChem contributors. This consistency filtering helps to eliminate promiscuous synonyms that correspond to multiple chemical structures (perhaps erroneously). The semantic links with the corresponding file names or prefixes are listed in the following table:
Semantic link | File name or prefix |
---|---|
rdf:type |
pc_synonym_type_ |
sio:has-value |
pc_synonym_value_ |
dcterms:subject |
pc_synonym_topic.ttl.gz |
sio:is-attribute-of |
pc_synonym2compound_ a |
a File prefixes followed by the range numbers.
6.6 PubChem BioAssay
Data for the “bioassay” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/bioassay/
The file “pc_bioassay.ttl.gz” contains the descriptive information for a given AID that is represented as an instance of BAO_0000015, including the type, title, depositor, as well as the links to bioassay neighbors (if it has any) and the corresponding measure groups. The semantic links are “rdf:type”, “dcterms:title”, “dcterms:source”, “bao:has-measure-group”, and “bao:has-summary-assay” (optional).
6.7 PubChem MeasureGroup
Data for the “measuregroup” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/measuregroup/
The files contain the descriptive information for a given AID that cannot be represented as an instance of BAO_0000015, as well as the links from measure group to the participating protein targets (if it has any), and the links from measure group to the corresponding endpoints (if it has any). The semantic links with the corresponding file names or prefixes are listed in the following table:
Semantic link | File name or prefix |
---|---|
rdf:type |
pc_measuregroup_type.ttl.gz |
dcterms:title |
pc_measuregroup_title.ttl.gz |
dcterms:source |
pc_measuregroup_source.ttl.gz |
bfo:has-participants |
pc_measuregroup2protein.ttl.gz |
obi:has-specified-output |
pc_measuregroup2endpoint_ a |
a File prefixes followed by the range numbers.
6.8 PubChem Endpoint
Data for the “endpoint” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/endpoint/
There are several files containing the type, value, unit, and label of a given endpoint, as well as the links to the substance and reference. The semantic links with the corresponding file names or prefixes are listed in the following table:
Semantic link | File name or prefix |
---|---|
rdf:type |
pc_endpoint_type.ttl.gz |
sio:has-value |
pc_endpoint_value.ttl.gz |
sio:has-unit |
pc_endpoint_unit.ttl.gz |
vocab:PubChemAssayOutcome |
pc_endpoint_outcome_ a |
rdfs:label |
pc_endpoint_label.ttl.gz |
iao:is-about |
pc_endpoint2substance_ a |
cito:citeAsDataSource |
pc_endpoint2reference.ttl.gz |
a File prefixes followed by the range numbers.
6.9 PubChem Protein
Data for the “protein” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/protein/
The file “pc_protein.ttl.gz” contains descriptive information for a given protein accession, including the type, title, alternative names, cross-links, conserved domain associations, and encoding gene information, as well as the links to the measure groups and the neighboring relationship between proteins. The semantic links are “rdf:type”, “dcterms:title”, “dcterms:alternative”, “skos:closeMatch”, “bfo:has-part”, “uniprot:encodedBy”, “bp:organism”, “cito:isDiscussedBy” “pdbo:link_to_pdb”, and “vocab:has-similar-protein”.
6.10 PubChem ConservedDomain
Data for the “conserveddomain” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/conserveddomain/
The file “pc_conserveddomain.ttl.gz” contains the type of a given PSSMID. The semantic link is “rdf:type”, “dcterms:title”, “dcterms:abstract”, and “cito:isDiscussedBy”.
6.11 PubChem Gene
Data for the “gene” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/gene/
The file “pc_gene.ttl.gz” contains the type and symbol of a given GID. The semantic links are “rdf:type”, “bp:organism”, “cito:isDiscussedBy”, “dcterms:title”, “dcterms:description”, “skos:closeMatch”, and “sio:gene-symbol”.
6.12 PubChem Pathway
Data for the “pathway” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/pathway/
The RDF statements contained in the file “pc_pathway.ttl.gz” provide the basic descriptions, including the type, title, and source for a given PWID. The semantic links are “rdf:type”, “bp:organism”, “cito:isDiscussedBy”, “skos:exactMatch”, “dcterms:title”, and “dcterms:source”.
6.13 PubChem Source
Data for the “source” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/source/
The file “pc_source.ttl.gz” contains descriptive information for a given PubChem contributor; including data source identifier, display name, alternative names, organization, and homepage, as well as any classification of the given source, such as the Substance Categorization Classification information.
6.14 PubChem Reference
Data for the “reference” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/reference/
There are several files containing the type, topics, citation, title, and date of a given reference. The semantic links with the corresponding file names or prefixes are listed in the following table:
Semantic link | File name or prefix |
---|---|
rdf:type |
pc_reference_type.ttl.gz |
cito:discusses |
pc_reference2chemical_disease_ a |
fabio:hasSubjectTerm |
pc_reference2meshheading_ a |
dcterms:bibliographicCitation |
pc_reference_citation_ a |
dcterms:title |
pc_reference_title_ a |
dcterms:date |
pc_reference_date.ttl.gz |
a File prefixes followed by the range numbers.
6.15 PubChem Taxonomy
Data for the “taxonomy” RDF subdomain directory can be found here:
https://ftp.ncbi.nlm.nih.gov/pubchem/RDF/taxonomy
The RDF statements contained in the file "pc_taxonomy.ttl.gz" provide semantic links: “dcterms:title”, “rdf:type”, and "skos:closeMatch".
7 Loading PubChemRDF
This section gives some SPARQL query examples of how PubChemRDF can be used under available Semantic Web frameworks. [Please note that these use cases assume some familiarity and proficiency with these tools.] Three popular Semantic Web frameworks that provide multiple collections of API functions to process RDF data are Apache Jena, OpenRDF Sesame, and Redland RDF libraries. Jena and Sesame are publically available Java frameworks, and Redland comprises a set of open source C libraries. All of them can be readily used to read, write, parse, serialize, and interpret RDF statements, and all of them provide both in memory and persistent storage, as well as SPARQL querying mechanisms.
Recent technology development has changed the landscape, in particular, for very large RDF stores (such as FRANZ AllegroGraph, OpenLink Virtuoso, Ontotext OWLIM, Garlik 4store, and SYSTAP Blazegraph) that can handle fast loading and querying of billions of triples. AllegroGraph is compatible with Jena framework. Virtuoso provides fully operational data access and management through interface implementations of Jena, Sesame, and Redland frameworks. Blazegraph supports Sesame API functions. OWLIM can deliver extensible and configurable performance with Jena and Sesame frameworks. According to the most recent DB-Engines ranking, Jena and Virtuoso are among the most popular RDF persistent stores. The open source version of the Virtuoso 7 installed on single server can readily handle the core subset of PubChemRDF data (over 7 billion triples), which does not include the compound 2D/3D similarity triples. Therefore, we will describe how to load and query the PubChemRDF data (the core subset) using CentOS 6 Linux system.
The core subset of PubChemRDF data can be downloaded using any FTP software. Here, we use a popular command line tool, wget, as an example. You can copy the following shell script and save it in a file named as "download_script.sh".
wget -r -A ttl.gz -nH --cut-dirs=3 -P compound ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/compound/general
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/substance
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/descriptor
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/synonym
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/inchikey
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/bioassay
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/measuregroup
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/endpoint
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/protein
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/pathway
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/conserveddomain
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/gene
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/source
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/concept
wget -r -A ttl.gz -nH --cut-dirs=2 ftp://ftp.ncbi.nlm.nih.gov/pubchem/RDF/reference
Then make the file executable:
This script will download all of the PubChemRDF data in the current working directory except compound 2D/3D similarity triples, and the downloaded files (over 40 GB) are organized in the same way as shown on FTP site.
It is recommended to install Virtuoso on a server with at least 64 GB memory and a 500 GB SSD(solid state disk), and to configure the performance of Virtuoso quad store through editing the “virtuoso.ini” file. Linux swapping should be rendered as well. Extensive experiments have shown that the default index scheme of Virtuoso 7 can already yield acceptable performance.
To further improve the query performance, one more indexes OGSP, (O: object; G: graph; S: subject; P: predict) should be added by running the following command in “isql” command line:
Virtuoso has built-in bulk load functions to load RDF triples from multiple files in parallel. Before running bulk load, it is recommended to change the transaction isolation as “read commited” to avoid deadlock during bulk loading. The transaction isolation level can be changed by adding the following line in the “virtuoso.ini” file:
Another way to change the transaction isolation is through the “isql” command line:
It is highly recommended to load PubChemRDF data into multiple graphs, and specify the graph using FROM clause in SPARQL queries. If so, the graph index provided by Virtuoso can be readily used to avoid full index scanning and, as a result, to improve the query performance. The bulk loading can be achieved through two steps:
First, you should register all the files to be loaded in a given directory to the corresponding graph. The following scripts can be run in “isql” command line to register datasets to be loaded in the given graphs (<Path> should be substituted by the local directory):
ld_dir_all ('<Path>/substance', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/substance');
ld_dir ('<Path>/descriptor/compound', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/compound');
ld_dir ('<Path>/descriptor/substance', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/substance');
ld_dir_all ('<Path>/synonym', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/synonym');
ld_dir_all ('<Path>/inchikey', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey');
ld_dir ('<Path>/measuregroup', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup');
ld_dir ('<Path>/endpoint', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint');
ld_dir ('<Path>/bioassay', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay');
ld_dir ('<Path>/protein', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/protein');
ld_dir ('<Path>/pathway', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/pathway');
ld_dir ('<Path>/conserveddomain', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain');
ld_dir ('<Path>/gene', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/gene');
ld_dir ('<Path>/reference', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/reference');
ld_dir ('<Path>/source', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/source');
ld_dir ('<Path>/concept', '*.ttl.gz', 'http://rdf.ncbi.nlm.nih.gov/pubchem/source');
checkpoint;
Second, you can execute the bulk load function (“rdf_loader_run()”) multiple times, like:
isql 1111 dba dba exec="rdf_loader_run();" &
isql 1111 dba dba exec="rdf_loader_run();" &
isql 1111 dba dba exec="rdf_loader_run();" &
isql 1111 dba dba exec="rdf_loader_run();" &
isql 1111 dba dba exec="rdf_loader_run();" &
isql 1111 dba dba exec="rdf_loader_run();" &
wait
isql 111 dba dba exec="checkpoint;"
The number of threads for bulk loading depends on the number of available processors. The core subset of PubChemRDF data can be loaded into Virtuoso quad store using 10 processes within 10 hours, and the loaded datasets will take approximately 500 GB SSD at the time of writing.
After loading, it is recommended to check the loading registry table (DB.DBA.load_list) to check if any job was killed or failed due to errors, by running the following command in “isql” command:
It is always a good practice to check “virtuoso.log” file after any operation. If the files to be loaded contain syntax errors, you may see the error messages in the log file.
Virtuoso has “dash board” graphical user interface (GUI) called Virtuoso conductor, which can be accessed through http://<server-name>:<port-number>. You can run SPARQL queries through the Virtuoso conductor after login, like you run queries through the “isql” command line. Another option is the Virtuoso SPARQL endpoint that can be accessed through http://<server-name>:<port-number>/sparql. The SPARQL endpoint is protected through a set of parameters defined in “virtuoso.ini” file, like timeout limit and the maximum number of concurrent users. You can change the configurations either through the Virtuoso conductor or by editing “virtuoso.ini” file directly.
You can either access the SPARQL endpoint service using a browser or by sending HTTP GET/POST requests to the SPARQL query service:
It is recommended to set namespace prefixes for the ontologies used in PubChemRDF using the predefined function (DB.DBA.XML_SET_NS_DECL) that can run in “isql” command:
DB.DBA.XML_SET_NS_DECL ('substance', 'http://rdf.ncbi.nlm.nih.gov/pubchem/substance/', 2);
DB.DBA.XML_SET_NS_DECL ('descriptor', 'http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor/', 2);
DB.DBA.XML_SET_NS_DECL ('synonym', 'http://rdf.ncbi.nlm.nih.gov/pubchem/synonym/', 2);
DB.DBA.XML_SET_NS_DECL ('inchikey', 'http://rdf.ncbi.nlm.nih.gov/pubchem/inchikey/', 2);
DB.DBA.XML_SET_NS_DECL ('bioassay', 'http://rdf.ncbi.nlm.nih.gov/pubchem/bioassay/', 2);
DB.DBA.XML_SET_NS_DECL ('measuregroup', 'http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup/', 2);
DB.DBA.XML_SET_NS_DECL ('endpoint', 'http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint/', 2);
DB.DBA.XML_SET_NS_DECL ('reference', 'http://rdf.ncbi.nlm.nih.gov/pubchem/reference/', 2);
DB.DBA.XML_SET_NS_DECL ('protein', 'http://rdf.ncbi.nlm.nih.gov/pubchem/protein/', 2);
DB.DBA.XML_SET_NS_DECL ('conserveddomain', 'http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain/', 2);
DB.DBA.XML_SET_NS_DECL ('gene', 'http://rdf.ncbi.nlm.nih.gov/pubchem/gene/', 2);
DB.DBA.XML_SET_NS_DECL ('pathway', 'http://rdf.ncbi.nlm.nih.gov/pubchem/pathway/', 2);
DB.DBA.XML_SET_NS_DECL ('source', 'http://rdf.ncbi.nlm.nih.gov/pubchem/source/', 2);
DB.DBA.XML_SET_NS_DECL ('concept', 'http://rdf.ncbi.nlm.nih.gov/pubchem/concept/', 2);
DB.DBA.XML_SET_NS_DECL ('vocab', 'http://rdf.ncbi.nlm.nih.gov/pubchem/vocabulary#', 2);
DB.DBA.XML_SET_NS_DECL ('obo', 'http://purl.obolibrary.org/obo/', 2);
DB.DBA.XML_SET_NS_DECL ('sio', 'http://semanticscience.org/resource/', 2);
DB.DBA.XML_SET_NS_DECL ('skos', 'http://www.w3.org/2004/02/skos/core#', 2);
DB.DBA.XML_SET_NS_DECL ('bao', 'http://www.bioassayontology.org/bao#', 2);
DB.DBA.XML_SET_NS_DECL ('bp', 'http://www.biopax.org/release/biopax-level3.owl#', 2);
DB.DBA.XML_SET_NS_DECL ('ndfrt', 'http://evs.nci.nih.gov/ftp1/NDF-RT/NDF-RT.owl#', 2);
DB.DBA.XML_SET_NS_DECL ('ncit', 'http://ncicb.nci.nih.gov/xml/owl/EVS/Thesaurus.owl#', 2);
DB.DBA.XML_SET_NS_DECL ('wikidata', 'http://www.wikidata.org/entity/', 2);
DB.DBA.XML_SET_NS_DECL ('ops', 'http://www.openphacts.org/units/', 2);
DB.DBA.XML_SET_NS_DECL ('cito', 'http://purl.org/spar/cito/', 2);
DB.DBA.XML_SET_NS_DECL ('fabio', 'http://purl.org/spar/fabio/', 2);
DB.DBA.XML_SET_NS_DECL ('uniprot', 'http://purl.uniprot.org/uniprot/', 2);
DB.DBA.XML_SET_NS_DECL ('up', 'http://purl.uniprot.org/core/', 2);
DB.DBA.XML_SET_NS_DECL ('pdbo', 'http://rdf.wwpdb.org/schema/pdbx-v40.owl#', 2);
DB.DBA.XML_SET_NS_DECL ('pdbr', 'http://rdf.wwpdb.org/pdb/', 2);
DB.DBA.XML_SET_NS_DECL ('taxonomy', 'http://identifiers.org/taxonomy/', 2);
DB.DBA.XML_SET_NS_DECL ('reactome', 'http://identifiers.org/reactome/', 2);
DB.DBA.XML_SET_NS_DECL ('chembl', 'http://rdf.ebi.ac.uk/resource/chembl/molecule/', 2);
DB.DBA.XML_SET_NS_DECL ('chemblchembl', 'http://linkedchemistry.info/chembl/chemblid/', 2);
DB.DBA.XML_SET_NS_DECL ('foaf', 'http://xmlns.com/foaf/0.1/', 2);
DB.DBA.XML_SET_NS_DECL ('void', 'http://rdfs.org/ns/void#', 2);
DB.DBA.XML_SET_NS_DECL ('dcterms', 'http://purl.org/dc/terms/', 2);
By doing this the query results can be visualized and understood much easier, in particular, using the RDF turtle format. It is noteworthy that the “isql” command cannot override the existing namespaces, so if there is an existing prefix with the same name, you need to either use another predefined function (DB.DBA.XML_REMOVE_NS_BY_PREFIX) or manually locate them and change them through theVirtuoso conductor, under the tab of “Linked Data”-> “Namespaces”.
For example, to remove the existing namespace for the prefix ‘obo’, the following ‘isql’ command can be used:
8 PubChemRDF Use Cases
The following sample SPARQL queries can help you to understand more about the PubChemRDF dataset. You can build other SPARQL queries based on the sample queries below.
Case 1: What protein targets does donepezil (CHEBI_53289) inhibit with an IC50 less than 10 microMolar?
from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
WHERE {
?sub rdf:type obo:CHEBI_53289 ; obo:RO_0000056 ?mg .
?mg obo:RO_0000057 ?protein ; obo:OBI_0000299 ?ep .
?protein rdf:type bp:Protein ; dcterms:title ?title .
?ep rdf:type bao:BAO_0000190 ; obo:IAO_0000136 ?sub ; sio:has-value ?value .
filter (?value < 10 )
}
Case 2: What pharmacological roles of SID46505803 are defined by CHEBI?
Note: CHEBI ontology should be downloaded, and loaded into a separate graph, i.e. <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
SELECT DISTINCT ?rolelabel
from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/compound>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
WHERE {
substance:SID46505803 sio:CHEMINF_000477 ?comp .
?comp rdf:type ?chebi .
?chebi rdfs:subClassOf [ a owl:Restriction ;
owl:onProperty obov:has_role ;
owl:someValuesFrom ?role ] .
?role rdfs:label ?rolelabel .
}
Case 3: What compound have a pharmacological role of NSAID as defined by CHEBI and molecular weight less than 200 g/mol?
Note: CHEBI ontology should be downloaded, and loaded into a separate graph, i.e. <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
SELECT distinct ?compound
from <http://rdf.ncbi.nlm.nih.gov/pubchem/compound>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/descriptor>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
WHERE {
?compound rdf:type ?chebi .
?chebi rdfs:subClassOf [ a owl:Restriction ;
owl:onProperty obov:has_role ;
owl:someValuesFrom obo:CHEBI_35475 ] .
?comp sio:has-attribute ?MW .
?MW rdf:type sio:CHEMINF_000334 .
?MW sio:has-value ?MWValue .
filter (?MWValue < 200 )
}
Case 4: What substances have a pharmacological role of NSAID as defined by CHEBI and the depositor-provided 3D X-ray structure information?
Note: CHEBI ontology should be downloaded, and loaded into a separate graph, i.e. <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
SELECT DISTINCT ?substance ?source
from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/source>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
WHERE {
?substance dcterms:source ?source .
?source dcterms:subject concept:Protein_3D_Structures .
?substance rdf:type ?chebi .
?chebi rdfs:subClassOf [ a owl:Restriction ;
owl:onProperty obov:has_role ;
owl:someValuesFrom obo:CHEBI_35475 ] .
}
Case 5: What protein targets are inhibited by substances with an IC50 less than 10 µM and have a pharmacological role of cholinesterase inhibitors as defined by CHEBI?
select distinct ?title
from <http://purl.obolibrary.org/obo>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
where {
?chebi rdfs:subClassOf [ a owl:Restriction ; owl:onProperty obov:has_role ; owl:someValuesFrom obo:CHEBI_37733 ] .
?sub rdf:type ?chebi ; obo:RO_0000056 ?mg .
?mg obo:RO_0000057 ?protein ; obo:OBI_0000299 ?ep .
?protein rdf:type bp:Protein ; dcterms:title ?title .
?ep rdf:type bao:BAO_0000190 ; obo:IAO_0000136 ?sub ; sio:has-value ?value .
filter (?value < 10 )
}
Case 6: Which substances inhibit protein targets similar to ACCP05979 and have the function domain PSSMID188648?
from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/conserveddomain>
where {
?substance obo:RO_0000056 ?measuregroup .
?measuregroup obo:RO_0000057 ?protein .
protein:ACCP05979 vocab:hasSimilarProtein ?protein .
?protein obo:RO_0002180 conserveddomain:PSSMID188648 .
?measuregroup obo:OBI_0000299 ?endpoint .
?endpoint obo:IAO_0000136 ?substance .
?endpoint rdf:type bao:BAO_0000190 .
?endpoint sio:has-value ?value .
}
Case 7: What protein targets are inhibited by substances with IC50 less than 10 µM and have the same standardized chemical structure (CID3152)?
from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
where {
?sub sio:CHEMINF_000477 compound:CID3152 ; obo:RO_0000056 ?mg .
?mg obo:RO_0000057 ?protein ; obo:OBI_0000299 ?ep .
?protein rdf:type bp:Protein ; dcterms:title ?title .
?ep rdf:type bao:BAO_0000190 ; obo:IAO_0000136 ?sub ; sio:has-value ?value .
filter (?value < 10 )
}
Case 8: What substances inhibit the proteins involved in the same biological pathway: prostaglandin biosynthetic process (GO:0001516), with an IC 50 less than 10 µM?
from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/gene>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/pathway>
where {
?substance obo:RO_0000056 ?measuregroup .
?measuregroup obo:RO_0000057 ?protein .
?protein rdf:type bp:Protein .
?protein up:encodedBy ?gene .
?gene rdf:type bp:Gene .
?gene obo:RO_0000056 obo:GO_0001516 .
?measuregroup obo:OBI_0000299 ?endpoint .
?endpoint obo:IAO_0000136 ?substance .
?endpoint rdf:type bao:BAO_0000190 .
?endpoint sio:has-value ?value .
filter (?value < 10)
}
Case 9: What the pharmacological roles defined by CHEBI are for the substances that inhibit protein target ACCQ12809 with an IC50 less than 10 µM?
from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/ruleset>
from <http://purl.obolibrary.org/obo>
where {
?sub obo:RO_0000056 ?mg .
?mg obo:RO_0000057 protein:ACCQ12809 ; obo:OBI_0000299 ?ep .
?sub rdf:type ?chebi .
?chebi rdfs:subClassOf _:I .
_:I a owl:Restriction .
_:I owl:onProperty <http://purl.obolibrary.org/obo#has_role> .
_:I owl:someValuesFrom ?role .
?role rdfs:label ?rolelabel .
?ep obo:IAO_0000136 ?sub ; rdf:type bao:BAO_0000190 ; sio:has-value ?value .
filter (?value < 10 )
}
Case 10: Summarize the statistics about the total number of substances tested in the PubChem database against each protein target.
Note: this may be a time-consuming query.
from <http://rdf.ncbi.nlm.nih.gov/pubchem/substance>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/measuregroup>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/endpoint>
from <http://rdf.ncbi.nlm.nih.gov/pubchem/protein>
where {
?sub obo:RO_0000056 ?mg .
?mg obo:RO_0000057 ?protein .
?protein rdf:type bp:Protein .
?mg obo:OBI_0000299 ?ep .
?ep rdf:type bao:BAO_0000190 ; obo:IAO_0000136 ?sub ; sio:has-value ?value .
}
group by ?protein
order by ?subcnt
9 Document Version History
V1.7.2.b - 2022Feb - Fixed bug in the docs of downloading data with wget over ftp instead of https to avoid robots.txt restrictions.
V1.7.1.b - 2021Sep - Fixed bug in the docs of Endpoint to make it consistent with data: qudt:numericValue to sio:has-value; xsd:double to xsd:float; qudt:unit to sio:has-unit; ops:Micromolar to uo:micromolar.
V1.7.0.b – 2020Apr - overview of major changes:
- Now using NCBI accession as protein identifier (NCBI phased out GI).
- biosystem subdomain was replaced by pathway (PubChem superseded the NCBI BioSystems database). Compounds, proteins, and genes are defined as participants of pathways, i.e. outgoing links (vs. incoming links).
- Added a new disease subdomain
- GO links were removed from the protein subdomain in favor of gene-GO links in the gene subdomain
- The source subdomain nows includes pathway sources
- Some predicates from PubChem internal vocabulary were replaced in favor of external ones (e.g. vocab:geneSymbol -> sio:gene-symbol); Non-addressable predicates were updated, e.g. obo:BFO_0000057 -> obo:RO_0000057
V1.6.3.b – 2019Nov25 – Fix bugs of an empty link in the example of measuregroup due to a revoked assay.
V1.6.2.b – 2018Dec19 – Fix bugs in URI resolving and content negotiation in RESTful Interfaces.
V1.6.1.b – 2016Sep14 – PubChemRDF is HTTPS-only now. All of the URIs are redirect to 'https', including the URLs in the PubChemRDF Release Notes
V1.6.b – 2016March – overview of major changes:
- Added more figures to demonstrate PubChemRDF graphs
- listed the classes and predicates used for each existing ontologies leveraged by PubChemRDF
- include up-to-date statistics, i.e. total number of triples in the release note
- V1.5.2.b – 2015Oct01 – updated Section 5.2.2. The simple SPARQL-like queries allow multiple values to be provided for subject and object; Changed "SYSTAP Bigdata" to "SYSTAP Blazegraph".
- V1.5.1.b – 2015Aug04 – updated Figure 1 with synonym type link and changed the namespace of NDFRT.
-
V1.5.b – 2015June – overview of major changes:
- Added more figures to demonstrate PubChemRDF graphs
- the classes and predicates used for each existing ontologies leveraged by PubChemRDF
- Added the subdomain “conserveddomain” to help organize and complement protein annotation
- Integrated MeSH with the “reference” subdomain
- Added direct links to external RDF triple stores: NLM’s MeSH RDF; EBI’s ChEMBL RDF, PDB RDF, UniProt RDF, and Reactome RDF; and WikiData RDF
- Added dependencies on the ontologies: “ncit” and “ndfrt”
- Inferable (and therefore redundant) predicates were removed to reduce triple count
- JSON-LD format is now available in URI de-referencing
- New RESTful API functions for substring search and simple SPARQL-like query
- Numerous improvements to documentation, including major changes to Figure 1 reflecting external ontology and external RDF triple store linking information
- V1.1.b – 2014Mar05 – added sections 1.1 and 1.2 and added Virtuoso bulk loading description.
V1.0.b – 2014Jan13 – Initial beta release.
National Library of Medicine 8600 Rockville Pike Bethesda, MD 20894 |