Co-occurrences in Literature
Biomedical literature records, along with genomic and chemical data, in public repositories could be a source of knowledge for biomedical discovery if interrelations between different pieces of data are well understood. However, collecting, curating, and transforming the data into an easy-to-comprehend knowledge representation is a major challenge. We are looking to uncover and summarize important relationships between chemicals, genes, proteins, and diseases by analyzing co-occurrences of terms in biomedical literature.
We utilize natural-language processing (NLP) software to annotate PubMed records, and then use our own software along with additional information from PubChem and other public databases to perform statistical analysis and relevance-based sampling to identify the most relevant information and summarize it to a compact form in PubChem Compound and Gene pages [1-6].
For more details about this topic, please read this paper:
[PubMed PMID: 34322655] [PubMed Central PMCID: PMC8311438] [Free Full Text]
Knowledge panels such as
are created for compounds and genes to show compounds, gene, and disease terms most compounds most frequently co-occurring in the PubMed records with the compound for which the page is created. These entities, most frequently co-occurring in literature records are called neighbors. Several relevant PubMed records are mentioned for each selected neighbor.
The neighbors are selected based on the co-occurrence of the text entities in PubMed records. For a query compound, its neighbors are ranked based on the co-occurrence score, with the neighbor having the largest co-occurrence score considered the most related. We use the information gain-based co-occurrence score described below.
Since it is important to show the user non-redundant neighbors, the exclusion rules described below are applied to avoid redundancy. PubMed records co-mentioning the query and the selected neighbor are sampled based on the relevance score.
Annotating PubMed records. We utilize NLP software [7, 8] to annotate PubMed records, then use our own software to match the chemical terms found in the records to PubChem filtered synonyms, perform statistical analysis and relevance-based sampling.
Scoring the co-occurrences. The information gain-based co-occurrence score for entities i and j is calculated by the formula
where Nij is the number of records where both entities i and j are annotated, Nj the number of records where entity j has been annotated along with any object of the type object i belongs to (say, the number of records containing annotations for disease j along with any compound annotation), and NDS is the effective dataset size. This score is derived from the Kullback–Leibler divergence, also known as relative entropy [9, 10]. The information gain-based score could be considered as a variant of term frequency–inverse document frequency (TF-IDF) score [10-13].
Selecting of non-redundant neighbor compound. In the case of PubChem compounds, redundancy and near-redundancy elimination is performed. For each query compound, we selected several neighbor compounds with the highest frequency of co-occurrence of the compounds in PubChem, based on the counts of PubMed records mentioning both compounds. To avoid redundancy (showing close neighbors of the query compound as well as close neighbors of already-selected neighbor compounds, we apply the exclusion rules. The rules are first applied before looking for neighbor compounds - with the query compound as the center compound, and then after each neighbor selection step - with the last selected neighbor compound as the center compound.
(*) Only synonyms from the PubChem list of filtered synonyms that matched PubMed records are taken into consideration here.
Sampling relevant PubMed records. PubMed records co-mentioning a pair of text entities are sampled based on the relevance score. The score is calculated based on the following factors:
- Whether both entities are co-mentioned in the title;
- Whether both entities are co-mentioned in the same sentence in the abstract;
- Whether one of the entities is mentioned in the title;
- Number of times each entity is mentioned;
- Whether the publication is a review;
- Publication date.
Selecting the time period for the publication dates. While reconciliation of different relevance factors is an intricate problem, balancing the publication date with other relevance factors can be especially difficult, and strongly depends on the user’s needs. To enable user input, while operating with the pre-calculated data in the default setting to assure PubChem system efficiency, we enable selection of the preferred publication date from a limited number of options (last year, 5 years or 10 years). The page view is formed and changed at the front end from the pre-calculated data based the user’s selection (see top of image above).
- Zaslavsky, L. et al. Strategies to improve PubChem data quality and search effectiveness through data analysis. The 252nd ACS national meeting, Philadelphia, August 21-25, 2016.
- Zaslavsky, L. et al. Improving chemical names matching for verification, rating and validation of PubChem Compound records. The 253rd ACS National Meeting, San Francisco, April 2-6, 2017.
- Zaslavsky, L. et al. Towards linking chemical-disease and chemical-gene information in PubChem. The 254rd American Chemical Society National Meeting, Washington, DC, August 20-24, 2017.
- Zaslavsky, L. et al. From text mining to knowledge: PubChem knowledge panels provide synopsis of chemical, gene, protein and disease term co-occurrences in biomedical literature. The 256th ACS National Meeting, Boston, Aug 19 - 23, 2018.
- Zaslavsky, L. et al. Enhancing data-driven summarization of relations between chemicals, genes, proteins, and diseases based on text mining of biomedical literature. The 258th ACS National Meeting, San Diego, Aug 25 - 29, 2019.
- Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, Li Q, Shoemaker BA, Thiessen PA, Yu B, Zaslavsky L, Zhang J, Bolton EE. PubChem 2019 update: improved accessto chemical data. Nucleic Acids Res. 2019 Jan 8;47(D1):D1102-D1109.
- Lowe, D.M., Sayle, R.A. LeadMine: a grammar and dictionary driven approach to entity recognition. J Cheminform. 2015; 7(Suppl 1): S5.
- LeadMine. NextMove Software. https://www.nextmovesoftware.com/leadmine.html.
- Kullback, S. and R. A. Leibler (1951). "On information and sufficiency." Ann Math Stat 22(1): 79-86.
- Manning, C., Raghavan, P., Schütze, H.(2008) “Introduction to Information Retrieval”, Cambridge University Press; 1 edition.
- Rajaraman, A. and Ullman, J. (2011). “Mining of Massive Datasets.” Cambridge University Press; 1 edition.
- Spärck Jones, K. (1972). "A Statistical Interpretation of Term Specificity and Its Application in Retrieval". Journal of Documentation. 28: 11–2.
- Robertson, S. (2004). "Understanding inverse document frequency: On theoretical arguments for IDF". Journal of Documentation. 60 (5): 503–520.
National Library of Medicine
8600 Rockville Pike
Bethesda, MD 20894