Search

Enter a search term, e.g. CXCR3, cholesterol, Lactobacillus plantarum, ADHD

Select & Show

Loading...

Vocabularies

Select one or more vocabularies for which you want to see the relations. Some vocabularies require a login.

All relations


Loading...

Abstracts



Loading...

Wordcloud

Loading...
Some text

Parameters

Network

Parameters

Classifier

Abstracts


Loading...

Parameters

Concept set

Your set contains concepts. Use the tabs to further explore this set. You can add additional concepts from the KMAP-scoping tab

Current Concepts


Intra-set relations

Abstracts

Network

Heatmap

Display mode


Download

Current Set


Reference Set

On this page the references that are currently in the set, are shown. Using the raiobuttons on the left panel, you can format the output that you need. The default output is to show all the abstracts with citation information. If you need the identifiers, for example to import them into a reference manager program like EndNote, choose the PMIDS option. If you want to copy the references directly into your document, select the Citations option. 

You can use the download button to download the abstracts to an Excel file. This file contains all the information on authors, journal etc. An example is shown below.

 

Reference Download Image

Please use the various tabs to find out more about KMINE.
For questions, feature requests and other suggestions, please email us at kmine@tenwise.nl.
KMINE
Welcome to the KMINE web server. This server has been created by TenWise on basis of the KMAP database of > 200 million literature relations that have been autmatically mined from the PubMed database using a number of text mining algorithms. This site is organized into several analysis modes (shown in the the tabs) that allow you to explore relations between biological concepts (KMAP-Scoping), build your own biological literature networks (KMAP-Multi) and organize the literature evidence for your concept (KMAP-Evidence). This flow is shown in the figure below. The details of each analysis mode is explained in this page and also on the Help tabs in each individual analysis mode.

 

Start page
On this page you enter your concept of interest.
A concept is a biological entity that can be:
  • a human gene
  • a bacterial gene
  • a metabolite
  • a molecular pathway
  • a phenotype
  • a disease
  • a drug
  • an organism
All these concepts are derived from a number of datasources. See the tab Datasources for more info on these concept lists.
KMAP-Scoping

On this page the relations are shown between your concept and the vocabularies that are selected in the left panel. The total number of hits with your concept and the total of terms in the vocabulary are displayed in parenthesis.

Relations The table that is generated indicates per column:

  • Name: Name of the concept. Clicking on the header sorts alphabetically.
  • Type: The vocabulary from which the concept is derived.
  • Nr. Abstracts: The number of abstracts in which both concepts co-occur.
  • Score: A score denoting the uniqueness of the relation. THis is a score that takes into account the frequency of the individual terms and the frequency of the co-occurrence.

The Search field can be used to search in the entire table for a keyword. Selecting a row in the table shows the abstracts in the right results panel. Abstracts in the table can be selected by clicking one or more rows in this table and added to your Abstract Set by pressing the Add selected abstracts to Abstract Set. The collected abstracts can be visited anytime via the My Abstracts tab. The selection in My Set does not change in case search conditions are altered.

Clicking the button ‘Add Concept to KMAP Multi Set’ when a concept is selected in the table add this concept to your collected concept set which you can further explore using the KMAP-Multi tab.

Wordclouds The wordcloud is a visual representation of the data in the Relations table. The font size for each concept is proportional to the Score. The sliders can be used to select the maximum number of words per category and the scaling of the fonts.

Network This network shows the data from the Relations in a network with the selected concept in the center and up to a number of 50 other concepts per category. These concepts (dots with their name inside) have a different colour dependent on the type (same colors used as for the worldclouds ). In the righthand above corner a bar called ‘number of connections per category’ can be used to create and alter the network picture using between 0 and 50 concepts per category. The thickness of the arrows between the nodes indicate the relationship between the concepts. The thicker the node the higher the score on abstracts in which the two co-occur. By clicking on the arrow you are redirected to the TenWise server that depicts the abstracts with the co-occurrences. By clicking on the nodes you are redirected to the TenWise server that depicts the abstracts on this single concept. Finally the network shape can be changed manually by dragging the nodes across the screen. Screenshots can be made and used for reporting ed.

 

KMAP-Evidence

This pages shows abstracts for your selected concept, ranked between 0 and 1 for various types of evidence. The score is determined by running a previously trained Random Forest model on the abstracts. Evidence is defined as the biological context of the abstract in which the concept appears. For example, an abstract with a score of 1 for clinical, is most likely about a clinical trial, or an observational study with patients. Likewise, an abstract with a high score for mouse is most likely about a mouse experimental model. The type of model can be selected in the left hand selection panel and described below. Not all models are available in the model if you are not logged in.

  • Clinical: Clinical trials, patient observations
  • Variant: Polymorphisms, DNA rearrangements, mutations
  • Microbiome: Microbiota, metagenomics and 16S profiling studies
  • Mouse: Experimental mouse models, studies on mice derived data
  • Rat: Experimental rat models, studies on rat derived data

In all cases the 100 abstracts with the highest score for the selected model are shown. By changing the selection of the model, the output in the tab Evidence is automatically updated. The generated table (Article classes) has the following columns:

Citation: The articles in which your concept occurs selected for high scores with the selected model.

Score: The score (between 0 and 1) from the randomForest model 

The search field button allow to select words used in the citations. This can be any word including author names and journal abbreviations.

Year: The year of the publication.

Clicking a row in the table shows the corresponding abstract on the righthand side. Clicking on the hyperlink (number) brings you to the abstract on the Pubmed site. Abstracts in the table can be selected and added to you set by pressing the button: Add selected abstracts to My Abstracts. If you go to My Abstracts then you will find the selected abstacts listed so you can revisit this at a later stage. The selection in My Abstracts does not change in case search conditions are altered.

 

KMAP-Multi

This tab shows analysis results for the set of concepts that have been saved in the concept set. This tab has a number of sub-tabs. 

Concept set: The concepts that are currently in your set.

X-table: A table in which all concepts in your set are linked to each other. Clicking on a rwo in the table show the abstracts in which both concepts co-occur. This can also be done in sentence mode, in which case only the sentences in which both concepts co-occur are shown.

Heatmap This is a heatmap in which the dot size depicts the strength of the relation (Score) between two concepts and the color indicates the overlap (number of abstracts).

Network: Network view, in which all concepts in your set are linked (if there is a literature connection). In the case that there is no conenction between the concepts in your set, a warning message is shown.

 

Abstract Set

On this page the abstracts that are currently in the set, are shown. Using the radiobuttons on the left panel, you can format the output that you need. The default output is to show all the abstracts with citation information. If you need the identifiers, for example to import them into a reference manager program like EndNote, choose the PMIDS option. If you want to copy the abstracts directly into your document, select the Citations option. 

You can use the download button to download the abstracts to an Excel file. This file contains all the information on authors, journal etc. An example is shown below.

 

Reference Download Image

 

Datasources
Human Genes
A vocabulary of human gene names, along with their symbols and synonyms. This collection was downloaded from the HGNC site. Since a lot of genes have symbols and synonyms that may also relate to non-gene words, a TenWise proprietary gene calling algorithm was used to discard false positive hits. The current vocabulary has 96342 terms, covering 43139 concepts.

Drugs
This is a vocabulary of pharmaceutical research compounds. This vocabulary was constructed by TenWise using a proprietary algorithm to detect these compounds in biomedical text. The current set contains 16031 concepts.

Metabolites
This is a vocabulary constructed by TenWise by merging metabolite names from several repositories (such as KEGG, ChEBI and Reactome) and text books on biochemical pathways. This vocabulary is still under construction, and new entities are added on a monthly basis.

Pathways
A vocabulary of biological pathways. This vocabulary was constructed by TenWise using a proprietary method to detect pathway names, signalling routes and metabolic pathways in biomedical text. In addition terms from the GeneOntology Biological Process subset were curated for text mining and added to this set. The current set contains 2533 concepts.

Phenotypes
A collection of terms describing human phenotypes. As such this collection has some overlap with the Dermatological and Human Disease vocabularies. This collection was downloaded from the HPO github repository and the terms were optimized for textmining. The current vocabulary has a total of 27961 terms covering 12956 concepts.

Human Disease
A collection of terms describing human diseases. This collection was downloaded from the BioOntology Portal and the terms were optimized for textmining. The current vocabulary has a total of 25483 terms covering 9189 concepts.

Skin Disease
A collection of terms describing dermatological disorders and diseases. This collection was downloaded from the BioOntology Portal and the terms were optimized for textmining. The current vocabulary has a total of 5512 terms covering 3427 concepts.

Organisms
This vocabulary is baded on the taxonomical names, obtained from the NCBI FTP site. From this set, only the names for Archaea, Bacteria and Eukaryota were used for text mining. In all cases the full name (e.g. Escherichia coli) and the abbreviated form (E. coli) were used for mining. The current vocabulary contains 418443 concepts.

Bacterial genes
A set of bacterial gene names that was obtained by creating a non-redundant list of bacterial names from public GenBank repositories for bacterial genomes. Since a lot of gene names are shared between multiple bacteria across a wide range of species, these names are provided as general names and not mapped back to any particular organism.
How is the score calculated?
The score represents the uniqueness of the relation. If two concepts are very often described in literature, e.g. the organism Escherichia coli and the molecule glucose, then the fact that that they occur multiple times together in abstracts is not so surprising. On the other hand, if a very rare disease is described only in 5 abstracts, and always together with a gene that also occurs in a limited number of abstracts, the combination of these two is very unique and points to a biological relation.
In other words, the score is calculated on basis of the number of abstracts in which both concepts co-occur and corrected for the number of abstracts in which both concepts occur alone, i.e. not with each other. This is also graphically explained in the figure below.

For a thorough description on how to score cooccurrences, we suggest to read the introductory chapter of the PhD thesis of Stefan Evert The Statistics of Word Cooccurrences.


Why don't you do text mining on full text articles?
Although this is in principle possible, we have decided not to implement it yet in this web server. There are two reasons for this.
  • We want to provide a direct link from our text mining results to the underlying source. When we would run our textmining on full text papers, this becomes problematic if not all users have access to the same set of full text articles due to restrictions in subscriptions. This could solved by running only on the subset of articles that are full text, but compared to the entire MEDLINE database, this is a small set, and we may miss important relations.
  • Mining on full text on the one hand yields more relations, but at the same time also retrieves relations that are less informative. For example, relations that are retrieved from the introduction is often a repetition of known knowledge, whereas relations from the discussion section may represent speculations that are not always backed up by experimental evidence.
    This does not mean that full text papers should not be used at all for text mining, but we believe that for this particular web application, abstracts provide a good trade-off between sensitivity (retrieving known facts) and specificity (discarding false positives).

Can you discriminate between facts from journals with a high and low impact factor?
Although we know the originating source of the co-occurrences, we do not use this in the calculation of the score. This would require some kind of scoring model, in which a consensus needs to be reached on what is considered a high or low impact journal (Note that the impact factor is only part of that assessment this is also critically dependent on the scientific field.) Also the impact of a journal may change over time, meaning that for scoring we would also need to take into account the exact time point at which the fact was published in a particular journal. Lastly, we would also need to decide on whether we would place greater or smaller trust to facts published in high impact journals. All in all, this would make the score non-transparent and therefore less useful.

More importantly, we believe that:

  • Such weighing schemes will not significantly alter the biological relations that are observed in literature.
  • Only biological experts can make the final call on whether a biological relation that is retrieved from literature is most useful for their particular research question, data analysis or hypothesis generation.

Please login to get access to all vocabularies and all predictive models.
If you do not have login information, contact us for a trial account.