Knowledgebase exploration in DES-RV

Introduction

Examples were extracted from the DES-rv KB, minor differences might exist for the other KBs build on DES.

Dragon Exploration System Knowledgebase dictionaries

Central to the knowledgebase is a notion of a “term” or “concept”. A term is a biological name or a phrase (e.g. Crohn’s diseases, autopphagy, T300A, etc.) used to mine the literature as illustrated in Figure 2. More detailed information about dictionaries can be found on the page "Dictionaries". Terms are organized into thematic dictionaries as shown in Figure 3. A term can be enriched by other data. Information sources like scientific papers and bioinformatics’ databases are analyzed, relevant terms extracted and recorded in the knowledgebase. In addition, connections between terms/concepts (in the knowledgebase called the “Enriched Pairs”) are inferred based on their co-occurrence. For example, terms “Crohn’s diseases” and “T300A” if mentioned in the same paper in a close proximity are co-occurring. Also, the knowledgebase system can infer hypotheticals. These hypotheses can be used as a starting point for possible further investigations. As a knowledgebase user, you will deal with terms: you can search the literature using them as keywords, you can list them, rank according to their relevance, find hidden relationships between them or ask the system to create hypotheses.

Knowledgebase Exploration

DES_rv is a web based application. To start exploring, inside your browser open the home page. This will take you to the knowledgebase opening screen as shown in Figure 4 for DES-rv. To navigate through the knowledgebase, you can use the main menu as shown in the figure 5.

Knowledgebase opening screen

Knowledgebase navigation and exploration modes

You can easily navigate through the knowledgebase by using the main menu as shown in Figure 5. After clicking on any of the options it will become highlighted and an appropriate exploration mode will open. The knowledgebase exploration modes are:

Navigation menu

Enriched Concepts exploration mode

In this mode, you can explore terms occurring in the knowledgebase as shown in Figure 6. The terms are displayed in a table that can be sorted by any of the column by simply clicking on the column heading. Also, users can filter the list by any column by typing a keyword, or part of a keyword in the boxes above the table heading.

Enriched terms browser

The following columns are available:

Right-click on a term’s name opens a visualization menu associated with the concept as shown in Figure 6. The menu has two options:

Here we explain the options in more details.

Network option

This option opens a visualization window as shown in Figure 7 showing an interconnection graph for top 50 terms. Each term on the graph is presented as a network node liked to another node via line labeled with a number of documents supporting the link. The graph can be explored in multiple ways, which we explain later.

Terms interconnections

Term Co-occurrence

Displays associated terms categorized by Dictionary as shown in Figure 8. This is a useful and informative functionality, as the user can see the top 50 associated terms with the selected one (after p_value cut-off), organized by dictionaries. The lists can be filtered in order of frequency (raw co-occurrence counts), p-value, or PMI (Point-wise Mutual Information measure). These enriched associations are cut-off at 0.05 after the Bonferoni correction for FDR. The user can further restrict this threshold using the P_VALUE MAX input.

Terms co-occurrence table

Enriched Term Pairs exploration mode

This exploration mode allows users to explore terms linked across multiple documents. In the simplest form, it is a simple co-occurrence of two terms in the same document. However, the information provided is much richer and displayed as a table as shown in Figure 9.

Enriched term pairs display

The table displays term pairs marked as A and B. Information about term association is organized in the following columns:

The table can be ordered by any of its columns, and cut-offs can be applied to any of its numerical columns at the table header or footer. Terms are also searchable from the Filter input in the header. Also, individual dictionaries can be searchable at the table header/footer text inputs. Associations are first captured through raw co-occurrence counts, and then two more accurate measures are used to rank the strength of these associations. The Point-wise Mutual Information measure is a well-known metric for quantifying association in information theory and statistics:

Point-wise Mutual Information measure

In the above, p(x) can be thought of as the frequency of the first term (A Count column), p(y) the frequency of the second term (B Count column), and p(x,y) as the co-occurrence frequency (AB Count column). The actual values range between -1 (weakest associations) to +1 (strongest).

The associations are also ranked by the probability P((X,Y)>=AB Count | X=A Count, Y=B Count) where A and B are drawn from the constant number of articles within knowledge-base. This is calculated to be the hypergeometric p-value for enriching the less frequent concept against the more frequent one. From our observations, this seems to be a robust measure for quantifying relevant associations. The table of associations is ordered by this p-value by default. The p-value is corrected for false discovery rate using Bonferroni, and a default cutoff is applied at 0.05. Values therefore range from 0 to 0.05.

In the example shown in Figure 9 two dictionaries are selected: “ChEBI” and “Biological Process (GO)”. In addition, any of these two dictionaries could be further filtered by filling in the search boxes underneath.

Associations mining example

The associations within this table can be visualized as a concept centric network by right-clicking on a term and choosing the 'Network' option from a context menu. Figure 11 shows the actual network. On the top, there is a menu (dark green band at the top) you can use to control and customize the view. In addition, right-click on any of the nodes displays a context-menu (light gray menu on the figure) with the following options:

Visualizing connections between terms

The main visualization menu (at the top) consists of number of dropdown as shown in Figure 12.

Visualization menus

Links filter: allows filtering links between terms according to the following criteria:

Dictionaries: allows filtering particular dictionary in/out of the network visualization.

Select Layout: provides a number of network layouts that the user can choose from:

Export the network: exporting the network graph into a portable image format (.png).

Reset the network: first right-click on network area, then you can return to the original node and its associations.

Semantic Similarity

Semantic similarity is a metric which establishes the likeness or closeness of two concepts in terms of their meaning. Semantic similarity can be the result of semantic relatedness, such as synonymy, antonymy, hypernymy, etc. For example, tall and short are semantically similar even though they are antonyms because they both share the semantic dimension of ‘height’.

Semantic similarity within DES is calculated as the cosine distance between two concept embeddings (vector representations in a latent semantic space). These embeddings are obtained using a skip-gram Word2Vec model trained on the DES-RedoxVasc literature corpus with normalized concept annotation. Therefore, the underlying assumption for semantic similarity in DES is concept co-occurrence, but not necessarily direct co-occurrence.

As a functionality within the DES interface, semantic similarity is used as a measure for sorting the KB concepts (the table on the right) according to this similarity metric with respect to a chosen concept (selectable from the table on the left) in Figure 13. Note that top hits for a chosen concept are potential association candidates, which may or may not have a direct co-occurence with it in within the text.

Semantic Similarity

Literature exploration mode

This page allows literature search by selected terms or dictionaries. Figure 14 shows such search where Mutations (tmVar) dictionary and Cancer are selected as search keywords. Search keywords are displayed in the line bellow the search box and can easily be removed by clicking on the “x” button. A new keyword can be added by typing it in the input box. Autocomplete feature automatically opens all relevant terms you can choose from. The result of the search is a list of PubMed and PubMed Central articles relevant to your search. All term of interest is highlighted in the text. In addition, right-click on the term bring the visualization submenu expanded with an extra feature “Add to Key Terms” that ads clicked terms into search criteria. Literature search example

Literature search

Conclusion

DES KBs are free for academic and nonprofit users. Users can exploit the knowledgebase by using any of the mainstream web browsers, including Firefox, Safari and Chrome. However, as far as we know, the only feature with browser inter-compatibility issues is the network export option that is only available through Chrome.