Biostat 823 - Semantic Similarity

Hilmar Lapp

Duke University, Department of Biostatistics & Bioinformatics

2024-10-08

Semantic similarity

Semantic similarity is defined as a quantitative metric between “documents”, or concepts (terms) ordered in a hierarchy (such as an ontology).
- For concepts in a hierarchy, we can consider their textual definitions as the “documents”.
- For text documents, we can represent these as concepts that occur in them, and then evaluate semantic similarity as between the two sets of concepts.
- More generally, we can consider any entity linked to a set of concepts as “documents” for evaluating their semantic similarity in this way.
Semantic similarity metrics use a Directed Acyclic Graph (DAG) representation of the concept hierarchy (such as an ontology).

Concept DAG

For evaluating semantic similarity, we use only edges representing concept inclusion (e.g., subClassOf).

Concept DAG

To evaluate the semantic similarity between two concepts, say M and F, we consider their induced subgraphs of subsumers.

Induced subgraphs of subsumers

The induced subgraph for M are M and all ancestors (subsumers) of M, following the edges that represent concept inclusion.

Induced subgraphs of subsumers

The induced subgraph for F is defined analogous.

Common and disjoint ancestors

The intersection of the induced subgraphs is the common ancestors (subsumers). Subtracting the intersection from the union yields the disjoint ancestors.

Common and disjoint ancestors

I and D are examples of what is sometimes referred to as “disjoint common subsumers”, i.e., the common subsumers that do not subsume another common subsumer.

Pairwise semantic similarity

Pairwise metrics are between a pair of concepts \(C_1\) and \(C_2\)
There are two major categories of pairwise metrics:
- Metrics that use only topology of the induced subgraphs
  - Sometimes referred to as Edge-based (because they are based on edge paths)
  - Can be computed from the concept graph alone (which is fast)
- Metrics that use information content (IC).
  - Sometimes referred to as Node-based (because they use a property of nodes)
  - Require a corpus against which to determine concept probabilities (which can be slow for a large and complex ontology and a large corpus, and is specific to the choice of corpus)

Subgraph topology-based metrics

Jaccard Index \(SimJ(C_1,C_2)=\frac{|S(C_1)\cap S(C_2)|}{|S(C_1)\cup S(C_2)|}\)

where S(C) is the set of subsumers (induced subgraph) of concept C.

For the example graph, SimJ(M,F) = 3/8.
Note that Jaccard index does incorporate depth through the numerator. Intuitively, two concepts deeper in the concept hierarchy should be semantically more similar than two concepts with equal path distance close to the root.

Subgraph topology-based metrics

Binary vector representation of subsumer subgraphs: \(\mathbf{S}_C: s_i=\begin{cases}1\;\mathrm{if}\ C_i\in S(C)\\ 0\; \mathrm{otherwise}\end{cases}\)

allows using vector algebra for metric calculation:

\(SimJ(C_1,C_2)=\frac{\mathbf{S}_{C_1}\cdot\mathbf{S}_{C_2}}{||\mathbf{S}_{C_1}||^2+||\mathbf{S}_{C_2}||^2-\mathbf{S}_{C_1}\cdot\mathbf{S}_{C_2}}\)
\(1-SimJ\) is a proper distance metric (satisfying the triangle inequality).

Subgraph topology-based metrics

Cosine similarity: \(SimCos(C_1,C_2)=\frac{\mathbf{S}_{C_1}\cdot\mathbf{S}_{C_2}}{||\mathbf{S}_{C_1}||\ ||\mathbf{S}_{C_2}||}\)
1-SimCos is not a proper distance metric (violates the triangle inequality). Need to convert to normalized angle: \(D_\theta(C_1,C_2) = \frac{\arccos(SimCos(C_1,C_2))}{\pi}\)

IC-based metrics

Information content (a.k.a. Shannon information) was first defined in information theory by C. Shannon:
- Relates information of an event to the probability of the event occurring.
- Formalizes the notion that the less probable an event is, the bigger the surprise, and thus the more information it yields. An event with 100% probability to occur yields zero information.
- Defined as \(IC(x) =-log(P(x))\) where P(x) is the probability of event \(x\) occurring.
For semantic similarity, the probability of encountering a concept C is usually estimated by its frequency f (number of occurrences) in an appropriate corpus.
- For example, for the human genome as a corpus, the frequency with which a gene function concept is annotated to some gene.
- P(C) = f(C)/N where N is the size of the corpus, and 0 ≤ f(C) ≤ N.
- Frequencies can differ between corpora (e.g., from one genome to another). Hence, choice of corpus needs to be well considered.

IC-based metrics

In a concept hierarchy, the frequency of a concept C includes the direct (without subsumption) frequencies of all concepts it subsumes: \(f(C) = \sum_{C_i\in S(C)} f_D(C_i)\)
Therefore, \(C_i\sqsubseteq C_j \Rightarrow IC(C_i)\ge IC(C_j)\)
\(0 \le IC(C) \lt\infty\)

Can be normalized using corpus size N: \[0 \le IC(C)/maxIC\le 1;\\ maxIC=-\log(1/N)\]

IC-based metrics: Resnik

Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. Proc. of the 14th International Joint Conference on Artificial Intelligence. pp. 448–453.
Resnik similarity: \(SimRes(C_1,C_2)=IC(MICA(C_1,C_2))\)

where MICA is the most informative common ancestor: \(MICA(C_1,C_2) =\\ arg\,max_{i,\ C_i\in S(C_1)\cap S(C_2)} IC(C_i)\)
MICA is the same as the max IC of the disjoint common subsumers. Related metrics use the average instead.

IC values as weights

Using the binary vector of subsumers representation, instead of 1 for non-zero values we can also use weights: \(\mathbf{S}_C: s_i=\begin{cases}w_i, 0<w_i<\infty\;\mathrm{if}\ C_i\in S(C)\\ 0\; \mathrm{otherwise}\end{cases}\)
- Use Jaccard (which then becomes Tanimoto similarity) or Cosine similarity using such weighted vectors
IC values provide meaningful weights
Can also train weights using ML

Profile semantic similarity

In a typical application, need to evaluate the semantic similarity between sets of concepts, called profiles:
- Sets of gene function GO terms annotated to two genes, or proteins.
- Sets of phenotype or disease ontology terms annotated to two gene variants. Etc
Profile semantic similarity uses a pairwise metric. Two major categories depending on how the pairwise metric is applied.
Groupwise (a.k.a. graph-based): the induced subgraphs of the concepts in a profile are combined, then the pairwise metric is applied to the combined subgraphs.
- Groupwise metrics are always symmetric.
Pairwise: The pairwise metric is applied to all combinations of pairs, then the pairwise scores are aggregated using some function.
- Symmetric if the aggregation function is commutative, and asymmetric otherwise.

Pairwise profile similarity

Often used in conjunction with IC for pairwise similarity scoring
Symmetric functions for aggregating scores (a.k.a. “all pairs”):
- Max; if used with IC, known sometimes as MaxIC profile metric.
- Average
Asymmetric aggregation:
- Best pairs: \[SimBP(G_1,G_2)=\frac{1}{|G_1|}\sum_{C_i\in G_1}\max_{Cj\in G_2}(simP(C_i,C_j))\] with SimP being the pairwise metric (such as IC or Jaccard)
- Natural when querying a query profile against a database of profiles
- Can be made symmetric by averaging SimBP(A,B) and SimBP(B,A).

Biostat 823 - Semantic Similarity

Semantic similarity

Concept DAG

Concept DAG

Induced subgraphs of subsumers

Induced subgraphs of subsumers

Common and disjoint ancestors

Common and disjoint ancestors

Pairwise semantic similarity

Subgraph topology-based metrics

Subgraph topology-based metrics

Subgraph topology-based metrics

IC-based metrics

IC-based metrics

IC-based metrics: Resnik

IC values as weights

Profile semantic similarity

Pairwise profile similarity

Further reading