Biostat 823 - Semantic Similarity
Hilmar Lapp
Duke University, Department of Biostatistics & Bioinformatics
2024-10-08
Semantic similarity
- Semantic similarity is defined as a quantitative metric between “documents”, or concepts (terms) ordered in a hierarchy (such as an ontology).
- For concepts in a hierarchy, we can consider their textual definitions as the “documents”.
- For text documents, we can represent these as concepts that occur in them, and then evaluate semantic similarity as between the two sets of concepts.
- More generally, we can consider any entity linked to a set of concepts as “documents” for evaluating their semantic similarity in this way.
- Semantic similarity metrics use a Directed Acyclic Graph (DAG) representation of the concept hierarchy (such as an ontology).
Concept DAG
For evaluating semantic similarity, we use only edges representing concept inclusion (e.g., subClassOf).
Concept DAG
To evaluate the semantic similarity between two concepts, say M and F, we consider their induced subgraphs of subsumers.
Induced subgraphs of subsumers
The induced subgraph for M are M and all ancestors (subsumers) of M, following the edges that represent concept inclusion.
Induced subgraphs of subsumers
The induced subgraph for F is defined analogous.
Common and disjoint ancestors
The intersection of the induced subgraphs is the common ancestors (subsumers). Subtracting the intersection from the union yields the disjoint ancestors.
Common and disjoint ancestors
I and D are examples of what is sometimes referred to as “disjoint common subsumers”, i.e., the common subsumers that do not subsume another common subsumer.
Pairwise semantic similarity
- Pairwise metrics are between a pair of concepts \(C_1\) and \(C_2\)
- There are two major categories of pairwise metrics:
- Metrics that use only topology of the induced subgraphs
- Sometimes referred to as Edge-based (because they are based on edge paths)
- Can be computed from the concept graph alone (which is fast)
- Metrics that use information content (IC).
- Sometimes referred to as Node-based (because they use a property of nodes)
- Require a corpus against which to determine concept probabilities (which can be slow for a large and complex ontology and a large corpus, and is specific to the choice of corpus)
Subgraph topology-based metrics
Jaccard Index \(SimJ(C_1,C_2)=\frac{|S(C_1)\cap S(C_2)|}{|S(C_1)\cup S(C_2)|}\)
where S(C) is the set of subsumers (induced subgraph) of concept C.
For the example graph, SimJ(M,F) = 3/8.
Note that Jaccard index does incorporate depth through the numerator. Intuitively, two concepts deeper in the concept hierarchy should be semantically more similar than two concepts with equal path distance close to the root.
Subgraph topology-based metrics
Binary vector representation of subsumer subgraphs: \(\mathbf{S}_C: s_i=\begin{cases}1\;\mathrm{if}\ C_i\in S(C)\\ 0\; \mathrm{otherwise}\end{cases}\)
allows using vector algebra for metric calculation:
\(SimJ(C_1,C_2)=\frac{\mathbf{S}_{C_1}\cdot\mathbf{S}_{C_2}}{||\mathbf{S}_{C_1}||^2+||\mathbf{S}_{C_2}||^2-\mathbf{S}_{C_1}\cdot\mathbf{S}_{C_2}}\)
\(1-SimJ\) is a proper distance metric (satisfying the triangle inequality).
Subgraph topology-based metrics
Cosine similarity: \(SimCos(C_1,C_2)=\frac{\mathbf{S}_{C_1}\cdot\mathbf{S}_{C_2}}{||\mathbf{S}_{C_1}||\ ||\mathbf{S}_{C_2}||}\)
1-SimCos is not a proper distance metric (violates the triangle inequality). Need to convert to normalized angle: \(D_\theta(C_1,C_2) = \frac{\arccos(SimCos(C_1,C_2))}{\pi}\)
IC-based metrics
- Information content (a.k.a. Shannon information) was first defined in information theory by C. Shannon:
- Relates information of an event to the probability of the event occurring.
- Formalizes the notion that the less probable an event is, the bigger the surprise, and thus the more information it yields. An event with 100% probability to occur yields zero information.
- Defined as \(IC(x) =-log(P(x))\) where P(x) is the probability of event \(x\) occurring.
- For semantic similarity, the probability of encountering a concept C is usually estimated by its frequency f (number of occurrences) in an appropriate corpus.
- For example, for the human genome as a corpus, the frequency with which a gene function concept is annotated to some gene.
- P(C) = f(C)/N where N is the size of the corpus, and 0 ≤ f(C) ≤ N.
- Frequencies can differ between corpora (e.g., from one genome to another). Hence, choice of corpus needs to be well considered.
IC-based metrics
In a concept hierarchy, the frequency of a concept C includes the direct (without subsumption) frequencies of all concepts it subsumes: \(f(C) = \sum_{C_i\in S(C)} f_D(C_i)\)
Therefore, \(C_i\sqsubseteq C_j \Rightarrow IC(C_i)\ge IC(C_j)\)
\(0 \le IC(C) \lt\infty\)
Can be normalized using corpus size N: \[0 \le IC(C)/maxIC\le 1;\\ maxIC=-\log(1/N)\]
IC-based metrics: Resnik
Resnik P (1995) Using information content to evaluate semantic similarity in a taxonomy. Proc. of the 14th International Joint Conference on Artificial Intelligence. pp. 448–453.
Resnik similarity: \(SimRes(C_1,C_2)=IC(MICA(C_1,C_2))\)
where MICA is the most informative common ancestor: \(MICA(C_1,C_2) =\\ arg\,max_{i,\ C_i\in S(C_1)\cap S(C_2)} IC(C_i)\)
MICA is the same as the max IC of the disjoint common subsumers. Related metrics use the average instead.
IC values as weights
- Using the binary vector of subsumers representation, instead of 1 for non-zero values we can also use weights: \(\mathbf{S}_C: s_i=\begin{cases}w_i, 0<w_i<\infty\;\mathrm{if}\ C_i\in S(C)\\ 0\; \mathrm{otherwise}\end{cases}\)
- Use Jaccard (which then becomes Tanimoto similarity) or Cosine similarity using such weighted vectors
- IC values provide meaningful weights
- Can also train weights using ML
Profile semantic similarity
- In a typical application, need to evaluate the semantic similarity between sets of concepts, called profiles:
- Sets of gene function GO terms annotated to two genes, or proteins.
- Sets of phenotype or disease ontology terms annotated to two gene variants. Etc
- Profile semantic similarity uses a pairwise metric. Two major categories depending on how the pairwise metric is applied.
- Groupwise (a.k.a. graph-based): the induced subgraphs of the concepts in a profile are combined, then the pairwise metric is applied to the combined subgraphs.
- Groupwise metrics are always symmetric.
- Pairwise: The pairwise metric is applied to all combinations of pairs, then the pairwise scores are aggregated using some function.
- Symmetric if the aggregation function is commutative, and asymmetric otherwise.
Pairwise profile similarity
- Often used in conjunction with IC for pairwise similarity scoring
- Symmetric functions for aggregating scores (a.k.a. “all pairs”):
- Max; if used with IC, known sometimes as MaxIC profile metric.
- Average
- Asymmetric aggregation:
- Best pairs: \[SimBP(G_1,G_2)=\frac{1}{|G_1|}\sum_{C_i\in G_1}\max_{Cj\in G_2}(simP(C_i,C_j))\] with SimP being the pairwise metric (such as IC or Jaccard)
- Natural when querying a query profile against a database of profiles
- Can be made symmetric by averaging SimBP(A,B) and SimBP(B,A).
Further reading
- Pesquita C, Faria D, Falcão AO, Lord P, Couto FM (2009) Semantic Similarity in Biomedical Ontologies. PLoS Comput Biol 5(7): e1000443.
- Landmark overview and categorization of semantic similarity metrics used in biostatistics & biomedical informatics.
- Lord P, Stevens R, Brass A, Goble C (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics 19: 1275–1283.
- Landmark paper: the first major application of semantic similarity for a bio-ontology. Uses the Resnik metric (IC).
- Manda, P., Vision, T. (2018). An analysis and comparison of the statistical sensitivity of semantic similarity metrics. Proc. of the 9th Intern. Conf. on Biol. Ontology (ICBO 2018).
- Comparison of the statistical susceptibility to noise for various groupwise and pairwise semantic similarity metrics.
- Kulmanov M, Smaili FZ, Gao X, Hoehndorf R. (2021) Semantic similarity and machine learning with ontologies. Brief Bioinform. 22(4):bbaa199.