Computing the semantic similarity between two phenotype profiles using hypergeometric methods


The datasets that will be used for this analysis are obtained from the supplementary data from Washington et. al. The following five genes have been curated by three independent curators:

  1. EYA1

  2. PAX2

  3. SOX10

  4. SOX9

  5. TTN

In total, we have 15 datasets, three per gene.

An initial manipulation is conducted to each of the gene datasets where the ancestors (till the root ) of each term are computed and then a union of these ancestors is added to the dataset. This manipulated Dataset for will be called Ancestor_Dataset and represents the subtree represented by the original dataset.

\[Ancestor\_Dataset = Gene\_Dataset \cup Ancestors(term_i) \colon \{i=1.....T\}\]

where T = total number of terms in Gene_Dataset

Calculating hypergeometric probability

The hypergeometric probability will be calculated in the following manner:

We define the following parameters to be used in the calculation of the hypergeometric probability:

  1. X: The total number of datasets. In our experiment, X = 15.

  2. Count_Dataset_1:

  3. Count_Dataset_2

  4. Count_MasterDataset

  5. Count_SharedDataset_1,2

The Master Dataset is created by performing a union over all the X number of gene datasets. The gene datasets are assumed to have been modified to include all ancestors of every EQ term present in the dataset.

\[Master\_Dataset = Master\_Dataset \cup Gene\_Dataset_i \colon i = {1, 2.... X}\]

The sum of the annotation counts for the subtree represented by a dataset, Count_Dataset is calculated as follows: \[Count\_Dataset = \sum\limits_{i=1}^{TotalTerms} \sum\limits_{j=1}^X Count(Term_{i,j})\]

Given Count_Dataset_1, Count_Dataset_2, Count_SharedDataset_1,2 and Count_MasterDataset, the hypergeometric probability of Count_SharedDataset_1,2 is defined as :

A = \[\left( \begin{array}{c} Count_{Dataset_1}  \\ Count_{SharedDataset_{1,2}}  \end{array} \right)\]  * \[\left( \begin{array}{c} Count_{MasterDataset} - Count_{Dataset_1}  \\ Count_{Dataset_2} - Count_{SharedDataset_{1,2}}  \end{array} \right)\] / \[\left( \begin{array}{c} Count_{MasterDataset}  \\ Count_{Dataset_2}  \end{array} \right)\] 

The lower the value obtained from the above calculation, the more significant the occurrence of the shared terms is between the two profiles.