Secondary Logo

Journal Logo

A new gene ontology-based measure for the functional similarity of gene products

QI, Guo-long; QIAN, Shi-yu; FANG, Ji-qian

doi: 10.3760/cma.j.issn.0366-6999.20131252
Original article

Background Although biomedical ontologies have standardized the representation of gene products across species and databases, a method for determining the functional similarities of gene products has not yet been developed.

Methods We proposed a new semantic similarity measure based on Gene Ontology that considers the semantic influences from all of the ancestor terms in a graph. Our measure was compared with Resnik's measure in two applications, which were based on the association of the measure used with the gene co-expression and the protein-protein interactions.

Results The results showed a considerable association between the semantic similarity and the expression correlation and between the semantic similarity and the protein-protein interactions, and our measure performed the best overall.

Conclusion These results revealed the potential value of our newly proposed semantic similarity measure in studying the functional relevance of gene products.

Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong 510080, China (Qi GL and Fang JQ)

Department of Medical Information, School of Medicine, Jinan University, Guangzhou, Guangdong 510632, China (Qian SY)

Correspondence to: FANG Ji-qian, Department of Medical Statistics and Epidemiology, School of Public Health, Sun Yat-sen University, Guangzhou, Guangdong 510080, China (Tel: 86–20–87330671. Email:

(Received May 12, 2013)

Edited by CUI Yi

The appearance of the ontology provides a novel means to compare the functional differences among gene products. One can adopt a common and objective knowledge representation to characterize and classify functional features of genes by terms; then, based on the statistical and topological information of the terms in the ontology, a semantic similarity can be measured to reflect the closeness in the biomedical sense between any two gene products.1–7

The gene ontology (GO) project is “a collaborative effort to address the needs for consistent descriptions of gene products in different databases”.8 Three structured, controlled ontologies were developed to standardize the representation of gene products in terms of their attributes, including biological process (BP), cellular component (CC), and molecular function (MF). The structure of the GO can be described as a directed acyclic graph (DAG), where each GO term is a node, and the relationships between the terms are arcs between the nodes; each node may have one or more parent nodes. In the GO annotation database, a gene product annotated with a given term is automatically annotated with its ancestor terms, and all of the annotated gene products would have attributes in the three node terms: BP, CC, and MF. Figure 1 depicts the sub-graph of the GO DAG for the term alcohol catabolic process, where the numbers in brackets are the annotation numbers of the gene products from Saccharomyces cerevisiae of the corresponding terms, according to the data from the July 2012 UniProt Knowledgebase (UniProtKB) release.

Figure 1.

Figure 1.

The currently prevalent approaches for determining the semantic similarity of the GO terms are the methods developed by Resnik, Lin, Jiang and Conrath.9–11 These methods all rely on the information content (IC), which is used to describe the information a term contains and is defined as IC(t) =-ln(p(t)), where p(t) is the probability of term t occurring in a specific corpus (such as UniProtKB). Then, the IC of the common ancestor of the two terms is applied to measure their semantic similarity. Recent evaluation studies performed on semantic similarity measures in GO show that Resnik's method is better than the other two measures in most assessment settings.1,4–6,12,13 However, Resnik's measure only considers the more informative common ancestor and disregards additional information in the graph; thus, it cannot distinguish the differences between distinct term-pairs that have the same ancestor node.

The GO annotation rule clearly indicates that the semantic of a GO term is not only related to the meaning of the term itself but also related to the meaning of all its ancestor terms. Therefore, a measure to evaluate the semantic similarity of two GO terms must incorporate the semantic influences from all ancestor terms related to the compared terms and not just consider the more informative common term in the GO graph.

According to this assumption, we proposed a new method based on the GO information to measure the semantic similarity. Correlation and cluster analyses were then performed to compare our new measure with Resnik's measure. The final results demonstrated the advantage of our measure.

Back to Top | Article Outline


Semantic similarity measure

Semantic value of the GO terms

For any term A, its self-semantic value is defined as

where p(A) is the occurrence probability of term A in a specific corpus, such as the UniProtKB. The logarithm with base 2 is taken on the sum of p(A) and 1 to produce values in the range of 0 to 1 and to avoid log2(0) when p(A)=0.

Because the annotation numbers of all descendants will be counted in the total annotation number of their ancestor term, the S value monotonically non-decreases from 0 to 1 as the term moves from the bottom of the ontology to the top.

For example, using Eq. 1 and the data in Figure 1, we can compute the S value of the GO term alcohol catabolic process as follow:

The S value of its ancestor, alcohol metabolic process, is 0.023, and the S value of the root term biological process is equal to 1.

From the human perspective, the ancestor terms closer to a specific term should contribute more to its semantic, whereas the ancestor terms farther from this term must contribute less because these are more generic to the term being analyzed. Thus, we use TA to represent the set of GO terms in the sub-graph of the GO DAG for term A, which includes term A and all of its ancestor terms in GO. Then, a variable denoted Semantic Influence (SI) was defined to describe the semantic influence that a term tTA imposes on a specific term A:

Because P(t)≥P(A), from Eq. 2, the SI value is also in the range of 0 and 1. A special case is SI (A, A). The SI values of all of the terms in the sub-graphs for the GO terms alcohol catabolic process and alcohol metabolic process are listed in Table 1.

Table 1

Table 1

We defined the entire semantic value of a GO term as an aggregation of all of the semantic influences from its ancestor terms and itself. Therefore, the semantic value (SV) of a specific term A can be calculated as

For instance, according to Eq. 3, the SV value of the term alcohol catabolic process in the BP ontology is equal to 1.223. The SV values of the term alcohol metabolic process is 1.198, and the SV value of the BP root node is still 1.

Back to Top | Article Outline

Semantic similarity between two GO terms

The similarity of two GO terms should account for the influences derived from their shared terms as well as the locations of these two terms in the entire GO graph. Therefore, a variable Sim (A,B) was defined as the semantic similarity between terms A and B:

where TA and TB are the GO term sets in the sub-graph of the GO DAG for terms A and B, respectively, and TA,B=TA∩TB represents the set of terms shared by terms A and B. Because TA,BTA and TA,BTB, the semantic similarity value between the two terms A and B will be in the range of 0 to 1.

The terms alcohol catabolic process and alcohol metabolic process in Figure 1 have four common terms, and the most informative common ancestor is alcohol metabolic process. However, the semantic influences from the same ancestor term to these two specific GO terms will be different. According to the data shown in Table 1, the final SV values of the set of shared common terms are 0.104 and 1.198 for the terms alcohol catabolic process and alcohol metabolic process, respectively. Based on Eq. 4, we have

These results demonstrate the superiority of this newly proposed similarity measure because the corresponding two results based on Resnik's measures are the same:

Back to Top | Article Outline

Functional similarity between gene products

The term “functional similarity” is widely used to describe the similarity between two gene products by applying the semantic similarity measure between the GO term sets with which they are annotated.14 For the assessment of the functional similarity between gene products, the two principles “compare within a particular GO category” and “compare within the set of terms that the gene represents rather than a single term” are accepted by most measures because the gene products are usually annotated with multiple terms from each of the three GO categories.2,14–18

For example, in the biochemical pathway of Saccharomyces cerevisiae denoted “isoleucine degradation”, the genes PDC1 and PDC6 are annotated in the BP ontology by the GO term sets {GO:0000949, GO:0006090, GO:0006559, GO:0006569, GO:0019655} and {GO:0000949, GO:0006067, GO:0006559, GO:0006569}, respectively. The corresponding semantic similarity values between these terms are listed in Table 2. By comparing the semantic similarities of the GO terms in these two sets, we can calculate their functional similarity with respect to the biological processes that the two genes are involved in.

Table 2

Table 2

Pair-wise approaches are the prevalent measures used for computing the functional similarity of two gene products.14 These approaches consider the semantic similarities of term-pairs originating from the term set that they annotate. Several combination techniques are used to combine these similarity values, including the use of the average, maximum, or best pair.16 The average best-pair approach is considered a reasonable method for producing better results than the methods based on the maximum and average strategies14 and has widely adopted by many researchers.2,15,17,18

Using the concept underlying the best-pair strategy, we defined the semantic similarity between a term ta and a gene product G as follows:

where G={g1, g2,…, gm} represents the annotating term set of gene G. We defined the functional similarity between two given genes G1 and G2 as the average for all possible best-matches:

where G1={g11, g12,…, g1m} and G2={g21, g22,…, g2n} represent the annotating term sets of genes G1 and G2, respectively.

In the example shown in Table 2, we used Eqs. 5 and 6 to compute the functional similarity value between the genes PDC1 and PDC6 on the BP ontology as follows:

Because the range of the semantic similarity value is 0 to 1, the functional similarity value will also range from 0 to 1, where 0 indicates that the two gene products have almost no biological relationship, and 1 indicates the highest functional similarity between them. The obtained value, 0.78, suggests that the genes PDC1 and PDC6 are functionally similar in terms of their biological processes. This conclusion is consistent with the biological perception because these two genes are involved in the same pathway.

Back to Top | Article Outline

Semantic similarity and gene co-expression

The grouping of genes or proteins depending on their differential expression profiles is the most common method used to interpret microarray data. This method is derived from the cognition that genes expressed in a coordinate manner are likely involved in the same biological processes or have a similar function.19 And more evidences show that the GO can reflect the functional similarity of gene products by the closeness of the terms that they represent.1–7 Therefore, it is reasonable to assume that gene products with similar expression patterns might have similarly annotated profiles, i.e., that the expression correlation might relate to the semantic similarity.3–5

Based on this hypothesis, we aimed to explore the possible quantitative relationship between the semantic similarity of gene products and their co-expression correlation. We used a real microarray dataset to calculate both the correlation of co-expression and the semantic similarity of the gene products.

The GO data were based on the July 2012 release version downloaded from the website of the Gene Ontology ( The annotation data corresponding to the GO terms were derived from the Gene Ontology Annotation (UniProtKB-GOA) Database ( The Saccharomyces cerevisiae microarray expression data reported by Chu et al20 were obtained from the Stanford Microarray Database ( Chu's dataset includes the expression profile of the whole yeast genome at seven distinct times during the developmental program of sporulation, which is suitable for measuring gene expression changes related to molecular functions and biological processes.

By analyzing Chu's data, we obtained the expression data for 6118 ORFs of yeast, and 6382 gene products annotated to 4201 GO terms were included in our study. We used the Pearson correlation coefficient to describe the expression correlation. The functional similarity was derived by our semantic similarity measure (denoted by Qi hereinafter) and Resnik's measure using the ontologies of BP, CC, and MF. We then calculated both of the Pearson and Spearmen correlation coefficients between the semantic similarity and the expression correlation.

In addition, to ascertain the underlying association between the semantic similarity and the expression correlation, we also adopted the method used in previous studies3–5 that all gene products pairs were divided into several sub-groups based on their absolute values of expression correlation coefficients, and the mean semantic similarity value was computed for each sub-group. For a given sub-group, the correlation coefficient will follow a certain statistical distribution, and the mean of the similarity values gives an estimate of the average level of semantic similarity in the sub-group.

Back to Top | Article Outline

Semantic similarity and protein-protein interaction

It has been proved that semantic similarity can improve the clustering of co-expressed gene products by taking into account their functional similarity.21,22 Therefore, we compared the performance of two semantic similarity measures and the microarray expression correlation in the functional clustering of genes.

Our functional similarity measure of gene products can be transformed into a distance measure by defining

which is suitable for cluster analysis. Based on this distance measure, we performed a K-means clustering, where K, which is the number of clusters, was varied from 5 to 100 in increments of 5. Recent reports have shown that the GO-based semantic similarity measures have been applied for validating and predicting the functions and interactions of a gene product.6,7,12,23,24 Many of the most important biological processes in a cell are performed by large molecular machines that are built from a large number of protein components and organized by their protein-protein interactions (PPI). This suggests a high degree of biological relevance between the gene function and the protein interactions.25 Thus, we defined the Rate of Interaction Pairs within cluster i (RIPi) as the ratio between the number of interaction pairs and the number of all gene pairs within cluster Ci:

To measure the quality of the whole partition and to demonstrate that semantic similarity measures can be used to verify the functional correlation between gene products, a comprehensive variable RIP(K) was defined by averaging all of the RIP values of the K clusters:

The results based on the Resnik's measure and our measure were compared to explore their differences, and the results from a microarray dataset were used as a contrast on behalf of the experimental method. The microarray and GO datasets used are the same as above. The PPI data were downloaded from The Biological General Repository for Interaction Datasets (BioGRID) ( Version 3.1.90 (July 2012) was selected as the validation source.

Back to Top | Article Outline


Association with gene co-expression

The Pearson correlation coefficients between the semantic similarity and the expression correlation for individual pairs of gene products were all statistically significant, although the values are very low. The corresponding results derived from Spearmen correlation coefficients were also low, and some were not statistically significant (left side of Table 3). The low correlation coefficient might be due to the intrinsic complexity of the biology. Because one gene is only described by a few molecular functional terms, and the number of participating genes in biological processes is also finite, each gene should only closely associate with a small number of other genes, and the relationships to most other genes are distant. Thus, relatively low values represent most of the semantic similarity data, whereas a relatively low number of high similarity values are generated.

Table 3

Table 3

The two type of correlation coefficients at the grouping level were calculated using the two semantic similarity measures, and these were also statistically significant with P <0.001 (right side of Table 3). Figure 2 summarizes the relationship between the mean semantic similarity and the absolute expression correlation of the gene products for all three ontologies. The axis of the abscissas was divided into 10 absolute correlation intervals, and the axis of ordinates showed the mean similarity values and their 95% confidence intervals for each interval. The results showed that the high GO similarity values are related to high expression correlation values. Similar trends were obtained for other numbers of intervals. All of these results demonstrate that our semantic similarity measure clearly outperforms Resnik's measure.

Figure 2.

Figure 2.

Back to Top | Article Outline

Association with protein-protein interactions

The results of the K-means cluster analysis on RIP(K) for any K showed the potential ability of the semantic similarity measures in functional clustering. Figure 3A summarizes the RIP(K) results related to two semantic similarity measures and the microarray expression correlation. The information related to the semantic similarity was derived from the BP ontology. Our semantic similarity measure was performed better than others. In addition, the biological relevance (as measured by RIP(K)) increased with K, which is the number of clusters. Similar patterns were obtained from the analyses based on the CC and MF ontologies. These results are illustrated in Figures 3B and 3C, which depict the associations between the semantic similarity and the protein interactions. All of these results confirmed that our similarity measure performed the best, and the results based on the expression correlation were the worst.

Figure 3.

Figure 3.

Back to Top | Article Outline


In this work, we proposed a new GO-based semantic similarity measure and applied it to explore the associations between semantic similarity and functional similarity. All of the results showed that our measure performed better than Resnik's because it found a higher correlation with the co-expression and more PPI pairs on both of three ontologies. This result was obtained because our measure extracted additional annotation information from all of the ancestor terms rather than only the most informative ancestor. Moreover, our measure does not significantly increase the complexity of computation. The calculation of information content for non-shared terms is additional by comparing the existing node-based measures.

We have assessed our measure on the small-scale simulations based on classification information in biochemical pathways, and genome-scale cluster analysis. All of the results showed that our measure is suitable in studying the functional relevance of gene products. Limited by the article length, we only reported the partial analysis of Saccharomyces cerevisiae in this work. For more details, one can see Qi's PhD. dissertation.27 Meanwhile, we provide a relevant program as a supplement for readers' use, by which one can easily extend the functional similarity to more species and various settings.

The degrees of biological relevance derived from semantic similarity measures were significantly better than those based only on the gene expression data. This finding indicates that, in quantitative studies on the databases of human knowledge, such as GO, the intelligent assessment techniques could be reliable tools for exploring the functional relationship within a gene network and that these are more stable and efficient than any single experiment. Further studies that focus on how such types of tools can be used to benefit more biological studies would be worthwhile.

Back to Top | Article Outline


1. Lord P, Stevens R, Brass A, Goble C. Semantic similarity measures as tools for exploring the gene ontology. Proceedings of the 8th Pacific Symposium on Biocomputing 2003: 601-612.
2. Schlicker A, Domingues FS, Rahnenfūhrer J, Lengauer T. A new measure for functional similarity of gene products based on gene ontology. BMC Bioinform 2006; 7: 302.
3. Azuaje F, Bodenreider O. Incorporating ontology-driven similarity knowledge into functional genomics: An exploratory study. Proceedings of the IEEE 4th Symposium on Bioinformatics and Bioengineering (BIBE-2004) 2004.
4. Wang H, Azuaje F, Bodenreider O, Dopazo J. Gene expression correlation and gene ontology-based similarity: an assessment of quantitative relationships. Proceedings of the IEEE Sym. Computational Intelligence in Bioinformatics and Computational Biology (CIBCB' 2004) 2004.
5. Sevilla JL, Segura V, Podhorski A, Guruceaga E, Mato JM, Martínez-Cruz LA, et al. Correlation between Gene Expression and GO Semantic Similarity. IEEE/ACM Trans Comput Biol Bioinform 2005; 2: 330-338.
6. Lubovac Z, Gamalielsson J, Olsson B. Combining functional and topological properties to identify core modules in protein interaction networks. Proteins 2006; 64: 948-959.
7. Guo X, Liu RX, Shriver CD, Hu H, Liebman MN. Assessing semantic similarity measures for the characterization of human regulatory pathways. Bioinformatics 2006; 22: 967-973.
8. The Gene Ontology Consortium. Gene Ontology: tool for the unification of biology. Nat Genet 2000; 25: 25-29.
9. Resnik P. Semantic similarity in a taxonomy: an informationbased measure and its application to problems of ambiguity in natural language. J Artif Intell Res 1999; 11: 95-130.
10. Lin D. An information-theoretic definition of similarity. Proceedings of the 15th International Conference on Machine Learning 1998.
11. Jiang J, Conrath D. Semantic similarity based on corpus statistics and lexical taxonomy. Proceedings of the International Conference on Research in Computational Linguistics 1998.
12. Mistry M, Pavlidis P. Gene ontology term overlap as a measure of gene functional similarity. BMC Bioinform 2008; 9: 327.
13. Xu T, Du L, Zhou Y. Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data. BMC Bioinform 2008; 9: 472.
14. Pesquita C, Faria D, Falcão AO, Lord P, Couto FM. Semantic similarity in biomedical ontologies. PLoS Comput Biol 2009; 5: e1000443.
15. Couto FM, Silva MJ, Coutinho PM. Semantic similarity over the gene ontology: family correlation and selecting disjunctive ancestors. Proceedings of the ACM Conference in Information and Knowledge Management 2005.
16. Lei ZD, Dai Y. Assessing protein similarity with gene ontology and its use in subnuclear localization prediction. BMC Bioinform 2006; 7: 491.
17. Azuaje F, Wang H, Bodenreider O. Ontology-driven similarity approaches to supporting gene functional assessment. Proceedings of the ISMB 2005 SIG meeting on Bio-ontologies 2005.
18. Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF. A new method to measure the semantic similarity of GO terms. Bioinformatics 2007; 23: 1274-1281.
19. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 1998; 95: 14863-14868.
20. Chu S, De Risi J, Eisen M, Mulholland J, Botstein D, Brown PO, et al. The transcriptional program of sporulation in budding yeast. Science 1998; 282: 699-705.
21. Chen JL, Liu Y, Sam LT, Li J, Lussier YA. Evaluation of highthroughput functional categorization of human disease genes. BMC Bioinform 2007; 8 Suppl3: S7.
22. Wolting C, Mcglade JC, Tritchler D. Cluster analysis of protein array results via similarity of gene ontology annotation. BMC Bioinform 2006; 7: 338.
23. Zhu M, Gao L, Guo Z, Li Y, Wang D, Wang J, et al. Globally predicting protein functions based on co-expressed protein-protein interaction networks and ontology taxonomy similarities. Gene 2007; 391: 113-119.
24. Taylor IW, Linding R, Warde-Farley D, Liu Y, Pesquita C, Faria D, et al. Dynamic modularity in protein interaction networks predicts breast cancer outcome. Nat Biotechnol 2009; 27: 199-204.
25. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science 1999; 285: 751- 753.
26. Breitkreutz BJ, Stark C, Tyers M. The GRID: the General Repository for Interaction Datasets. Genome Biol 2003; 4: r23.
27. Qi GL. Knowledge based measure for functional associations of gene products [dissertation]. Guangzhou: Sun Yat-Sen University; 2013.

gene ontology; semantic similarity; clustering; protein-protein interaction

© 2013 Chinese Medical Association