A Power Law Semantic Similarity from Gene Ontology


Currently, there is a massive data generation in the most diverse areas of knowledge, as bioinformatics that generates huge amounts of data, requiring the analysis and the summarization of this data for its understanding. Semantic similarity can be seen as an approach that considers the features of objects in a context in order to establish the similarity or dissimilarity of these objects. The Gene Ontology (GO) has been widely employed as a source of features in the estimation of semantic similarity between its terms. Several methods have been proposed in the literature for estimating semantic similarity from GO. However, the methods are based on parametric distributions or arbitrarily defined parameters that do not consider the distribution of GO data. In this context, this work presents a data-driven method for estimating the semantic similarity from GO terms that exploit the power-law distribution. A set of five metabolic pathways were considered for the evaluation of the proposed method and compared with some of the principal methods in the literature. The results showed the adequacy of the proposed method in the estimation of semantic similarities and that it produced more compact gene clusters among all the methods adopted and with an adequate distance between them, leading to clusters more assertive and less susceptible to errors. The proposed method is freely available at https://github.com/EricIto/plawss.

Palavras-chave: Semantic similarity, Complex networks, Power-law, Bioinformatics, Pattern Recognition


Akmal, S., Shih, L.H., Batres, R.: Ontology-based similarity for product information retrieval. Comput. Ind. 65(1), 91–107 (2014)

Albert, R.: Scale-free networks in cell biology. J. Cell Sci. 118(21), 4947–4957 (2005)

Almaas, E., Barabási, A.L.: Power Laws in Biological Networks. Springer, Boston (2006). https://doi.org/10.1007/0-387-33916-7_1

Barabási, A.L.: Scale-free networks: a decade and beyond. Science 325(5939), 412–413 (2009)

Barabási, A.L., Albert, R.: Emergence of scaling in random networks. Science 286(5439), 509–512 (1999)

Barabási, A.L., Gulbahce, N., Loscalzo, J.: Network medicine: a network-based approach to human disease. Nat. Rev. Genet. 12(1), 56–68 (2011)

Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.U.: Complex networks: structure and dynamics. Phys. Rep. 424(4–5), 175–308 (2006)

Cao, R., Cheng, J.: Deciphering the association between gene function and spatial gene-gene interactions in 3d human genome conformation. BMC Genom. 16(1), 880 (2015)

Cherry, J.M., et al.: SGD: saccharomyces genome database. Nucleic Acids Res. 26(1), 73–79 (1998)

Cho, Y.R., Zhang, A., Xu, X.: Semantic similarity based feature extraction from microarray expression data. Int. J. Data Min. Bioinform. 3(3), 333–345 (2009)

Gene Ontology Consortium: Expansion of the gene ontology knowledgebase and resources. Nucleic Acids Res. 45(D1), D331–D338 (2016)

Costa, L.F., Rodrigues, F.A., Travieso, G., Villas-Boas, P.R.: Characterization of complex networks: a survey of measurements. Adv. Phys. 56(1), 167–242 (2007)

Evlampiev, K., Isambert, H.: Conservation and topology of protein interaction networks under duplication-divergence evolution. Proc. Natl. Acad. Sci. 105(29), 9863–9868 (2008)

Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315(5814), 972–976 (2007)

Garla, V.N., Brandt, C.: Semantic similarity in the biomedical domain: an evaluation across knowledge sources. BMC Bioinform. 13(1), 261 (2012)

He, X., Zhang, J.: Why do hubs tend to be essential in protein networks? PLOS Genet. 2(6), 1–9 (2006)

Ito, E.A., Katahira, I., Vicente, F.F., Pereira, L.P., Lopes, F.M.: BASiNET-BiologicAl Sequences NETwork: a case study on coding and non-coding RNAs identification. NAR 46(16), e96 (2018)

Jeong, H., Tombor, B., Albert, R., Oltvai, Z.N., Barabási, A.L.: The large-scale organization of metabolic networks. Nature 407, 651–654 (2000)

Jiang, Y., et al.: An expanded evaluation of protein function prediction methods shows an improvement in accuracy. Genome Biol. 17(1), 184 (2016)

Khanin, R., Wit, E.: How scale-free are biological networks. J. Comput. Biol. 13(3), 810–818 (2006)

de Lima, G.V.L., Castilho, T.R., Bugatti, P.H., Saito, P.T.M., Lopes, F.M.: A complex network-based approach to the analysis and classification of images. In: CIARP 2015. LNCS, vol. 9423, pp. 322–330. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-25751-8_39

Lin, D., et al.: An information-theoretic definition of similarity. In: ICML, vol. 98, pp. 296–304 (1998)

Lopes, F.M., Martins Jr, D.C., Barrera, Jr., Cesar, Jr., Roberto M.: A feature selection technique for inference of graphs from their known topological properties: revealing scale-free gene regulatory networks. Inf. Sci. 272, 1–15 (2014)

Lopes, F.M., Martins, D.C., Barrera, J., Cesar, R.M.: SFFS-MR: a floating search strategy for GRNs inference. In: Dijkstra, T.M.H., Tsivtsivadze, E., Marchiori, E., Heskes, T. (eds.) PRIB 2010. LNCS, vol. 6282, pp. 407–418. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-16001-1_35

Lorenz, D.M., Jeng, A., Deem, M.W.: The emergence of modularity in biological systems. Phys. Life Rev. 8(2), 129–160 (2011)

Newman, M.E.J.: The structure and function of complex networks. SIAM Rev. 45(2), 167–256 (2003)

Pesquita, C.: Semantic similarity in the gene ontology. In: The Gene Ontology Handbook, pp. 161–173. Humana Press, New York, NY (2017)

Pesquita, C., Faria, D., Falcao, A.O., Lord, P., Couto, F.M.: Semantic similarity in biomedical ontologies. PLoS Comput. Biol. 5(7), e1000443 (2009)

Pratt, J.W., Gibbons, J.D.: Kolmogorov-Smirnov two-sample tests. In: Pratt, J.W., Gibbons, J.D. (eds.) Concepts of Nonparametric Theory. Springer Series in Statistics, pp. 318–344. Springer, New York, NY (1981). https://doi.org/10.1007/978-1-4612-5931-2_7

Ravasz, E.: Detecting Hierarchical Modularity in Biological Networks, pp. 145–160. Humana Press, Totowa, NJ (2009)

Resnik, P.: Semantic similarity in a taxonomy: an information-based measure and its application to problems of ambiguity in natural language. J. Artif. Intell. Res. 11, 95–130 (1999)

Serban, M.: Exploring modularity in biological networks. Philos. Trans. R. Soc. B 375(1796), 20190316 (2020)

Shirai, S., et al.: Long-range temporal correlations in scale-free neuromorphic networks. Netw. Neurosci. 4(2), 432–447 (2020)

Song, X., Li, L., Srimani, P.K., Yu, P.S., Wang, J.Z.: Measure the semantic similarity of go terms using aggregate information content. IEEE/ACM Trans. Comput. Biol. Bioinf. 11(3), 468–476 (2014)

da Rocha Vicente, F.F., Lopes, F.M.: SFFS-SW: a feature selection algorithm exploring the small-world properties of GNs. In: Comin, M., Käll, L., Marchiori, E., Ngom, A., Rajapakse, J. (eds.) PRIB 2014. LNCS, vol. 8626, pp. 60–71. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-09192-1_6

Wang, J.Z., Du, Z., Payattakool, R., Yu, P.S., Chen, C.F.: A new method to measure the semantic similarity of GO terms. Bioinformatics 23(10), 1274–1281 (2007)

Webb, A.R.: Statistical Pattern Recognition, 2nd edn. John Willey & Sons, New York (2002)

Zhao, C., Wang, Z.: GOGO: an improved algorithm to measure the semantic similarity between gene ontology terms. Sci. Rep. 8(1), 1–10 (2018)
ITO, Eric Augusto; VICENTE, Fábio Fernandes da Rocha; PEREIRA, Luiz Felipe Protasio; LOPES, Fabricio Martins. A Power Law Semantic Similarity from Gene Ontology. In: SIMPÓSIO BRASILEIRO DE BIOINFORMÁTICA (BSB), 16. , 2023, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 123-135. ISSN 2316-1248.