Contextual Similarity Among Identifier Names: An Empirical Study

  • Remo de Oliveira Gresta Universidade Federal de São João del Rei
  • Elder Cirilo Universidade Federal de São João del Rei


Identifiers are one of the most important sources of domain information in software development. Therefore, it is recognized that the proper use of names directly impacts the code's comprehensibility, maintainability, and quality. Our goal in this work is to expand the current knowledge about names by considering not only their quality but also their contextual similarity. To achieve that, we extracted names of four large scale open-source projects written in Java. Then, we computed the semantic similarity between classes and their attributes/variables using Fasttext, an word embedding algorithm. As a result, we could observe that source code, in general, preserve an acceptable level of contextual similarity, developers avoid to use names out of the default dictionary (e.g., domain), and files with more changes and maintained by distinct contributors tend to have better a contextual similarity.
Palavras-chave: Identifier names, Semantic Similarity, Word Embedding, Empirical Study


Avidan, E. and Feitelson, D. G. (2017). Effects of variable names on comprehension:An empirical study. In2017 IEEE/ACM 25th International Conference on ProgramComprehension (ICPC), pages 55–65. IEEE.

Basili, V. R. and Rombach, H. D. (1988). The tame project: Towards improvement-oriented software environments.IEEE Transactions on software engineering,14(6):758–773.

Butler, S., Wermelinger, M., Yu, Y., and Sharp, H. (2009). Relating identifier namingflaws and code quality: An empirical study. In16th Working Conference on ReverseEngineering, pages 31–35.

Butler, S., Wermelinger, M., Yu, Y., and Sharp, H. (2010). Exploring the influence ofidentifier names on code quality: An empirical study. In2010 14th European Confer-ence on Software Maintenance and Reengineering, pages 156–165. IEEE.

Deissenboeck, F. and Pizka, M. (2006). Concise and consistent naming.Software QualityJournal, 14(3):261–282.

Feitelson, D., Mizrahi, A., Noy, N., Ben Shabat, A., Eliyahu, O., and Sheffer, R. (2020).How developers choose names.IEEE Transactions on Software Engineering, pages1–1.

Hofmeister, J., Siegmund, J., and Holt, D. V. (2017). Shorter identifier names take longerto comprehend. In2017 IEEE 24th International conference on software analysis,evolution and reengineering (SANER), pages 217–227. IEEE.

Jurafsky, D. and Martin, J. H. (2000).Speech and Language Processing: An Introductionto Natural Language Processing, Computational Linguistics, and Speech Recognition.Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition.

Kawamoto, K. and Mizuno, O. (2012). Predicting fault-prone modules using the length ofidentifiers. In2012 Fourth International Workshop on Empirical Software Engineeringin Practice, pages 30–34. IEEE.

Lawrie, D., Morrell, C., Feild, H., and Binkley, D. (2006). What’s in a name? a studyof identifiers. In14th IEEE International Conference on Program Comprehension(ICPC’06), pages 3–12. IEEE.

Li, G., Liu, H., Liu, Q., and Wu, Y. (2018). Lexical similarity between argument andparameter names: An empirical study.IEEE Access, 6:58461–58481.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of wordrepresentations in vector space.arXiv preprint arXiv:1301.3781.

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2017). Advances inpre-training distributed word representations.arXiv preprint arXiv:1712.09405.

Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for wordrepresentation. InProceedings of the 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association forComputational Linguistics.
Como Citar

Selecione um Formato
GRESTA, Remo de Oliveira; CIRILO, Elder. Contextual Similarity Among Identifier Names: An Empirical Study. In: WORKSHOP ON SOFTWARE VISUALIZATION (VEM), 8. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 49-56. DOI: