Contextual Similarity Among Identifier Names: An Empirical Study
Identifiers are one of the most important sources of domain information in software development. Therefore, it is recognized that the proper use of names directly impacts the code's comprehensibility, maintainability, and quality. Our goal in this work is to expand the current knowledge about names by considering not only their quality but also their contextual similarity. To achieve that, we extracted names of four large scale open-source projects written in Java. Then, we computed the semantic similarity between classes and their attributes/variables using Fasttext, an word embedding algorithm. As a result, we could observe that source code, in general, preserve an acceptable level of contextual similarity, developers avoid to use names out of the default dictionary (e.g., domain), and files with more changes and maintained by distinct contributors tend to have better a contextual similarity.
Basili, V. R. and Rombach, H. D. (1988). The tame project: Towards improvement-oriented software environments. IEEE Transactions on software engineering,14(6):758–773.
Butler, S., Wermelinger, M., Yu, Y., and Sharp, H. (2009). Relating identifier naming flaws and code quality: An empirical study. In16th Working Conference on Reverse Engineering, pages 31–35.
Butler, S., Wermelinger, M., Yu, Y., and Sharp, H. (2010). Exploring the influence ofidentifier names on code quality: An empirical study. In2010 14th European Conference on Software Maintenance and Reengineering, pages 156–165. IEEE.
Deissenboeck, F. and Pizka, M. (2006). Concise and consistent naming. Software Quality Journal, 14(3):261–282.
Feitelson, D., Mizrahi, A., Noy, N., Ben Shabat, A., Eliyahu, O., and Sheffer, R. (2020). How developers choose names. IEEE Transactions on Software Engineering, pages1–1.
Hofmeister, J., Siegmund, J., and Holt, D. V. (2017). Shorter identifier names take longer to comprehend. In 2017 IEEE 24th International conference on software analysis, evolution and reengineering (SANER), pages 217–227. IEEE.
Jurafsky, D. and Martin, J. H. (2000).Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition.
Kawamoto, K. and Mizuno, O. (2012). Predicting fault-prone modules using the length ofidentifiers. In 2012 Fourth International Workshop on Empirical Software Engineering in Practice, pages 30–34. IEEE.
Lawrie, D., Morrell, C., Feild, H., and Binkley, D. (2006). What’s in a name? a study of identifiers. In14th IEEE International Conference on Program Comprehension (ICPC’06), pages 3–12. IEEE.
Li, G., Liu, H., Liu, Q., and Wu, Y. (2018). Lexical similarity between argument and parameter names: An empirical study. IEEE Access, 6:58461–58481.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2017). Advances inpretraining distributed word representations. arXiv preprint arXiv:1712.09405.
Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association for Computational Linguistics.