Contextual Similarity Among Identifier Names: An Empirical Study
ResumoIdentifiers are one of the most important sources of domain information in software development. Therefore, it is recognized that the proper use of names directly impacts the code's comprehensibility, maintainability, and quality. Our goal in this work is to expand the current knowledge about names by considering not only their quality but also their contextual similarity. To achieve that, we extracted names of four large scale open-source projects written in Java. Then, we computed the semantic similarity between classes and their attributes/variables using Fasttext, an word embedding algorithm. As a result, we could observe that source code, in general, preserve an acceptable level of contextual similarity, developers avoid to use names out of the default dictionary (e.g., domain), and files with more changes and maintained by distinct contributors tend to have better a contextual similarity.
Basili, V. R. and Rombach, H. D. (1988). The tame project: Towards improvement-oriented software environments.IEEE Transactions on software engineering,14(6):758–773.
Butler, S., Wermelinger, M., Yu, Y., and Sharp, H. (2009). Relating identifier namingflaws and code quality: An empirical study. In16th Working Conference on ReverseEngineering, pages 31–35.
Butler, S., Wermelinger, M., Yu, Y., and Sharp, H. (2010). Exploring the influence ofidentifier names on code quality: An empirical study. In2010 14th European Confer-ence on Software Maintenance and Reengineering, pages 156–165. IEEE.
Deissenboeck, F. and Pizka, M. (2006). Concise and consistent naming.Software QualityJournal, 14(3):261–282.
Feitelson, D., Mizrahi, A., Noy, N., Ben Shabat, A., Eliyahu, O., and Sheffer, R. (2020).How developers choose names.IEEE Transactions on Software Engineering, pages1–1.
Hofmeister, J., Siegmund, J., and Holt, D. V. (2017). Shorter identifier names take longerto comprehend. In2017 IEEE 24th International conference on software analysis,evolution and reengineering (SANER), pages 217–227. IEEE.
Jurafsky, D. and Martin, J. H. (2000).Speech and Language Processing: An Introductionto Natural Language Processing, Computational Linguistics, and Speech Recognition.Prentice Hall PTR, Upper Saddle River, NJ, USA, 1st edition.
Kawamoto, K. and Mizuno, O. (2012). Predicting fault-prone modules using the length ofidentifiers. In2012 Fourth International Workshop on Empirical Software Engineeringin Practice, pages 30–34. IEEE.
Lawrie, D., Morrell, C., Feild, H., and Binkley, D. (2006). What’s in a name? a studyof identifiers. In14th IEEE International Conference on Program Comprehension(ICPC’06), pages 3–12. IEEE.
Li, G., Liu, H., Liu, Q., and Wu, Y. (2018). Lexical similarity between argument andparameter names: An empirical study.IEEE Access, 6:58461–58481.
Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013). Efficient estimation of wordrepresentations in vector space.arXiv preprint arXiv:1301.3781.
Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2017). Advances inpre-training distributed word representations.arXiv preprint arXiv:1712.09405.
Pennington, J., Socher, R., and Manning, C. (2014). Glove: Global vectors for wordrepresentation. InProceedings of the 2014 Conference on Empirical Methods in Nat-ural Language Processing (EMNLP), pages 1532–1543, Doha, Qatar. Association forComputational Linguistics.