EmbSE: A Word Embeddings Model Oriented Towards Software Engineering Domain

Eliane Maria De Bortoli Fávero; Dalcimar Casanova; Andrey Ricardo Pimentel

Eliane Maria De Bortoli Fávero Federal Technological University of Parana
Dalcimar Casanova Federal Technological University of Parana
Andrey Ricardo Pimentel Federal University of Parana

Resumo

The representation of contexts is essential in tasks involving Natural Language Processing (NLP). In the field of software engineering, classifying similar texts within a specific context has been a complex task, considering the informality and the complexity inherent of the texts produced through many software development processes (e.g. agile methods). Word embeddings capture semantic and syntactic information about unique words, allowing them to be represented in a dense and low-dimensional format. This property makes the embeddings vectors an important input feature for machine learning algorithms that aim to classify texts. Although there has been much research around the application of word embeddings in several areas, up to this moment, there is no knowledge about studies that have explored its application in the creation of a specific model for the domain of the area of software engineering. Thus, this article presents the proposal to generate an embedding model, called embeddings model for software engineering (EmbSE), which can recognize specific and relevant terms in the software engineering context. This model can be used as the main entry in the classification of several textual artifacts generated during the software development project process. The results are promising, presenting a 48% improvement in the mAP values for the EmbSE concerning the model trained on the generic corpus. This reinforces the hypothesis that a model of this nature can bring significant improvements in the classification of texts of the area.

Palavras-chave: word embedding, domain-specific model, pre-treined model, software engineering, machine learning

Referências

Ricardo Baeza-Yates, Berthier Ribeiro-Neto, et al. 1999. Modern information retrieval. Vol. 463. ACM press New York.

Antoine Bordes, Xavier Glorot, Jason Weston, and Yoshua Bengio. 2012. Joint learning of words and meaning representations for open-text semantic parsing. In Artificial Intelligence and Statistics. 127--135.

Emanuela Boros, Romaric Besançon, Olivier Ferret, and Brigitte Grau. 2014. Event role extraction using domain-relevant word representations. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1852--1857.

Fabio Calefato, Filippo Lanubile, Federico Maiorano, and Nicole Novielli. 2018. [Journal First] Sentiment Polarity Detection for Software Development. In 2018 IEEE/ACM 40th International Conference on Software Engineering (ICSE). IEEE, 128--128.

Chunyang Chen, Sa Gao, and Zhenchang Xing. 2016. Mining analogical libraries in q&a discussions-incorporating relational and categorical knowledge into word embedding. In 2016 IEEE 23rd international conference on software analysis, evolution, and reengineering (SANER), Vol. 1. IEEE, 338--348.

Xinxiong Chen, Zhiyuan Liu, and Maosong Sun. 2014. A unified model for word sense representation and disambiguation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). 1025--1035.

Morakot Choetkiertikul, Hoa Khanh Dam, Truyen Tran, Trang Thi Minh Pham, Aditya Ghose, and Tim Menzies. 2018. A deep learning model for estimating story points. IEEE Transactions on Software Engineering (2018).

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell. 2014. Decaf: A deep convolutional activation feature for generic visual recognition. In International conference on machine learning. 647--655.

Christiane Fellbaum. 2012. WordNet. The Encyclopedia of Applied Linguistics (2012).

Alessio Ferrari, Beatrice Donati, and Stefania Gnesi. 2017. Detecting domain-specific ambiguities: an NLP approach based on Wikipedia crawling and word embeddings. In 2017 IEEE 25th International Requirements Engineering Conference Workshops (REW). IEEE, 393--399.

Michael Alexander Kirkwood Halliday, Christian Matthiessen, and Michael Halliday. 2014. An introduction to functional grammar. Routledge.

Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146 (2018).

Eric H Huang, Richard Socher, Christopher D Manning, and Andrew Y Ng. 2012. Improving word representations via global context and multiple word prototypes. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 873--882.

Zhenchao Jiang, Lishuang Li, Degen Huang, and Liuke Jin. 2015. Training word embeddings for deep learning in biomedical text mining tasks. In 2015 IEEE international conference on bioinformatics and biomedicine (BIBM). IEEE, 625--628.

Mikael Kågebäck, Fredrik Johansson, Richard Johansson, and Devdatt Dubhashi. 2015. Neural context embeddings for automatic discovery of word senses. In Proceedings of the 1st Workshop on Vector Space Modeling for Natural Language Processing. 25--32.

Quan Liu, Hui Jiang, Si Wei, Zhen-Hua Ling, and Yu Hu. 2015. Learning semantic word embeddings based on ordinal knowledge constraints. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), Vol. 1. 1501--1511.

Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, Nov (2008), 2579--2605.

Oren Melamud, David McClosky, Siddharth Patwardhan, and Mohit Bansal. 2016. The role of context types and dimensionality in learning word embeddings. arXiv preprint arXiv:1601.00893 (2016).

Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013).

Tomas Mikolov, Wen-tau Yih, and Geoffrey Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 746--751.

Sewon Min, Minjoon Seo, and Hannaneh Hajishirzi. 2017. Question answering through transfer learning from large fine-grained supervision data. arXiv preprint arXiv:1702.02171 (2017).

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532--1543.

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. 2018. Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018).

Sebastian Raschka. 2015. Python machine learning. Packt Publishing Ltd.

Henrique Rocha, Marco Tulio Valente, Humberto Marques-Neto, and Gail C Murphy. 2016. An empirical study on recommendations of similar bugs. In 2016 IEEE 23rd International Conference on Software Analysis, Evolution, and Reengineering (SANER), Vol. 1. IEEE, 46--56.

Aliaksei Severyn and Alessandro Moschitti. 2015. Learning to rank short text pairs with convolutional deep neural networks. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval. ACM, 373--382.

Keet Sugathadasa, Buddhi Ayesha, Nisansa de Silva, Amal Shehan Perera, Vindula Jayawardana, Dimuthu Lakmal, and Madhavi Perera. 2017. Synergistic union of word2vec and lexicon for domain specific semantic similarity. In 2017 IEEE International Conference on Industrial and Information Systems (ICIIS). IEEE, 1--6.

Yuan Tian, David Lo, and Julia Lawall. 2014. SEWordSim: Software-specific word similarity database. In Companion Proceedings of the 36th International Conference on Software Engineering. ACM, 568--571.

Xinli Yang, David Lo, Xin Xia, Lingfeng Bao, and Jianling Sun. 2016. Combining word embedding with information retrieval to recommend similar bug reports. In 2016 IEEE 27th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 127--137.

Xin Ye, Razvan Bunescu, and Chang Liu. 2014. Learning to rank relevant files for bug reports using domain knowledge. In Proceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering. ACM, 689--699.

Xin Ye, Hui Shen, Xiao Ma, Razvan Bunescu, and Chang Liu. 2016. From word embeddings to document similarities for improved information retrieval in software engineering. In Proceedings of the 38th international conference on software engineering. ACM, 404--415.

Jian Zhou, Hongyu Zhang, and David Lo. 2012. Where should the bugs be fixed? more accurate information retrieval-based bug localization based on bug reports. In 2012 34th International Conference on Software Engineering(ICSE). IEEE, 14--24.