From Bag-of-Words to Pre-trained Neural Language Models: Improving Automatic Classification of App Reviews for Requirements Engineering

  • Adailton Araujo Universidade de São Paulo
  • Marcos Golo Universidade de São Paulo
  • Breno Viana Universidade de São Paulo
  • Felipe Sanches Universidade de São Paulo
  • Roseli Romero USP-SC
  • Ricardo Marcacini ICMC/USP


Popular mobile applications receive millions of user reviews. Thesereviews contain relevant information, such as problem reports and improvementsuggestions. The reviews information is a valuable knowledge source for soft-ware requirements engineering since the analysis of the reviews feedback helpsto make strategic decisions in order to improve the app quality. However, due tothe large volume of texts, the manual extraction of the relevant information is animpracticable task. In this paper, we investigate and compare textual represen-tation models for app reviews classification. We discuss different aspects andapproaches for the reviews representation, analyzing from the classic Bag-of-Words models to the most recent state-of-the-art Pre-trained Neural Languagemodels. Our findings show that the classic Bag-of-Words model, combined witha careful analysis of text pre-processing techniques, is still a competitive model.However, pre-trained neural language models showed to be more advantageoussince it obtains good classification performance, provides significant dimension-ality reduction, and deals more adequately with semantic proximity between thereviews’ texts, especially the multilingual neural language models.

Palavras-chave: opinion mining, sentiment analysis, data-driven requirements engineering, crowd RE, mobile apps, app review, software review, user feedback, natural language processing, automatic classification


Aggarwal, C. C. (2018). Machine Learning for Text. Springer Publishing Company, Incorporated, 1st edition.

Al Kilani, N., Tailakh, R., and Hanani, A. (2019). Automatic classification of apps reviews for requirement engineering: Exploring the customers need from healthcare applications. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), pages 541–548.

Aralikatte, R., Sridhara, G., Gantayat, N., and Mani, S. (2018). Fault in your stars: an analysis of android app reviews. In Proceedings of the ACM India Joint International Conference on Data Science and Management of Data, pages 57–66.

Belinkov, Y. and Glass, J. (2019). Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.

Dabrowski, J., Letier, E., Perini, A., and Susi, A. (2020). Mining user opinions to support requirement engineering: An empirical study. In Advanced Information Systems Engineering, pages 401–416, Cham. Springer International Publishing.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Garcı́a, S., Fernández, A., Luengo, J., and Herrera, F. (2010). Advanced nonparamric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10):2044–2064.

Guzman, E., El-Haliby, M., and Bruegge, B. (2015). Ensemble methods for app review classification: An approach for software evolution (n). In 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 771–776.

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V. (2019). Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Maalej, W., Kurtanović, Z., Nabil, H., and Stanik, C. (2016). On the automatic classifiction of app reviews. Requirements Engineering, 21(3):311–331.

Maalej, W., Nayebi, M., Johann, T., and Ruhe, G. (2016). Toward data-driven requirements engineering. IEEE Software, 33(1):48–54.

Marcacini, R. M., Rossi, R. G., Matsuno, I. P., and Rezende, S. O. (2018). Cross-domain aspect extraction for sentiment analysis: A transductive learning approach. Decision Support Systems, 114:70–80.

Messaoud, M. B., Jenhani, I., Jemaa, N. B., and Mkaouer, M. W. (2019). A multi-label active learning approach for mobile app user review classification. In International Conference on Knowledge Science, Engineering and Management, pages 805–816.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.

Mulder, W., Bethard, S., and Moens, M.-F. (2015). A survey on the application of recurrent neural networks to statistical language modeling. Computer Speech & Language, 30(1):61–98.

Otter, D. W., Medina, J. R., and Kalita, J. K. (2020). A survey of the usages of deep learning for natural language processing. IEEE Transactions on Neural Networks and Learning Systems.

Pagano, D. and Maalej, W. (2013). User feedback in the appstore: An empirical study. In IEEE International Requirements Engineering Conference (RE), pages 125–134.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3973–3983.

Ribeiro, M. T., Singh, S., and Guestrin, C. (2016). ” why should i trust you?” explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages 1135–1144.

Sanh, V., Debut, L., Chaumond, J., and Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

Shah, F. A., Sirts, K., and Pfahl, D. (2019). Using app reviews for competitive analysis: Tool support. In Proceedings of the 3rd ACM SIGSOFT International Workshop on App Market Analytics, WAMA 2019, pages 40–46, New York, NY, USA. ACM.

Tan, P., Steinbach, M., and Kumar, V. (2013). Introduction to Data Mining: Pearson New International Edition. Pearson Education Limited.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in neural information processing systems, pages 5998–6008.

Wang, C., Zhang, F., Liang, P., Daneva, M., and van Sinderen, M. (2018). Can app changelogs improve requirements classification from app reviews? an exploratory study. In Proceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 1–4.

Zhou, X., Zhang, Y., Cui, L., and Huang, D. (2020). Evaluating commonsense in pretrained language models. In AAAI, pages 9733–9740.
ARAUJO, Adailton; GOLO, Marcos; VIANA, Breno; SANCHES, Felipe; ROMERO, Roseli; MARCACINI, Ricardo. From Bag-of-Words to Pre-trained Neural Language Models: Improving Automatic Classification of App Reviews for Requirements Engineering. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 17. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 378-389. ISSN 2763-9061. DOI: