Evaluating a New Auto-ML Approach for Sentiment Analysis and Intent Recognition Tasks

Douglas Nunes de Oliveira; Milo Noronha Rocha Utsch; Diogo Villela Pedro de Almeida Machado; Nina Goulart Pena; Ramon Gomes Durães de Oliveira; Arthur Iperoyg Rodrigues Carvalho; Luiz Henrique de Campos Merschmann

doi:10.5753/jis.2023.3161

Authors

Douglas Nunes de Oliveira Take Blip - Research & Innovation https://orcid.org/0000-0002-0698-1845
Milo Noronha Rocha Utsch Take Blip - Research & Innovation https://orcid.org/0000-0001-9735-7054
Diogo Villela Pedro de Almeida Machado Take Blip - Research & Innovation https://orcid.org/0000-0001-5367-5587
Nina Goulart Pena Take Blip - Research & Innovation https://orcid.org/0000-0002-6212-0695
Ramon Gomes Durães de Oliveira Take Blip - Research & Innovation https://orcid.org/0000-0001-8631-8404
Arthur Iperoyg Rodrigues Carvalho Take Blip - Research & Innovation https://orcid.org/0000-0002-0442-5677
Luiz Henrique de Campos Merschmann Federal University of Lavras https://orcid.org/0000-0002-9948-2673

DOI:

https://doi.org/10.5753/jis.2023.3161

Keywords:

automl, bias correction cross-validation, genetic algorithm, bayesian optimization, intent recognition, chatbot

Abstract

Automated Machine Learning (AutoML) is a research area that aims to help humans solve Machine Learning (ML) problems by automatically discovering good ML pipelines (algorithms and their hyperparameters for every stage of a machine learning process) for a given dataset. Since we have a combinatorial optimization problem for which it is impossible to evaluate all possible pipelines, most AutoML systems use a Genetic Algorithm (GA) or Bayesian Optimization (BO) to find a good solution. These systems usually evaluate the performance of the pipelines using the K-fold cross-validation method, for which the more pipelines are evaluated, the higher the chance of finding an overfitted solution. To avoid the aforementioned issue, we propose a system named Auto-ML System for Text Classification (ASTeC), that uses the Bootstrap Bias Corrected CV (BBC-CV) method to evaluate the performance of the pipelines. More specifically, the proposed system combines GA, BO, and BBC-CV to find a good ML pipeline for the text classification task. We evaluated our approach by comparing it with state-of-the-art systems: in the the Sentiment Analysis (SA) task, we compared our approach to TPOT (Tree-based Pipeline Optimization Tool) and Google Cloud AutoML service, and for the Intent Recognition (IR) task, we compared with TPOT and MLJAR AutoML. Concerning the data, we analysed seven public datasets from the SA domain and sixteen from the IR domain. Four out of those sixteen are composed by written English text, while all of the others are in Brazilian Portuguese. Statistical tests show that, in 21 out of 23 datasets, our system's performance is equivalent to or better than the others.

Downloads

Download data is not yet available.

References

AIworx (2017). Chocolate: A fully decentralized hyperparameter optimization framework. [link]. Acessado em 17 de março de 2019.

Alam, S. and Yao, N. (2019). The impact of preprocessing steps on the accuracy of machine learning algorithms in sentiment analysis. Comput. Math. Organ. Theory, 25(3):319–335.

Araújo, M., dos Reis, J. C., Pereira, A. C. M., and Benevenuto, F. (2016). An evaluation of machine translation for multilingual sentence-level sentiment analysis. In Proceedings of the Annual ACM Symposium on Applied Computing, pages 1140–1145, Pisa, Italy. ACM.

Beyer, H. and Schwefel, H. (2002). Evolution strategies - A comprehensive introduction. Natural Computing, 1(1):3–52.

Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media, Inc., Sebastopol, Estados Unidos.

Brum, H. B. and das Graças Volpe Nunes, M. (2018). Building a Sentiment Corpus of Tweets in Brazilian Portuguese. In Proceedings of the International Conference on Language Resources and Evaluation, Miyazaki, Japan. ELRA.

Casanueva, I., Temčinas, T., Gerz, D., Henderson, M., and Vulić, I. (2020). Efficient intent detection with dual sentence encoders. arXiv preprint arXiv:2003.04807.

Cawley, G. C. and Talbot, N. L. C. (2010). On over-fitting in model selection and subsequent selection bias in performance evaluation. J. Mach. Learn. Res., 11:2079–2107.

Coucke, A., Saade, A., Ball, A., Bluche, T., Caulier, A., Leroy, D., Doumouro, C., Gisselbrecht, T., Caltagirone, F., Lavril, T., et al. (2018). Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces. arXiv preprint arXiv:1805.10190.

de Oliveira, D. N. and de Campos Merschmann, L. H. (2021). Joint evaluation of preprocessing tasks with classifiers for sentiment analysis in brazilian portuguese language. Multimedia Tools and Applications, 80(10):15391–15412.

de Oliveira, D. N. and de Campos Merschmann, L. H. (2022). An auto-ml approach applied to text classification. In WebMedia ’22: Brazilian Symposium on Multimedia and the Web, Curitiba, Parań, Brazil, November 7-11, 2022. ACM.

de Sá, A. G. C., Pinto, W. J. G. S., Oliveira, L. O. V. B., and Pappa, G. L. (2017). Recipe: A grammar-based framework for automatically evolving classification pipelines. In Proceedings of the European Conference on Genetic Programming, pages 246–261, Amsterdam, Netherlands. Springer International Publishing.

dos Santos, F. L. and Ladeira, M. (2014). The role of text preprocessing in opinion mining on a social media language dataset. In Proceedings of the Brazilian Conference on Intelligent Systems, pages 50–54, São Paulo, Brazil. IEEE.

Eberhard, D. M., Simons, G. F., and Fennig, C. D., editors (2022). Ethnologue: Languages of the World. SIL International, Dallas, TX, USA, 25 edition.

Ferreira, R. S. (2017). Análise de sentimentos - Aprenda de uma vez por todas como funciona utilizando dados do twitter. [link]. (Accessed on 2019 Mar 3).

Feurer, M. and Hutter, F. (2019). Hyperparameter Optimization, pages 3–33. Springer International Publishing, Cham.

Feurer, M., Klein, A., Eggensperger, K., Springenberg, J., Blum, M., and Hutter, F. (2015). Efficient and robust automated machine learning. In Proceedings of the Neural Information Processing Systems Conference, pages 2962–2970. Curran Associates, Inc., Montreal, Canada.

Fonseca, E. R. and Rosa, J. L. G. (2013). Mac-morpho revisited: Towards robust part-of-speech tagging. In 9th Brazilian Symposium in Information and Human Language Technology (STIL), Fortaleza, Brasil. SBC.

Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A. G., Parizeau, M., and Gagné, C. (2012). DEAP: Evolutionary algorithms made easy. Journal of Machine Learning Research, 13.

Golovin, D., Solnik, B., Moitra, S., Kochanski, G., Karro, J., and Sculley, D. (2017). Google vizier: A service for blackbox optimization. In Conference on Knowledge Discovery and Data Mining, pages 1487–1495, Halifax, Canada. ACM.

Google Cloud (2019). Custom machine learning models. [link]. (Accessed on 2019 Jun 3).

Guyon, I., Bennett, K. P., Cawley, G. C., Escalante, H. J., Escalera, S., Ho, T. K., Macià, N., Ray, B., Saeed, M., Statnikov, A. R., and Viegas, E. (2015). Design of the 2015 chalearn automl challenge. In Proceedings of the International Joint Conference on Neural Networks, pages 1–8, Killarney, Ireland. IEEE.

Hansen, N. and Ostermeier, A. (2001). Completely derandomized self-adaptation in evolution strategies. Evolutionary Computation, 9.

Hodges Jr, J. L. and Lehmann, E. L. (1962). Rank methods for combination of independent experiments in analysis of variance. The Annals of Mathematical Statistics, 33(2):482–497.

Huggins, M., Alghowinem, S., Jeong, S., Colon-Hernandez, P., Breazeal, C., and Park, H. W. (2021). Practical guidelines for intent recognition: Bert with minimal training data evaluated in real-world hri application. In Proceedings of the 2021 ACM/IEEE International Conference on Human-Robot Interaction, pages 341–350.

Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2011). Sequential model-based optimization for general algorithm configuration. In Coello, C. A. C., editor, Proceedings of the Learning and Intelligent Optimization International Conference, volume 6683 of Lecture Notes in Computer Science, pages 507–523, Rome, Italy. Springer.

Hutter, F., Kotthoff, L., and Vanschoren, J., editors (2019). Automated Machine Learning - Methods, Systems, Challenges. Springer.

Junior, M. S. and de Campos Merschmann, L. H. (2016). A methodology to handle social media posts in brazilian portuguese for text mining applications. In Proceedings of the Brazilian Symposium on Multimedia and the Web, pages 239–246, Teresina, Brazil. ACM.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of the 14th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’95, page 1137–1143, San Francisco, CA, USA. Morgan Kaufmann Publishers Inc.

Kotthoff, L., Thornton, C., Hoos, H. H., Hutter, F., and Leyton-Brown, K. (2017). Auto-weka 2.0: Automatic model selection and hyperparameter optimization in WEKA. Journal of Machine Learning Research, 18.

Larson, S., Mahendran, A., Peper, J. J., Clarke, C., Lee, A., Hill, P., Kummerfeld, J. K., Leach, K., Laurenzano, M. A., Tang, L., and Mars, J. (2019). An evaluation dataset for intent classification and out-of-scope prediction. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLPIJCNLP), pages 1311–1316, Hong Kong, China. Association for Computational Linguistics.

Liu, X., Eshghi, A., Swietojanski, P., and Rieser, V. (2019). Benchmarking natural language understanding services for building conversational agents. arXiv preprint arXiv:1903.05566.

Martins, R. F., Pereira, A. C. M., and Benevenuto, F. (2015). An approach to sentiment analysis of web applications in portuguese. In Proceedings of the Brazilian Symposium on Multimedia and the Web, pages 105–112, Manaus, Brazil. ACM.

Narr, S., Hülfenhaus, M., and Albayrak, S. (2012). Language-independent twitter sentiment analysis. In Proceedings of the Workshop on Knowledge Discovery, Data Mining and Machine Learning, pages 12–14, Dortmund, Germany.

Olson, R. S., Bartley, N., Urbanowicz, R. J., and Moore, J. H. (2016). Evaluation of a tree-based pipeline optimization tool for automating data science. In Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA. ACM.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., VanderPlas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12.

Pereira, D. A. (2021). A survey of sentiment analysis in the portuguese language. Artificial Intelligence Review, 54:1087–1115.

Płońska, A. and Płoński, P. (2021). Mljar: State-of-the-art automated machine learning framework for tabular data. version 0.10.3.

Ravi, K. and Ravi, V. (2015). A survey on opinion mining and sentiment analysis: Tasks, approaches and applications. Knowledge-Based Systems, 89:14–46.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.

Ribeiro, F. N., Araújo, M., Gonçalves, P., Gonçalves, M. A., and Benevenuto, F. (2016). SentiBench - a benchmark comparison of state-of-the-practice sentiment analysis methods. EPJ Data Science, 5(1):1–29.

Rodríguez-Fdez, I., Canosa, A., Mucientes, M., and Bugarín, A. (2015). STAC: a web platform for the comparison of algorithms using statistical tests. In Proceedings of the 2015 IEEE International Conference on Fuzzy Systems (FUZZ-IEEE).

Silva, I. S., Gomide, J., Veloso, A., Jr., W. M., and Ferreira, R. (2011). Effective sentiment stream analysis with self-augmenting training and demand-driven projection. In Proceedings of the International Conference on Research and Development in Information Retrieval, pages 475–484, Beijing, China. ACM.

Souza, E., Vitório, D., Castro, D., Oliveira, A. L. I., and Gusmão, C. (2016). Characterizing opinion mining: A systematic mapping study of the portuguese language. In Proceedings of the Computational Processing of the Portuguese Language, volume 9727 of Lecture Notes in Computer Science, pages 122–127, Tomar, Portugal. Springer.

Stilingue (2023). Curupira s.a. – stilingue. [link]. (Accessed on 2023 Fev 03).

TakeBlip (2023). Takenet llc. [link]. (Accessed on 2023 Fev 03).

Thornton, C., Hutter, F., Hoos, H. H., and Leyton-Brown, K. (2013). Auto-weka: combined selection and hyperparameter optimization of classification algorithms. In Conference on Knowledge Discovery and Data Mining, pages 847–855, Chicago, USA. ACM.

Tibshirani, R. J. and Tibshirani, R. (2009). A bias correction for the minimum error rate in cross-validation. The Annals of Applied Statistics, 3(2):822–829.

Tsamardinos, I., Greasidou, E., and Borboudakis, G. (2018). Bootstrapping the out-of-sample predictions for efficient and accurate cross-validation. Machine Learning, 107(12):1895–1922.

Tsamardinos, I., Rakhshani, A., and Lagani, V. (2015). Performance-estimation properties of cross-validation-based protocols with simultaneous hyper-parameter optimization. International Journal on Artificial Intelligence Tools, 24(5):1–29.

Uysal, A. K. and Günal, S. (2014). The impact of preprocessing on text classification. Information Processing and Management, 50(1):104–112.

Wang, W., Wei, F., Dong, L., Bao, H., Yang, N., and Zhou, M. (2020). Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics bulletin, 1(6):80–83.

Xavier, C. (2018). Polarity classification of traffic related tweets. In Proceedings of Encontro Nacional de Inteligência Artificial e Computacional, São Paulo, Brazil.

Zhang, J.-G., Hashimoto, K., Wan, Y., Liu, Y., Xiong, C., and Yu, P. S. (2021). Are pretrained transformers robust in intent classification? a missing ingredient in evaluation of out-of-scope intent detection. arXiv preprint arXiv:2106.04564.