Aprendizagem de Máquina para Classificação de Tipos Textuais: Estudo de Caso em Textos escritos em Português Brasileiro

Gabriel A. Barbosa; Hyan H. N. Batista; Péricles Miranda; Jário Santos; Seiji Isotani; Thiago Cordeiro; Ig Ibert Bittencourt; Rafael Ferreira Mello

doi:10.5753/sbie.2022.224769

Gabriel A. Barbosa Universidade Federal Rural de Pernambuco
Hyan H. N. Batista Universidade Federal Rural de Pernambuco
Péricles Miranda Universidade Federal Rural de Pernambuco https://orcid.org/0000-0002-5767-7544
Jário Santos Universidade de São Paulo
Seiji Isotani Universidade de São Paulo / Harvard University http://orcid.org/0000-0003-1574-0784
Thiago Cordeiro Universidade Federal de Alagoas
Ig Ibert Bittencourt Universidade Federal de Alagoas / Harvard University http://orcid.org/0000-0001-5676-2280
Rafael Ferreira Mello Universidade Federal Rural de Pernambuco / Centro de Estudos e Sistemas Avançados do Recife http://orcid.org/0000-0003-3548-9670

DOI: https://doi.org/10.5753/sbie.2022.224769

Resumo

A classificação de textos considerando tipos textuais é de suma importância para algumas aplicações de Processamento de Linguagem Natural (PLN). Nos últimos anos, algoritmos de aprendizado de máquina têm obtido bons resultados nesta tarefa considerando textos em inglês. No entanto, pesquisas voltadas para a detecção de tipos textuais escritos em português ainda são escassas, e ainda há muito a ser estudado e descoberto nesse contexto. Assim, este artigo propõe um estudo experimental que investiga o uso de algoritmos de aprendizado de máquina para classificar textos em português considerando tipos textuais. Para isso, propomos um novo corpus composto por textos em português de dois tipos textuais: narrativo e dissertativo. Três algoritmos de aprendizado de máquina tiveram seu desempenho avaliado no corpus criado em termos de precisão, revocação e pontuação F1. Além disso, também foi realizada uma análise dos atributos envolvidos no processo para identificar quais características textuais são mais importantes na tarefa atual. Os resultados mostraram que é possível alcançar altos níveis de precisão e rememoração na classificação de textos narrativos e dissertativos. Os algoritmos obtiveram níveis de métricas semelhantes, demonstrando a qualidade das características extraídas.

Palavras-chave: PLN, Classificação textual, Tipologia textual, Características linguísticas

Referências

Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347.

Awad, M. and Khanna, R. (2015). Support vector machines for classification. In Efficient learning machines, pages 39–66. Springer.

Botta, A., de Donato, W., Persico, V., and Pescape, A. (2016). Integration of cloud computing and internet of things: A survey. Future Generation Computer Systems, 56:684–700.

Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.

Camelo, R., Justino, S., and de Mello, R. F. L. (2020). Coh-metrix pt-br: Uma api web de análise textual para a educação. In Anais dos Workshops do IX Congresso Brasileiro de Informática na Educação, pages 179–186. SBC.

Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28.

Dabre, R., Chu, C., and Kunchukuttan, A. (2020). A survey of multilingual neural machine translation. ACM Comput. Surv., 53(5).

Ferreira-Mello, R., Andre, M., Pinheiro, A., Costa, E., and Romero, C. (2019). Text mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(6):e1332.

Ferreira Mello, R., Fiorentino, G., Oliveira, H., Miranda, P., Rakovic, M., and Gasevic, D. (2022). Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in portuguese. In LAK22: 12th International Learning Analytics and Knowledge Conference, pages 404–414.

Hassani, H., Beneki, C., Unger, S., Mazinani, M. T., and Yeganegi, M. R. (2020). Text mining in big data analytics. Big Data and Cognitive Computing, 4(1).

Hossin, M. and Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2):1.

Karlgren, J. and Cutting, D. (1994). Recognizing text genres with simple metrics using discriminant analysis.

Ke, Z. and Ng, V. (2019). Automated essay scoring: A survey of the state of the art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6300–6308. International Joint Conferences on Artificial Intelligence Organization.

Kessler, B., Nunberg, G., and Schutze, H. (1997). Automatic detection of text genre. arXiv preprint cmp-lg/9707002.

Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., and Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4).

Lagutina, K. and Lagutina, N. (2021). A survey of models for constructing text features to classify texts in natural language. In 2021 29th Conference of Open Innovations Association (FRUCT), pages 222–233.

Li, S., Xu, L. D., and Zhao, S. (2015). The internet of things: a survey. Information Systems Frontiers, 17(2):243–259.

McNamara, D. S., Graesser, A. C., McCarthy, P. M., and Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.

Melissourgou, M. N. and Frantzi, K. T. (2017). Genre identification based on sfl principles: The representation of text types and genres in english language teaching material. Corpus Pragmatics, 1(4):373–392.

Mustonen, S. (1965). Multiple discriminant analysis in linguistic problems. Statistical Methods in Linguistics, 4:37–44.

Onan, A. (2017). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes, 46(2):330–348.

Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1):28–47.

Oussous, A., Benjelloun, F.-Z., Ait Lahcen, A., and Belfkih, S. (2018). Big data technologies: A survey. Journal of King Saud University - Computer and Information Sciences, 30(4):431–448.

Patout, P.-A. and Cordy, M. (2019). Towards context-aware automated writing evaluation systems. In Proceedings of the 1st ACM SIGSOFT International Workshop on Education through Advanced Software Engineering and Artificial Intelligence, EASEAI 2019, page 17–20, New York, NY, USA. Association for Computing Machinery.

Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). Text genre detection using common word frequencies. In COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics.

Travaglia, L. C. (2002). Tipos, gêneros e subtipos textuais e o ensino de língua materna. Língua Portuguesa: uma visão em mosaico. São Paulo: EDUC, pages 201–214.

Travaglia, L. C. (2003). Tipelementos e a construção de uma teoria tipológica geral de textos. FÁVERO, Leonor Lopes; BASTOS, Neusa M. de O. Barbosa, pages 97–117.

Travaglia, L. C. (2018). Tipologia textual e ensino da língua. A ser publicado como capítulo do livro Linguística Textual e Análise da conversação (GTLAC) da ANPOLL. Uberlândia.

Wachsmuth, H. and Bujna, K. (2011). Back to the roots of genres: Text classification by language function. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 632–640.

Zhang, L., Wang, S., and Liu, B. (2018). Deep learning for sentiment analysis: A survey. WIREs Data Mining and Knowledge Discovery, 8(4):e1253.

Zhou, X., Gururajan, R., Li, Y., Venkataraman, R., Tao, X., Bargshady, G., Barua, P. D., and Kondalsamy-Chennakesavan, S. (2020). A survey on text classification and its applications. Web Intelligence, 18:205–216. 3.