Machine Learning for Classification of Textual Types: A Case Study in Texts Written in Brazilian Portuguese
Abstract
The classification of texts regarding textual types is of paramount importance for some Natural Language Processing (NLP) applications. In recent years, machine learning algorithms have achieved good results in this task considering English texts. However, research aimed at detecting textual types written in Portuguese is still scarce, and much remains to be studied and discovered in this context. Thus, this article proposes an experimental study that investigates the use of machine learning algorithms to classify texts in Portuguese regarding textual types. For this, we propose a new corpus composed of Portuguese texts of two textual types: narrative and dissertation. Three machine learning algorithms had their performance evaluated in the proposed corpus in terms of accuracy, recall, and F1 score. Besides, an analysis of the attributes involved in the process was also carried out to identify which textual characteristics are more important in the current task. The results showed that it is possible to achieve high levels of precision and recall in classifying narrative and essay texts. The algorithms obtained similar metrics levels, demonstrating the extracted features’ quality.
Keywords:
NLP, Text classification, Text typology, Linguistic features
References
Altmann, A., Toloşi, L., Sander, O., and Lengauer, T. (2010). Permutation importance: a corrected feature importance measure. Bioinformatics, 26(10):1340–1347.
Awad, M. and Khanna, R. (2015). Support vector machines for classification. In Efficient learning machines, pages 39–66. Springer.
Botta, A., de Donato, W., Persico, V., and Pescape, A. (2016). Integration of cloud computing and internet of things: A survey. Future Generation Computer Systems, 56:684–700.
Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
Camelo, R., Justino, S., and de Mello, R. F. L. (2020). Coh-metrix pt-br: Uma api web de análise textual para a educação. In Anais dos Workshops do IX Congresso Brasileiro de Informática na Educação, pages 179–186. SBC.
Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28.
Dabre, R., Chu, C., and Kunchukuttan, A. (2020). A survey of multilingual neural machine translation. ACM Comput. Surv., 53(5).
Ferreira-Mello, R., Andre, M., Pinheiro, A., Costa, E., and Romero, C. (2019). Text mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(6):e1332.
Ferreira Mello, R., Fiorentino, G., Oliveira, H., Miranda, P., Rakovic, M., and Gasevic, D. (2022). Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in portuguese. In LAK22: 12th International Learning Analytics and Knowledge Conference, pages 404–414.
Hassani, H., Beneki, C., Unger, S., Mazinani, M. T., and Yeganegi, M. R. (2020). Text mining in big data analytics. Big Data and Cognitive Computing, 4(1).
Hossin, M. and Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2):1.
Karlgren, J. and Cutting, D. (1994). Recognizing text genres with simple metrics using discriminant analysis.
Ke, Z. and Ng, V. (2019). Automated essay scoring: A survey of the state of the art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6300–6308. International Joint Conferences on Artificial Intelligence Organization.
Kessler, B., Nunberg, G., and Schutze, H. (1997). Automatic detection of text genre. arXiv preprint cmp-lg/9707002.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., and Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4).
Lagutina, K. and Lagutina, N. (2021). A survey of models for constructing text features to classify texts in natural language. In 2021 29th Conference of Open Innovations Association (FRUCT), pages 222–233.
Li, S., Xu, L. D., and Zhao, S. (2015). The internet of things: a survey. Information Systems Frontiers, 17(2):243–259.
McNamara, D. S., Graesser, A. C., McCarthy, P. M., and Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.
Melissourgou, M. N. and Frantzi, K. T. (2017). Genre identification based on sfl principles: The representation of text types and genres in english language teaching material. Corpus Pragmatics, 1(4):373–392.
Mustonen, S. (1965). Multiple discriminant analysis in linguistic problems. Statistical Methods in Linguistics, 4:37–44.
Onan, A. (2017). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes, 46(2):330–348.
Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1):28–47.
Oussous, A., Benjelloun, F.-Z., Ait Lahcen, A., and Belfkih, S. (2018). Big data technologies: A survey. Journal of King Saud University - Computer and Information Sciences, 30(4):431–448.
Patout, P.-A. and Cordy, M. (2019). Towards context-aware automated writing evaluation systems. In Proceedings of the 1st ACM SIGSOFT International Workshop on Education through Advanced Software Engineering and Artificial Intelligence, EASEAI 2019, page 17–20, New York, NY, USA. Association for Computing Machinery.
Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). Text genre detection using common word frequencies. In COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics.
Travaglia, L. C. (2002). Tipos, gêneros e subtipos textuais e o ensino de língua materna. Língua Portuguesa: uma visão em mosaico. São Paulo: EDUC, pages 201–214.
Travaglia, L. C. (2003). Tipelementos e a construção de uma teoria tipológica geral de textos. FÁVERO, Leonor Lopes; BASTOS, Neusa M. de O. Barbosa, pages 97–117.
Travaglia, L. C. (2018). Tipologia textual e ensino da língua. A ser publicado como capítulo do livro Linguística Textual e Análise da conversação (GTLAC) da ANPOLL. Uberlândia.
Wachsmuth, H. and Bujna, K. (2011). Back to the roots of genres: Text classification by language function. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 632–640.
Zhang, L., Wang, S., and Liu, B. (2018). Deep learning for sentiment analysis: A survey. WIREs Data Mining and Knowledge Discovery, 8(4):e1253.
Zhou, X., Gururajan, R., Li, Y., Venkataraman, R., Tao, X., Bargshady, G., Barua, P. D., and Kondalsamy-Chennakesavan, S. (2020). A survey on text classification and its applications. Web Intelligence, 18:205–216. 3.
Awad, M. and Khanna, R. (2015). Support vector machines for classification. In Efficient learning machines, pages 39–66. Springer.
Botta, A., de Donato, W., Persico, V., and Pescape, A. (2016). Integration of cloud computing and internet of things: A survey. Future Generation Computer Systems, 56:684–700.
Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
Camelo, R., Justino, S., and de Mello, R. F. L. (2020). Coh-metrix pt-br: Uma api web de análise textual para a educação. In Anais dos Workshops do IX Congresso Brasileiro de Informática na Educação, pages 179–186. SBC.
Chandrashekar, G. and Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1):16–28.
Dabre, R., Chu, C., and Kunchukuttan, A. (2020). A survey of multilingual neural machine translation. ACM Comput. Surv., 53(5).
Ferreira-Mello, R., Andre, M., Pinheiro, A., Costa, E., and Romero, C. (2019). Text mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 9(6):e1332.
Ferreira Mello, R., Fiorentino, G., Oliveira, H., Miranda, P., Rakovic, M., and Gasevic, D. (2022). Towards automated content analysis of rhetorical structure of written essays using sequential content-independent features in portuguese. In LAK22: 12th International Learning Analytics and Knowledge Conference, pages 404–414.
Hassani, H., Beneki, C., Unger, S., Mazinani, M. T., and Yeganegi, M. R. (2020). Text mining in big data analytics. Big Data and Cognitive Computing, 4(1).
Hossin, M. and Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International journal of data mining & knowledge management process, 5(2):1.
Karlgren, J. and Cutting, D. (1994). Recognizing text genres with simple metrics using discriminant analysis.
Ke, Z. and Ng, V. (2019). Automated essay scoring: A survey of the state of the art. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 6300–6308. International Joint Conferences on Artificial Intelligence Organization.
Kessler, B., Nunberg, G., and Schutze, H. (1997). Automatic detection of text genre. arXiv preprint cmp-lg/9707002.
Kowsari, K., Jafari Meimandi, K., Heidarysafa, M., Mendu, S., Barnes, L., and Brown, D. (2019). Text classification algorithms: A survey. Information, 10(4).
Lagutina, K. and Lagutina, N. (2021). A survey of models for constructing text features to classify texts in natural language. In 2021 29th Conference of Open Innovations Association (FRUCT), pages 222–233.
Li, S., Xu, L. D., and Zhao, S. (2015). The internet of things: a survey. Information Systems Frontiers, 17(2):243–259.
McNamara, D. S., Graesser, A. C., McCarthy, P. M., and Cai, Z. (2014). Automated evaluation of text and discourse with Coh-Metrix. Cambridge University Press.
Melissourgou, M. N. and Frantzi, K. T. (2017). Genre identification based on sfl principles: The representation of text types and genres in english language teaching material. Corpus Pragmatics, 1(4):373–392.
Mustonen, S. (1965). Multiple discriminant analysis in linguistic problems. Statistical Methods in Linguistics, 4:37–44.
Onan, A. (2017). Hybrid supervised clustering based ensemble scheme for text classification. Kybernetes, 46(2):330–348.
Onan, A. (2018). An ensemble scheme based on language function analysis and feature engineering for text genre classification. Journal of Information Science, 44(1):28–47.
Oussous, A., Benjelloun, F.-Z., Ait Lahcen, A., and Belfkih, S. (2018). Big data technologies: A survey. Journal of King Saud University - Computer and Information Sciences, 30(4):431–448.
Patout, P.-A. and Cordy, M. (2019). Towards context-aware automated writing evaluation systems. In Proceedings of the 1st ACM SIGSOFT International Workshop on Education through Advanced Software Engineering and Artificial Intelligence, EASEAI 2019, page 17–20, New York, NY, USA. Association for Computing Machinery.
Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). Text genre detection using common word frequencies. In COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics.
Travaglia, L. C. (2002). Tipos, gêneros e subtipos textuais e o ensino de língua materna. Língua Portuguesa: uma visão em mosaico. São Paulo: EDUC, pages 201–214.
Travaglia, L. C. (2003). Tipelementos e a construção de uma teoria tipológica geral de textos. FÁVERO, Leonor Lopes; BASTOS, Neusa M. de O. Barbosa, pages 97–117.
Travaglia, L. C. (2018). Tipologia textual e ensino da língua. A ser publicado como capítulo do livro Linguística Textual e Análise da conversação (GTLAC) da ANPOLL. Uberlândia.
Wachsmuth, H. and Bujna, K. (2011). Back to the roots of genres: Text classification by language function. In Proceedings of 5th International Joint Conference on Natural Language Processing, pages 632–640.
Zhang, L., Wang, S., and Liu, B. (2018). Deep learning for sentiment analysis: A survey. WIREs Data Mining and Knowledge Discovery, 8(4):e1253.
Zhou, X., Gururajan, R., Li, Y., Venkataraman, R., Tao, X., Bargshady, G., Barua, P. D., and Kondalsamy-Chennakesavan, S. (2020). A survey on text classification and its applications. Web Intelligence, 18:205–216. 3.
Published
2022-11-16
How to Cite
BARBOSA, Gabriel A.; BATISTA, Hyan H. N.; MIRANDA, Péricles; SANTOS, Jário; ISOTANI, Seiji; CORDEIRO, Thiago; BITTENCOURT, Ig Ibert; FERREIRA MELLO, Rafael.
Machine Learning for Classification of Textual Types: A Case Study in Texts Written in Brazilian Portuguese. In: BRAZILIAN SYMPOSIUM ON COMPUTERS IN EDUCATION (SBIE), 33. , 2022, Manaus.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2022
.
p. 920-931.
DOI: https://doi.org/10.5753/sbie.2022.224769.
