A machine learning approach to literary genre classification on Portuguese texts: circumventing NLP’s standard varieties
Resumo
Avaliamos e classificamos quali-quantitativamente gêneros literários do corpus BDCamões. Crônicas, romances, histórias curtas e contos, anotados em UD, são classificados por florestas aleatórias, e analisados com base na versão português-brasileira do LIWC. Os resultados por classe são reportados pela média, juntamente com uma medida de desvio padrão. Os resultados das características por classe, rótulos LIWC, classes gramaticais e rótulos UD destacam características positivas altas e negativas baixas. A adaptação desta metodologia à fluidez e mutabilidade dos gêneros literários contorna as dificuldade normalemnet encontradas em NLP, apresentando consistência e poucos erros nos resultados.
Referências
Altman, R. (1984). A semantic/syntactic approach to film genre. Cinema Journal, pages 6–18.
Balage Filho, P., Pardo, T. A. S., and Aluísio, S. (2013). An evaluation of the Brazilian Portuguese LIWC dictionary for sentiment analysis. In Proceedings of the 9th Brazilian Symposium in Information and Human Language Technology.
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
Bronckart, J.-P. (2004). Les genres de textes et leur contribution au développement psychologique. Langages, 1(153):98–108.
Crowston, K. and Kwasnik, B. H. (2003). Can document-genre metadata improve information access to large digital collections? LIBRARY TRENDS, 52(2):345–361.
Feldman, S., Marin, M. A., Ostendorf, M., and Gupta, M. R. (2009). Part-of-speech histograms for genre classification of text. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 4781–4784. IEEE.
Gottschalk, L. A. and Gleser, G. C. (1979). The measurement of psychological states through the content analysis of verbal behavior. University of California Press.
Grilo, S., Bolrinha, M., Silva, J., Vaz, R., and Branco, A. (2020). The BDCamões Collection of Portuguese Literary Documents: a Research Resource for Digital Humanities and Language Technology. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 849–854.
Karlgren, J., Bretan, I., Dewe, J., Hallberg, A., and Wolkert, N. (1998). Iterative information retrieval using fast clustering and usage-specific genres. In Eight DELOS workshop on User Interfaces in Digital Libraries, pages 85–92.
Lüthi, M. (1970). Once Upon a Time: On the Nature of Fairy Tales. Trans. Lee Chadeayne & Paul Gottwald. New York: Frederick Ungar Publishing Co.
Marcuschi, L. A. et al. (2002). Gêneros textuais: definição e funcionalidade. Gêneros textuais e ensino, 2:19–36.
Martins, N. S. (2008). Introdução à estilística: a expressividade na língua portuguesa, volume 71. Edusp.
Matos, T. (2021). Gêneros textuais. [link]. Online; accessed July 17th of 2021.
Monte-Serrat, D. (2017). Neurolinguistics, Language and Time: investigating the verbal art in its amplitude. International Journal of Perceptions in Public Health, 1(3):162–171.
Monte-Serrat, D. (2021). Operating language value structures in the intelligent systems. Advanced Mathematical Models & Applications, 6(1):31–44.
Monte-Serrat, D. M. and Cattani, C. (2021a). Interpretability in neural networks towards universal consistency. International Journal of Cognitive Computing in Engineering, 2:30–39.
Monte-Serrat, D. M. and Cattani, C. (2021b). The Natural Language for Artificial Intelligence. Elsevier.
Nemerov, H. (2020). Poetry. Encyclopedia Britannica. https://www.britannica.com/art/poetry. [Online; accessed 05-August-2020].
Nilan, M. S., Pomerantz, J., and Paling, S. (2001). Genres from the Bottom Up: What Has the Web Brought Us? In Proceedings of the ASIST Annual Meeting, volume 38, pages 330–39. ERIC.
Nivre, J. (2015). Towards a Universal Grammar for Natural Language Processing. In Gelbukh, A., editor, Computational Linguistics and Intelligent Text Processing, pages 3–16, Cham. Springer International Publishing.
Omar, A. (2020). Classifying literary genres: a methodological synergy of computational modelling and lexical semantics. Texto Livre: Linguagem e Tecnologia, 13(2):83–101.
Pennebaker, J. W., Boyd, R. L., Jordan, K., and Blackburn, K. (2015). The development and psychometric properties of liwc2015. Technical report.
Pennebaker, J. W., Francis, M. E., and Booth, R. J. (2001). Linguistic Inquiry and Word Count: LIWC 2001. Mahway: Lawrence Erlbaum Associates, 71.
Plank, B. (2011). Domain Adaptation for Parsing. PhD thesis, University of GroninISBN: 978-90gen, https://bplank.github.io/publications.html. 367-5199-5.
Plank, B. (2016). What to do about non-standard (or non-canonical) language in NLP. In Proceedings of the 13th Conference on Natural Language Processing.
Rosenberg, S. D. and Tucker, G. J. (1979). Verbal behavior and schizophrenia: The semantic dimension. Archives of General Psychiatry, 36(12):1331–1337.
Rosso, M. A. (2005). What type of page is this? Genre as Web descriptor. In Proceedings of the 5th ACM/IEEE–CS joint Conference on Digital libraries, pages 398–398.
Schneuwly, B. (1997). Textual organizers and text types: Ontogenetic aspects in writing. Processing interclausal relationships. Studies in the production and comprehension of text, pages 245–263.
Stamatatos, E., Fakotakis, N., and Kokkinakis, G. (2000). Text genre detection using common word frequencies. In COLING 2000 Volume 2: The 18th International Conference on Computational Linguistics.
Sven Meyer zu, E. and Stein, B. (2004). Genre classification of web pages. In Biundo, S., Frühwirth, T., and Palm, G., editors, KI 2004: Advances in Artificial Intelligence, pages 256–269, Berlin, Heidelberg. Springer Berlin Heidelberg.
Tausczik, Y. R. and Pennebaker, J. W. (2010). The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods. Journal of Language and Social Psychology, 29(1):24–54.