Uma Abordagem Híbrida para Predição de Gênero a partir de Textos em Português

João Pedro M. de Morais; Luiz Henrique de Campos Merschmann

doi:10.5753/sbbd.2021.17865

João Pedro M. de Morais Universidade Federal de Lavras (UFLA)
Luiz Henrique de Campos Merschmann Universidade Federal de Lavras (UFLA) http://orcid.org/0000-0002-9948-2673

DOI: https://doi.org/10.5753/sbbd.2021.17865

Resumo

A área de estudo e pesquisa denominada Caracterização Autoral, cujo objetivo é analisar um texto para inferir informações a respeito do seu autor, vem sendo cada vez mais útil para diferentes setores, tais como o forense, marketing e comércio eletrônico. Apesar do crescente interesse em pesquisas nessa área, a quantidade de técnicas e ferramentas apresentadas na literatura com foco na língua portuguesa é relativamente escassa quando comparada àquela disponível para outros idiomas. Desse modo, este trabalho contribui nessa área de estudo propondo e avaliando uma abordagem híbrida, que combina uma heurística com um classificador, para a predição do gênero do autor de um texto escrito em português utilizando somente o conteúdo textual.

Palavras-chave: caracterização autoral, gênero, mineração de texto, pln

Referências

Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. N-gram: New groningen author-profiling model. arXiv preprint arXiv:1707.03764, 2017.

Evanildo Bechara. Moderna Gramática Portuguesa. Editora Nova Fronteira, 2009.

Rafael Dias and Ivandré Paraboni. Cross-domain author gender classification in brazilian portuguese. In Proceedings of The 12th Language Resources and Evaluation Conference, 2020.

Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira. BlogSet-BR: A Brazilian Portuguese Blog Corpus. In 11th International Conference on Language Resources and Evaluation, 2018.

Faraco Carlos Emı́lio Faraco and Francisco Marto Moura. Gramática. 2010.

Yongyan Guo, Jiayong Liu, Wenwu Tang, and Cheng Huang. Exsense: Extract sensitive information from unstructured data. Computers & Security, 2021.

Nathan S. Hartmann, Erick R. Fonseca, Christopher D. Shulby, Marcos V. Treviso, Jéssica S. Rodrigues, and Sandra M. Aluı́sio. Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Anais do XI Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, 2017.

Fernando Hsieh, Rafael Dias, and Ivandré Paraboni. Author profiling from facebook corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2018.

Stefan Krüger and Ben Hermann. Can an online service predict gender? on the state-of-the-art in gender identification from texts. In IEEE/ACM 2nd International Workshop on Gender Equality in Software Engineering, 2019.

Ilia Markov, Helena Gómez-Adorno, and Grigori Sidorov. Language-and subtaskdependent feature selection and classifier parameter tuning for author profiling. In Conference and Labs of the Evaluation Forum, 2017a.

Ilia Markov, Helena Gómez-Adorno, Grigori Sidorov, and Alexander Gelbukh. The winning approach to cross-genre gender identification in russian at rusprofiling. 2017b.

Yasuhide Miura, Tomoki Taniguchi, Motoki Taniguchi, and Tomoko Ohkuma. Author profiling with word+ character neural attention network. In Conference and Labs of the Evaluation Forum, 2017.

Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. ”how old do you think i am?”a study of language and age in twitter. In International Conference On Web and Social Media, 2013.

Ricelli Ramos, Georges Neto, Barbara Silva, Danielle Monteiro, Ivandré Paraboni, and Rafael Dias. Building a corpus for personality-dependent natural language understanding and generation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 2018.

Francisco Rangel, Paolo Rosso, Martin Potthast, and Benno Stein. Overview of the 5th author profiling task at pan 2017: Gender and language variety identification in twitter. Working Notes Papers of the CLEF, 2017.

Livy Real, Marcio Oshiro, and Alexandre Mafra. B2w-reviews01 an open product reviews corpus. In XII Symposium in Information and Human Language Technology and Collocates Events, 2019.

Wesley Santos and Ivandré Paraboni. Moral stance recognition and polarity classification from Twitter and elicited text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing, 2019.

Upendra Sapkota, Steven Bethard, Manuel Montes, and Thamar Solorio. Not all character n-grams are created equal: A study in authorship attribution. In Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: Human language technologies, 2015.

Ian H. Witten, Eibe Frank, and Mark A. Hall. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Amsterdam, 3 edition, 2011.