A Cascade Approach for Gender Prediction from Texts in Portuguese Language
Author Profiling is a prominent research area in which computational approaches have been proposed to predict authors’ characteristics from their texts. Gender, age, personality traits, and occupation are examples of commonly analyzed characteristics. It is a task of growing importance, with applications in different areas such as forensics, marketing, and e-commerce. Although a lot of research has been conducted on this task for some widely used languages (e.g., English), there is still a lot of room for improvement in studies involving the Portuguese language. Thus, this work contributes by proposing and evaluating a cascading approach, which combines a weighted lexical approach, a heuristic, and a classifier, for the gender prediction problem using only textual content written in the Portuguese language. The proposed approach considers both specificities of the Portuguese language and domain characteristics of the texts. The results obtained from the proposed approach showed that exploring the specificities of the Portuguese language and domain characteristics of the texts can positively contribute to the performance of the gender prediction task.
author profiling, text mining, gender prediction, portuguese language
Yaritza Adame Arcia, Daniel Castro-Castro, Reynier Ortega Bueno, and Rafael Muñoz. 2017. Author Profiling, instance-based Similarity Classification. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866). CEUR-WS.org.
Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. 2017. N-gram: New groningen author-profiling model. https://doi.org/10.48550/ARXIV.1707.03764
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.
Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, and Liviu P. Dinu. 2017. Including dialects and language varieties in author profiling. https://doi.org/10.48550/ARXIV.1707.00621
João Pedro de Morais and Luiz Henrique Merschmann. 2021. Uma Abordagem Híbrida para Predição de Gênero a partir de Textos em Português. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados (Rio de Janeiro). SBC, Porto Alegre, RS, Brasil, 49–60. https://doi.org/10.5753/sbbd.2021.17865
Rafael Dias and Ivandré Paraboni. 2020. Cross-domain Author Gender Classification in Brazilian Portuguese. In Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 1227–1234.
Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira. 2018. BlogSet-BR: A Brazilian Portuguese Blog Corpus. In 11th International Conference on Language Resources and Evaluation. ELRA, Miyazaki, Japan, 1110–1123.
Yongyan Guo, Jiayong Liu, Wenwu Tang, and Cheng Huang. 2021. Exsense: Extract sensitive information from unstructured data. Computers & Security(2021). https://doi.org/10.1016/j.cose.2020.102156
Fernando Hsieh, Rafael Dias, and Ivandré Paraboni. 2018. Author Profiling from Facebook Corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. European Language Resources Association, Miyazaki, Japan, 1210–132.
Ilia Markov, Helena Gómez-Adorno, and Grigori Sidorov. 2017. Language-and Subtask-Dependent Feature Selection and Classifier Parameter Tuning for Author Profiling. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866). CEUR-WS.org.
Matej Martinc, Iza Skrjanec, Katja Zupan, and Senja Pollak. 2017. PAN 2017: Author Profiling-Gender and Language Variety Prediction. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866). CEUR-WS.org.
Yasuhide Miura, Tomoki Taniguchi, Motoki Taniguchi, and Tomoko Ohkuma. 2017. Author Profiling with Word+Character Neural Attention Network. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866). CEUR-WS.org.
Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. 2013. ”How Old Do You Think I Am?” A Study of Language and Age in Twitter. In Seventh International AAAI Conference on Weblogs and Social Media. International Conference on Weblogs and Social Media, Massachusetts USA.
Francisco Manuel Rangel Pardo, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866), Linda Cappellato, Nicola Ferro, Lorraine Goeuriot, and Thomas Mandl (Eds.). CEUR-WS.org.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research(2011), 2825–2830.
Adam Poulston, Zeerak Waseem, and Mark Stevenson. 2017. Using TF-IDF n-gram and Word Embedding Cluster Ensembles for Author Profiling. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866). CEUR-WS.org.
Ricelli Ramos, Georges Neto, Barbara Silva, Danielle Monteiro, Ivandré Paraboni, and Rafael Dias. 2018. Building a corpus for personality-dependent natural language understanding and generation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. European Language Resources Association, Miyazaki, Japan, 1138–1145.
Francisco Manuel Rangel Pardo, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. 2015. Overview of the 3rd Author Profiling Task at PAN 2015. In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers. 1–8.
Livy Real, Marcio Oshiro, and Alexandre Mafra. 2019. B2W-Reviews01 An open product reviews corpus. In XII Symposium in Information and Human Language Technology and Collocates Events. STIL, Salvador, BA, 200–208.
Wesley Santos and Ivandré Paraboni. 2019. Moral Stance Recognition and Polarity Classification from Twitter and Elicited Text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. INCOMA Ltd, Varna, Bulgaria, 1148–1160. https://doi.org/10.26615/978-954-452-056-4_123
Maarten Sap, Gregory Park, Johannes Eichstaedt, Margaret Kern, David Stillwell, Michal Kosinski, Lyle Ungar, and Hansen Andrew Schwartz. 2014. Developing Age and Gender Predictive Lexica over Social Media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Doha, Qatar, 1146–1151. https://doi.org/10.3115/v1/D14-1121
Marco Vicente, Fernando Batista, and Joao P. Carvalho Carvalho. 2019. Gender Detection of Twitter Users Based on Multiple Information Sources. Springer International Publishing, Berna, SW, 39–54. https://doi.org/10.1007/978-3-030-01632-6_3
Angelo Basile, Gareth Dwyer, Maria Medvedeva, Josine Rawee, Hessel Haagsma, and Malvina Nissim. 2017. N-gram: New groningen author-profiling model. https://doi.org/10.48550/ARXIV.1707.03764
Steven Bird, Ewan Klein, and Edward Loper. 2009. Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc.
Alina Maria Ciobanu, Marcos Zampieri, Shervin Malmasi, and Liviu P. Dinu. 2017. Including dialects and language varieties in author profiling. https://doi.org/10.48550/ARXIV.1707.00621
João Pedro de Morais and Luiz Henrique Merschmann. 2021. Uma Abordagem Híbrida para Predição de Gênero a partir de Textos em Português. In Anais do XXXVI Simpósio Brasileiro de Bancos de Dados (Rio de Janeiro). SBC, Porto Alegre, RS, Brasil, 49–60. https://doi.org/10.5753/sbbd.2021.17865
Rafael Dias and Ivandré Paraboni. 2020. Cross-domain Author Gender Classification in Brazilian Portuguese. In Proceedings of The 12th Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 1227–1234.
Henrique D. P. dos Santos, Vinicius Woloszyn, and Renata Vieira. 2018. BlogSet-BR: A Brazilian Portuguese Blog Corpus. In 11th International Conference on Language Resources and Evaluation. ELRA, Miyazaki, Japan, 1110–1123.
Yongyan Guo, Jiayong Liu, Wenwu Tang, and Cheng Huang. 2021. Exsense: Extract sensitive information from unstructured data. Computers & Security(2021). https://doi.org/10.1016/j.cose.2020.102156
Fernando Hsieh, Rafael Dias, and Ivandré Paraboni. 2018. Author Profiling from Facebook Corpora. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. European Language Resources Association, Miyazaki, Japan, 1210–132.
Ilia Markov, Helena Gómez-Adorno, and Grigori Sidorov. 2017. Language-and Subtask-Dependent Feature Selection and Classifier Parameter Tuning for Author Profiling. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866). CEUR-WS.org.
Matej Martinc, Iza Skrjanec, Katja Zupan, and Senja Pollak. 2017. PAN 2017: Author Profiling-Gender and Language Variety Prediction. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866). CEUR-WS.org.
Yasuhide Miura, Tomoki Taniguchi, Motoki Taniguchi, and Tomoko Ohkuma. 2017. Author Profiling with Word+Character Neural Attention Network. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866). CEUR-WS.org.
Dong Nguyen, Rilana Gravel, Dolf Trieschnigg, and Theo Meder. 2013. ”How Old Do You Think I Am?” A Study of Language and Age in Twitter. In Seventh International AAAI Conference on Weblogs and Social Media. International Conference on Weblogs and Social Media, Massachusetts USA.
Francisco Manuel Rangel Pardo, Paolo Rosso, Martin Potthast, and Benno Stein. 2017. Overview of the 5th Author Profiling Task at PAN 2017: Gender and Language Variety Identification in Twitter. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866), Linda Cappellato, Nicola Ferro, Lorraine Goeuriot, and Thomas Mandl (Eds.). CEUR-WS.org.
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-Learn: Machine Learning in Python. Journal of Machine Learning Research(2011), 2825–2830.
Adam Poulston, Zeerak Waseem, and Mark Stevenson. 2017. Using TF-IDF n-gram and Word Embedding Cluster Ensembles for Author Profiling. In Working Notes of CLEF 2017 - Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11-14, 2017(CEUR Workshop Proceedings, Vol. 1866). CEUR-WS.org.
Ricelli Ramos, Georges Neto, Barbara Silva, Danielle Monteiro, Ivandré Paraboni, and Rafael Dias. 2018. Building a corpus for personality-dependent natural language understanding and generation. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation. European Language Resources Association, Miyazaki, Japan, 1138–1145.
Francisco Manuel Rangel Pardo, Fabio Celli, Paolo Rosso, Martin Potthast, Benno Stein, and Walter Daelemans. 2015. Overview of the 3rd Author Profiling Task at PAN 2015. In CLEF 2015 Evaluation Labs and Workshop Working Notes Papers. 1–8.
Livy Real, Marcio Oshiro, and Alexandre Mafra. 2019. B2W-Reviews01 An open product reviews corpus. In XII Symposium in Information and Human Language Technology and Collocates Events. STIL, Salvador, BA, 200–208.
Wesley Santos and Ivandré Paraboni. 2019. Moral Stance Recognition and Polarity Classification from Twitter and Elicited Text. In Proceedings of the International Conference on Recent Advances in Natural Language Processing. INCOMA Ltd, Varna, Bulgaria, 1148–1160. https://doi.org/10.26615/978-954-452-056-4_123
Maarten Sap, Gregory Park, Johannes Eichstaedt, Margaret Kern, David Stillwell, Michal Kosinski, Lyle Ungar, and Hansen Andrew Schwartz. 2014. Developing Age and Gender Predictive Lexica over Social Media. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Doha, Qatar, 1146–1151. https://doi.org/10.3115/v1/D14-1121
Marco Vicente, Fernando Batista, and Joao P. Carvalho Carvalho. 2019. Gender Detection of Twitter Users Based on Multiple Information Sources. Springer International Publishing, Berna, SW, 39–54. https://doi.org/10.1007/978-3-030-01632-6_3
Como Citar
MORAIS, João Pedro Moreira de; MERSCHMANN, Luiz Henrique de Campos.
A Cascade Approach for Gender Prediction from Texts in Portuguese Language. In: BRAZILIAN SYMPOSIUM ON MULTIMEDIA AND THE WEB (WEBMEDIA), 28. , 2022, Curitiba.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
p. 151-158.