Estudo comparativo entre abordagens estilométricas e textuais para atribuição de autoria em trabalhos escolares

Daniel Cirne Vilas-Boas dos Santos; Cleber Zanchettin

doi:10.5753/sbie.2021.217413

Daniel Cirne Vilas-Boas dos Santos Universidade Federal de Pernambuco
Cleber Zanchettin Universidade Federal de Pernambuco https://orcid.org/0000-0001-6421-9747

DOI: https://doi.org/10.5753/sbie.2021.217413

Resumo

O aumento no volume de documentos digitais associado ao seu uso no processo de verificação de aprendizagem demanda recursos computacionais para compreensão e análise de autoria. A literatura propõe distinguir os autores pelo estilo de escrita e palavras-chave. Entretanto, estes trabalhos não estão inseridos no contexto educacional e são majoritariamente em inglês. Este artigo se distingue por explorar a verificação de autoria numa base de atividades pedagógicas escritas na língua portuguesa. Devido ao baixo volume de exemplos, usamos bases jornalísticas robustas como referência. Por meio dos experimentos verificamos que em domínios restritos, representações baseadas em características de estilo são superiores à abordagens textuais, que sofrem influência do tópico em corpora mais abrangente. Este trabalho revelou que o modelo Extremelly Randomized Trees associado às características de estilo propostas foi superior aos demais modelos, em todas as bases utilizadas, alcançando uma média de 70% na taxa de acerto e AUC 0.81.

Palavras-chave: Estilometria, Atribuição de autoria, NLP

Referências

Aluísio, S., Pelizzoni, J., Marchi, A. R., de Oliveira, L., Manenti, R., and Marquiafável, V. (2003). An account of the challenge of tagging a reference corpus for brazilian portuguese. In International Workshop on Computational Processing of the Portuguese Language, pages 110–117. Springer.

Bevendorff, J., Ghanem, B., Giachanou, A., Kestemont, M., Manjavacas, E., Potthast, M., Rangel, F., Rosso, P., Specht, G., Stamatatos, E., et al. (2020a). Shared tasks on authorship analysis at pan 2020. In European Conference on Information Retrieval, pages 508–516. Springer.

Bevendorff, J., Ghanem, B., Giachanou, A., Kestemont, M., Manjavacas, E., Potthast, M., Rangel, F., Rosso, P., Specht, G., Stamatatos, E., et al. (2020b). Shared tasks on authorship analysis at pan 2020. In European Conference on Information Retrieval, pages 508–516. Springer.

Bezdek, J. C. (2013). Pattern recognition with fuzzy objective function algorithms. Springer Science & Business Media.

Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5:135–146.

Botelho, J. C. and da Silva Martins, M. R. A. (2020). Avaliação da aprendizagem: novas perspectivas para velhos problemas. Revista Encantar-Educação, Cultura e Sociedade, 2.

Chowdhury, G. G. (2003). Natural language processing. Annual review of information science and technology, 37(1):51–89.

Chowdhury, H. A., Imon, M. A. H., and Islam, M. S. (2018). A comparative analysis of word embedding representations in authorship attribution of bengali literature. In 2018 21st International Conference of Computer and Information Technology (ICCIT), pages 1–6. IEEE.

Curtis, G. J. and Tremayne, K. (2019). Is plagiarism really on the rise? results from four 5-yearly surveys. Studies in Higher Education, pages 1–11.

Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407.

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine Learning Research, 7:1–30.

dos Santos, D. C. V.-B. (2021). Estudo comparativo entre abordagens estilométricas e textuais para atribuição de autoria em trabalhos escolares. Master’s thesis, Centro de Informática – Universidade Federal de Pernambuco (UFPE).

Freitas, C., Carvalho, P., Gonçalo Oliveira, H., Mota, C., and Santos, D. (2010). Second harem: advancing the state of the art of named entity recognition in portuguese. In quot; In Nicoletta Calzolari; Khalid Choukri; Bente Maegaard; Joseph Mariani; Jan Odijk; Stelios Piperidis; Mike Rosner; Daniel Tapias (ed) Proceedings of the International Conference on Language Resources and Evaluation (LREC 2010)(Valletta 17-23 May de 2010) European Language Resources Association. European Language Resources Association.

Gamon, M. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 611–617.

Geurts, P., Ernst, D., and Wehenkel, L. (2006a). Extremely randomized trees. Machine learning, 63(1):3–42.

Geurts, P., Ernst, D., and Wehenkel, L. (2006b). Extremely randomized trees. Machine learning, 63(1):3–42.

Goebel, R., Chander, A., Holzinger, K., Lecue, F., Akata, Z., Stumpf, S., Kieseberg, P., and Holzinger, A. (2018). Explainable ai: the new 42? In International cross-domain conference for machine learning and knowledge extraction, pages 295–303. Springer.

Goldberg, Y. (2017). Neural network methods for natural language processing. Synthesis lectures on human language technologies, 10(1):1–309.

Halvani, O., Graner, L., and Regev, R. (2020). A step towards interpretable authorship verification. arXiv preprint arXiv:2006.12418.

Honnibal, M. and Montani, I. (2017). spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1):411–420.

Jang, B., Kim, I., and Kim, J. W. (2019). Word2vec convolutional neural networks for classification of news articles and tweets. PloS one, 14(8):e0220976.

Juola, P. (2008). Authorship attribution, volume 3. Now Publishers Inc.

Khonji, M., Iraqi, Y., and Jones, A. (2015). An evaluation of authorship attribution using random forests. In 2015 International Conference on Information and Communication Technology Research (ICTRC), pages 68–71. IEEE.

Maitra, P., Ghosh, S., and Das, D. (2016). Authorship verification-an approach based on random forest. arXiv preprint arXiv:1607.08885.

Martins, T. B., Ghiraldelo, C. M., Nunes, M. d. G. V., and de Oliveira Junior, O. N. (1996). Readability formulas applied to textbooks in brazilian portuguese. Icmsc-Usp.

Neal, T., Sundararajan, K., Fatima, A., Yan, Y., Xiang, Y., and Woodard, D. (2017). Surveying stylometry techniques and applications. ACM Computing Surveys (CSUR), 50(6):1–36.

Pacheco, M. L., Fernandes, K., and Porco, A. (2015). Random forest with increased generalization: A universal background approach for authorship verification. In CLEF (Working Notes).

Pires, A. R. O. (2017). Named entity extraction from portuguese web text. Master’s thesis, Faculdade de Engenharia da Universidade Do Porto.

Rangel, F., Giachanou, A., Ghanem, B., and Rosso, P. (2020). Overview of the 8th author profiling task at pan 2020: Profiling fake news spreaders on twitter. In CLEF.

Scarton, C. E. and Aluísio, S. M. (2010). Análise da inteligibilidade de textos via ferramentas de processamento de língua natural: adaptando as métricas do coh-metrix para o português. Linguamática, 2(1):45–61.

Shapley, L. S. (1953). A value for n-person games. Contributions to the Theory of Games, 2(28):307–317.

Shrestha, P., Sierra, S., González, F. A., Montes, M., Rosso, P., and Solorio, T. (2017). Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 669–674.

SILVA, D. d. C. (2011). Algoritmos de processamento da linguagem e síntese de voz com emoções aplicados a um conversor texto-fala baseado em hmm. Doutorado, Programa de Engenharia Elétrica, Instituto Alberto Luiz Coimbra de Pós-Graduação e Pesquisa de Engenharia (COPPE/UFRJ), Rio de Janeiro.

Singh, S. and Remenyi, D. (2016). Plagiarism and ghostwriting: The rise in academic misconduct. South African Journal of Science, 112(5-6):1–7.

Stamatatos, E. (2009). A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology, 60(3):538–556.

Sundararajan, M. and Najmi, A. (2020). The many shapley values for model explanation. In International Conference on Machine Learning, pages 9269–9278. PMLR.

Tempestt, N., Kalaivani Sundararajan, A. F., Yan, Y., Xiang, Y., and Woodard, D. (2017). Surveying stylometry techniques and applications. ACM Computing Surveys, 50(6).

Thinsungnoena, T., Kaoungkub, N., Durongdumronchaib, P., Kerdprasopb, K., and Kerdprasopb, N. (2015). The clustering validity with silhouette and sum of squared errors. learning, 3(7).

Tweedie, F. J. and Baayen, R. H. (1998). How variable may a constant be? measures of lexical richness in perspective. Computers and the Humanities, 32(5):323–352.

Van der Maaten, L. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(11).

Varela, P. J. (2017). Uma abordagem computacional baseada em análise sintática multílingue na atribuição da autoria de documentos digitais. PhD thesis, Pontifícia Universidade Católica do Paraná.

Weisberg, S. (2001). Yeo-johnson power transformations. Department of Applied Statistics, University of Minnesota. Retrieved June, 1:2003.

Yang, M., Chen, X., Tu, W., Lu, Z., Zhu, J., and Qu, Q. (2018). A topic drift model for authorship attribution. Neurocomputing, 273:133–140.