Comparison of Stylometric Attributes for Writing Authorship Identification: A Case Study of Guimarães Rosa versus Clarice Lispector

  • Raido Galina Instituto Federal do Espírito Santo
  • Diego Flores Instituto Federal do Espírito Santo
  • Karin Komati Instituto Federal do Espírito Santo

Abstract


When a writer expresses himself, he must decide among a wealth of choices, such as which words/expressions to use or how to punctuate his writing. These choices define the writers individual characteristics and stylometry is the quantitative study of such writing style. This paper aims to identify the books of writers with well-defined writing styles, Guimarães Rosa and Clarice Lispector, by means of lexical attributes found in their texts: letter frequency, word frequency and TF-IDF. Attributes are compared using the Euclidean distance, cosine similarity and Jaccard similarity index. The results show that by using the set of words with Jaccard similarity index it was possible to separate the books according to authorship.

Keywords: lexical attribute, Euclidean distance, cosine similarity, Jaccard similarity

References

Abbasi, A. and Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection in cyberspace. ACM Transactions on Information Systems (TOIS), 26(2):7.

Akimushkin, C., Amancio, D. R., and Oliveira Jr, O. N. (2017). Text authorship identified using the dynamics of word co-occurrence networks. PloS one, 12(1):e0170527.

Alzahrani, S. M., Salim, N., and Abraham, A. (2011). Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(2):133–149.

Amancio, D. R., Oliveira Jr, O. N., and Costa, L. d. F. (2015). Topological-collaborative approach for disambiguating authors’ names in collaborative networks. Scientometrics, 102(1):465–485.

Antiqueira, L., Pardo, T. A. S., Nunes, M. d. G. V., and Oliveira Jr, O. N. (2007). Some issues on complex networks for author characterization. Inteligencia Artificial. Revista Iberoamericana de Inteligencia Artificial, 11(36):51–58.

Bueno, L. (2001). Guimarães, Clarice e antes. Teresa, (2):249–261.

Corso, G., Fossa, C. R., and de Oliveira, G. B. (2005). Uma aplicação da teoria de redes a estilometria: Comparando machado de assis e tribuna do norte. Revista Brasileira de Ensino de Fı́sica, 27(2):389–393.

Gamon, M. (2004). Linguistic correlates of style: authorship classification with deep linguistic analysis features. In Proceedings of the 20th international conference on Computational Linguistics, page 611. Association for Computational Linguistics.

Ghosh, J. and Strehl, A. (2006). Similarity-based text clustering: a comparative study. In Grouping Multidimensional Data, pages 73–97. Springer.

Honório, T. C. S., Nobre Neto, F. D., Almeida, T. P., Duarte, R. C. M., Barbosa, Y. A. M., Rocha, V. M., and Batista, L. V. (2007). Atribuição de autoria com WEKA. In Anais do IX Encontro de Extensão e X Encontro de Iniciação, pages 42–42. Editora Universitária/UFPB.

Jimenez, S., Gonzalez, F. A., and Gelbukh, A. (2016). Mathematical properties of soft cardinality: Enhancing jaccard, dice and cosine similarity measures with element-wise distance. Information Sciences, 367:373–389.

Juola, P. (2013). Stylometry and immigration: A case study. Journal of Law and Policy, 21(2):287–298.

Koppel, M. and Schler, J. (2003). Exploiting stylistic idiosyncrasies for authorship attribution. In Proceedings of IJCAI’03 Workshop on Computational Approaches to Style Analysis and Synthesis, volume 69, pages 72–80.

Kutuzov, A. and Kuzmenko, E. (2015). Comparing neural lexical models of a classic national corpus and a web corpus: the case for russian. In International Conference on Intelligent Text Processing and Computational Linguistics, pages 47–58. Springer.

Leydesdorff, L. (2008). On the normalization and visualization of author co-citation data: Salton’s cosine versus the Jaccard index. Journal of the American Society for Information Science and Technology, 59(1):77–85.

Lima, J. M. C. and Maia, J. E. B. (2018). A topical word embeddings for text classification. In Anais do XV Encontro Nacional de Inteligência Artificial e Computacional, pages 25–35. SBC.

López-Escobedo, F., Solorzano-Soto, J., and Sierra Martı́nez, G. (2016). Analysis of intertextual distances using multidimensional scaling in the context of authorship attribution. Journal of Quantitative Linguistics, 23(2):154–176.

Nunes, B. (1989). O drama da linguagem: uma leitura de Clarice Lispector, volume 12. łica.

Pavelec, D., Justino, E., and Freitas, C. (2006). Identificação da autoria de documentos digitais com base em atributos estilométricos da lı́ngua portuguesa. In TIL-06, 4o workshop em Tecnologia da Informaçao e da Linguagem Humana, pages 1659–1668.

Rosa, J. G., de Athayde Sandroni, L. C. A., and de Aguiar, F. W. (2006). João Guimarães Rosa. Editora Nova Fronteira.

Sundararajan, K. and Woodard, D. (2018). What represents “style” in authorship attribution? In Proceedings of the 27th International Conference on Computational Linguistics, pages 2814–2822.

Varela, P. J., Justino, E. J., and Oliveira, L. E. (2011). Identificação de autoria de textos através do uso de classes linguı́sticas da lı́ngua portuguesa (authorship identification using linguistic classes for portuguese)[in portuguese]. In Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology.

Venčkauskas, A., Damaševičius, R., Marcinkevičius, R., and Karpavičius, A. (2015)blems of authorship identification of the national language electronic discourse. In International Conference on Information and Software Technologies, pages 415–432. Springer.

Yule, G. U. (1939). On sentence-length as a statistical characteristic of style in prose: With application to two cases of disputed authorship. Biometrica, 30:363–390.
Published
2019-10-15
GALINA, Raido; FLORES, Diego; KOMATI, Karin. Comparison of Stylometric Attributes for Writing Authorship Identification: A Case Study of Guimarães Rosa versus Clarice Lispector. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 16. , 2019, Salvador. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2019 . p. 353-364. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2019.9297.