Identificação de Autoria de Documentos Eletrônicos
Abstract
The identification of authorship of electronic documents lies among the demands of forensic analysis. Data compressors and the Normalized Compression Distance (NCD) are tools that can help the expert to perform this task. In this work the performance of these tools were analyzed in a dataset of 3,000 electronic documents in Brazilian Portuguese, from 100 different authors, and correct attribution was made in more than 70% of the cases, indicating that this approach might have promising results. It was also verified the influence of the number of training documents in the results.References
Cilibrasi, R. e Vitányi, P. M. B. (2005) "Clustering by compression", In IEEE Transactions on Information Theory, 51:4, pp. 1523–1545.
Kolmogorov, A. N. (1965) "Three approaches to the quantitative definition of information". In Problems Inform. Transmission, 1, pp. 1–7
Li, M. e Vitányi, P. M. B. (1997) "An Introduction to Kolmogorov Complexity and Its Applications", Springer, 2nd edition.
M. Burrows, D. J. Wheeler (1994) "A block-sorting lossless data compression algorithm" In Technical Report 124, System Research Center - Digital, Palo Alto.
Malyutov, M.B., Wickramasinghe, C.I. and Li, S. (2007): Conditional Complexity of Compression for Authorship Attribution, In SFB 649 Discussion Paper No. 57, Humboldt University, Berlin.
Pinker, S. (2007) "The stuff of thought: language as a window into human nature", Viking Adult, 1st edition.
Shkarin, D. (2002) "PPM: One step to practicality." In Proceedings of the Data Compression Conference, April 2-4, IEEE Computer Society, Washington, DC., USA., pp: 202-211.
Stamatatos, E. (2009) "A survey of modern authorship attribution methods" In Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pp. 538–556
Varela, P. J. (2010) “O uso de atributos estilométricos na identificação da autoria de textos”. Dissertação de Mestrado apresentada no Programa de Pós-Graduação em Informática Aplicada, Pontifícia Universidade Católica do Paraná, Brasil.
Kolmogorov, A. N. (1965) "Three approaches to the quantitative definition of information". In Problems Inform. Transmission, 1, pp. 1–7
Li, M. e Vitányi, P. M. B. (1997) "An Introduction to Kolmogorov Complexity and Its Applications", Springer, 2nd edition.
M. Burrows, D. J. Wheeler (1994) "A block-sorting lossless data compression algorithm" In Technical Report 124, System Research Center - Digital, Palo Alto.
Malyutov, M.B., Wickramasinghe, C.I. and Li, S. (2007): Conditional Complexity of Compression for Authorship Attribution, In SFB 649 Discussion Paper No. 57, Humboldt University, Berlin.
Pinker, S. (2007) "The stuff of thought: language as a window into human nature", Viking Adult, 1st edition.
Shkarin, D. (2002) "PPM: One step to practicality." In Proceedings of the Data Compression Conference, April 2-4, IEEE Computer Society, Washington, DC., USA., pp: 202-211.
Stamatatos, E. (2009) "A survey of modern authorship attribution methods" In Journal of the American Society for Information Science and Technology, Volume 60, Issue 3, pp. 538–556
Varela, P. J. (2010) “O uso de atributos estilométricos na identificação da autoria de textos”. Dissertação de Mestrado apresentada no Programa de Pós-Graduação em Informática Aplicada, Pontifícia Universidade Católica do Paraná, Brasil.
Published
2012-11-19
How to Cite
OLIVEIRA JR., Walter R. de; OLIVEIRA, Luiz E. S.; JUSTINO, Edson J. R..
Identificação de Autoria de Documentos Eletrônicos. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 12. , 2012, Curitiba.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2012
.
p. 277-287.
DOI: https://doi.org/10.5753/sbseg.2012.20552.
