Lince: um arcabouço para ofuscação de estilo de escrita em texto

Resumo


O anonimato é essencial para a segurança física de jornalistas e denunciantes em geral que submetem denúncias através da Internet. Atualmente, existem abordagens para se obter anonimato online, no entanto, usuários anônimos ainda podem ser identificados pelo seu estilo de escrita. A chance de sucesso de um classificador identificar corretamente o autor de um texto tem crescido cada vez mais com o avanço das pesquisas em processamento de linguagem natural. Por outro lado, novas abordagens para geração automática de textos ofuscados também têm surgido para combater os adversários do anonimato na Internet. Neste trabalho, objetivamos conceber um arcabouço para ofuscação de autoria de textos. Para isso, avaliamos duas abordagens para ofuscação de autoria de textos e propomos melhorias para otimizar a qualidade do texto gerado pelos ofuscadores e para facilitar o uso para usuários não-técnicos. Tais melhorias resultaram em um aumento de até 20% na qualidade das sentenças geradas, enquanto mantiveram a taxa de sucesso do adversário abaixo do nível de chance.

Palavras-chave: ofuscação de autoria, privacidade, processamento de linguagem natural

Referências

Bagnall, D. (2015). Author identication using multi-headed recurrent neural networks. arXiv preprint arXiv:1506.04891.

Banerjee, S. and Lavie, A. (2005). METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In ACL workshop on intrinsic and extrinsic evaluation measures for machine translation, pages 65–72.

Bevendorff, J., Wenzel, T., Potthast, M., Hagen, M., and Stein, B. (2020). On divergencebased author obfuscation: An attack on the state of the art in statistical authorship verication. it-Information Technology, 62(2):99–115.

Bird, S., Klein, E., and Loper, E. (2009). Natural Language Processing with Python. O’Reilly Media.

Bo, H., Ding, S. H., Fung, B., and Iqbal, F. (2019). ER-AE: Differentially-private Text Generation for Authorship Anonymization. arXiv preprint arXiv:1907.08736.

Emmery, C., Manjavacas Arevalo, E., and Chrupaa, G. (2018). Style Obfuscation by Invariance. In Proceedings of the 27th COLING, pages 984–996, Santa Fe, New Mexico, USA. Association for Computational Linguistics.

Fernandes, N., Dras, M., and McIver, A. (2019). Generalised differential privacy for text document processing. In International Conference on Principles of Security and Trust, pages 123–148. Springer.

Fitzgerald, J. R. (2004). Using a forensic linguistic approach to track the unabomber. Prolers. New York: Prometheus Books, pages 193–221.

Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189.

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680.

Graves, A. (2013). Generating Sequences With Recurrent Neural Networks. arXiv preprint arXiv:1308.0850.

Gröndahl, T. and Asokan, N. (2019). Effective writing style imitation via combinatorial paraphrasing. arXiv preprint arXiv:1905.13464.

Hartmann, N., Fonseca, E., Shulby, C., Treviso, M., Silva, J., and Aluísio, S. (2017). Portuguese word embeddings: Evaluating on word analogies and natural language tasks. In Proceedings of the 11th Brazilian STIL, pages 122–131.

Howard, J. and Ruder, S. (2018). Universal language model ne-tuning for text classication. arXiv preprint arXiv:1801.06146.

Indurkhya, N. and Damerau, F. J. (2010). Handbook of natural language processing, volume 2. CRC Press.

Karadzhov, G., Mihaylova, T., Kiprov, Y., Georgiev, G., Koychev, I., and Nakov, P. (2017). The case for being average: A mediocrity approach to style masking and author obfuscation. In CLEF for European Languages, pages 173–185. Springer.

Keswani, Y., Trivedi, H., Mehta, P., and Majumder, P. (2016). Author Masking through Translation. In CLEF (Working Notes), pages 890–894.

Mahmood, A., Ahmad, F., Shaq, Z., Srinivasan, P., and Zaffar, F. (2019). A girl has no name: Automated authorship obfuscation using mutant-x. Proceedings on Privacy Enhancing Technologies, 2019(4):54–71.

Mansoorizadeh, M., Rahgooy, T., Aminiyan, M., and Eskandari, M. (2016). Author obfuscation using wordnet and language models-notebook for pan at clef 2016. In CLEF 2016 Evaluation Labs and Workshop–Working Notes Papers, pages 5–8.

McDonald, A. W., Afroz, S., Caliskan, A., Stolerman, A., and Greenstadt, R. (2012). Use fewer instances of the letter “i”: Toward writing style anonymization. In International Symposium on Privacy Enhancing Technologies Symposium, pages 299–318. Springer.

Mihaylova, T., Karadjov, G., Kiprov, Y., Georgiev, G., Koychev, I., and Nakov, P. (2016). Su@ pan’2016: Author obfuscation. In CLEF (Working Notes), pages 956–969.

Mosteller, F. and Wallace, D. L. (1963). Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers. Journal of the American Statistical Association, 58(302):275–309.

Narayanan, A., Paskov, H., Gong, N. Z., Bethencourt, J., Stefanov, E., Shin, E. C. R., and Song, D. (2012). On the feasibility of internet-scale author identication. In 2012 IEEE Symposium on Security and Privacy, pages 300–314. IEEE.

Pan, S. J. and Yang, Q. (2009). A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22(10):1345–1359.

Potthast, M., Hagen, M., and Stein, B. (2016). Author Obfuscation: Attacking the State of the Art in Authorship Verication. In CLEF (Working Notes), pages 716–749.

Ruder, S., Ghaffari, P., and Breslin, J. G. (2016). Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv preprint arXiv:1609.06686.

Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986). Learning representations by back-propagating errors. nature, 323(6088):533–536.

Sari, Y., Vlachos, A., and Stevenson, M. (2017). Continuous n-gram representations for In Proceedings of the 15th EACL: Volume 2, Short Papers, authorship attribution. pages 267–273.

Shetty, R., Schiele, B., and Fritz, M. (2018). A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation. In 27th USENIX Security Symposium (USENIX Security 18), pages 1633–1650, Baltimore, MD. USENIX Association.

Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.

Tacorda, A. J., Ignacio, M. J., Oco, N., and Roxas, R. E. (2017). Controlling byte pair encoding for neural machine translation. In IALP’17, pages 168–171. IEEE.

Varela, P., Justino, E., and Oliveira, L. S. (2011). Selecting syntactic attributes In The 2011 International Joint Conference on Neural for authorship attribution. Networks, pages 167–172. IEEE.
Publicado
04/10/2021
FRANCO, Antônio M. R.; CUNHA, Ítalo F. S.; OLIVEIRA, Leonardo B.. Lince: um arcabouço para ofuscação de estilo de escrita em texto. In: SIMPÓSIO BRASILEIRO DE SEGURANÇA DA INFORMAÇÃO E DE SISTEMAS COMPUTACIONAIS (SBSEG), 21. , 2021, Belém. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 225-238. DOI: https://doi.org/10.5753/sbseg.2021.17318.