Authorship Attribution with Temporal Data in Reddit
Resumo
Context: The practicality brought by the use of smartphones has resulted, in recent years, in greater interaction through online social networks. Problem: Social networks can influence users both positively and negatively, one of the negative impacts is the spread of fake news. In this context, identifying the correct source of information or whether the information is true becomes an extremely relevant activity. Solution: This paper presents an approach for authorship attributions that combines text mining and temporal analysis techniques. IS Theory: This work is under the Social Network Theory, in particular, the user interaction through a forum network model, in which each post creates a comment thread and the user can reply or not inside the thread. Method: This work is a controlled experiment and it aims to extend a previous case study that used a classification between two and ten authors. The results were validated through a quantitative approach. Summary of Results: Among 10 authors, classification results had more than 97% of accuracy with chars feature having more than 99% of accuracy, among 100 authors all features presented more than 70% of accuracy. Contributions and Impact in the IS area: The main contribution of this works is to validate the authorship attribution in a big data context, using significant features and a robust classifier model.
Palavras-chave:
Online social media, Authorship analysis, Text mining, Temporal data
Referências
A. Abbasi and H. Chen. 2005. Applying authorship analysis to extremist-group Web forum messages. IEEE Intelligent Systems 20, 5 (Sep. 2005), 67–75. https://doi.org/10.1109/MIS.2005.81
Charu C Aggarwal. 2018. Machine learning for text. Springer.
Jafar Albadarneh, Bashar Talafha, Mahmoud Al-Ayyoub, Belal Zaqaibeh, Mohammad Al-Smadi, Yaser Jararweh, and Elhadj Benkhelifa. 2015. Using Big Data Analytics for Authorship Authentication of Arabic Tweets. In Proceedings of the 8th International Conference on Utility and Cloud Computing (Limassol, Cyprus) (UCC ’15). IEEE Press, Piscataway, NJ, USA, 448–452. http://dl.acm.org/citation.cfm?id=3233397.3233483
Hosein Azarbonyad, Mostafa Dehghani, Maarten Marx, and Jaap Kamps. 2015. Time-Aware Authorship Attribution for Short Text Streams. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval(Santiago, Chile) (SIGIR ’15). ACM, New York, NY, USA, 727–730. https://doi.org/10.1145/2766462.2767799
R. Banga and P. Mehndiratta. 2017. Authorship attribution for textual data on online social networks. In 2017 Tenth International Conference on Contemporary Computing (IC3). 1–7. https://doi.org/10.1109/IC3.2017.8284311
Jason Baumgartner. 2019. Pushshift Reddit comments. https://files.pushshift.io/reddit/comments/
Guilherme Ramos Casimiro and Luciano Antonio Digiampietri. 2020. Authorship Attribution using data from Reddit forum. In XVI Brazilian Symposium on Information Systems. 1–8.
S. H. H. Ding, B. C. M. Fung, F. Iqbal, and W. K. Cheung. 2019. Learning Stylometric Representations for Authorship Analysis. IEEE Transactions on Cybernetics 49, 1 (Jan 2019), 107–121. https://doi.org/10.1109/TCYB.2017.2766189
S. E. M. El Bouanani and I. Kassou. 2013. Using lexicometry and vocabulary analysis techniques to detect a signature for web profile. In 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013). 1494–1498. https://doi.org/10.1145/2492517.2558568
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
Oren Halvani, Christian Winter, and Lukas Graner. 2017. On the Usefulness of Compression Models for Authorship Verification. In Proceedings of the 12th International Conference on Availability, Reliability and Security(Reggio Calabria, Italy) (ARES ’17). ACM, New York, NY, USA, Article 54, 10 pages. https://doi.org/10.1145/3098954.3104050
Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Eran Messeri. 2006. Authorship Attribution with Thousands of Candidate Authors. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, USA) (SIGIR ’06). ACM, New York, NY, USA, 659–660. https://doi.org/10.1145/1148170.1148304
R. Layton, P. Watters, and R. Dazeley. 2010. Authorship Attribution for Twitter in 140 Characters or Less. In 2010 Second Cybercrime and Trustworthy Computing Workshop. 1–8. https://doi.org/10.1109/CTC.2010.17
Hoi Le and Reihaneh Safavi-Naini. 2018. On De-anonymization of Single Tweet Messages. In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics (Tempe, AZ, USA) (IWSPA ’18). ACM, New York, NY, USA, 8–14. https://doi.org/10.1145/3180445.3180451
Tatiana Litvinova, Olga Litvinova, and Polina Panicheva. 2019. Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval (Tokushima, Japan) (NLPIR 2019). ACM, New York, NY, USA, 9–14. https://doi.org/10.1145/3342827.3342834
Brendan O'Connor, Ramnath Balasubramanyan, Bryan R Routledge, and Noah A Smith. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In Fourth International AAAI Conference on Weblogs and Social Media.
C. Perez, B. Birregah, R. Layton, M. Lemercier, and P. Watters. 2013. REPLOT: Retrieving profile links on Twitter for suspicious networks detection. In 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013). 1307–1314. https://doi.org/10.1145/2492517.2500234
S. Petrasova, N. Khairova, and W. Lewoniewski. 2018. Building the Semantic Similarity Model for Social Network Data Streams. In 2018 IEEE Second International Conference on Data Stream Mining Processing (DSMP). 21–24. https://doi.org/10.1109/DSMP.2018.8478480
S. R. Pillay and T. Solorio. 2010. Authorship attribution of web forum posts. In 2010 eCrime Researchers Summit. 1–7. https://doi.org/10.1109/ecrime.2010.5706693
S. E. Seker, K. Al-Naami, and L. Khan. 2013. Author attribution on streaming data. In 2013 IEEE 14th International Conference on Information Reuse Integration (IRI). 497–503. https://doi.org/10.1109/IRI.2013.6642511
M. Spitters, F. Klaver, G. Koot, and M. v. Staalduinen. 2015. Authorship Analysis on Dark Marketplace Forums. In 2015 European Intelligence and Security Informatics Conference. 1–8. https://doi.org/10.1109/EISIC.2015.47
Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.
M. Sultana, P. Polash, and M. Gavrilova. 2017. Authorship recognition of tweets: A comparison between social behavior and linguistic profiles. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 471–476. https://doi.org/10.1109/SMC.2017.8122650
R. H. R. Tan and F. S. Tsai. 2010. Authorship Identification for Online Text. In 2010 International Conference on Cyberworlds. 155–162. https://doi.org/10.1109/CW.2010.50
J. Yan and S. J. Matthews. 2016. Applying clustering algorithms to determine authorship of chinese twitter messages. In 2016 IEEE MIT Undergraduate Research Technology Conference (URTC). 1–4. https://doi.org/10.1109/URTC.2016.8361150
Charu C Aggarwal. 2018. Machine learning for text. Springer.
Jafar Albadarneh, Bashar Talafha, Mahmoud Al-Ayyoub, Belal Zaqaibeh, Mohammad Al-Smadi, Yaser Jararweh, and Elhadj Benkhelifa. 2015. Using Big Data Analytics for Authorship Authentication of Arabic Tweets. In Proceedings of the 8th International Conference on Utility and Cloud Computing (Limassol, Cyprus) (UCC ’15). IEEE Press, Piscataway, NJ, USA, 448–452. http://dl.acm.org/citation.cfm?id=3233397.3233483
Hosein Azarbonyad, Mostafa Dehghani, Maarten Marx, and Jaap Kamps. 2015. Time-Aware Authorship Attribution for Short Text Streams. In Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval(Santiago, Chile) (SIGIR ’15). ACM, New York, NY, USA, 727–730. https://doi.org/10.1145/2766462.2767799
R. Banga and P. Mehndiratta. 2017. Authorship attribution for textual data on online social networks. In 2017 Tenth International Conference on Contemporary Computing (IC3). 1–7. https://doi.org/10.1109/IC3.2017.8284311
Jason Baumgartner. 2019. Pushshift Reddit comments. https://files.pushshift.io/reddit/comments/
Guilherme Ramos Casimiro and Luciano Antonio Digiampietri. 2020. Authorship Attribution using data from Reddit forum. In XVI Brazilian Symposium on Information Systems. 1–8.
S. H. H. Ding, B. C. M. Fung, F. Iqbal, and W. K. Cheung. 2019. Learning Stylometric Representations for Authorship Analysis. IEEE Transactions on Cybernetics 49, 1 (Jan 2019), 107–121. https://doi.org/10.1109/TCYB.2017.2766189
S. E. M. El Bouanani and I. Kassou. 2013. Using lexicometry and vocabulary analysis techniques to detect a signature for web profile. In 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013). 1494–1498. https://doi.org/10.1145/2492517.2558568
Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep learning. MIT press.
Oren Halvani, Christian Winter, and Lukas Graner. 2017. On the Usefulness of Compression Models for Authorship Verification. In Proceedings of the 12th International Conference on Availability, Reliability and Security(Reggio Calabria, Italy) (ARES ’17). ACM, New York, NY, USA, Article 54, 10 pages. https://doi.org/10.1145/3098954.3104050
Moshe Koppel, Jonathan Schler, Shlomo Argamon, and Eran Messeri. 2006. Authorship Attribution with Thousands of Candidate Authors. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (Seattle, Washington, USA) (SIGIR ’06). ACM, New York, NY, USA, 659–660. https://doi.org/10.1145/1148170.1148304
R. Layton, P. Watters, and R. Dazeley. 2010. Authorship Attribution for Twitter in 140 Characters or Less. In 2010 Second Cybercrime and Trustworthy Computing Workshop. 1–8. https://doi.org/10.1109/CTC.2010.17
Hoi Le and Reihaneh Safavi-Naini. 2018. On De-anonymization of Single Tweet Messages. In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics (Tempe, AZ, USA) (IWSPA ’18). ACM, New York, NY, USA, 8–14. https://doi.org/10.1145/3180445.3180451
Tatiana Litvinova, Olga Litvinova, and Polina Panicheva. 2019. Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features. In Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval (Tokushima, Japan) (NLPIR 2019). ACM, New York, NY, USA, 9–14. https://doi.org/10.1145/3342827.3342834
Brendan O'Connor, Ramnath Balasubramanyan, Bryan R Routledge, and Noah A Smith. 2010. From tweets to polls: Linking text sentiment to public opinion time series. In Fourth International AAAI Conference on Weblogs and Social Media.
C. Perez, B. Birregah, R. Layton, M. Lemercier, and P. Watters. 2013. REPLOT: Retrieving profile links on Twitter for suspicious networks detection. In 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM 2013). 1307–1314. https://doi.org/10.1145/2492517.2500234
S. Petrasova, N. Khairova, and W. Lewoniewski. 2018. Building the Semantic Similarity Model for Social Network Data Streams. In 2018 IEEE Second International Conference on Data Stream Mining Processing (DSMP). 21–24. https://doi.org/10.1109/DSMP.2018.8478480
S. R. Pillay and T. Solorio. 2010. Authorship attribution of web forum posts. In 2010 eCrime Researchers Summit. 1–7. https://doi.org/10.1109/ecrime.2010.5706693
S. E. Seker, K. Al-Naami, and L. Khan. 2013. Author attribution on streaming data. In 2013 IEEE 14th International Conference on Information Reuse Integration (IRI). 497–503. https://doi.org/10.1109/IRI.2013.6642511
M. Spitters, F. Klaver, G. Koot, and M. v. Staalduinen. 2015. Authorship Analysis on Dark Marketplace Forums. In 2015 European Intelligence and Security Informatics Conference. 1–8. https://doi.org/10.1109/EISIC.2015.47
Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. Journal of the American Society for information Science and Technology 60, 3 (2009), 538–556.
M. Sultana, P. Polash, and M. Gavrilova. 2017. Authorship recognition of tweets: A comparison between social behavior and linguistic profiles. In 2017 IEEE International Conference on Systems, Man, and Cybernetics (SMC). 471–476. https://doi.org/10.1109/SMC.2017.8122650
R. H. R. Tan and F. S. Tsai. 2010. Authorship Identification for Online Text. In 2010 International Conference on Cyberworlds. 155–162. https://doi.org/10.1109/CW.2010.50
J. Yan and S. J. Matthews. 2016. Applying clustering algorithms to determine authorship of chinese twitter messages. In 2016 IEEE MIT Undergraduate Research Technology Conference (URTC). 1–4. https://doi.org/10.1109/URTC.2016.8361150
Publicado
16/05/2022
Como Citar
CASIMIRO, Guilherme Ramos; DIGIAMPIETRI, Luciano Antonio.
Authorship Attribution with Temporal Data in Reddit. In: SIMPÓSIO BRASILEIRO DE SISTEMAS DE INFORMAÇÃO (SBSI), 18. , 2022, Curitiba.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2022
.