Identification of Brazilian sexual predators in textual conversations on the internet through machine learning
Keywords:PAN-2012, Sexual predator Identification, Machine Learning, Convolutional Neural Networks, Support Vector Machine, Decision Tree, Naïve Bayes, Random Forests, Social Networks, Chats
Nowadays, a large number of children and adolescents have made use of social applications. Easy to access, these applications provide benefits and opportunities. However, at the same time, they expose users to different risks, including predatory sexual activity. Predatory sexual activity has several purposes, such as obtaining child pornography, extortion, and sexual abuse. The present work has three main objectives: (i) to create a data set of textual conversations containing a real predatory sexual activity for Brazilian Portuguese; (ii) to perform a statistical analysis in the data set created; (iii) to carry out an experimental evaluation considering the most popular machine learning algorithms in the research domain with the data set built. This evaluation regards F1$ measure as a basis. The results achieved with contributions (i) and (ii) enable new studies to focus on the problem of identifying sexual predators in textual conversations for Brazilian Portuguese. The results obtained with the contribution (iii) show that the Support Vector Machines behaved as the best of the considered algorithms, presenting a result of 89.87%.
[Barbosa 2018] Barbosa, A. F. (2018). Pesquisa sobre o uso da internet por crianc¸as e adolescentes no brasil: Tic kids online brasil 2017. S˜ao Paulo: Comitˆe Gestor da Internet no Brasil.
[Biber 1993] Biber, D. (1993). Representativeness in corpus design. Literary and linguistic computing, 8(4):243–257.
[Bishop 2006] Bishop, C. M. (2006). Pattern recognition and machine learning. Springer.
[Blitzer et al. 2006] Blitzer, J., McDonald, R., and Pereira, F. (2006). Domain adaptation with structural correspondence learning. In Proceedings of the 2006 conference on empirical methods in natural language processing, pages 120–128.
[Cano et al. 2014] Cano, A. E., Fernandez, M., and Alani, H. (2014). Detecting child grooming behaviour patterns on social media. In International conference on social informatics, pages 412–427. Springer.
[Cardei and Rebedea 2017] Cardei, C. and Rebedea, T. (2017). Detecting sexual predators in chats using behavioral features and imbalanced learning. Natural Language Engineering, 23(4):589–616.
[Cheong and Jensen 2015] Cheong, Y.-G. and Jensen, A. K. (2015). Detecting predatory behavior in game chats. IEEE Transactions on Computational Intelligence and AI in Games, 7(3):220–232.
[Crystal 2002] Crystal, D. (2002). Language and the internet. IEEE Transactions on Professional Communication, 45(2):142–144.
[Dorasamy et al. 2018] Dorasamy, M., Jambulingam, M., and Vigian, T. (2018). Building a bright society with au courant parents: Combating online grooming.
[Ebrahimi 2016] Ebrahimi, M. (2016). Automatic Identification of Online Predators in Chat Logs by Anomaly Detection and Deep Learning. PhD thesis, Concordia University.
[Ebrahimi et al. 2016] Ebrahimi, M., Suen, C. Y., and Ormandjieva, O. (2016). Detecting predatory conversations in social media by deep convolutional neural networks. Digital Investigation, 18:33–49.
[Ghosh et al. 2018] Ghosh, A. K., Badillo-Urquiola, K., Guha, S., LaViola Jr, J. J., and Wisniewski, P. J. (2018). Safety vs. surveillance: what children have to say about mobile apps for parental control. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, page 124. ACM.
[Hernandez et al. 2018] Hernandez, S. C. L. S., Lacsina, A. C., Ylade, M. C., Aldaba, J., Lam, H. Y., Estacio Jr, L. R., and Lopez, A. L. (2018). sexual exploitation and abuse of children online in the philippines: A review of online news and articles. Acta Medica Philippina, 52(4):306.
[Inches and Crestani 2012] Inches, G. and Crestani, F. (2012). Overview of the international sexual predator identification competition at pan-2012. In CLEF (Online working notes/labs/workshop), volume 30.
[Johnson and Zhang 2015] Johnson, R. and Zhang, T. (2015). Effective use of word order for text categorization with convolutional neural networks. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 103–112.
[Kloess et al. 2019] Kloess, J. A., Hamilton-Giachritsis, C. E., and Beech, A. R. (2019). Offense processes of online sexual grooming and abuse of children via internet communication platforms. Sexual Abuse, 31(1):73–96.
[Kluyver et al. 2016] Kluyver, T., Ragan-Kelley, B., P´erez, F., Granger, B., Bussonnier, M., Frederic, J., Kelley, K., Hamrick, J., Grout, J., Corlay, S., Ivanov, P., Avila, D., Abdalla, S., andWilling, C. (2016). Jupyter notebooks – a publishing format for reproducible computational workflows. In Loizides, F. and Schmidt, B., editors, Positioning and Power in Academic Publishing: Players, Agents and Agendas, pages 87 – 90. IOS Press.
[Kohavi 1995] Kohavi, R. (1995). Wrappers for performance enhancement and oblivious decision graphs. Technical report, Carnegie-Mellon Univ. Pittsburgh PA Dept. of Computer Science.
[Komesu and Tenani 2009] Komesu, F. and Tenani, L. (2009). Considerac¸ ˜oes sobre o conceito de”internetˆes”nos estudos da linguagem. Linguagem em (Dis) cursor, 9(3):621–643.
[Livingstone et al. 2017] Livingstone, S., O´ lafsson, K., Helsper, E. J., Lupia´n˜ez-Villanueva, F., Veltri, G. A., and Folkvord, F. (2017). Maximizing opportunities and minimizing risks for children online: The role of digital skills in emerging strategies of parental mediation. Journal of Communication, 67(1):82–105.
[NCMEC 2017] NCMEC (2017). The online enticement of children: An in-depth analysis of cybertipline reports. National Center for Missing & Exploited Children Web site. https://missingkids-stage.adobecqms.net/ourwork/publications/exploitation/onlineenticement (Acessado em 16 de marc¸o de 2019).
[Ngejane et al. 2018] Ngejane, C., Mabuza-Hocquet, G., Eloff, J., and Lefophane, S. (2018). Mitigating online sexual grooming cybercrime on social media using machine learning: A desktop survey. In 2018 International Conference on Advances in Big Data, Computing and Data Communication Systems (icABCD), pages 1–6. IEEE.
[Olowu 2014] Olowu, D. (2014). Cyber-based obscenity and the sexual exploitation of children via the internet: Implications for africa. In African Cyber Citizenship Conference 2014 (ACCC2014), page 115.
[O’Connell 2003] O’Connell, R. (2003). A typology of child cybersexploitation and online grooming practices. Preston, UK: University of Central Lancashire.
[Pedregosa et al. 2011] Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikitlearn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
[Pendar 2007] Pendar, N. (2007). Toward spotting the pedophile telling victim from predator in text chats. In International Conference on Semantic Computing (ICSC 2007), pages 235–241. IEEE.
[Pennebaker et al. 2001] Pennebaker, J.W., Francis, M. E., and Booth, R. J. (2001). Linguistic inquiry and word count: Liwc 2001. Mahway: Lawrence Erlbaum Associates, 71(2001):2001.
[Ponomareva and Thelwall 2012] Ponomareva, N. and Thelwall, M. (2012). Biographies or blenders: Which resource is best for cross-domain sentiment analysis? In International Conference on Intelligent Text Processing and Computational Linguistics, pages 488–499. Springer.
[Ross 1997] Ross, S. M. (1997). Introduction to Probability Models. Academic Press, San Diego, CA, USA, sixth edition.
[Santos and Guedes 2019] Santos, L. F. d. and Guedes, G. P. (2019). Identificac¸ ˜ao de predadores sexuais brasileiros por meio de an´alise de conversas realizadas na internet. In Anais do VIII Brazilian Workshop on Social Network Analysis and Mining, pages 143–154, Porto Alegre, RS, Brasil. SBC.
[Scott and Matwin 1998] Scott, S. and Matwin, S. (1998). Text classification using wordnet hypernyms. In Usage of WordNet in Natural Language Processing Systems.
[Sokolova and Bobicev 2018] Sokolova, M. and Bobicev, V. (2018). Corpus statistics in text classification of online data. arXiv preprint arXiv:1803.06390.
[Sutskever et al. 2014] Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
[Varma and Simon 2006] Varma, S. and Simon, R. (2006). Bias in error estimation when using cross-validation for model selection. BMC bioinformatics, 7(1):91.
[Villatoro-Tello et al. 2012] Villatoro-Tello, E., Ju´arez-Gonz´alez, A., Escalante, H. J., Montes-y G´omez, M., and Pineda, L. V. (2012). A two-step approach for effective detection of misbehaving users in chats. In CLEF (Online Working Notes/-Labs/Workshop), volume 1178.
[Webb 2018] Webb, K. (2018). The world’s most popular video game chat app is now worth more than $2 billion, as it gears up to take on the makers of ’fortnite’. https://www.businessinsider.com/discord-funding-2-billion-value-2018-12 (Acessado em 17 de fevereiro de 2020).
[Weiss and Kulikowski 1991] Weiss, S. M. and Kulikowski, C. A. (1991). Computer systems that learn: classification and prediction methods from statistics, neural nets, machine learning, and expert systems.
[Wolak et al. 2018] Wolak, J., Finkelhor, D.,Walsh,W., and Treitman, L. (2018). Sextortion of minors: Characteristics and dynamics. Journal of Adolescent Health, 62(1):72–79.
[Yang and Pedersen 1997] Yang, Y. and Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. pages 412–420.