Evolution of threats in Dark Web and Surface Web forums: a study based on topic modeling and time series
Abstract
This work investigates the temporal evolution of discussions about cyber threats in Dark Web and Surface Web forums between 2015 and 2024, aiming to identify trends, seasonal patterns, and differences between these environments. By analyzing over 52,000 posts using text preprocessing and Latent Dirichlet Allocation (LDA) topic modeling, the study identifies key trends, seasonal patterns, differences between environments, and dynamics within online communities. The analysis showed that Surface Web forums exhibited high topic variability. In contrast, the Portuguese-speaking Dark Web demonstrated a predominance of personal data commercialization, while the English-speaking Dark Web consistently maintained technical offensive topics, such as phishing and malware creation.References
Avanzi, B., Tan, X., Taylor, G., and Wong, B. (2023). On the evolution of data breach reporting patterns and frequency in the united states: a cross-state analysis. arXiv preprint arXiv:2310.04786.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
Cascavilla, G. (2025). The rise of cybercrime and cyber-threat intelligence: Perspectives and challenges from law enforcement. IEEE Security & Privacy, 23(1):17–26.
Cimpanu, C. (2020). University of utah pays $457,000 to ransomware gang. Acessado: 12-04-2023.
Crawly (2021). O que é crawler e como funcionam os robôs para coleta de dados. Acessado: 25-10-2024.
de Jesus Filho, S. A. (2024). Identificação de posts maliciosos na dark web utilizando aprendizado de máquina supervisionado. Dissertação de mestrado, Universidade Federal de Uberlândia, Uberlândia, Brasil. Orientador: Rodrigo Sanches Miani.
Fu, T., Abbasi, A., and Chen, H.-c. (2010). A focused crawler for dark web forums. JASIST, 61:1213–1231.
Hickman, L., Thapa, S., Tay, L., Cao, M., and Srinivasan, P. (2022). Text preprocessing for text mining in organizational research: Review and recommendations. Organizational Research Methods, 25(1):114–146.
Kavallieros, D., Myttas, D., Kermitsis, E., Lissaris, E., Giataganas, G., and Darra, E. (2021). Understanding the Dark Web, pages 3–26. Springer International Publishing, Cham.
Koloveas, P., Chantzios, T., Alevizopoulou, S., Skiadopoulos, S., and Tryfonopoulos, C. (2021). intime: A machine learning-based framework for gathering and leveraging web data to cyber-threat intelligence. Electronics, 10(7).
Kühn, P., Wittorf, K., and Reuter, C. (2024). Navigating the shadows: Manual and semi-automated evaluation of the dark web for cyber threat intelligence. IEEE Access, 12:118903–118922.
Labs, F. (2024). Pesquisa de ameaças da fortinet descobre que os cibercriminosos estão explorando novas vulnerabilidades do setor 43% mais rápido do que no 1º semestre de 2023. Acessado: 13-04-2025.
Liakos, P., Ntoulas, A., Labrinidis, A., and Delis, A. (2015). Focused crawling for the hidden web. World Wide Web, 19.
Najork, M. (2009). Web Crawler Architecture, pages 3462–3465. Springer US, Boston, MA.
Nunes, E., Diab, A., Gunn, A., Marin, E., Mishra, V., Paliath, V., Robertson, J., Shakarian, J., Thart, A., and Shakarian, P. (2016). Darknet and deepnet mining for proactive cybersecurity threat intelligence. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pages 7–12.
Rahman, M. R., Hezaveh, R. M., and Williams, L. (2023). What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey. ACM Comput. Surv., 55(12).
Sapienza, A., Bessi, A., Damodaran, S., Shakarian, P., Lerman, K., and Ferrara, E. (2017). Early warnings of cyber threats in online discussions. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pages 667–674.
Sarkar, S., Almukaynizi, M., Shakarian, J., and Shakarian, P. (2018). Predicting enterprise cyber incidents using social network analysis on the darkweb hacker forums.
Sun, N., Ding, M., Jiang, J., Xu, W., Mo, X., Tai, Y., and Zhang, J. (2023). Cyber threat intelligence mining for proactive cybersecurity defense: A survey and new perspectives. IEEE Communications Surveys & Tutorials, 25(3):1748–1774.
Syed, S. and Spruit, M. (2017). Full-text or abstract? examining topic coherence scores using latent dirichlet allocation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 165–174.
Tong, Z. and Zhang, H. (2016). A text mining research based on lda topic modelling. In International conference on computer science, engineering and information technology, pages 201–210.
Wagner, T. D., Mahbub, K., Palomar, E., and Abdallah, A. E. (2019). Cyber threat intelligence sharing: Survey and research directions. Computers & Security, 87:101589.
Řehůřek, R. (2024). What is gensim. Acessado: 27-04-2025.
Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation. Journal of machine Learning research, 3(Jan):993–1022.
Cascavilla, G. (2025). The rise of cybercrime and cyber-threat intelligence: Perspectives and challenges from law enforcement. IEEE Security & Privacy, 23(1):17–26.
Cimpanu, C. (2020). University of utah pays $457,000 to ransomware gang. Acessado: 12-04-2023.
Crawly (2021). O que é crawler e como funcionam os robôs para coleta de dados. Acessado: 25-10-2024.
de Jesus Filho, S. A. (2024). Identificação de posts maliciosos na dark web utilizando aprendizado de máquina supervisionado. Dissertação de mestrado, Universidade Federal de Uberlândia, Uberlândia, Brasil. Orientador: Rodrigo Sanches Miani.
Fu, T., Abbasi, A., and Chen, H.-c. (2010). A focused crawler for dark web forums. JASIST, 61:1213–1231.
Hickman, L., Thapa, S., Tay, L., Cao, M., and Srinivasan, P. (2022). Text preprocessing for text mining in organizational research: Review and recommendations. Organizational Research Methods, 25(1):114–146.
Kavallieros, D., Myttas, D., Kermitsis, E., Lissaris, E., Giataganas, G., and Darra, E. (2021). Understanding the Dark Web, pages 3–26. Springer International Publishing, Cham.
Koloveas, P., Chantzios, T., Alevizopoulou, S., Skiadopoulos, S., and Tryfonopoulos, C. (2021). intime: A machine learning-based framework for gathering and leveraging web data to cyber-threat intelligence. Electronics, 10(7).
Kühn, P., Wittorf, K., and Reuter, C. (2024). Navigating the shadows: Manual and semi-automated evaluation of the dark web for cyber threat intelligence. IEEE Access, 12:118903–118922.
Labs, F. (2024). Pesquisa de ameaças da fortinet descobre que os cibercriminosos estão explorando novas vulnerabilidades do setor 43% mais rápido do que no 1º semestre de 2023. Acessado: 13-04-2025.
Liakos, P., Ntoulas, A., Labrinidis, A., and Delis, A. (2015). Focused crawling for the hidden web. World Wide Web, 19.
Najork, M. (2009). Web Crawler Architecture, pages 3462–3465. Springer US, Boston, MA.
Nunes, E., Diab, A., Gunn, A., Marin, E., Mishra, V., Paliath, V., Robertson, J., Shakarian, J., Thart, A., and Shakarian, P. (2016). Darknet and deepnet mining for proactive cybersecurity threat intelligence. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pages 7–12.
Rahman, M. R., Hezaveh, R. M., and Williams, L. (2023). What are the attackers doing now? automating cyberthreat intelligence extraction from text on pace with the changing threat landscape: A survey. ACM Comput. Surv., 55(12).
Sapienza, A., Bessi, A., Damodaran, S., Shakarian, P., Lerman, K., and Ferrara, E. (2017). Early warnings of cyber threats in online discussions. In 2017 IEEE International Conference on Data Mining Workshops (ICDMW), pages 667–674.
Sarkar, S., Almukaynizi, M., Shakarian, J., and Shakarian, P. (2018). Predicting enterprise cyber incidents using social network analysis on the darkweb hacker forums.
Sun, N., Ding, M., Jiang, J., Xu, W., Mo, X., Tai, Y., and Zhang, J. (2023). Cyber threat intelligence mining for proactive cybersecurity defense: A survey and new perspectives. IEEE Communications Surveys & Tutorials, 25(3):1748–1774.
Syed, S. and Spruit, M. (2017). Full-text or abstract? examining topic coherence scores using latent dirichlet allocation. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pages 165–174.
Tong, Z. and Zhang, H. (2016). A text mining research based on lda topic modelling. In International conference on computer science, engineering and information technology, pages 201–210.
Wagner, T. D., Mahbub, K., Palomar, E., and Abdallah, A. E. (2019). Cyber threat intelligence sharing: Survey and research directions. Computers & Security, 87:101589.
Řehůřek, R. (2024). What is gensim. Acessado: 27-04-2025.
Published
2025-09-01
How to Cite
PEREIRA, Miguel Henrique de Brito; JESUS FILHO, Sebastião Alves de; GABRIEL, Paulo Henrique Ribeiro; MIANI, Rodrigo Sanches.
Evolution of threats in Dark Web and Surface Web forums: a study based on topic modeling and time series. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 401-416.
DOI: https://doi.org/10.5753/sbseg.2025.11375.
