Comparative Evaluation of Class Balancing Strategies for Depression Detection in Reddit Posts
Resumo
Conjuntos de dados reais em saúde geralmente são desbalanceados, o que dificulta a construção de modelos preditivos eficazes. Este trabalho avalia o impacto de técnicas de balanceamento na detecção de sinais de depressão em postagens do Reddit, utilizando o conjunto de dados do eRisk 2017. Foram testadas quatro estratégias de amostragem e uma abordagem sem reamostragem, aplicadas aos modelos Random Forest, XGBoost e GLM. As métricas analisadas foram sensibilidade, especificidade e AUC-ROC. A subamostragem parcial atingiu AUC-ROC de 0,72 com GLM (lambda = 1), enquanto a combinação com superamostragem obteve 0,68 com RF (mtry = 126). Os resultados reforçam a importância de técnicas de balanceamento para predição em saúde mental.Referências
Abdulsadig, R. S. and Rodriguez-Villegas, E. (2024). A comparative study in class imbalance mitigation when working with physiological signals. Front. Digit. Health, pages 1–11.
Almeida, H., Briand, A., and Meurs, M.-J. (2017). Detecting Early Risk of Depression from Social Media User-generated Content . pages 1–12.
Araf, I., Idri1, A., and Chairi3, I. (2024). Cost-sensitive learning for imbalanced medical data: a review. Artifcial Intelligence Review, 57(80):1–72.
de Oliveira Melo, W. E. and Cortes, O. A. C. (2021). Utilizando Análise de Sentimentos e SVM na Classificação de Tweets Depressivos. XI Computer on the Beach.
dos Santos, H. G. (2018). Comparação da performance de algoritmos de machine learning para a análise preditiva em saúde pública e medicina. PhD thesis, Universidade de São Paulo (USP), São Paulo.
ERISK (2017). eRisk 2017: Early risk prediction on the Internet: experimental foundations. [link].
Errecalde, M. L., Villegas, P., Funez, D. G., Uelay, J. G., and Cagnina, L. C. (2017). Temporal Variation of Terms as concept space for early risk predi tion. pages 1–12.
Farías-Anzaldúa, A. A., y Gómez, M. M., López-Monroy, A. P., and González-Gurrola, L. C. (2017). UACH-INAOE participation at eRisk2017. pages 1–8.
Gentili, E., Franchini, G., Zese, R., Alberti, M., Ferrara, M., Domenicano, I., and Grassi, L. (2024). Machine learning from real data: A mental health registry case study. Computer Methods and Programs in Biomedicine Update, 5(100132):1–10.
Islam, M. R., Kabir, M. A., Ahmed, A., Kamal, A. R. M., Wang, H., and Ulhaq, A. (2018). Depression detection from social network data using machine learning techniques. Health Information Science and Systems, 6(8):1–12.
Junior, E. S. S., de Melo, J. A. B., da Silva, A. P., de A. Silva, T., de C. Chaves, A. P., de Souza, A. F., de S. G. júnior, J., and do N. Santana, S. (2022). Depression among adolescents who frequently use social networks: a literature review. 8(3):18838–18851.
Kim, J., Lee, D., and Park, E. (2021). Machine Learning for Mental Health in Social Media: Bibliometric Study. Journal of Medical Internet Research, 23(3):1–17.
Lin, W.-J. and Chen, J. J. (2012). Class-imbalanced classifiers for high-dimensional data. Oxford University Press, 14(1):13–26.
Losada, D. E., Crestani, F., and Parapar, J. (2017). CLEF 2017 eRisk Overview: Early Risk Prediction on the Internet: Experimental Foundations. pages 1–18.
Malam, I. A., Arziki, M., Bellazrak, M. N., Benamara, F., Kaidi, A. E., Es-Saghir, B., He, Z., Housni, M., Moriceau, V., Mothe, J., and Ramiandrisoa, F. (2017). IRIT at e-Risk. pages 1–7.
Mena, L. J., Orozco, E. E., Felix, V. G., Ostos, R., Melgarejo, J., and Maestre, G. E. (2012). Machine learning approach to extract diagnostic and prognostic thresholds: application in prognosis of cardiovascular mortality. Computational and Mathematical Methods in Medicine, pages 1–6.
Morais, E. A. M. and Ambrósio, A. P. L. (2007). Mineração de Textos. [link].
Mussio, R. A. P. (2019). A geração Z e suas respostas comportamental e emotiva nas redes sociais virtuais. European psychiatry, 3(3):204–217.
Nadaraja, R. and Yazdanifard, R. (2014). Social Media Marketing: Advantages and Disadvantages. Social Media Marketing, pages 1–10.
OPAS (2017). Com depressão no topo da lista de causas de problemas de saúde, OMS lança a campanha “Vamos conversar”. [link].
Ostojic, D., Lalousis, P. A., Donohoe, G., and Morris, D. W. (2024). The challenges of using machine learning models in psychiatric research and clinical practice. European Neuropsychopharmacology, 88:53–65.
Rahman, R. A., Omar, K., Noah, S. A. M., and Danuri, M. S. N. M. (2018). A Survey on Mental Health Detection in Online Social Network. International Journal on Advanced Science Engineering Information Technology, 8(4-2):1431–1436.
Sadeque, F., Xu, D., and Bethard, S. (2017). UArizona at the CLEF eRisk 2017 Pilot Task: Linear and Recurrent Models for Early Depression Detection . pages 1–9.
Sampasa-Kanyinga, H. and Hamilton, H. (2015). Social networking sites and mental health problems in adolescents: The mediating role of cyberbullying victimization. European psychiatry, 30(8):1021–1027.
Souza, K. and da Cunha, M. X. C. (2019). Impacts of the use of virtual social networks on adolescents’ mental health: A systematic review of literature. Revista Educação, Psicologia e Interfaces, 3(3):204–217.
Trotzek, M., Koitka, S., and Friedrich, C. M. (2017). Linguistic Metadata Augmented Classifiers at the CLEF 2017 Task for Early Detection of Depression - FHDO Biomedical Computer Science Group (BCSG) . pages 1–17.
Villatoro-Tello, E., de-la Rosa, G. R., and Jiménez-Salazar, H. (2017). UAM’s participation at CLEF eRisk 2017 task: Towards modelling depressed bloggers. pages 1–9.
Visentini, C., Cassidy, M., Bird, V. J., and Priebe, S. (2018). Social networks of patients with chronic depression: A systematic review . Journal of Affective Disorders, 241:571–578.
WHO (2023). Depressive disorder (depression). [link].
Zhang, D., Yin, C., Zeng, J., Yuan, X., and Zhang, P. (2020). Combining structured and unstructured data for predictive models: a deep learning approach. BMC Medical Informatics and Decision Making, 20(280):1–11.
Almeida, H., Briand, A., and Meurs, M.-J. (2017). Detecting Early Risk of Depression from Social Media User-generated Content . pages 1–12.
Araf, I., Idri1, A., and Chairi3, I. (2024). Cost-sensitive learning for imbalanced medical data: a review. Artifcial Intelligence Review, 57(80):1–72.
de Oliveira Melo, W. E. and Cortes, O. A. C. (2021). Utilizando Análise de Sentimentos e SVM na Classificação de Tweets Depressivos. XI Computer on the Beach.
dos Santos, H. G. (2018). Comparação da performance de algoritmos de machine learning para a análise preditiva em saúde pública e medicina. PhD thesis, Universidade de São Paulo (USP), São Paulo.
ERISK (2017). eRisk 2017: Early risk prediction on the Internet: experimental foundations. [link].
Errecalde, M. L., Villegas, P., Funez, D. G., Uelay, J. G., and Cagnina, L. C. (2017). Temporal Variation of Terms as concept space for early risk predi tion. pages 1–12.
Farías-Anzaldúa, A. A., y Gómez, M. M., López-Monroy, A. P., and González-Gurrola, L. C. (2017). UACH-INAOE participation at eRisk2017. pages 1–8.
Gentili, E., Franchini, G., Zese, R., Alberti, M., Ferrara, M., Domenicano, I., and Grassi, L. (2024). Machine learning from real data: A mental health registry case study. Computer Methods and Programs in Biomedicine Update, 5(100132):1–10.
Islam, M. R., Kabir, M. A., Ahmed, A., Kamal, A. R. M., Wang, H., and Ulhaq, A. (2018). Depression detection from social network data using machine learning techniques. Health Information Science and Systems, 6(8):1–12.
Junior, E. S. S., de Melo, J. A. B., da Silva, A. P., de A. Silva, T., de C. Chaves, A. P., de Souza, A. F., de S. G. júnior, J., and do N. Santana, S. (2022). Depression among adolescents who frequently use social networks: a literature review. 8(3):18838–18851.
Kim, J., Lee, D., and Park, E. (2021). Machine Learning for Mental Health in Social Media: Bibliometric Study. Journal of Medical Internet Research, 23(3):1–17.
Lin, W.-J. and Chen, J. J. (2012). Class-imbalanced classifiers for high-dimensional data. Oxford University Press, 14(1):13–26.
Losada, D. E., Crestani, F., and Parapar, J. (2017). CLEF 2017 eRisk Overview: Early Risk Prediction on the Internet: Experimental Foundations. pages 1–18.
Malam, I. A., Arziki, M., Bellazrak, M. N., Benamara, F., Kaidi, A. E., Es-Saghir, B., He, Z., Housni, M., Moriceau, V., Mothe, J., and Ramiandrisoa, F. (2017). IRIT at e-Risk. pages 1–7.
Mena, L. J., Orozco, E. E., Felix, V. G., Ostos, R., Melgarejo, J., and Maestre, G. E. (2012). Machine learning approach to extract diagnostic and prognostic thresholds: application in prognosis of cardiovascular mortality. Computational and Mathematical Methods in Medicine, pages 1–6.
Morais, E. A. M. and Ambrósio, A. P. L. (2007). Mineração de Textos. [link].
Mussio, R. A. P. (2019). A geração Z e suas respostas comportamental e emotiva nas redes sociais virtuais. European psychiatry, 3(3):204–217.
Nadaraja, R. and Yazdanifard, R. (2014). Social Media Marketing: Advantages and Disadvantages. Social Media Marketing, pages 1–10.
OPAS (2017). Com depressão no topo da lista de causas de problemas de saúde, OMS lança a campanha “Vamos conversar”. [link].
Ostojic, D., Lalousis, P. A., Donohoe, G., and Morris, D. W. (2024). The challenges of using machine learning models in psychiatric research and clinical practice. European Neuropsychopharmacology, 88:53–65.
Rahman, R. A., Omar, K., Noah, S. A. M., and Danuri, M. S. N. M. (2018). A Survey on Mental Health Detection in Online Social Network. International Journal on Advanced Science Engineering Information Technology, 8(4-2):1431–1436.
Sadeque, F., Xu, D., and Bethard, S. (2017). UArizona at the CLEF eRisk 2017 Pilot Task: Linear and Recurrent Models for Early Depression Detection . pages 1–9.
Sampasa-Kanyinga, H. and Hamilton, H. (2015). Social networking sites and mental health problems in adolescents: The mediating role of cyberbullying victimization. European psychiatry, 30(8):1021–1027.
Souza, K. and da Cunha, M. X. C. (2019). Impacts of the use of virtual social networks on adolescents’ mental health: A systematic review of literature. Revista Educação, Psicologia e Interfaces, 3(3):204–217.
Trotzek, M., Koitka, S., and Friedrich, C. M. (2017). Linguistic Metadata Augmented Classifiers at the CLEF 2017 Task for Early Detection of Depression - FHDO Biomedical Computer Science Group (BCSG) . pages 1–17.
Villatoro-Tello, E., de-la Rosa, G. R., and Jiménez-Salazar, H. (2017). UAM’s participation at CLEF eRisk 2017 task: Towards modelling depressed bloggers. pages 1–9.
Visentini, C., Cassidy, M., Bird, V. J., and Priebe, S. (2018). Social networks of patients with chronic depression: A systematic review . Journal of Affective Disorders, 241:571–578.
WHO (2023). Depressive disorder (depression). [link].
Zhang, D., Yin, C., Zeng, J., Yuan, X., and Zhang, P. (2020). Combining structured and unstructured data for predictive models: a deep learning approach. BMC Medical Informatics and Decision Making, 20(280):1–11.
Publicado
29/09/2025
Como Citar
FONTENELE, Thallyson G. M. C.; SOUZA, Bruno Feres de.
Comparative Evaluation of Class Balancing Strategies for Depression Detection in Reddit Posts. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1682-1693.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.13879.
