On Text Preprocessing for Early Detection of Depression on Social Media

  • José Figueredo UEFS
  • Rodrigo Calumby UEFS


Depression is a serious challenge to public health. Many of those who suffer from this disease use social media for information or relief. The text data produced by these users can be used to support research in this field. However, this raw information is not always suitable for use directly in machine learning. Hence, a comparative analysis was performed between different preprocessing techniques to verify the impact on the effectiveness of early depression detection on social media. The results show that the preprocessing contributes to an increase in the prediction effectiveness. Moreover, the mapping of emoticons to real emotion words was decisive to improve not only model’s effectiveness, but also to keep the balance between different evaluation measures.


Aggarwal, C. C. (2011). An Introduction to Social Network Data Analytics, pages 1–15. Springer US, Boston, MA.

Almeida, H., Briand, A., and Meurs, M. (2017). Detecting early risk of depression from social media user-generated content. In Working Notes of CLEF 2017, Dublin, Ireland, September 11-14, 2017.

Benamara, F., Moriceau, V., Mothe, J., Ramiandrisoa, F., and He, Z. (2018). Automatic detection of depressive users in social media. In CORIA 2018, 15th French Information Retrieval Conference, Rennes, France, May 16-18, 2018. Proceedings.

Cavazos-Rehg, P. A., Krauss, M. J., Sowles, S., Connolly, S., Rosas, C., Bharadwaj, M., and Bierut, L. J. (2016). A content analysis of depression-related tweets. Computers in Human Behavior, 54:351–357.

Choudhury, M. D., Counts, S., and Horvitz, E. (2013a). Predicting postpartum changes in emotion and behavior via social media. In 2013 ACM SIGCHI Conference on Human Factors in Computing Systems, CHI ’13, Paris, France, April 27 - May 2, 2013, pages 3267–3276.

Choudhury, M. D., Counts, S., and Horvitz, E. (2013b). Social media as a measurement tool of depression in populations. In Web Science 2013 (co-located with ECRC), WebSci ’13, Paris, France, May 2-4, 2013, pages 47–56.

Choudhury, M. D., Gamon, M., Counts, S., and Horvitz, E. (2013c). Predicting depression via social media. In Proceedings of ICWSM 2013, Cambridge, Massachusetts, USA, July 8-11, 2013.

Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K., and Mitchell, M. (2015). CLPsych 2015 shared task: Depression and PTSD on twitter. In Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 31–39, Denver, Colorado. Association for Computational Linguistics.

Goodwin, F. K. and Jamison, K. R. (1990). Manic-depressive illness: bipolar disorders and recurrent depression. Oxford University Press, New York.

Losada, D. E. and Crestani, F. (2016). A test collection for research on depression and language use. In Experimental IR Meets Multilinguality, Multimodality, and Interaction 7th International Conference of the CLEF Association, CLEF 2016, Evora, Portugal, September 5-8, 2016, Proceedings, pages 28–39.

Losada, D. E., Crestani, F., and Parapar, J. (2017). CLEF 2017 erisk overview: Early risk prediction on the internet: Experimental foundations. In Working Notes of CLEF 2017, Dublin, Ireland, September 11-14, 2017.

Malam, I. A., Arziki, M., Bellazrak, M. N., Benamara, F., Kaidi, A. E., Es-Saghir, B., He, Z., Housni, M., Moriceau, V., Mothe, J., and Ramiandrisoa, F. (2017). IRIT at e-risk. In Working Notes of CLEF 2017, Dublin, Ireland, September 11-14, 2017.

Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., and Joulin, A. (2018). Advances in pre-training distributed word representations. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018).

Nadeem, M. (2016). Identifying depression on twitter. CoRR, abs/1607.07384.

Nakamura, T., Kubo, K., Usuda, Y., and Aramaki, E. (2014). Defining patients with depressive disorder by using textual information. In 2014 AAAI Spring Symposia, Stanford University, Palo Alto, California, USA, March 24-26, 2014.

Organization, W. H. (2017). Depression and other common mental disorders: Global health estimates.

Pennebaker, J. W., Mehl, M. R., and Niederhoffer, K. G. (2003). Psychological aspects of natural language use: Our words, our selves. Annual review of psychology, 54(1):547– 577.

Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.

Richards, C. S. and O’Hara, M. W., editors (2014). The Oxford Handbook of Depression and Comorbidity, volume 1. Oxford University Press.

Rude, S., Gortner, E.-M., and Pennebaker, J. (2004). Language use of depressed and depression-vulnerable college students. Cognition & Emotion, 18(8):1121–1133.

Santana, R. C., de Lima, T. H. N., Pinto, S. A. P., Zarate, L. E., and Nobre, C. N. (2018). Otimização automática de classificadores para auxiliar no diagnóstico da depressão. In Anais do XVIII Simpósio Brasileiro de Computação Aplicada à Saúde, Porto Alegre, RS, Brasil. SBC.

Schoen, H., Gayo-Avello, D., Metaxas, P. T., Mustafaraj, E., Strohmaier, M., and Gloor, P. A. (2013). The power of prediction with social media. Internet Research, 23(5):528– 543.

Trotzek, M., Koitka, S., and Friedrich, C. M. (2020). Utilizing neural networks and linguistic metadata for early detection of depression indications in text sequences. IEEE Trans. Knowl. Data Eng., 32(3):588–601.

Tsugawa, S., Kikuchi, Y., Kishino, F., Nakajima, K., Itoh, Y., and Ohsaki, H. (2015). Recognizing depression from twitter activity. In Proceedings of the CHI 2015, Seoul, Republic of Korea, April 18-23, 2015, pages 3187–3196.

Vedula, N. and Parthasarathy, S. (2017). Emotional and linguistic cues of depression from social media. In Proceedings of the 2017 International Conference on Digital Health, London, United Kingdom, July 2-5, 2017, pages 127–136.

Wang, X., Zhang, C., and Sun, L. (2013). An improved model for depression detection in micro-blog social network. In 13th IEEE International Conference on Data Mining Workshops, ICDM Workshops, TX, USA, December 7-10, 2013, pages 80–87.

Yang, C. and Srinivasan, P. (2016). Life satisfaction and the pursuit of happiness on twitter. PloS one, 11(3).

Yates, A., Cohan, A., and Goharian, N. (2017). Depression and self-harm risk assessment in online forums. In Proceedings of the 2017 EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2968–2978.
Como Citar

Selecione um Formato
FIGUEREDO, José; CALUMBY, Rodrigo . On Text Preprocessing for Early Detection of Depression on Social Media. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 20. , 2020, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2020 . p. 84-95. ISSN 2763-8952. DOI: https://doi.org/10.5753/sbcas.2020.11504.