Synthetic Data for Mental Health: A Comparative Analysis of LLMs, BERT, and Copy-Based Augmentation

  • Matheus Yasuo Ribeiro Utino USP
  • Elton H. Matsushima UFF
  • Aline Paes UFF
  • Paulo Mann UFRJ

Resumo


Depression screening through social media has emerged as a promising research avenue; however, the scarcity of annotated data remains a significant barrier to effective model training. In this work, we evaluate several textual data augmentation strategies for screening users with depression through Brazilian-Portuguese Instagram posts. We explore three techniques of increasing complexity: simple post duplication, contextual word substitution using BERT-based models, and synthetic posts generation via Large Language Models (LLMs), both with and without the modulation of psychometric data from the Beck’s Depression Inventory (BDI-II) to create new instances. Experiments were conducted under both Single-Instance Learning (SIL) and Multiple-Instance Learning (MIL) frameworks, using multilingual sentence embeddings and an XGBoost classifier. Results reveal statistically significant differences among augmentation strategies, with LLM-based generation without BDI-II modulation achieving the highest performance. Contextual substitution proved to be a competitive and computationally efficient alternative. In contrast, psychometric modulation reduced model effectiveness, suggesting that artificially aligning emotional tone may compromise data quality. These findings underscore the importance of semantically coherent augmentation for sensitive applications in mental health. Code and supplementary material: https://github.com/Matheusutino/depression-data-augmentation.

Publicado
29/09/2025
UTINO, Matheus Yasuo Ribeiro; MATSUSHIMA, Elton H.; PAES, Aline; MANN, Paulo. Synthetic Data for Mental Health: A Comparative Analysis of LLMs, BERT, and Copy-Based Augmentation. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 393-408. ISSN 2643-6264.