Multimodal Summarization of Clinical Dialogues in Digital Primary Care: Integrating Text Messages and Audio
Abstract
Instant messaging platforms in digital health have increased the volume of interactions, making the management and retrieval of clinical information a central challenge in digital primary care. Although automatic summarization of text-based dialogues with Large Language Models (LLMs) has been explored, a substantial portion of these exchanges occurs through audio messages. In this work, we propose a multimodal pipeline that integrates speech and text for LLM-based dialogue summarization. It was investigated (i) how to automatically extract clinically relevant information from audio messages with varying quality and (ii) the impact of this integration on summary quality. The methodology was developed using 706 real-world audio messages, a manually annotated dataset, and classifiers to filter out inadequate transcriptions. Results show that incorporating audio messages enriches the summaries by increasing contextualization and the level of clinical detail.References
Anibal, J., Huth, Wood, B., et al. (2025). Voice EHR: introducing multimodal audio data for health. Frontiers in Digital Health.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. J. Artif. Int. Res.
Esquivel, P., Gill, K., Goldberg, M., Sundaram, S. A., Morris, L., and Ding, D. (2024). Voice assistant utilization among the disability community for independent living: A rapid review of recent evidence. Human Behavior and Emerging Technologies.
Ferreira, A. A., Rocha, L., et al. (2025). A comprehensive qualitative analysis of patient dialogue summarization using large language models applied to noisy, informal, non-english real-world data. Scientific Reports.
Hone, T., Rasella, D., Barreto, M. L., Majeed, A., and Millett, C. (2017). Association between expansion of primary healthcare and racial inequalities in mortality amenable to primary care in brazil: a national longitudinal analysis. PLoS medicine.
Keszthelyi, D., Gaudet-Blavignac, C., Bjelogrlic, M., and Lovis, C. (2023). Patient information summarization in clinical settings: Scoping review. JMIR Medical Informatics.
Liu, S., McCoy, A. B., Wright, A., et al. (2024). Leveraging large language models for generating responses to patient messages-a subjective analysis. JAMIA.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. J. Artif. Int. Res.
Esquivel, P., Gill, K., Goldberg, M., Sundaram, S. A., Morris, L., and Ding, D. (2024). Voice assistant utilization among the disability community for independent living: A rapid review of recent evidence. Human Behavior and Emerging Technologies.
Ferreira, A. A., Rocha, L., et al. (2025). A comprehensive qualitative analysis of patient dialogue summarization using large language models applied to noisy, informal, non-english real-world data. Scientific Reports.
Hone, T., Rasella, D., Barreto, M. L., Majeed, A., and Millett, C. (2017). Association between expansion of primary healthcare and racial inequalities in mortality amenable to primary care in brazil: a national longitudinal analysis. PLoS medicine.
Keszthelyi, D., Gaudet-Blavignac, C., Bjelogrlic, M., and Lovis, C. (2023). Patient information summarization in clinical settings: Scoping review. JMIR Medical Informatics.
Liu, S., McCoy, A. B., Wright, A., et al. (2024). Leveraging large language models for generating responses to patient messages-a subjective analysis. JAMIA.
Published
2026-06-01
How to Cite
REIS, Davi; FERREIRA, Anderson A.; CUNHA, Washington; MACUL, Victor; NETO, Olivio; ALMEIDA, Jussara; ROCHA, Leonardo; GONÇALVES, Marcos André.
Multimodal Summarization of Clinical Dialogues in Digital Primary Care: Integrating Text Messages and Audio. In: BRAZILIAN SYMPOSIUM ON COMPUTING APPLIED TO HEALTH (SBCAS), 26. , 2026, Ouro Preto/MG.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2026
.
p. 1367-1372.
ISSN 2763-8952.
DOI: https://doi.org/10.5753/sbcas.2026.21379.
