Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning

  • Alex Echeverria UFG
  • Sávio Salvarino Teles de Oliveira UFG
  • Fernando Marques Federson UFG

Resumo


The adaptation of Large Language Models (LLMs) to specialized domains critically depends on high-quality instructional datasets. A significant bottleneck exists in generating Question-Answer (Q&A) datasets from noisy, unstructured sources such as call center audio recordings. This work presents Call2Instruct, a novel end-to-end automated pipeline that integrates five sequential modules: audio processing (diarization, noise suppression, ASR), text processing (normalization, anonymization), semantic extraction and vectorization, Q&A dataset generation using embedding-based similarity matching, and dataset validation through LLM fine-tuning.

Palavras-chave: LLM Fine-Tuning, Call Center Recordings, Q&A Dataset Generation, Automated Pipeline, Large Language Models
Publicado
04/12/2025
ECHEVERRIA, Alex; OLIVEIRA, Sávio Salvarino Teles de; FEDERSON, Fernando Marques. Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning. In: ESCOLA REGIONAL DE INFORMÁTICA DE GOIÁS (ERI-GO), 13. , 2025, Luziânia/GO. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 388-389. DOI: https://doi.org/10.5753/erigo.2025.17118.