Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning
Resumo
The adaptation of Large Language Models (LLMs) to specialized domains critically depends on high-quality instructional datasets. A significant bottleneck exists in generating Question-Answer (Q&A) datasets from noisy, unstructured sources such as call center audio recordings. This work presents Call2Instruct, a novel end-to-end automated pipeline that integrates five sequential modules: audio processing (diarization, noise suppression, ASR), text processing (normalization, anonymization), semantic extraction and vectorization, Q&A dataset generation using embedding-based similarity matching, and dataset validation through LLM fine-tuning.
