Diversity in Data for Speech Processing in Brazilian Portuguese

Giovana Meloni Craveiro; Julio Cesar Galdino

Giovana Meloni Craveiro USP
Julio Cesar Galdino USP

Resumo

Striving to attend AI ethical guidelines is essential when developing and testing AI systems in order to ensure safe and trustworthy applications. However, these guidelines can be too general. The analysis presented here concerns the ethical principle of diversity, by discussing its application to the field of speech processing, using the task of prosodic segmentation of spontaneous speech as a case study. Particularly, it covers the relevance of including diversity of speaker’s profiles and regional variants in data used for training and developing AI applications, in the context of Brazilian Portuguese (BP). The contributions brought by this study are: (i) a discussion of the application of the diversity principle in the context of corpora for speech applications, considering some relevant aspects and the process we formulated to select a diverse sample of speakers to compose our corpus; (ii) a literature review of the current scenario of available corpora for the task of prosodic segmentation of spontaneous speech in BP, focused on the diversity of the data; (iii) a publicly available speech corpus (The corpus is publicly available in our Github repository https://github.com/nilc-nlp/MuPe-Diversidades/ under the CC BY-NC-ND 4.0 license) containing 2 h 32 min 15 s of spontaneous speech audios in BP, their revised transcriptions with automatic prosodic segmentation annotation, elaborated to comprise diversity of age, gender, and accents (17 Brazilian states).