Aroeira: A Curated Corpus for the Portuguese Language with a Large Number of Tokens
Resumo
The emphasis on constructing extensive datasets for training large language models (LLM) has recently increased, and current literature predominantly features datasets for high-resource languages such as English and Chinese. However, there is a notable scarcity of high-quality corpora for the Portuguese language. To address this limitation, we propose Aroeira, a curated corpus explicitly designed for training large language models in the Portuguese language, with a focus on the Brazilian Portuguese one. The Aroeira Corpus consists of 100 GB of texts from various internet platforms, processed through a comprehensive pipeline to ensure superior quality. The pipeline handles downloading, text extraction, language identification, application of quality and bias filters, and storage, all tailored for the Portuguese language. The resulting corpus contains 35.3 million documents and over 15.1 billion tokens, surpassing the largest previously available corpus in this domain.
Publicado
17/11/2024
Como Citar
LIRA, Thiago et al.
Aroeira: A Curated Corpus for the Portuguese Language with a Large Number of Tokens. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 13. , 2024, Belém/PA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 185-199.
ISSN 2643-6264.