Multilingual Extractive Summarization: Investigating State-of-the-Art Methods for English and Brazilian Portuguese
Resumo
Automatic Text Summarization (ATS) is a Natural Language Processing (NLP) task essential for handling large volumes of information. ATS can be classified into two main types: extractive and abstractive. Extractive summarization selects sentences or phrases directly from the source text(s), while abstractive summarization generates new sentences that try to capture the original meaning of the source text(s). This paper describes our efforts to perform extractive single-document summarization in multilingual contexts. Although various summarization methods, such as PreSumm and HiStruct+, have shown promising results on English corpora like CNN/DM, there is a significant gap in applying these methods to other languages, especially Brazilian Portuguese. Additionally, these summarizers were evaluated with traditional metrics like ROUGE, which has limitations as it primarily measures superficial text overlap. To fill these gaps, we evaluate the effectiveness of these state-of-the-art methods on the CSTNews corpus (with news texts in Brazilian Portuguese) employing ROUGE and the recent BLANC metric, which measures how much the generated summary aids a pre-trained language model (like BERT) in understanding the document. Our contributions include the results and comparison of adapted models, the discussion of the BLANC metric in contrast to ROUGE, and the expansion of resources available to the Portuguese and multilingual NLP community.
Publicado
17/11/2024
Como Citar
JORGE, Germano Antonio Zani; BEZERRA, Davi Alves; XAVIER, Clarissa Castellã; PARDO, Thiago Alexre Salgueiro.
Multilingual Extractive Summarization: Investigating State-of-the-Art Methods for English and Brazilian Portuguese. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 13. , 2024, Belém/PA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 212-223.
ISSN 2643-6264.