Multilingual Extractive Summarization: Investigating State-of-the-Art Methods for English and Brazilian Portuguese

Germano Antonio Zani Jorge; Davi Alves Bezerra; Clarissa Castellã Xavier; Thiago Alexre Salgueiro Pardo

Germano Antonio Zani Jorge USP
Davi Alves Bezerra USP
Clarissa Castellã Xavier USP / SiDi
Thiago Alexre Salgueiro Pardo USP

Resumo

Automatic Text Summarization (ATS) is a Natural Language Processing (NLP) task essential for handling large volumes of information. ATS can be classified into two main types: extractive and abstractive. Extractive summarization selects sentences or phrases directly from the source text(s), while abstractive summarization generates new sentences that try to capture the original meaning of the source text(s). This paper describes our efforts to perform extractive single-document summarization in multilingual contexts. Although various summarization methods, such as PreSumm and HiStruct+, have shown promising results on English corpora like CNN/DM, there is a significant gap in applying these methods to other languages, especially Brazilian Portuguese. Additionally, these summarizers were evaluated with traditional metrics like ROUGE, which has limitations as it primarily measures superficial text overlap. To fill these gaps, we evaluate the effectiveness of these state-of-the-art methods on the CSTNews corpus (with news texts in Brazilian Portuguese) employing ROUGE and the recent BLANC metric, which measures how much the generated summary aids a pre-trained language model (like BERT) in understanding the document. Our contributions include the results and comparison of adapted models, the discussion of the BLANC metric in contrast to ROUGE, and the expansion of resources available to the Portuguese and multilingual NLP community.