CLS, Averaging, or Layer Combinations? Embedding Strategies for Text Classification Across BERT Variants
Resumo
The literature provides inconsistent recommendations on the best method for extracting embeddings from BERT variants when designing a text classifier. Some studies suggest averaging the first and last layers, while others claim that only the CLS embedding is sufficient. Others indicate that averaging the last layer yields good results. To address these conflicting suggestions, we conducted a comparative empirical evaluation on text classification benchmarks: R8, SST2, Movie Review, TREC6 Overruling, R52, TREC6, Snippets, and Ohsumed, using well-known BERT variants, including RoBERTa, DistilBERT, DeBERTa-v3, MPNet, ModernBERT, and BERT. We tested all combinations of these datasets and algorithms with CLS embeddings, First+Last embeddings, and average embeddings from the last layer. We used these results to perform a significance test through ANOVA and a paired t-test specifically for text classification. Based on comparative empirical experiments, the main contributions include findings that clarify the best approach for obtaining embeddings to improve results in text classification, which can impact machine learning practitioners and research studies in this field.
Publicado
29/09/2025
Como Citar
PERIN, Eliton Luiz Scardin; SOUZA, Mariana Caravanti de; COSTA, Anderson Bessa; MATSUBARA, Edson Takashi.
CLS, Averaging, or Layer Combinations? Embedding Strategies for Text Classification Across BERT Variants. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 239-254.
ISSN 2643-6264.
