The Role of Aggregation Functions on Transformers and ViTs Self-Attention for Classification

Joelson Sartori; Rodrigo de Bem; Graçaliz Dimuro; Giancarlo Lucca

Joelson Sartori FURG
Rodrigo de Bem FURG
Graçaliz Dimuro FURG
Giancarlo Lucca UCPel

Resumo

Aggregation functions are mathematical operations that combine or summarize a set of values into a single representative value. They play a crucial role in the attention mechanisms of Transformer neural networks. However, Transformers' default aggregation functions, based on matrix multiplication, may have limitations in certain classification scenarios. This function may struggle with the complexity of information present in the input data, resulting in lower accuracy and efficiency. Considering this issue, the present work aims to replace the traditional matrix multiplication operation used in the classical attention mechanism with alternative and more general aggregation functions. To validate the new aggregation methods on the attention mechanism, we conducted experiments on two datasets, the recently propose Google American Sign Language (ASL) Fingerspelling Recognition and the well-known CIFAR-10, performing time series and image classification, respectively. Results shed light on the role of aggregation functions for classification with Transformers, demonstrating promising outcomes and potential for further improvements.