BeSIM: A Benchmark for Evaluating the Interpretation of Social Interactions in Brazil Using Multimodal Large Language Models
Resumo
The rapid development of Multimodal Large Language Models (MLLMs) has expanded the possibilities for automatically understanding complex real-world scenarios. However, their ability to interpret social interactions through videos remains underexplored. This study introduces BeSIM, a benchmark designed to evaluate MLLMs' competence in interpreting such interactions among Brazilians, based on the APRACE taxonomy for categorizing key elements of social interaction. A set of 22 videos was collected from YouTube, resulting in 110 multiple-choice questions aligned with these categories. The results show that models such as Gemini 2.5 Pro outperform their performance on generalist benchmarks like Video-MME, reaching up to 90 percent of accuracy on BeSIM. We also conducted a qualitative analysis to discuss possible features to explain failure/success of the models. These findings indicate that, when properly evaluated, MLLMs demonstrate great potential in interpreting human interactions. Code and data available at https://github.com/M4Falcao/BeSIM.
