BeSIM: A Benchmark for Evaluating the Interpretation of Social Interactions in Brazil Using Multimodal Large Language Models

  • Mateus Souza Falcão UFPE
  • Francisco Paulo Magalhães Simões UFPE
  • Vitória Sofia Vieira dos Santos UFRPE

Resumo


The rapid development of Multimodal Large Language Models (MLLMs) has expanded the possibilities for automatically understanding complex real-world scenarios. However, their ability to interpret social interactions through videos remains underexplored. This study introduces BeSIM, a benchmark designed to evaluate MLLMs' competence in interpreting such interactions among Brazilians, based on the APRACE taxonomy for categorizing key elements of social interaction. A set of 22 videos was collected from YouTube, resulting in 110 multiple-choice questions aligned with these categories. The results show that models such as Gemini 2.5 Pro outperform their performance on generalist benchmarks like Video-MME, reaching up to 90 percent of accuracy on BeSIM. We also conducted a qualitative analysis to discuss possible features to explain failure/success of the models. These findings indicate that, when properly evaluated, MLLMs demonstrate great potential in interpreting human interactions. Code and data available at https://github.com/M4Falcao/BeSIM.

Palavras-chave: Analytical models, Visualization, Accuracy, Video on demand, Large language models, Taxonomy, Psychology, Benchmark testing, Web sites, Videos
Publicado
30/09/2025
FALCÃO, Mateus Souza; SIMÕES, Francisco Paulo Magalhães; SANTOS, Vitória Sofia Vieira dos. BeSIM: A Benchmark for Evaluating the Interpretation of Social Interactions in Brazil Using Multimodal Large Language Models. In: CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 38. , 2025, Salvador/BA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 391-396.