Here comes the SAM: bringing light to black box models applied to video content

Davi Monteiro Paiva; Joao Marcelo Xavier Natario Teixeira; Veronica Teichrieb

doi:10.5753/latinoware.2025.16583

Davi Monteiro Paiva UFPE
Joao Marcelo Xavier Natario Teixeira UFPE
Veronica Teichrieb UFPE

DOI: https://doi.org/10.5753/latinoware.2025.16583

Resumo

This paper introduces a model-agnostic approach to improving explainability in black-box video models by integrating advanced segmentation techniques. Leveraging the Segment Anything Model 2 (SAM) to create coherent spatio-temporal segments, we adapt a LIME-inspired framework to generate more intuitive local surrogate explanations. Our method allows for the extraction of meaningful regions within video frames, providing clearer insights into the model’s decision-making process. Experimental results demonstrate that employing better segmentation leads to more faithful and interpretable explanations, highlighting the benefits of this generalizable strategy for a wide range of video-based classification and detection tasks.

Palavras-chave: Explainable Artificial Intelligence, Video Segmentation, Model-Agnostic Explanations

Referências

M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should i trust you?”: Explaining the predictions of any classifier,” 2016. [Online]. Available: [link]

N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer, “Sam 2: Segment anything in images and videos,” 2024. [Online]. Available: [link]

L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. V. Gool, “Temporal segment networks: Towards good practices for deep action recognition,” 2016. [Online]. Available: [link]

C. Roy, M. Nourani, S. Arya, M. Shanbhag, T. Rahman, E. D. Ragan, N. Ruozzi, and V. Gogate, “Explainable activity recognition in videos using deep learning and tractable probabilistic models,” ACM Transactions on Interactive Intelligent Systems, vol. 13, no. 4, pp. 1–32, 2023.

F. X. Gaya-Morey, J. M. Buades-Rubio, I. S. MacKenzie, and C. Manresa-Yee, “Revex: A unified framework for removal-based explainable artificial intelligence in video,” 2024. [Online]. Available: [link]

R. Achanta, A. Shaji, K. Smith, A. Lucchi, P. Fua, and S. Süsstrunk, “Slic superpixels,” Technical report, EPFL, 06 2010.

W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, M. Suleyman, and A. Zisserman, “The kinetics human action video dataset,” 2017. [Online]. Available: [link]

G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” 2021. [Online]. Available: [link]

C. Yang, Y. Xu, J. Shi, B. Dai, and B. Zhou, “Temporal pyramid network for action recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2020.

Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “Tam: Temporal adaptive module for video recognition,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2021, pp. 13 708–13 718.

V. Petsiuk, A. Das, and K. Saenko, “Rise: Randomized input sampling for explanation of black-box models,” 2018. [Online]. Available: [link]

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” in Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds., vol. 30. Curran Associates, Inc., 2017. [Online]. Available: [link]