TransferAttn: Transferable-guided Attention for Video Domain Adaptation

  • André Sacilotti USP
  • Nicu Sebe University of Trento
  • Jurandy Almeida UFSCar

Resumo


Unsupervised domain adaptation (UDA) in videos is a challenging task that remains not well explored compared to image-based UDA techniques. Although vision transformers (ViT) achieve state-of-the-art performance in many computer vision tasks, their use in video domain adaptation has still been little explored. Our key idea is to use the transformer layers as a feature encoder and incorporate spatial and temporal transferability relationships into the attention mechanism. A Transferable-guided Attention (TransferAttn) framework is then developed to exploit the capacity of the transformer to adapt cross-domain knowledge across different backbones. To improve the transferability of ViT, we introduce a novel and effective module, named Domain Transferable-guided Attention Block (DTAB), which compels ViT to focus on the spatio-temporal transferability relationship among video frames by changing the self-attention mechanism to a transferability attention mechanism. Experiments conducted on the UCF-HMDB and Kinetics-NEC Drone datasets, with different backbones, like I3D and STAM, show that TransferAttn outperforms state-of-the-art approaches. Also, we demonstrate that our DTAB yields performance gains when applied to other ViT-based methods for video UDA.

Referências

V. da Costa, G. Zara, P. Rota, T. Oliveira-Santos, N. Sebe, V. Murino, and E. Ricci, “Unsupervised domain adaptation for video transformers in action recognition,” in ICPR, 2022, pp. 1258–1265.

Y. Kong and Y. Fu, “Human action recognition and prediction: A survey,” Int. J. Comput. Vis., vol. 130, no. 5, pp. 1366–1401, 2022.

K. Wang, D. Zhang, Y. Li, R. Zhang, and L. Lin, “Cost-effective active learning for deep image classification,” IEEE Trans. Circuits Syst. Video Technol., vol. 27, no. 12, pp. 2591–2600, 2017.

Y. Xu, H. Cao, Z. Chen, X. Li, L. Xie, and J. Yang, “Video unsupervised domain adaptation with deep learning: A comprehensive survey,” CoRR, vol. abs/2211.10412, 2022.

P. Wei, L. Kong, X. Qu, Y. Ren, zhiqiang xu, J. Jiang, and X. Yin, “Unsupervised video domain adaptation for action recognition: A disentanglement perspective,” in NeurIPS, 2023.

A. Dasgupta, C. V. Jawahar, and K. Alahari, “Overcoming label noise for source-free unsupervised video domain adaptation,” CoRR, vol. abs/2311.18572, 2023.

P. Chen, Y. Gao, and A. J. Ma, “Multi-level attentive adversarial learning with temporal dilation for unsupervised video domain adaptation,” in WACV, 2022, pp. 776–785.

M.-H. Chen, Z. Kira, G. Alregib, J. Yoo, R. Chen, and J. Zheng, “Temporal attentive alignment for large-scale video domain adaptation,” in ICCV, 2019, pp. 6320–6329.

J. Choi, G. Sharma, M. Chandraker, and J.-B. Huang, “Unsupervised and semi-supervised domain adaptation for action recognition from drones,” in WACV, 2020, pp. 1706–1715.

J. Li, L. Zhu, and Z. Du, Unsupervised Domain Adaptation - Recent Advances and Future Perspectives. Springer, 2024.

J. Yang, J. Liu, N. Xu, and J. Huang, “Tvt: Transferable vision transformer for unsupervised domain adaptation,” in WACV, 2023, pp. 520–530.

G. Sharir, A. Noy, and L. Zelnik-Manor, “An image is worth 16x16 words, what is a video worth?” CoRR, vol. abs/2103.13915, 2021.

N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in IEEE Information Theory Workshop (ITW’15), 2015, pp. 1–5.

J. Selva, A. S. Johansen, S. Escalera, K. Nasrollahi, T. B. Moeslund, and A. Clapés, “Video transformers: A survey,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 11, pp. 12 922–12 943, 2023.

K. Soomro, A. R. Zamir, and M. Shah, “UCF101: A dataset of 101 human actions classes from videos in the wild,” CoRR, vol. abs/1212.0402, 2012.

H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre, “Hmdb: A large video database for human motion recognition,” in ICCV, 2011, pp. 2556–2563.

J. Carreira, E. Noland, A. Banki-Horvath, C. Hillier, and A. Zisserman, “A short note about kinetics-600,” CoRR, vol. abs/1808.01340, 2018.
Publicado
30/09/2024
SACILOTTI, André; SEBE, Nicu; ALMEIDA, Jurandy. TransferAttn: Transferable-guided Attention for Video Domain Adaptation. In: WORKSHOP DE TRABALHOS DA GRADUAÇÃO - CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 37. , 2024, Manaus/AM. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 155-158. DOI: https://doi.org/10.5753/sibgrapi.est.2024.31663.

Artigos mais lidos do(s) mesmo(s) autor(es)