Hierarchical Vision Transformer Using Shifted Windows for Skeleton-Based Action Recognition

Luiz Carlos Cosmi Filho; Jorge Leonid Aching Samatelo; Raquel Frizera Vassallo

Luiz Carlos Cosmi Filho UFES
Jorge Leonid Aching Samatelo UFES
Raquel Frizera Vassallo UFES

Resumo

Skeleton-based human action recognition has gained significant attention due to the increasing accessibility of skeleton data. In this work, we propose a method for skeleton-based action recognition that leverages a hierarchical vision transformer with shifted windows, known as the Swin Transformer, combined with a self-supervised learning strategy inspired by Simple Masked Image Modeling (SimMIM). The Swin Transformer restricts selfattention to local windows and enables cross-window information exchange through a shift mechanism. This design ensures scalability with longer skeleton sequences while maintaining a balance between local and global context modeling and reducing feature redundancy. Given the limited availability of labeled data, especially for training large transformer-based models, we incorporate a self-supervised pre-training stage. This pretraining follows the SimMIM strategy, where masked patches of raw skeleton sequences are reconstructed using a one-layer linear head and L1 loss. This encourages the model to capture meaningful motion patterns even with partially visible data. Our method demonstrates competitive performance on public available datasets, including NTU RGB+D 60 and NTU RGB+D 120.

Palavras-chave: Training, Computer vision, Solid modeling, Three-dimensional displays, Scalability, Computer architecture, Self-supervised learning, Transformers, Skeleton, Data models