Hierarchical Vision Transformer Using Shifted Windows for Skeleton-Based Action Recognition
Resumo
Skeleton-based human action recognition has gained significant attention due to the increasing accessibility of skeleton data. In this work, we propose a method for skeleton-based action recognition that leverages a hierarchical vision transformer with shifted windows, known as the Swin Transformer, combined with a self-supervised learning strategy inspired by Simple Masked Image Modeling (SimMIM). The Swin Transformer restricts selfattention to local windows and enables cross-window information exchange through a shift mechanism. This design ensures scalability with longer skeleton sequences while maintaining a balance between local and global context modeling and reducing feature redundancy. Given the limited availability of labeled data, especially for training large transformer-based models, we incorporate a self-supervised pre-training stage. This pretraining follows the SimMIM strategy, where masked patches of raw skeleton sequences are reconstructed using a one-layer linear head and L1 loss. This encourages the model to capture meaningful motion patterns even with partially visible data. Our method demonstrates competitive performance on public available datasets, including NTU RGB+D 60 and NTU RGB+D 120.
Palavras-chave:
Training, Computer vision, Solid modeling, Three-dimensional displays, Scalability, Computer architecture, Self-supervised learning, Transformers, Skeleton, Data models
Publicado
30/09/2025
Como Citar
COSMI FILHO, Luiz Carlos; SAMATELO, Jorge Leonid Aching; VASSALLO, Raquel Frizera.
Hierarchical Vision Transformer Using Shifted Windows for Skeleton-Based Action Recognition. In: CONFERENCE ON GRAPHICS, PATTERNS AND IMAGES (SIBGRAPI), 38. , 2025, Salvador/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 116-121.
