In recent years, supervised skeleton-based action recognition has achieved notable results. However, these methods rely on labeled data, which is both resource-intensive and time-demanding to obtain. Self-supervised methods do not require data labels and has attracted considerable interest within the academic community. Masked Autoencoders are a self-supervised learning paradigm, and Siamese networks are a common structure in computer vision tasks. Combining these two approaches is natural. However, most existing research has applied these methods to image or video tasks, while relatively limited attention has been given to skeleton-based action recognition. Additionally, current methods tend to ignore differences in how the same action appears from different views, which limits the model's spatial representation capabilities. To address this, we introduce the Siamese Masked Autoencoders framework into skeleton-based action representation learning, named SiamMVMAE. To encourage the model to capture action features across various viewpoints, both the original skeleton sequences and rotation-augmented sequences are used as independent inputs for the Siamese networks. These inputs are then processed with a transformer encoder and decoder, enabling effective learning of action representations. Experiments on the NTU-RGB+D 60, NTU-RGB+D 120, and PKU-MMD benchmark datasets show that our method is highly competitive compared to existing approaches.
Comment submit