Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition (ICCV 2025)

¹University of North Carolina at Charlotte, Department of Computer Science
²Northern Illinois University, Department of Computer Science
³Center for Research in Computer Vision, University of Central Florida
FS-VAE Method Overview

Abstract

Zero-shot skeleton-based action recognition aims to develop models capable of identifying actions beyond the categories encountered during training. Previous approaches have primarily focused on aligning visual and semantic representations but often overlooked the importance of fine-grained action patterns in the semantic space (e.g., the hand movements in drinking water and brushing teeth). To address these limitations, we propose a Frequency-Semantic Enhanced Variational Autoencoder (FS-VAE) to explore the skeleton semantic representation learning with frequency decomposition. FS-VAE consists of three key components: 1) a frequency-based enhancement module with high- and low-frequency adjustments to enrich the skeletal semantics learning and improve the robustness of zero-shot action recognition; 2) a semantic-based action description with multilevel alignment to capture both local details and global correspondence, effectively bridging the semantic gap and compensating for the inherent loss of information in skeleton sequences; 3) a calibrated cross-alignment loss that enables valid skeleton-text pairs to counterbalance ambiguous ones, mitigating discrepancies and ambiguities in skeleton and text features, thereby ensuring robust alignment.

Key Contributions

🔄 Frequency Enhanced Module

Employs Discrete Cosine Transform (DCT) to decompose skeleton motions into high- and low-frequency components, allowing adaptive feature enhancement to improve semantic representation learning in ZSSAR.

🎯 Semantic-based Action Description

Novel dual-level description comprising Local action Description (LD) and Global action Description (GD) to enrich the semantic information for improving the model performance.

⚖️ Calibrated Cross-Alignment Loss

Addresses modality gaps and skeleton ambiguities by dynamically balancing positive and negative pair contributions, ensuring robust alignment between semantic embeddings and skeleton features.

Method Architecture

FS-VAE Framework Overview

Our FS-VAE framework integrates frequency domain analysis with semantic understanding to achieve robust zero-shot action recognition. The frequency enhancement module preserves essential motion patterns while mitigating noise, the semantic-based descriptions bridge the gap between visual and textual modalities, and the calibrated loss ensures robust alignment even with noisy skeleton data.

FS-VAE Architecture

Experimental Results

NTU-60 (55/5 split)

86.9%

+3.6% over best baseline

NTU-60 (48/12 split)

57.2%

+7.4% over best baseline

NTU-120 (110/10 split)

74.4%

+2.4% over best baseline

NTU-120 (96/24 split)

62.5%

+1.8% over best baseline

Zero-Shot Learning Results

Zero-Shot Learning Results

Generalized Zero-Shot Learning Results

Generalized Zero-Shot Learning Results

Citation

@article{wu2025frequencysemanticenhancedvariationalautoencoder,
  title={Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition},
  author={Wenhan Wu and Zhishuai Guo and Chen Chen and Hongfei Xue and Aidong Lu},
  year={2025},
  eprint={2506.22179},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2506.22179},
  note={arXiv preprint arXiv:2506.22179}
}

Acknowledgments

We thank the anonymous reviewers for their valuable feedback. This work was supported by research grants from the University of North Carolina at Charlotte and collaborations with Northern Illinois University and University of Central Florida.