Employs Discrete Cosine Transform (DCT) to decompose skeleton motions into high- and low-frequency components, allowing adaptive feature enhancement to improve semantic representation learning in ZSSAR.
Novel dual-level description comprising Local action Description (LD) and Global action Description (GD) to enrich the semantic information for improving the model performance.
Addresses modality gaps and skeleton ambiguities by dynamically balancing positive and negative pair contributions, ensuring robust alignment between semantic embeddings and skeleton features.
Our FS-VAE framework integrates frequency domain analysis with semantic understanding to achieve robust zero-shot action recognition. The frequency enhancement module preserves essential motion patterns while mitigating noise, the semantic-based descriptions bridge the gap between visual and textual modalities, and the calibrated loss ensures robust alignment even with noisy skeleton data.
+3.6% over best baseline
+7.4% over best baseline
+2.4% over best baseline
+1.8% over best baseline
@article{wu2025frequencysemanticenhancedvariationalautoencoder, title={Frequency-Semantic Enhanced Variational Autoencoder for Zero-Shot Skeleton-based Action Recognition}, author={Wenhan Wu and Zhishuai Guo and Chen Chen and Hongfei Xue and Aidong Lu}, year={2025}, eprint={2506.22179}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2506.22179}, note={arXiv preprint arXiv:2506.22179} }
We thank the anonymous reviewers for their valuable feedback. This work was supported by research grants from the University of North Carolina at Charlotte and collaborations with Northern Illinois University and University of Central Florida.