Yingcun Wang1, Rong Wang2, Shizhong Liu3, and Zhi Zeng4
1School of Information Engineering, Weifang Vocational College, Weifang 261031, China
2School of International Business, Weifang Vocational College, Weifang 262737, China
3Weifang Vocational College, Weifang 261031, China
4Shandong Qiaotong Tianxia Network Technology Co., Ltd, Weifang, 261061, China
Received: May 2, 2025
Accepted: July13, 2025
Publication Date: April 2, 2026
The overall framework of DAUfomer, containing multi-granularity feature extraction, uncertainty-driven spatial-temporal aggregation, proactive semantic enhancement
Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.
Download Citation: BibTeX | https://doi.org/10.6180/jase.202604_29(4).0009
Current action recognition research excessively relies on deterministic feature correlation mechanisms, struggling to address core challenges including spatiotemporal heterogeneity, action category ambiguity, and cross-frame semantic discontinuity prevalent in video stream data. To this end, this study proposes the Deep Robust Recognition Transformer (DAUfomer), to reconstruct action recognition paradigms through three synergistic modules. Multi-granularity feature extraction module employs Transformer with dual attentions to extract low-dimensional and high-information-density spatiotemporal features from high-dimensional video streams, preserving local motion details while establishing global contextual correlations. Uncertainty-driven spatial-temporal aggregation module innovatively constructs a hybrid Gaussian-Dirichlet distribution model, transforming deterministic spatiotemporal attention into a probabilistic learnable Bayesian network. This enables dynamic adaptation to data distribution shifts through latent space uncertainty quantification. Proactive semantic enhancement architecture breaks traditional causal constraints in temporal modeling by designing a bidirectional temporal distillation mechanism. It leverages latent semantic cues from future frames to construct cross-frame attention correlation graphs, enhancing current action features via gated recurrent unit-based spatiotemporal context refinement. Finally, extensive results on two real world datasets, especially on the MSDanceAction dataset with a 4.48% ACC improvement compared to the second-best result, verify that DAUfomer conducts a new standard baseline in the action recognition task.
Keywords: Multimedia action recognition; probability-driven aggregation; proactive semantic enhancement
- [1] Y. Zhong, L. Chen, C. Dan, and A. Rezaeipanah, (2022) “A systematic survey of data mining and big data analysis in internet of things” The Journal of Supercomputing 78(17): 18405–18453.
- [2] P. Li, J. Gao, J. Zhang, S. Jin, and Z. Chen, (2022) “Deep Reinforcement Clustering” IEEE Transactions on Multimedia: DOI: 10.1109/TMM.2022.3233249.
- [3] M. Korban, P. Youngs, and S. T. Acton, (2023) “A multi-modal transformer network for action detection” Pattern Recognition 142: 109713.
- [4] J. Gao, M. Liu, P. Li, J. Zhang, and Z. Chen, (2024) “Deep Multiview Adaptive Clustering With Semantic Invariance” IEEE Transactions on Neural Networks and Learning Systems 35(9): 12965–12978. DOI: 10.1109/TNNLS.2023.3265699.
- [5] J. Gao, M. Liu, P. Li, A. A. Laghari, A. R. Javed, N. Victor, and T. R. Gadekallu, (2023) “Deep Incomplete Multiview Clustering via Information Bottleneck for Pattern Mining of Data in Extreme-Environment IoT” IEEE Internet of Things Journal 11(16): 26700–26712. DOI: 10.1109/JIOT.2023.325272.
- [6] Z. Lin. “Dance movement recognition method based on convolutional neural network”. In: 2023 4th International Conference on Computer Vision, Image and Deep Learning (CVIDL). 2023, 255–258.
- [7] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard, and Q. V. Le. “Mnasnet: Platform-aware neural architecture search for mobile”. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019, 2820–2828.
- [8] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le. “Learning transferable architectures for scalable image recognition”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, 8697–8710.
- [9] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L.-J. Li, L. Fei-Fei, A. Yuille, J. Huang, and K. Murphy. “Progressive neural architecture search”. In: Proceedings of the European conference on computer vision (ECCV). 2018, 19–34.
- [10] J. Lee, M. Lee, D. Lee, and S. Lee. “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 10444–10453.
- [11] J. Lee, M. Lee, D. Lee, and S. Lee. “Hierarchically decomposed graph convolutional networks for skeleton-based action recognition”. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023, 10444–10453.
- [12] F. Zhu and R. Zhu, (2021) “Dance Action Recognition and Pose Estimation Based on Deep Convolutional Neural Network.” Traitement Du Signal 38(2): DOI: 10.18280/ts.380233.
- [13] Z. Zhan, X. Mao, H. Liu, and S. Yu, (2025) “STGL: Self-Supervised Spatio-Temporal Graph Learning for Traffic Forecasting” Journal of Artificial Intelligence Research 2(1): 1–8. DOI: 10.70891/JAIR.2025.040001.
- [14] W. Zhang and J. Wang, (2024) “English Text Sentiment Analysis Network based on CNN and U-Net” Journal of Science and Engineering 1(1): 13–18. DOI: 10.70891/JSE.2024.100009.
- [15] B. Li, Y. Zhao, S. Zhelun, and L. Sheng. “Danceformer: Music conditioned 3d dance generation with parametric motion transformer”. In: Proceedings of the AAAI Conference on Artificial Intelligence. 36. 2. 2022, 1272–1279. DOI: 10.1609/aaai.v36i2.20014.
