Real-time video violence detection using shallow convolutional neural networks

2026

2026-05-24

Nanfei Jiang

Investigation Department, Beijing Police College, Beijing, 102202, China

Received: January 2, 2026
Accepted: March 13, 2026
Publication Date: May 24, 2026

The suggested method’s detailed sequence for detecting video violence

Copyright The Author(s). This is an open access article distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are cited.

Download Citation: BibTeX | http://dx.doi.org/10.6180/jase.202609_32.053

Download PDF

Detecting violent content in real-time videos has become vital when digital content management and safety for people are of the maximum priority. The research proposes a novel and effective hybrid framework, called Shallow Lightweight Convolutional Attention-based Unified Temporal Network (SLCAUT-Net), proposed to accurately identify violent actions in video streams while sustaining real-time processing competences. To extract spatial features, SLCAUT-Net uses a shallow convolutional neural network (SCNN) backbone, combined with motion-based features derived from optical flow and improved through a temporal attention mechanism. The proposed architecture uses a minimal number of convolutional layers to ensure fast inference while simultaneously capturing temporal dependencies and motion patterns that are critical to distinguishing violent from non-violent behavior. The dataset from Kaggle can be used to support training and valuation by using video clips of real violent and non-violent scenarios. Frame differencing, data augmentation, and lightweight attention modules are employed to increase robustness and reduce overfitting. Experimental assessments validate that SLCAUT-Net attains competitive accuracy ( 97% ) while operating proficiently on low-resource devices, with atypical latency. The addition of temporal attention and motion cues with a shallow framework offers a novel solution to the challenges of violence detection in conditions that are dynamic an unrestricted. The research highlights the potential of hybrid shallow networks in real-time video surveillance and safety-critical applications.

Keywords: SLCAUT-Net, violence detection, shallow convolutional neural networks, real-time video analysis, computer vision, temporal attention.

[1] V. D. Huszar, V. K. Adhikarla, I. Négyesi, and C. Krasznay, (2023) “Toward fast and accurate violence detection for automated video surveillance applications” IEEE Access 11: 18772–18773. DOI: 10.1109/ACCESS.2023.3245521.
[2] A. N. Sai and K. S. Prasad, (2023) “Machine learning software for the detection of violence from CCTV live footage” Journal of Image Processing and Artificial Intelligence 9(3): DOI: 10.46610/JOIPAI.2023.v09i03.002.
[3] F. U. M. Ullah, K. Muhammad, I. U. Haq, N. Khan, A. A. Heidari, S. W. Baik, and V. H. C. de Albuquerque, (2021) “AI-assisted edge vision for violence detection in IoT-based industrial surveillance networks” IEEE Transactions on Industrial Informatics 18(8): 5359–5370. DOI: 10.1109/TII.2021.3116377.
[4] D. Freire-Obregón, P. Barra, M. Castrillón-Santana, and M. De Marsico, (2022) “Inflated 3D ConvNet context analysis for violence detection” Machine Vision and Applications 33: 15. DOI: 10.1007/s00138-021-01264-9.
[5] A. J. Naik and M. T. Gopalakrishna, (2021) “Deep-violence: Person violent activity detection in video” Multimedia Tools and Applications 80(12): 18365–18380. DOI: 10.1007/s11042-021-10682-w.
[6] S. Vosta and K. C. Yow, (2022) “A CNN-RNN combined structure for real-world violence detection in surveillance cameras” Applied Sciences 12(3): 1021. DOI: 10.3390/app12031021.
[7] M. Asad, J. Yang, J. He, P. Shamsolmoali, and X. He, (2021) “Multi-frame feature-fusion-based model for violence detection” The Visual Computer 37(6): 1415–1431. DOI: 10.1007/s00371-020-01878-6.
[8] G. Garcia-Cobo and J. C. SanMiguel, (2023) “Human skeletons and change detection for efficient violence detection in surveillance videos” Computer Vision and Image Understanding 233: 103739. DOI: 10.1016/j.cviu.2023.103739.
[9] S. M. Mohtavipour, M. Saeidi, and A. Arabsorkhi, (2022) “A multi-stream CNN for deep violence detection in video sequences using handcrafted features” The Visual Computer 38(6): 2057–2072. DOI: 10.1007/s00371-021-02266-4.
[10] F. J. Rendón-Segador, J. A. Álvarez-García, J. L. Salazar-González, and T. Tommasi, (2023) “CrimeNet: Neural structured learning using vision transformer for violence detection” Neural Networks 161: 318–329. DOI: 10.1016/j.neunet.2023.01.048.
[11] A. Alshalawi, W. Abdul, and G. Muhammad, (2025) “Advanced detection of violence from video: Performance evaluation of transformer and state-of-the-art convolution neural network transformer” IEEE Access: DOI: 10.1109/ACCESS.2025.3564435.
[12] B. Wan, W. Jiang, Y. Fang, Z. Luo, and G. Ding, (2021) “Anomaly detection in video sequences: A benchmark and computational model” IET Image Processing 15(14): 3454–3465. DOI: 10.1049/ipr2.12258.
[13] K. Yousaf and T. Nawaz, (2022) “A deep learning-based approach for inappropriate content detection and classification of YouTube videos” IEEE Access 10: 16283–16298. DOI: 10.1109/ACCESS.2022.3147519.
[14] Y. Pu, X. Wu, L. Yang, and S. Wang, (2024) “Learning prompt-enhanced context features for weakly supervised video anomaly detection” IEEE Transactions on Image Processing: DOI: 10.1109/TIP.2024.3451935.
[15] P. Wu, J. Liu, X. He, Y. Peng, P. Wang, and Y. Zhang, (2024) “Toward video anomaly retrieval from video anomaly detection: New benchmarks and model” IEEE Transactions on Image Processing 33: 2213–2225. DOI: 10.1109/TIP.2024.3374070.
[16] M. Khan, A. El Saddik, W. Gueaieb, G. De Masi, and F. Karray, (2024) “VD-Net: An edge vision-based surveillance system for violence detection” IEEE Access 12: 43796–43808. DOI: 10.1109/ACCESS.2024.3380192.
[17] A. Mehmood, (2021) “Efficient anomaly detection in crowd videos using pre-trained 2D convolutional neural networks” IEEE Access 9: 138283–138295. DOI: 10.1109/ACCESS.2021.3118009.
[18] R. Sharma and A. Sungheetha, (2021) “An efficient dimension reduction-based fusion of CNN and SVM model for detection of abnormal incidents in video surveillance” Journal of Soft Computing Paradigm 3(2): 55–69. DOI: 10.36548/jscp.2021.2.001.
[19] S. Habib, A. Hussain, W. Albattah, M. Islam, S. Khan, R. U. Khan, and K. Khan, (2021) “Abnormal activity recognition from surveillance videos using convolutional neural network” Sensors 21(24): 8291. DOI: 10.3390/s21248291.
[20] J. C. Vieira, A. Sartori, S. F. Stefenon, F. L. Perez, G. S. De Jesus, and V. R. Q. Leithardt, (2022) “Low-cost CNN for automatic violence recognition on an embedded system” IEEE Access 10: 25190–25202. DOI: 10.1109/ACCESS.2022.3155123.
[21] B. Omarov, S. Narynov, Z. Zhumanov, A. Gumar, and M. Khassanova, (2022) “A skeleton-based approach for campus violence detection” Computers, Materials & Continua 72(1): 315–331. DOI: 10.32604/cmc.2022.024566.
[22] M. T. Bhatti, M. G. Khan, M. Aslam, and M. J. Fiaz, (2021) “Weapon detection in real-time CCTV videos using deep learning” IEEE Access 9: 34366–34382. DOI: 10.1109/ACCESS.2021.3059170.