Spatiotemporal Transformer Networks for Real-Time Video-Based Anomaly Detection in Smart City Surveillance
DOI:
https://doi.org/10.17051/NJSIP/01.04.06Keywords:
Video Signal Processing, Spatiotemporal Transformer, Real-Time Anomaly Detection, Smart City Surveillance, Edge Computing, Self-Attention, Deep Learning, Video AnalyticsAbstract
The quick growth in the infrastructure of smart cities has led to the widespread and unending video-based data flow of highly dense surveillance systems, establishing the utmost vitality of effective anomaly detection in real-time. A new Spatiotemporal Transformer Network (STTN) is developed in this work that incorporates modern signal and image processing approaches and connects them to transformer-based deep learning to increase the efficiency of urban safety monitoring. The offered architecture leverages spatial time embedded patch, multi-head self-attention, and hybrid scheme of supervised learning and self-training to account for intricate inter-frame dependency and do it with low latency. In signal processing terms, the model includes patch-wise feature collection, sequence modeling of a temporal sequence and slide-inference over continuous video streams. State-of-the-art accuracy (AUC: 96.7% UCSD Pedestrian, 90.3% Shanghai Tech Campus and a proprietary SmartCitySurv dataset), lower rates of false alarms, and inference velocity appropriate to edge implementation are found using experimental validation using UCSD Pedestrian, ShanghaiTech Campus, and a proprietary SmartCitySurv dataset. Attention-based visualizations also increase understandability and can be used to facilitate human-in-the-loop decision-making. This work brings sophisticated deep architecture to fundamental video signal processing, realizing a scalable, interpretable, and real-time means to next-generation smart city-level video surveillance.