Multi-Task Deep Neural Network for Simultaneous Audio Event Detection and Localization in Smart Surveillance Systems

Authors

  • Md. Abbas Faculty of Engineering Ain Shams University & Arab Academy for Science and Technology Cairo, Egypt Author
  • Andrés Revera Facultad de Ingenieria Universidad Andres Bello, Santiago, Chile Author

DOI:

https://doi.org/10.17051/NJSIP/01.02.10

Keywords:

Multi-task deep neural network (MT-DNN), audio event detection (AED), sound source localization (SSL), smart surveillance systems, spectral–temporal feature extraction, attention mechanisms in audio processing.

Abstract

Smart surveillance systems with audio intelligence embedded provide substantial situational awareness in designs where it may be inconvenient to observe the situation visually because of a blocked view, low-light situations, or privacy requirements. It is the purpose of this paper to propose a new Multi-Task Deep Neural Network (MT-DNN) architecture that aims to conduct multiple tasks, Audio Event Detection (AED) and Sound Source Localization (SSL) in multi-channel audio recordings. In contrast to traditional single-task models where different pipelines are involved in the detection and localization processes, the present architecture follows with shared spectral-spatial encoder and task-specific attention-enhanced heads, which allows to reuse features efficiently and contributes to better cross-task generalization. The encoder exploits convolutional layers in extracting local spectral patterns, bidirectional recurrent units in the modeling of the temporal context, and high-seen in attention mechanisms to discriminately focus features. AED is treated as a multi-classification task where SSL is solved as a regression task of predicting source azimuth and elevation angles jointly optimized with a weighted composite loss. Thorough evaluation of the UrbanSound8K data on AED and the TAU Spatial Sound Events 2021 data on SSL shows that the MT-DNN has an AED accuracy of 93.1 percent and an SSL mean angular error of 4.2, equivalent to 6 percent and 12 percent, respectively, improvement over comparable single-task baselines. Furthermore, the model has the parameter reduction of 25 percent and reduced inference latency and it is useful in real-time edge implementation on embedded surveillance devices. Such results highlight the opportunities of multi-task learning to develop resource-sparing multimodal surveillance systems and leave room to future realizations of the integration with vision-based analytics to better understand events.

Additional Files

Published

2025-02-15

Issue

Section

Articles