Data-Efficient Audio Event Detection with Few-Shot Learning Paradigms
DOI:
https://doi.org/10.17051/NJSAP/01.03.07Keywords:
Audio Event Detection (AED); Few-Shot Learning (FSL); Prototypical Networks; Data-Efficient Learning; Log-Mel Spectrogram; Metric-Based Learning; Self-Supervised Audio; Environmental Sound Classification; Low-Resource Audio Recognition; Meta-Learning.Abstract
Audio Event Detection (AED) is a critical element of any intelligent system introduced into various applications, e.g., in service of public surveillance, smart home automation as well as assistive technologies. Conventional AED systems are mostly dependent on supervised deep learning models with demanding amounts of labeled data to train, which is typically infeasible because it is time-consuming and laborious in annotating audio objects since almost all occasions require labour-intensive and time-consuming attempts to annotate sounds. To deal with this problem, this paper presents a new data efficient AED framework based on the Few-Shot Learning paradigms (FSL) that allows effective detection with a minimum amount of annotated data. The proposed system utilises a metric-based methodology based on convolutional prototypical networks trained via episodic learning, so it can learn a generalised embedding space based on a small amount of data. It also uses data augmentation to make the training more variable and robust, such as by shifting the pitch, injecting noise, and stretch-time, as well as transfer learning, initialising the model with expert semantic prior knowledge with audio feature extractor networks that have been pre-trained. In a bid to guarantee flexibility and scalability, optimization methods-based FSL approaches are investigated in order to calibrate the model to new classes using minimal gradient updates. The model is tested on two popular benchmark datasets, ESC-50 and UrbanSound8K, on two different classification disciplines, 1-shot and 5-shot classification intervals. Experiments demonstrate that the proposed technique far out-scores established AED baselines and recent meta-learning models, and is able to achieve >74% accuracy on 5-shot scenarios, marking a significant improvement over state-of-the-art baselines with fewer than 20 examples per class. The t-SNE visualizations show evident class-wise separation in the embedding space, which proves the model can discriminate between a wide varieties of audio events. This paper demonstrates how FSL can minimize dependence on the data in AED tasks and, therefore, implement reliable, adaptive audio recognition systems in real-world low-resource settings. The given methodology would be formulating a framework locating scope to scalable AED solutions in recognition of rare, novel, and underrepresented audio incderindlings given limited labeled information.Downloads
Download data is not yet available.
Additional Files
Published
2025-05-18
Issue
Section
Articles
How to Cite
[1]
Charpe Prasanjeet Prabhakar and Gaurav Tamrakar , Trans., “Data-Efficient Audio Event Detection with Few-Shot Learning Paradigms”, National Journal of Speech and Audio Processing , pp. 54–61, May 2025, doi: 10.17051/NJSAP/01.03.07.