Deep Learning-Driven Speech and Audio Processing: Advances in Noise Reduction, Speech Enhancement, and Real-Time Voice Analytics
DOI:
https://doi.org/10.17051/NJSAP/01.04.02Keywords:
Deep Learning, Speech Processing, Noise Reduction, Speech Enhancement, Real-Time Voice Analytics, Embedded AI, Audio Signal Processing.Abstract
The speed of development in deep learning has brought major changes in speech and audio processing, where deep learning today is able to significantly reduce noise, improve speech enhancement, and even allow voice analysis in real time, just to name a few modern applications: smart assistants, hearing aids, web-based call centers, and teleconferencing systems. The paper presented below will have the purpose of examining recent developments in algorithm and deployment methods that pursue the accuracy as well as efficiency issues. The latest architectures, such as convolutional neural networks (CNNs), recurrent neural networks (RNNs), transformer-based models, and self-supervised learning frameworks, are discussed in terms of their potential in dealing with non-stationary noise, maintenance of speech intelligibility, and low-latency processing. The suggested evaluation approach would use benchmark data, such as CHiME, LibriSpeech, and VoiceBank-DEMAND to measure effects on signal-to-noise ratio (SNR) gain, perceptual evaluation of speech quality (PESQ), and word error rate (WER) reduction. We show experimental results that transformer-based enhancement models show up to +11.5 dB SNR improvement and significant PESQ gains on embedded platforms with a processing latency requirement of under 50 ms. The paper also investigates the process of quantization and pruning of lightweight models, making it possible to be deployed to low resources settings. It ends with a discussion of the remaining open problems, including domain adaptation, robustness to multilingual corpora, and ethics, and finally a vision of possible future research areas to close the gap between lab and real-world systems AI-based speech and audio systems back to larger applications.