Robust Audio Signal Enhancement Using Hybrid Spectral-Temporal Deep Learning Models in Noisy Environments
Keywords:
Audio signal enhancement, deep learning, CNN, Bi-GRU, spectral-temporal modeling, speech quality, noisy environmentsAbstract
Improvement of the audio signal is highly essential in many applications starting with telecommunication and up to assisting hearing devices in difficult noisy settings. In this paper, we propose a hybrid spectral-temporal deep learning model combining the convolutional and recurrent neural network model for enhancing robust audio signals. Spectral representations of the audio (log-magnitude spectrograms) and temporal dependencies were used in the model via bidirectional gated recurrent units (Bi-GRU). A multi-stage architecture is selected wherein the CNN effects spatial features and the Bi-GRU the temporal continuity. Utilized over databases including VoiceBank-DEMAND and TIMIT with artificially corrupted noises at different SNR levels (0 dB, 5 dB, 10 dB), the proposed model has been proven to dramatically increase the quality of the signal and gains PESQ up to 3.21 and STOI increments up to 0.26 over classical & modern deep models. This shows that hybrid deep learning works in real-world noisy set-ups.