Transformer-Based End-to-End Speech Recognition for Noisy Real-World Environments

T.G. Zengeni; M.P. Bates

doi:10.17051/NJSAP/01.04.01

Authors

T.G. Zengeni Dept. of Electrical Engineering, University of Zimbabwe, Harare, Zimbabwe Author
M.P. Bates Dept. of Electrical Engineering, University of Zimbabwe, Harare, Zimbabwe Author

DOI:

https://doi.org/10.17051/NJSAP/01.04.01

Keywords:

Speech Recognition, Transformers, End-to-End Models, Noisy Environments, Self-Attention, Data Augmentation, Word Error Rate (WER), Robust ASR

Abstract

New achievements of automatic speech recognition (ASR) have significantly contributed to the creation of control environments, but still, the challenge of ASR implementation in noisy real-life circumstances remains a complicated problem since acoustic interferences, reverberation, and non-stable background noise are broad in a diversified way. The present paper proposes a powerful end-to-end speech recognition system that uses the efficiency of Transformer models to achieve high accuracy of transcriptions even under these difficult acoustic settings. The proposed system also differs with the widespread hybrid HMM-DNN and RNN based systems because; it employs a self-attention-based encoder-decoder architecture that is tailored to operate with long-range dependencies, and also assists in memorizing interacting information, which is required to accomplish recognition in the presence of noise. To be even more robust, the framework incorporates very noise-robust pre-processing methods and generous data augmentation of which augmentation of the spectrum and noise mixing with real-world sources were done during training. The literature conducts extensive testing of the given model on common noisy speech datasets, like CHiME-4, Aurora-4, etc. at various signal-to-noise ratios (SNRs) and acoustic situations to measure its performance. Our Transformer-based ASR system has shown a relative improvement over state-of-the-art RNN and hybrid HMM-DNN models by a significant amount of up to 32 percent on relative word error rate (WER) on the same task, and-at low signal-to-noise-ratio (SNR)-it has shown results comparable to the same evil. On top of that, ablation experiments indicate that data augmentation and self-attention mechanisms are essential to achieving performance improvement when adverse conditions are present. This evidence expounds the revolutionary implication of attention-based models in the progression of strong speech recognition and supports its application potential in real-life environments, such as smart assistants, mobile devices, and embedded systems. The suggested study would open a path towards subsequent researches of lightweight versions of Transformers and multi-modal speech recognition architectures, with the eventual objective of facilitating the possibility of dependable and real-time speech comprehension in progressively fluid and acoustically super busy circumstances.

Downloads

Download data is not yet available.

Transformer-Based End-to-End Speech Recognition for Noisy Real-World Environments

Authors

DOI:

Keywords:

Abstract

Downloads

Additional Files

Published

Issue

Section

How to Cite

Latest publications

Information

Language