Transformer-Based Architectures for Robust Speech Recognition and Natural Language Understanding in Noisy and Multilingual Environments
DOI:
https://doi.org/10.17051/NJSAP/01.04.05Keywords:
Transformer, Speech Recognition, Natural Language Understanding, Multilingual, Noise Robustness, Conformer, Self-Supervised Learning.Abstract
Transformer architectures have already boosted the automatic speech recognition (ASR) and natural language understanding (NLU) fields, and this has resulted in the state-of-the-art performance in capturing various languages and difficult acoustic conditions. The following paper examines how the design and use of transformer variants-Conformer models as well as self-supervised models, i.e, wav2vec 2.0 were modified and used in a specific setting of robust speech processing in noisy and multilingual setups. Our configuration integrates data augmentation with domain adaptation, and together with cross-linguistic learning, focus on boosting the generalization and robustness of the model towards noise. The experiments carried out using benchmark multilingual speech corpora and noisy datasets in the real world reveal that transformer based models perform significantly better than conventional recurrent neural network and convolutional neural networks yielding lower word error rates (WER) and higher semantic accuracy. The findings indicate the success of self-attention mechanisms and convolutional augmentations in the ability to capture both the far and local relationships in a signalled speech. Lastly, the paper presents important issues and areas of future research, such as the creation of low latency inference techniques, model compression techniques toward implementing models on the edge, and ethical concerns related to multilingual speech and language applications. This rich through study can be of great help in promoting efficient and high quality, supportive and scalable transformer based speech and language systems that can be adapted appropriately into real life contexts.