Explainable Artificial Intelligence in Speech and Audio Processing: Enhancing Interpretability, Fostering Trust, and Addressing Deployment Challenges
DOI:
https://doi.org/10.17051/NJSAP/01.03.06Keywords:
Explainable AI, Speech Recognition, Audio Processing, Interpretability, Trust, Model Explainability, Deployment Challenges.Abstract
The speed at which deep learning has found an application in speech and audio processing has demonstrated significant performance improvements across several applications such as the automatic speech recognition (ASR), speaker verification, emotion recognition, and audio event detection. Nevertheless, the black boxed, opaque nature of state of the art training poses some serious problems of transparency, interpretability, and trust of users, especially in safety critical and privacy sensitive areas. Explainable Artificial Intelligence (XAI) is a way forward in mitigating such problems as it offers inroads into the inner workings of AI systems. This paper works as a review of the existing XAI techniques applied to speech and audio processing and classifies the methods into model-specific and model-agnostic, explaining the most acceptable metrics of their interpretability. We investigate how explainability can increase trustworthiness, promote regulatory compliance, assist with debugging the systems and reduce bias. The issues of deployment, including real-time interpretability in the face of computational constraints, cross-lingual robustness, and human- machine communication, are critically assessed. We also find gaps in the research which can be explored further such as low-latency explanation generation, multimodal explainability and privacy-preserving explanation mechanisms. Lastly, we lay out a roadmap to the inclusion of XAI in next-generation speech and audio systems and ways to promote the deployment of responsible, transparent, and trusted AI in both business and mission-critical scenarios. Three case studies show that XAI is effective in detecting bias, confirming feature explanations and resolving errors in ASR, emotion recognition, and audio event classification.