Multimodal Audio–Visual Fusion for Enhanced Conversational AI and Human–Computer Interaction
DOI:
https://doi.org/10.17051/NJSAP/01.02.09Keywords:
Multimodal Fusion, Audio–Visual Speech Recognition, Conversational AI, Human–Computer Interaction, Cross-Modal Attention.Abstract
The multi modal learning, specifically the audio visual stimulation combination has a big promise to transform conversational artificial intelligence (AI) and human computer interaction (HCI). Current conversational systems generally rely on audio-only systems to process speech and natural language understanding which lacks robustness in environments where there are distractive noise and high dynamic visuality. In the present paper we introduce a multimodal audio-visual synthesis structure that integrates in a common framework the processing of speech data and the visual signs to enhance the perception of speech, recognition of emotion and situational context in interactive systems. The model consists of Convolutional Neural Networks (CNNs) embedded features extraction to recognize the visual information, a Transformer-based acoustic encoder to represent the speech, and a cross-modal attention to combine the temporal and spatial in a dynamic way. The approach was tested on three benchmark datasets (GRID, CREMA-D and LRS3) in order to examine it in both synthetic and real-life settings. The findings indicate that 17.3 percent and 12.8 percent Word Error Rate (WER) and emotion classification accuracy have been reduced in comparison to unimodal baselines. In addition, the system is quite resilient to acoustic interference and visual occlusions and produces robust performance in a wide range of scenarios. The results suggest that the suggested framework may be deployed on the forthcoming conversational systems such as virtual assistants, telepresence robots, and assistive technologies, as well as a scalable backbone that may underlie a future multimodal AI due to its “plug-and-play” nature.