Multimodal Audio–Visual Fusion for Enhanced Conversational AI and Human–Computer Interaction

Authors

  • Wai Cheng Lau Faculty of Information Science and Technology University, Kebangsaan, Malaysia Author
  • H. Fratlin Department of Electrical and Computer Engineering, Ben-Gurion University, Beer Sheva, Israel Author

DOI:

https://doi.org/10.17051/NJSAP/01.02.09

Keywords:

Multimodal Fusion, Audio–Visual Speech Recognition, Conversational AI, Human–Computer Interaction, Cross-Modal Attention.

Abstract

The multi modal learning, specifically the audio visual stimulation combination has a big promise to transform conversational artificial intelligence (AI) and human computer interaction (HCI). Current conversational systems generally rely on audio-only systems to process speech and natural language understanding which lacks robustness in environments where there are distractive noise and high dynamic visuality. In the present paper we introduce a multimodal audio-visual synthesis structure that integrates in a common framework the processing of speech data and the visual signs to enhance the perception of speech, recognition of emotion and situational context in interactive systems. The model consists of Convolutional Neural Networks (CNNs) embedded features extraction to recognize the visual information, a Transformer-based acoustic encoder to represent the speech, and a cross-modal attention to combine the temporal and spatial in a dynamic way. The approach was tested on three benchmark datasets (GRID, CREMA-D and LRS3) in order to examine it in both synthetic and real-life settings. The findings indicate that 17.3 percent and 12.8 percent Word Error Rate (WER) and emotion classification accuracy have been reduced in comparison to unimodal baselines. In addition, the system is quite resilient to acoustic interference and visual occlusions and produces robust performance in a wide range of scenarios. The results suggest that the suggested framework may be deployed on the forthcoming conversational systems such as virtual assistants, telepresence robots, and assistive technologies, as well as a scalable backbone that may underlie a future multimodal AI due to its “plug-and-play” nature.

Downloads

Download data is not yet available.

Additional Files

Published

2025-04-18

Issue

Section

Articles

How to Cite

[1]
Wai Cheng Lau and H. Fratlin , Trans., “Multimodal Audio–Visual Fusion for Enhanced Conversational AI and Human–Computer Interaction”, National Journal of Speech and Audio Processing , pp. 68–73, Apr. 2025, doi: 10.17051/NJSAP/01.02.09.