Multimodal Emotion Recognition for Human–Robot Interaction Using Speech, Facial Dynamics, and Physiological Signals
DOI:
https://doi.org/10.17051/NJSAP/01.03.02Keywords:
Multimodal emotion recognition, Human–Robot Interaction (HRI), Speech emotion analysis, Facial dynamics, Physiological signal processing, Cross-attention fusion, Deep learning.Abstract
Speech emotion recognition is an important research area that makes a significant contribution to the development of socially intelligent HumanMachine Interaction (HRI) since vocal communication contains a lot of paralinguistic information that supplements the semantic load. However, speech based systems can have poor performance in real world situations because of background noise, variant microphone as well as variations in style of speaking. To deal with these shortcomings, we are introducing the multimodal emotion recognition system where the speech processing forms its core and is complemented by face dynamics and physiological measures to increase reliability and precision. The speech channel uses CNN-BiLSTM pipeline to extract spectral-temporal prosodic features of Mel-spectrograms which have strong discriminative power despite noisy environment. A 3D-CNN is used to analyze the facial expressions, and an Electrodermal Activity (EDA), Electrocardiogram (ECG) and Photoplethysmography (PPG) is modeled by using a Temporal Convolutional Network (TCN). Theses modalities are integrated, oriented and harmonized by a Transformer-based cross-attention fusion mechanism that harnesses the complementarity of these strengths to overcome any weaknesses. Simulation of datasets (IEMOCAP, SEMAINE, and AMIGOS) indicate an increment of 7 12 weighted F1-scores compared to unimodal baselines for speech-based HRI scenarios, with or without noise, obscuration, or lost modalities-percentages attesting to the usefulness of the approach in emotion-sensitive HRI.