Multimodal Fusion Techniques for Emotion Recognition Using Audio, Visual, and Physiological Signals
DOI:
https://doi.org/10.17051/NJSIP/01.03.04Keywords:
Multimodal Signal Processing, Time–Frequency Analysis, Feature Fusion, Adaptive Filtering, Emotion Recognition, Physiological Signal Processing, Audio-Visual Processing, Deep Learning, Affective Computing, Robust ClassificationAbstract
Sensitive emotion recognition information depends on high-level signal and image processing technologies to acquire, synchronize, and merge multimodal data involving heterogeneous information. One-modal methods are usually sensitive to isolated noise, occlusions or information losses. This paper examines multimodal signal processing techniques which combine audio, visual and physiological information using new feature extraction pipelines and adaptive fusion algorithms. The suggestion of fusion using a hybrid deep learning focus refers to the following presentation of integrating time and space representations by achieving a fusion of synchronized time frequency features, statistical descriptors, and deep embeddings, combined with adaptive weighting operations that help to overcome domain-specific noise. Test on benchmark datasets shows that the framework is more accurate, F1-score, and robust under realistic conditions when compared with unimodal and conventional fusion methods. This work helps in achieving scalable, real-time and noise-invariant multimodal systems that can be applied to healthcare, adaptive interfaces and affective computing through increased signal preprocessing, feature-level integration and classifier decision fusion.