Lip-Reading-Guided Speech Enhancement via Self-Aligning Cross-Attention Networks
DOI:
https://doi.org/10.17051/NJSAP/01.04.03Keywords:
Lip-reading, Speech enhancement, Audio-visual fusion, Self-aligning cross-attention networks, Multimodal speech processing, Temporal alignment, Deep learning, Noise-robust ASRAbstract
One challenge is speech enhancement in noise, which is of great importance in enhancing communication in real world scenarios like teleconferencing, using hearing aids, and automatic speech recognition (ASR). Although recently developed audio-visual speech processing methods have shown that visual information of the lips motion of a speaker can be used to enhance noise robustness to a high degree, temporal misalignment between the audio and video data has remained a performance limit. In this paper, a novel Lip-Reading-Guided Speech Enhancement architecture is presented relying on Self-Aligning Cross-Attention Networks (SACAN), and serving to dynamically synchronize and allure together multi-modal features to recover clearer speech. The two streams of visual data and audio data are processed through a spatio-temporal convolutional encoder to capture discriminative features of lip movements and are encoded through log-mel spectrogram encoding to achieve representations in spectral and time dimensions. These features are adapted to align frame-wise with a bidirectional self-aligning cross-attention mechanism that helps mitigate distortions caused by latencies and articulation mismatch between modalities. A U-Net based enhancement network is used to decode the fused representation to produce a clean speech spectrogram that in turn is reconstructed into waveform through inverse short-time Fourier transform. The GRID and LRS3-TED datasets are experimented on under three plausible conditions of noise (babble, street and cafe) at various signal-to-noise ratio (SNR) levels. PESQ, STOI, and WER comparisons of quantitative assessments show that SACAN is evaluated to have a PESQ gain of 0.41, a STOI gain of 0.05 and WER reduction of 17.3 percent against state-of-the-art audio-only enhancement baselines. Improved speech naturalness and intelligibility is further verified by subjective listening tests. The results demonstrate the usefulness of cross-modal temporal matching in reliable multimodal speech enhancement and its feasibility to be applied in realtime hostile communication conditions.