I I A Net : An Intra- and Inter-Modality Attention Network for Audio-Visual Speech Separation

Kai Li, Runxuan Yang, Fuchun Sun, Xiaolin Hu

ICML 2024



Abstract

Recent research has made significant progress in designing fusion modules for audio-visual speech separation. However, they predominantly focus on multi-modal fusion at a single temporal scale of auditory and visual features without employing selective attention mechanisms, which is in sharp contrast with the brain. To address this, We propose a novel model called intra- and inter-attention network (IIANet), which leverages the attention mechanism for efficient audio-visual feature fusion. IIANet consists of two types of attention blocks: intra-attention (IntraA) and inter-attention (InterA) blocks, where the InterA blocks are distributed at the top, middle and bottom of IIANet. Heavily inspired by the way how human brain selectively focuses on relevant content at various temporal scales, these blocks maintain the ability to learn modality-specific features and enable the extraction of different semantics from audio-visual features. Comprehensive experiments on three standard audio-visual separation benchmarks (LRS2, LRS3, and VoxCeleb2) demonstrate the effectiveness of IIANet, outperforming previous state-of-the-art methods while maintaining comparable inference time. In particular, the fast version of IIANet (IIANet-fast) has only 7% of CTCNet’s MACs and is 40% faster than CTCNet on CPUs while achieving better separation quality, showing the great potential of attention mechanism for efficient and effective multimodal fusion.

Overview of IIANet

The architecture of IIANet's separation network

Comparison with other SOTA methods

We compare our method with other SOTA methods on the LRS2, LRS3, and VoxCeleb2 datasets. The results show that our method outperforms other SOTA methods in terms of SI-SNRi, SDRi and PESQ.

results Image

Visualisation of different model results

We visualise the results of different models on the LRS2 dataset. The results show that our method can separate the target speaker's voice from the background speech more effectively.

spec Image

Real World Muti-speakers Videos

We conducted evaluations of various Audio-Visual Source Separation (AVSS) methods, including Visualvoice, and CTCNet, in real-world scenarios. These scenarios were derived from multi-speaker videos collected from YouTube. Upon reviewing these results, it becomes evident that our IIANet outperforms other separation models by generating higher-quality separated audio.

Demo 1

Original
IIANet SPK1 (Ours)
IIANet SPK2 (Ours)
Visualvoice SPK1
Visualvoice SPK2
CTCNet SPK1
CTCNet SPK2

Demo 2

Original
IIANet SPK1 (Ours)
IIANet SPK2 (Ours)
Visualvoice SPK1
Visualvoice SPK2
CTCNet SPK1
CTCNet SPK2

Demo 3

Original
IIANet SPK1 (Ours)
IIANet SPK2 (Ours)
Visualvoice SPK1
Visualvoice SPK2
CTCNet SPK1
CTCNet SPK2