A fast and lightweight model for Causal Audio-Visual Speech Separation

Wendi Sang1,*, Kai Li2,*, Runxuan Yang2, Jianqiang Huang1,†, Xiaolin Hu2,†

1Qinghai University      2Tsinghua University

*Equal contribution      Corresponding authors

ECAI 2025



Abstract

Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. We propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audiovisual integration.Additionally, Swift-Net employs grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical data. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance in cocktail party scenarios, highlighting the potential of this method for processing speech in such complex environments.

Overview of Swift-Net

results Image

The architecture of Swift-Net's separation networks

FTGS Block Image

Figure 1: The structural diagram of FTGSBlock



LightVid Block Image

Figure 2: The structural diagram of LightVidBlock



SAF Block Image

Figure 3: The structural diagram of SAFBlock.

Comparison with other Causal models

We compare our approach with other methods on the LRS2, LRS3, and VoxCeleb2 datasets. Results show that our method consistently outperforms other causal methods in terms of SI-SNRi and SDRi. All models are implemented following causal design strategies to strictly ensure causality.

results Image

Real World Muti-speakers Videos

We evaluated multiple audio-visual speech separation (AVSS) methods, including AV-TFGridNet and RTFSNet, on real-world multi-speaker YouTube videos. The results indicate that Swift-Net achieves superior audio separation quality compared to the other models.

Demo 1

Original
Swift-Net SPK1 (Ours)
Swift-Net SPK2 (Ours)
AV-TFGridNet SPK1
AV-TFGridNet SPK2
RTFSNet SPK1
RTFSNet SPK2

Demo 2

Original
Swift-Net SPK1 (Ours)
Swift-Net SPK2 (Ours)
AV-TFGridNet SPK1
AV-TFGridNet SPK2
RTFSNet SPK1
RTFSNet SPK2

Demo 3

Original
Swift-Net SPK1 (Ours)
Swift-Net SPK2 (Ours)
AV-TFGridNet SPK1
AV-TFGridNet SPK2
RTFSNet SPK1
RTFSNet SPK2

Demo 4

Original
Swift-Net SPK1 (Ours)
Swift-Net SPK2 (Ours)
AV-TFGridNet SPK1
AV-TFGridNet SPK2
RTFSNet SPK1
RTFSNet SPK2