Swift-Net

Audio-visual speech separation (AVSS) aims to extract a target speech signal from a mixed signal by leveraging both auditory and visual (lip movement) cues. However, most existing AVSS methods exhibit complex architectures and rely on future context, operating offline, which renders them unsuitable for real-time applications. We propose a novel streaming AVSS model, named Swift-Net, which enhances the causal processing capabilities required for real-time applications. Swift-Net adopts a lightweight visual feature extraction module and an efficient fusion module for audiovisual integration.Additionally, Swift-Net employs grouped SRUs to integrate historical information across different feature spaces, thereby improving the utilization efficiency of historical data. We further propose a causal transformation template to facilitate the conversion of non-causal AVSS models into causal counterparts. Experiments on three standard benchmark datasets (LRS2, LRS3, and VoxCeleb2) demonstrated that under causal conditions, our proposed Swift-Net exhibited outstanding performance in cocktail party scenarios, highlighting the potential of this method for processing speech in such complex environments.

Figure 1: The structural diagram of FTGSBlock

Figure 2: The structural diagram of LightVidBlock

Figure 3: The structural diagram of SAFBlock.

We compare our approach with other methods on the LRS2, LRS3, and VoxCeleb2 datasets. Results show that our method consistently outperforms other causal methods in terms of SI-SNRi and SDRi. All models are implemented following causal design strategies to strictly ensure causality.

We evaluated multiple audio-visual speech separation (AVSS) methods, including AV-TFGridNet and RTFSNet, on real-world multi-speaker YouTube videos. The results indicate that Swift-Net achieves superior audio separation quality compared to the other models.