TIGER: Time-frequency Interleaved Gain Extraction and Reconstruction for Efficient Speech Separation
Arxiv 2024
Mohan Xu*
Kai Li*
Guo Chen
Xiaolin Hu
Tsinghua University
These authors contributed equally
[arXiv 📝]
[code ⚙️]
[EchoSet 🖼️]

TIGER is a lightweight model for speech separation which effectively extracts key acoustic features through frequency band-split, multi-scale and full-frequency-frame modeling.


Abstract

In recent years, much speech separation research has focused primarily on improving model performance. However, for low-latency speech processing systems, computational efficiency is equally critical. In this paper, we propose a speech separation model with significantly reduced parameter size and computational cost: Time-Frequency Interleaved Gain Extraction and Reconstruction Network (TIGER). TIGER leverages prior knowledge to divide frequency bands and applies compression on frequency information. We employ a multi-scale selective attention (MSA) module to extract contextual features, while introducing a full-frequency-frame attention (F^3A) module to capture both temporal and frequency contextual information. Additionally, to more realistically evaluate the performance of speech separation models in complex acoustic environments, we introduce a novel dataset called EchoSet. This dataset includes noise and more realistic reverberation (e.g., considering object occlusions and material properties), with speech from two speakers overlapping at random proportions. Experimental results demonstrated that TIGER significantly outperformed state-of-the-art (SOTA) model TF-GridNet on the EchoSet dataset in both inference speed and separation quality, while reducing the number of parameters by 94.3% and the MACs by 95.3%. These results indicate that by utilizing frequency band-split and interleaved modeling structures, TIGER achieves a substantial reduction in parameters and computational costs while maintaining high performance. Notably, TIGER is the first speech separation model with fewer than 1 million parameters that achieves performance close to the SOTA model.

Overall pipeline of the model architecture of TIGER and its modules.

Results on Speech Separation

Performance comparisons of TIGER and other existing separation models on Libri2Mix, LRS2-2Mix, and EchoSet. Bold indicates optimal performance, and italics indicate suboptimal performance.

Efficiency comparisons of TIGER and other models.

Results on Cinematic Sound Separation

Comparison of performance and efficiency of cinematic sound separation models on DnR. ‘*’ means the result comes from the original paper of DnR.

Speech Separation Demo

EchoSet Sample I

Mixture
Ground Truth TF-GridNet TIGER

EchoSet Sample II

Mixture
Ground Truth TF-GridNet TIGER

EchoSet Sample III

Mixture
Ground Truth TF-GridNet TIGER

Cinematic Sound Demo

Deadpool
Dialog Music Effect
The Wandering Earth
Dialog Music Effect
Captain America
Dialog Music Effect

Acknowledgements

Website template was borrowed from Colorful Image Colorization and Nerfies; the code can be found here and here. Thank you (.❛ ᴗ ❛.).
▶ cslikai.cn's clustrmaps 🌎.