Existing causal speech separation models have a significant performance gap compared to non-causal models, due to the difficulty in retaining historical information. To address this issue, we introduce a causal speech separation method called the Time-Frequency Attention Cache Memory model (TFACM). It models the spatio-temporal relationships between the time and frequency dimensions and an attention mechanism. Specifically, we use the LSTM layer to capture the relative spatial positions in the frequency dimension, while causal modeling is performed in the time dimension using both local and global representations, with a cache memory (CM) module introduced to store historical information. Additionally, we introduce a causal attention refinement (CAR) module to optimize the representation in the time dimension, thereby achieving finer-grained feature representations. Experimental results on public datasets demonstrated that TFACM significantly outperformed existing methods in speech separation performance, showcasing its robust capabilities in complex environments. |
Index | Ground Truth | TFACM | TF-GridNet-Causal | DPRNN | SKiM | ReSepFormer |
---|---|---|---|---|---|---|
SPK A |
||||||
SPK B |
Index | Ground Truth | TFACM | TF-GridNet-Causal | DPRNN | SKiM | ReSepFormer |
---|---|---|---|---|---|---|
SPK A |
||||||
SPK B |
Index | Ground Truth | TFACM | TF-GridNet-Causal | DPRNN | SKiM | ReSepFormer |
---|---|---|---|---|---|---|
SPK A |
||||||
SPK B |
Index | Ground Truth | TFACM | TF-GridNet-Causal | DPRNN | SKiM | ReSepFormer |
---|---|---|---|---|---|---|
SPK A |
||||||
SPK B |
Index | Ground Truth | TFACM | TF-GridNet-Causal | DPRNN | SKiM | ReSepFormer |
---|---|---|---|---|---|---|
SPK A |
||||||
SPK B |
Acknowledgements |