logo
SonicSim: A customizable simulation platform
for speech models in moving sound source scenarios

Kai Li, Wendi Sang, Chang Zeng, Runxuan Yang, Guo Chen, Xiaolin Hu
Coming Soon


Introdution

We introduce SonicSim, a synthetic toolkit designed to generate highly customizable data for moving sound sources. SonicSim is developed based on the embodied AI simulation platform, Habitat-sim, supporting multi-level parameter adjustments, including scene-level, microphone-level, and source-level, thereby generating more diverse synthetic data. Leveraging SonicSim, we constructed a moving sound source benchmark dataset, SonicSet, using the LibriSpeech dataset, the Freesound Dataset 50k (FSD50K) and Free Music Archive (FMA), and 90 scenes from the Matterport3D to evaluate speech separation and enhancement models.

SonicSim Platform

SonicSet Datasets

Overview of the SonicSet dataset
An automatic simulation pipeline for moving sound sources

Example

3D Room Map
Speech Movement Trajectory
Speech Movement Trajectory 1
Speech Movement Trajectory 2
Speech Movement Trajectory 3
Music Background
Noise Background

Leaderboard

We compare different SOTA methods on the LibriSpace datasets.

Speech Separation (only two speakers, noise environment)
Model SI-SNR SDR NB-PESQ WB-PESQ STOI MOS_NOISE MOS_REVERB MOS_SIG MOS_OVRL WER (%)
Conv-TasNet 4.81 7.13 2.00 1.46 0.73 2.45 3.04 2.30 2.10 53.82
DPRNN 4.87 6.65 2.17 1.63 0.77 2.54 3.28 2.47 2.11 47.81
DPTNet 11.51 13.00 2.82 2.35 0.87 3.00 3.15 2.68 2.32 28.13
SuDoRM-RF 8.01 9.70 2.47 1.98 0.81 2.95 3.26 2.63 2.25 35.61
A-FRCNN 9.17 10.63 2.70 2.16 0.84 2.98 3.24 2.72 2.32 35.44
TDANet 9.27 11.00 2.72 2.22 0.85 3.05 3.22 2.74 2.36 30.46
SKIM 7.23 8.78 2.34 1.86 0.79 2.65 3.23 2.47 2.11 38.92
BSRNN 9.10 10.86 2.82 2.26 0.85 2.93 3.11 2.84 2.45 29.86
TF-GridNet 15.38 16.81 3.58 3.08 0.93 3.11 3.10 2.91 2.49 12.04
Mossformer 14.72 15.97 3.02 2.67 0.91 3.11 3.24 2.76 2.39 21.10
Mossformer2 14.84 16.09 3.17 2.83 0.91 3.20 3.21 2.78 2.40 19.51
Speech Separation (only two speakers, music environment)
Model SI-SNR SDR NB-PESQ WB-PESQ STOI MOS_NOISE MOS_REVERB MOS_SIG MOS_OVRL WER (%)
Conv-TasNet 4.12 5.38 1.84 1.42 0.65 1.98 3.53 2.21 1.81 63.21
DPRNN 4.37 5.73 1.98 1.50 0.73 2.47 3.28 2.45 2.07 51.33
DPTNet 11.69 12.80 2.67 2.13 0.84 2.91 3.14 2.54 2.23 29.05
SuDoRM-RF 6.84 8.34 2.15 1.66 0.77 2.80 3.28 2.48 2.12 41.37
A-FRCNN 7.59 9.32 2.52 2.00 0.82 2.94 3.24 2.67 2.29 33.82
TDANet 7.00 8.68 2.26 1.71 0.79 2.71 3.25 2.58 2.19 37.16
SKIM 6.00 7.42 2.23 1.75 0.77 2.63 3.29 2.44 2.10 42.82
BSRNN 6.96 8.66 2.36 1.76 0.79 2.54 3.13 2.79 2.32 41.73
TF-GridNet 14.37 15.69 3.45 2.84 0.91 3.31 3.15 2.96 2.58 14.43
Mossformer 11.80 13.17 2.82 2.26 0.86 3.05 3.28 2.61 2.25 26.64
Mossformer2 11.12 12.34 2.62 2.09 0.83 2.87 3.31 2.55 2.20 32.65
Speech Enhancement (noise environment)
Model SI-SNR SDR NB-PESQ WB-PESQ STOI MOS_NOISE MOS_REVERB MOS_SIG MOS_OVRL WER (%)
DCCRN 8.41 11.29 2.81 2.17 0.87 2.94 3.01 2.80 2.39 21.78
Fullband 7.82 8.34 3.05 2.34 0.89 3.30 3.04 2.95 2.54 22.04
FullSubNet 9.48 11.92 3.19 2.48 0.90 3.24 3.05 2.98 2.54 20.01
Fast-FullSubNet 8.14 8.71 3.13 2.41 0.90 3.31 3.05 2.99 2.58 21.13
FullSubNet+ 8.93 11.07 3.06 2.35 0.89 3.12 2.97 2.91 2.47 20.73
TaylorSENet 10.11 12.67 3.07 2.45 0.89 2.72 3.01 2.65 2.22 21.61
GaGNet 10.01 12.78 3.12 2.48 0.89 2.77 3.05 2.64 2.23 21.40
G2Net 9.82 12.22 3.03 2.39 0.89 2.78 3.00 2.64 2.22 22.02
Inter-SubNet 10.34 12.87 3.32 2.61 0.91 3.39 3.10 3.05 2.62 18.83
SudoRMRF 11.28 13.35 2.75 2.20 0.87 3.64 2.88 2.80 1.88 93.54
Speech Enhancement (music environment)
Model SI-SNR SDR NB-PESQ WB-PESQ STOI MOS_NOISE MOS_REVERB MOS_SIG MOS_OVRL WER (%)
DCCRN 11.56 11.98 2.72 2.00 0.85 3.30 3.51 2.94 2.59 25.13
Fullband 10.07 11.098 2.80 2.02 0.86 3.13 2.99 2.88 2.46 25.27
FullSubNet 11.60 12.31 3.10 2.22 0.88 3.34 3.08 3.05 2.63 20.82
Fast-FullSubNet 10.36 11.24 2.93 2.08 0.87 3.22 3.03 2.93 2.51 24.98
FullSubNet+ 10.64 11.50 2.80 1.99 0.86 3.02 2.93 2.82 2.38 24.11
TaylorSENet 12.18 13.04 3.06 2.33 0.88 2.76 2.92 2.65 2.24 23.46
GaGNet 12.20 13.17 2.95 2.27 0.87 2.78 2.86 2.64 2.21 23.36
G2Net 12.14 13.13 3.00 2.32 0.88 2.80 2.88 2.64 2.23 22.96
Inter-SubNet 12.07 13.01 3.15 2.28 0.88 3.34 3.11 3.04 2.64 20.07
SudoRMRF 12.99 13.86 2.61 2.01 0.85 3.91 2.80 2.98 1.93 88.72

Efficiency Metrics

Speech Separation Models

Model Params (M) MACs (G/s) CPU Inference (1s, ms) GPU Inference (1s, ms) Inference GPU Memory (1s, MB) Backward GPU (1s, ms) Backward GPU Memory (1s, MB)
Conv-TasNet 5.62 10.23 71.67 8.59 134.34 42.34 647.22
DPRNN 2.72 43.79 379.49 15.88 285.49 38.57 1757.00
DPTNet 2.80 53.37 481.37 20.04 20.67 58.28 3120.22
SuDoRM-RF 2.72 4.60 87.81 17.83 138.94 68.40 1058.76
A-FRCNN 6.13 81.20 102.22 36.19 157.20 128.40 1141.86
TDANet 2.33 9.13 169.47 32.88 145.56 89.62 3064.75
SKIM 5.92 21.92 245.98 10.54 273.07 38.62 1083.77
BSRNN 25.97 123.10 577.11 59.78 135.48 184.26 2349.62
TF-GridNet 14.43 525.68 1525.98 64.59 615.04 165.55 6687.60
Mossformer 42.10 85.54 473.74 49.71 163.68 153.84 4385.91
Mossformer2 55.74 112.67 830.66 93.33 163.52 297.07 5617.39

Speech Enhancement Models

Model Params (M) MACs (G/s) CPU Inference (1s, ms) GPU Inference (1s, ms) Inference GPU Memory (1s, MB) Backward GPU (1s, ms) Backward GPU Memory (1s, MB)
DCCRN 3.67 14.38 98.42 5.81 30.42 35.42 124.66
Fullband 6.05 0.39 5.98 1.99 23.01 10.21 73.39
FullSubNet 5.64 30.87 58.46 3.66 144.21 15.25 491.20
Fast-FullSubNet 6.84 4.14 12.33 4.63 26.75 20.12 111.45
FullSubNet+ 8.66 31.11 110.44 9.50 147.02 37.40 521.49
TaylorSENet 5.40 6.15 70.96 26.84 139.33 76.63 329.40
GaGNet 5.95 1.66 66.72 29.72 129.59 84.05 226.49
G2Net 7.39 2.85 98.29 47.56 130.33 162.51 291.98
Inter-SubNet 2.29 36.71 78.81 4.40 216.91 14.59 725.93
SudoRMRF 2.70 2.12 42.43 11.42 8.52 52.59 293.44