End-to-end multimodal speech separation in time domain

Speech Samples

The model is evaluated with BBC videos from:
The Oxford-BBC Lip Reading Sentences 2 (LRS2) Dataset [1]:

Here are some examples of interactive multimodal speech separation. You can use the mouse to
hover over the lips of a speaker to hear the separated sound.

Mixture input	Conv-TasNet	AV-Model-23mix	AV-Model-2spk	Ground-Truth

References

[1] Afouras T, Chung J S, Senior A, et al. Deep audio-visual speech recognition[J]. IEEE transactions on pattern analysis and machine intelligence, 2018.