AudioLabs - Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors

Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors

Srikanth Raj Chetupalli and E. A. P. Habets

Submitted to INTERSPEECH 2022.

Abstract

Speaker-independent speech separation for single-channel mixtures with an unknown number of multiple speakers, in the waveform domain, is considered in this paper. To deal with the unknown number of sources, we incorporate an encoder-decoder attractor (EDA) module into a speech separation network. The neural network architecture consists of a trainable encoder-decoder pair and a masking network. The mask network in the proposed approach is inspired by the transformer-based SepFormer separation system and contains a dual-path block and a triple-path block, each block modeling both short-time and long-time dependencies in the signal. The EDA module summarises the dual-path block output using an LSTM encoder first and generates one attractor vector per speaker in the mixture using an LSTM decoder. The attractors are combined with the dual-path block output to generate speaker channels, which are processed jointly by the triple-path block to predict the mask. Further, a linear-sigmoid layer, with attractors as the input, predicts a binary output to indicate a stopping criterion for attractor generation. The proposed approach is evaluated on the WSJ0-mix dataset with mixtures of up to five speakers. State-of-the-art results are obtained in the speech separation quality and speaker counting for all the mixtures.

Audio Examples

Audio examples shown below are the outputs of the proposed architecture trained in three different ways,

SepEDA₂: architecture trained using WSJ0-2mix
SepEDA_2/3: architecture trained using WSJ0-2mix and WSJ0-3mix together.
SepEDA_[2-5]: architecture initialised with SepEDA₂ model and fine-tuned on WSJ0-[2,3,4,5]mix datasets.

SepEDA₂ represents the fixed, two speaker scenario, and SepEDA_[2-5] represents the more general multi-speaker scenario with up to five speakers, while SepEDA_2/3 represents the more practical scenario with up to three speakers

Note: Please use the Chrome or Safari browser for optimal audio/video synchronization. If you encounter playback problems, please reload the page.

SepEDA₂ model

Two speaker mixture example

Activate

Play
Stop
Repeat
--:--:--:--- / --:--:--:---

Mixture signal
- Solo
Reference source 1
- Solo
Estimated source 1
- Solo
Reference source 2
- Solo
Estimated source 2
- Solo

SepEDA_2/3 model

Two speaker mixture example

Activate

Play
Stop
Repeat
--:--:--:--- / --:--:--:---

Mixture signal
- Solo
Reference source 1
- Solo
Estimated source 1
- Solo
Reference source 2
- Solo
Estimated source 2
- Solo

Three speaker mixture example

Activate

Play
Stop
Repeat
--:--:--:--- / --:--:--:---

Mixture signal
- Solo
Reference source 1
- Solo
Estimated source 1
- Solo
Reference source 2
- Solo
Estimated source 2
- Solo
Reference source 3
- Solo
Estimated source 3
- Solo

SepEDA_[2-5] model

Four speaker mixture example

Activate

Play
Stop
Repeat
--:--:--:--- / --:--:--:---

Mixture signal
- Solo
Reference source 1
- Solo
Estimated source 1
- Solo
Reference source 2
- Solo
Estimated source 2
- Solo
Reference source 3
- Solo
Estimated source 3
- Solo
Reference source 4
- Solo
Estimated source 4
- Solo

Five speaker mixture example

Activate

Play
Stop
Repeat
--:--:--:--- / --:--:--:---

Mixture signal
- Solo
Reference source 1
- Solo
Estimated source 1
- Solo
Reference source 2
- Solo
Estimated source 2
- Solo
Reference source 3
- Solo
Estimated source 3
- Solo
Reference source 4
- Solo
Estimated source 4
- Solo
Reference source 5
- Solo
Estimated source 5
- Solo

References

[1] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, Attention is all you need in speech separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

[2] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors, in Interspeech, 2020, pp. 269–273.

[3] Y. Luo, Z. Chen, and T. Yoshioka, Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50.

[4] Y. Luo and N. Mesgarani, Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.

International Audio Laboratories Erlangen

Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors

Abstract

Audio Examples

SepEDA2 model

SepEDA2/3 model

SepEDA[2-5] model

References

SepEDA₂ model

SepEDA_2/3 model

SepEDA_[2-5] model