Speech Separation for an Unknown Number of Speakers Using Transformers With Encoder-Decoder Attractors

Srikanth Raj Chetupalli and E. A. P. Habets

Submitted to INTERSPEECH 2022.

Abstract

Speaker-independent speech separation for single-channel mixtures with an unknown number of multiple speakers, in the waveform domain, is considered in this paper. To deal with the unknown number of sources, we incorporate an encoder-decoder attractor (EDA) module into a speech separation network. The neural network architecture consists of a trainable encoder-decoder pair and a masking network. The mask network in the proposed approach is inspired by the transformer-based SepFormer separation system and contains a dual-path block and a triple-path block, each block modeling both short-time and long-time dependencies in the signal. The EDA module summarises the dual-path block output using an LSTM encoder first and generates one attractor vector per speaker in the mixture using an LSTM decoder. The attractors are combined with the dual-path block output to generate speaker channels, which are processed jointly by the triple-path block to predict the mask. Further, a linear-sigmoid layer, with attractors as the input, predicts a binary output to indicate a stopping criterion for attractor generation. The proposed approach is evaluated on the WSJ0-mix dataset with mixtures of up to five speakers. State-of-the-art results are obtained in the speech separation quality and speaker counting for all the mixtures.

Audio Examples

Audio examples shown below are the outputs of the proposed architecture trained in three different ways,

  • SepEDA2: architecture trained using WSJ0-2mix
  • SepEDA2/3: architecture trained using WSJ0-2mix and WSJ0-3mix together.
  • SepEDA[2-5]: architecture initialised with SepEDA2 model and fine-tuned on WSJ0-[2,3,4,5]mix datasets.

SepEDA2 represents the fixed, two speaker scenario, and SepEDA[2-5] represents the more general multi-speaker scenario with up to five speakers, while SepEDA2/3 represents the more practical scenario with up to three speakers

Note: Please use the Chrome or Safari browser for optimal audio/video synchronization. If you encounter playback problems, please reload the page.

SepEDA2 model

Two speaker mixture example

SepEDA2/3 model

Two speaker mixture example

Three speaker mixture example

SepEDA[2-5] model

Four speaker mixture example

Five speaker mixture example

References

[1] C. Subakan, M. Ravanelli, S. Cornell, M. Bronzi, and J. Zhong, Attention is all you need in speech separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.

[2] S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors, in Interspeech, 2020, pp. 269–273.

[3] Y. Luo, Z. Chen, and T. Yoshioka, Dual-Path RNN: Efficient long sequence modeling for time-domain single-channel speech separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50.

[4] Y. Luo and N. Mesgarani, Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation, IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.