Time-Frequency Masking Based Online Multi-Channel Speech Enhancement With Convolutional Recurrent Neural Networks

Soumitro Chakrabarty and Emanuël Habets

Published in the IEEE Journal of Selected Topics in Signal Processing, Vol. 13, Issue 4, pp. 787-799, Aug. 2019.

Abstract

The paper presents a time-frequency (T-F) masking based online multi-channel speech enhancement approach that uses a convolutional recurrent neural network (CRNN) to estimate the mask. The magnitude and phase components of the short-time Fourier transform (STFT) coefficients for multiple time frames are provided as an input such that the network is able to discriminate between the directional speech source and the noise components based on the spatial characteristics of the individual signals as well as their spectro-temporal structure. In contrast to most speech enhancement methods that utilize multi-channel data, the proposed method does not require information about the location of the desired speech source. The estimation of two different masks, ideal ratio mask (IRM) and ideal binary mask (IBM), along with two different approaches for incorporating the mask to obtain the desired signal are discussed. In the first approach the mask is directly applied as a real valued gain to a reference microphone signal, whereas in the second approach, the masks are used as an activity indicator for the recursive update of power spectral density (PSD) matrices to be used within a beamformer. The performance of the proposed system with the two different estimated masks utilized within the two different enhancement approaches is evaluated with both simulated as well as measured room impulse responses (RIRs), where it is shown that the IBM is better suited as an indicator for the PSD updates while direct application of IRM as a real valued gain leads to a better improvement in terms of short term objective intelligibility (STOI). Analysis of the performance of the proposed system also demonstrates the robustness of the system to different angular positions of the speech source.

Sound Example - Simulated Setup

To highlight the difference in performance of the methods in different scenarios, in this part we present audio examples for the simulated setup. Both examples are from the same room.

Experimental Setup:

  • Room size: 9 m x 7 m x 3 m

  • Source-microphone distance: 1.7 m

  • Reverberation time: 0.7 s

  • Source DOA: 10 degrees

  • Babble noise with iSNR = 0 dB

  • Microphone Self-noise with iSNR = 20 dB

Mask based beamformers

Methods with direct application of mask