W. Mack, S. Deng and E. A. P. Habets
Acoustic parameters, like the direct-to-reverberation ratio (DRR), can be used in audio processing algorithms to perform, e.g., dereverberation or in audio augmented reality. Often, the DRR is not available and has to be estimated blindly from recorded audio signals. State-of-the-art DRR estimation is achieved by deep neural networks (DNNs), which directly map a feature representation of the acquired signals to the DRR. Motivated by the equality of the signal-to-reverberation ratio and the (channel-based) DRR under certain conditions, we formulate single-channel DRR estimation as an extraction task of two signal components from the recorded audio. The DRR can be obtained by inserting the estimated signals in the definition of the DRR. The extraction is performed using time-frequency masks. The masks are estimated by a DNN trained end-to-end to minimize the mean-squared error between the estimated and the oracle DRR. We conduct experiments with different pre-processing and mask estimation schemes. The proposed method outperforms state-of-the-art single- and multi-channel methods on the ACE challenge data corpus.
S. Deng, W. Mack and E. A. P. Habets
The reverberation time, T60, is an important acoustic parameter in speech and acoustic signal processing. Often, the T60 is unknown and blind estimation from a single-channel measurement is required. State-of-the-art T60 estimation is achieved by a convolutional neural network (CNN) which maps a feature representation of the speech to the T60. The temporal input length of the CNN is fixed. Time-varying scenarios, e.g., robot audition, require continuous T60 estimation in an online fashion, which is computationally heavy using the CNN. We propose to use a convolutional recurrent neural network (CRNN) for blind T60 estimation as it combines the parametric efficiency of CNNs with the online estimation of recurrent neural networks and, in contrast to CNNs, can process time-sequences of variable length. We evaluated the proposed CRNN on the "Acoustic Characterization of Environments Challenge" dataset for different input lengths. Our proposed method outperforms the state-of-the-art CNN approach even for shorter inputs at the cost of more trainable parameters.
[1] W. Mack, S. Deng and E. A. P. Habets "Single-Channel Blind Direct-to-Reverberation Ratio Estimation Using Masking," Proc. Interspeech, pp. 5066-5070, 2020.
[2] S. Deng, W. Mack and E. A. P. Habets "Online Blind Reverberation Time Estimation Using CRNNs," Proc. Interspeech, pp. 5061-506, 2020.