Better Together: Voice Activity Detection and Dialogue Separation for Audio Personalization in TV

M. Torcoli and E. A. P. Habets

Abstract

In TV services, dialogue level personalization is key to meeting user preferences and needs. When dialogue and background sounds are not separately available from the production stage, Dialogue Separation (DS) can estimate them to enable personalization. DS was shown to provide clear benefits for the end user. Still, the estimated signals are not perfect, and some leakage can be introduced. This is undesired, especially during passages without dialogue. We propose to combine DS and Voice Activity Detection (VAD), both recently proposed for TV audio. When their combination suggests dialogue inactivity, background components leaking in the dialogue estimate are reassigned to the background estimate. A clear improvement of the audio quality is shown for dialogue-free signals, without performance drops when dialogue is active. A post-processed VAD estimate with improved detection accuracy is also generated. It is concluded that DS and VAD can improve each other and are better used together.

Audio Examples

Two test scenarios are considered. The first one considers music and effects only (MUSFX). The second one considers dialogue mixed over musical background (MIX).

Goal of the baseline DS and of the proposed method is to attenuate the non-dialogue parts by 15 dB with constant attenuation.

MUSFX: Music and effects only

Baseline DS introduces some distortions although no dialogue is present. The proposed method significantly improves the perceived quality, and the music sounds like in the input signal, the only difference is overall level.

MIX: Dialogue over music

Baseline DS successfully attenuates the background music, and the dialogue can be followed more easily. The proposed method does not introduce any perceivable difference with respect to baseline DS, as desired.

(Audio excerpt taken from this interview about MPEG-H Dialog+)