Ünal Ege Gaznepoğlu and Nils Peters
Accepted for publication to ICASSP 2025.
Disentanglement-based speaker anonymization involves decomposing speech into a semantically meaningful representation, altering the speaker embedding, and resynthesizing a waveform using a neural vocoder. State-of-the-art systems of this kind are known to remove emotion information. Possible reasons include mode collapse in GAN-based vocoders, unintended modeling and modification of emotions through speaker embeddings, or excessive sanitization of the intermediate representation (IR). In this paper, we conduct a comprehensive evaluation of a state-of-the-art speaker anonymization system to understand the underlying causes. We conclude that the main reason is the lack of emotion-related information in the IR. The speaker embeddings also have a high impact, if they are learned in a generative context. The vocoder’s out-of-distribution performance has a smaller impact. Additionally, we discovered that synthesis artifacts increase spectral kurtosis, biasing emotion recognition evaluation towards classifying utterances as angry. Therefore, we conclude that reporting unweighted average recall alone for emotion recognition performance is suboptimal.
Note: In this figure, the original IEMOCAP utterances have an unusual spectral kurtosis distribution, where angry samples have the lowest median kurtosis and sad samples have the highest median kurtosis. This is contrary to the literature. However, the reconstructions of experiment None-T1 demonstrates the correct ordering (from lowest to highest: sad, neutral, happy, angry). We interpret this as a sign of IEMOCAP's noisy nature interfering with the acoustic features.
We computed acoustic features that correlate well with emotions for each IEMOCAP_test and Libri_test utterance. While we visualize the aggregated summaries as a Violin plot (see Fig. 4 in the paper), we think that observing the behavior for few utterances might provide additional insights. Please find listening samples and the acoustic feature visualizations (over time) for 4 different utterances in each dataset below.
We used Silero VAD https://github.com/snakers4/silero-vad to detect speech active segments. Even though we tuned the parameters, (min_speech_duration_ms=100, threshold=0.35), for some utterances the utilized VAD could not identify regions with speech activity.
Description: Mild background noise, intended speaker sighing
Description: Prominent background noise, intended speaker sighing and breathing in
Description: Mild background noise, intended speaker breathing in
Description: Prominent background noise, intended speaker sighing
Description: Interfering speaker (male), intended speaker shouting 'Marry you again, huh!' with a high pitch
Description: Interfering speaker, intended speaker shouting 'Swine!'
Description: Interfering speaker, intended speaker laughing
Description: Prominent background noise, intended speaker sighing
Description: Prominent background noise, intended speaker sighing
Description: Interfering speaker (Male), intended speaker saying 'Look!' in a very high pitch
Description: Prominent background noise, intended speaker whispering 'Yeah'
Description: Interfering speaker at the beginning, intended speaker laughing
Description: Mild background noise, intended speaker chuckling
Description: Prominent background noise, intended speaker lightly whispering 'Or not'
Description: Prominent background noise, intended speaker saying 'Duh' very lightly.
Description: Prominent background noise, audible interfering speaker (Female), intended speaker saying 'Ssh'.
@inproceedings{2025_gaznepoglu_why,
author = {Ünal Ege Gaznepoğlu, Nils Peters},
booktitle = {Submitted to Proc. {IEEE} Intl. Conf. on Acoustics, Speech and Signal Processing ({ICASSP})},
title = {Why disentanglement-based speaker anonymization systems fail at preserving emotions?},
keywords = {speaker anonymization, neural vocoders, speaker embeddings, speech foundation models},
year = {2025}}