Why disentanglement-based speaker anonymization systems fail at preserving emotions?

Ünal Ege Gaznepoğlu and Nils Peters

Accepted for publication to ICASSP 2025.

Abstract

Disentanglement-based speaker anonymization involves decomposing speech into a semantically meaningful representation, altering the speaker embedding, and resynthesizing a waveform using a neural vocoder. State-of-the-art systems of this kind are known to remove emotion information. Possible reasons include mode collapse in GAN-based vocoders, unintended modeling and modification of emotions through speaker embeddings, or excessive sanitization of the intermediate representation (IR). In this paper, we conduct a comprehensive evaluation of a state-of-the-art speaker anonymization system to understand the underlying causes. We conclude that the main reason is the lack of emotion-related information in the IR. The speaker embeddings also have a high impact, if they are learned in a generative context. The vocoder’s out-of-distribution performance has a smaller impact. Additionally, we discovered that synthesis artifacts increase spectral kurtosis, biasing emotion recognition evaluation towards classifying utterances as angry. Therefore, we conclude that reporting unweighted average recall alone for emotion recognition performance is suboptimal.

Supplementary material: IEMOCAP-test acoustic feature breakdown per emotion classes (click to enlarge)

Note: In this figure, the original IEMOCAP utterances have an unusual spectral kurtosis distribution, where angry samples have the lowest median kurtosis and sad samples have the highest median kurtosis. This is contrary to the literature. However, the reconstructions of experiment None-T1 demonstrates the correct ordering (from lowest to highest: sad, neutral, happy, angry). We interpret this as a sign of IEMOCAP's noisy nature interfering with the acoustic features.

cover_distributions_detailed

Supplementary material: randomly selected sound samples and corresponding acoustic feature visualizations

We computed acoustic features that correlate well with emotions for each IEMOCAP_test and Libri_test utterance. While we visualize the aggregated summaries as a Violin plot (see Fig. 4 in the paper), we think that observing the behavior for few utterances might provide additional insights. Please find listening samples and the acoustic feature visualizations (over time) for 4 different utterances in each dataset below.

Acoustic features for libri-test | IEMOCAP-test samples (click to enlarge)

cover_samples_libri cover_samples

IEMOCAP-test listening samples

Ses02F_script03_1_F022.wav

Ses02F_script03_2_F016.wav

Ses03M_script02_1_M026.wav

Ses05M_impro03_F013.wav

Libri-test listening samples

237-126133-0022.wav

5105-28241-0018.wav

6930-81414-0017.wav

908-157963-0020.wav

Supplementary material: IEMOCAP_test utterances with issues

We used Silero VAD https://github.com/snakers4/silero-vad to detect speech active segments. Even though we tuned the parameters, (min_speech_duration_ms=100, threshold=0.35), for some utterances the utilized VAD could not identify regions with speech activity.

Ses05F_impro06_F015.wav

Description: Mild background noise, intended speaker sighing

Ses01F_script02_2_M043.wav

Description: Prominent background noise, intended speaker sighing and breathing in

Ses02M_impro02_F003.wav

Description: Mild background noise, intended speaker breathing in

Ses05F_impro06_F027.wav

Description: Prominent background noise, intended speaker sighing

Ses04M_script03_2_F050.wav

Description: Interfering speaker (male), intended speaker shouting 'Marry you again, huh!' with a high pitch

Ses04M_script03_2_F052.wav

Description: Interfering speaker, intended speaker shouting 'Swine!'

Ses04F_impro03_F040.wav

Description: Interfering speaker, intended speaker laughing

Ses05F_impro06_F024.wav

Description: Prominent background noise, intended speaker sighing

Ses05F_impro06_F004.wav

Description: Prominent background noise, intended speaker sighing

Ses04F_impro03_F002.wav

Description: Interfering speaker (Male), intended speaker saying 'Look!' in a very high pitch

Ses02M_impro06_F018.wav

Description: Prominent background noise, intended speaker whispering 'Yeah'

Ses04F_impro07_F065.wav

Description: Interfering speaker at the beginning, intended speaker laughing

Ses04M_impro04_F017.wav

Description: Mild background noise, intended speaker chuckling

Ses05M_script02_2_F038.wav

Description: Prominent background noise, intended speaker lightly whispering 'Or not'

Ses01M_impro06_M027.wav

Description: Prominent background noise, intended speaker saying 'Duh' very lightly.

Ses01M_script02_2_M052.wav

Description: Prominent background noise, audible interfering speaker (Female), intended speaker saying 'Ssh'.

Paper (click to enlarge)

cover

@inproceedings{2025_gaznepoglu_why,
author = {Ünal Ege Gaznepoğlu, Nils Peters},
booktitle = {Submitted to Proc. {IEEE} Intl. Conf. on Acoustics, Speech and Signal Processing ({ICASSP})},
title = {Why disentanglement-based speaker anonymization systems fail at preserving emotions?},
keywords = {speaker anonymization, neural vocoders, speaker embeddings, speech foundation models},
year = {2025}}