AudioLabs - Target Speaker Adaptation in Text-to-Speech Synthesis: A Comparison of Efficient Fine-Tuning and Zero-Shot Methods

Target Speaker Adaptation in Text-to-Speech Synthesis: A Comparison of Efficient Fine-Tuning and Zero-Shot Methods

Kishor Kayyar Lakshminarayana, Frank Zalkow, Christian Dittmar, Nicola Pia, Emanuël A.P. Habets

Publication under review.

Abstract

Neural text-to-speech (TTS) systems, such as Tacotron [2], FastSpeech, and ForwardTacotron [1], have enabled the synthesis of highly natural speech. A key feature of TTS systems is speaker adaptation, allowing for the synthesis of speech with the specific voice characteristics of a target speaker, such as timbre, pitch, and prosody. To enable such a feature, traditional approaches require extensive speaker-specific data and full retraining, making them resource-intensive. Recent methods address these limitations by enabling faster adaptation without full retraining. These approaches fall into two categories: fine-tuning-based methods, which adapt models using small amounts of speaker data, and zero-shot methods, which generalize to unseen speakers without additional training but with a single reference sample during inference. This work compares these two target speaker adaptation methods across different metrics. We show that fine-tuning-based methods achieve similar speaker adaptation quality as zero-shot methods at a slightly lower naturalness but significantly lower computational complexity.

Comparison of Neural TTS Methods

We compare different state-of-the-art TTS methods across different parameters below. The encoder here refers to the text-to-token model and decoder refers to the token-to-speech model. The real time factor (RTF) was measured on an NVIDIA RTX2080 GPU with 12GB memory as the ratio of generated speech duration to synthesis time. Training times of zero-shot models are from the original authors using A100 (80GB) GPUs. AdapterMix and ForwardTacotron times are for RTX2080 GPUs (12GB). Using A100s here would slightly reduce training times. HierSpeech++ was trained using multiple GPUs (e.g. eight A6000 GPUs for LibriTTS). The minimum estimated training time tabulated here assumes 4 A100 GPUs for 4 days.

Model	Type	Encoder Params (M)	Decoder Params (M)	Training Data (hrs)	Inference Speed (RTF ↑)	Training Time (GPU hrs ↓)
HierSpeech++ [3]	Zero-shot	224	105	2800	7.07	>400
MaskGCT [4]	Zero-shot	695	353	100K	0.057	>800
FireRedTTS [5]	Zero-shot	400	112	250K	<0.001	NA
F5-TTS [6]	Zero-shot	336	14	100K	0.58	>800
AdapterMix [7]	Fine-tuning	144	53	45	18.83	224 + 8
ForwardTacotron [8]	Traditional	46	10	60	43	36

Audio Examples

Here, we share a few samples synthesized using the models listed above for two target speakers, along with the corresponding text. Adapter-Mix was fine-tuned using twenty minutes of target speaker data. ForwardTacotron versions utilized all available data for the target speaker, which was 7 hours for 'Ian' and 23 hours for '92'.

	Ian (M)		92 (F)

Model	There are no particular "rainy" and "dry" seasons: the amount of rain stays roughly the same throughout the year.	When returning home after living abroad, you've adapted to the new culture and lost some of your habits from your home culture.	There are no particular "rainy" and "dry" seasons: the amount of rain stays roughly the same throughout the year.	When returning home after living abroad, you've adapted to the new culture and lost some of your habits from your home culture.
HierSpeech++
MaskGCT
FireRedTTS
F5TTS
AdapterMix
ForwardTacotron Retrain [8]
ForwardTacotron Finetune

Efficient Fine-tuning with ForwardTacotron

We employ noise augmentation during fine-tuning to address low-resource data scenarios, similar to our prior work [9]. Additionally, we freeze selected parameters during fine-tuning, thereby avoiding the computation of their gradients and updates, which reduces computational cost and improves parameter efficiency. Some audio samples synthesized using models trained with these approaches are provided below.

Fine-tuning Subset	Ian (M)	92 (F)
Fine-tuning Subset	Each temple had an open temple courtyard and then an inner sanctuary that only the priests could enter.	For example, one might say that the motor car necessarily leads to the development of roads.
7200s
7200s + NoiseAug.
7200s + NoiseAug. (Postnet-1L)¹
1200s
1200s + 5 × NoiseAug.

¹Parameter efficient version with only 13.34M trainable parameters whereas the other models had 23.03M trainable parameters.

References

Christian Schäfer, Ollie McCarthy, and contributors
ForwardTacotron
https://github.com/as-ideas/ForwardTacotron, 2020.

@misc{Schaefer20_ForwardTacotron_Github,
author = {Christian Schäfer and Ollie McCarthy and contributors},
howpublished = {\url{https://github.com/as-ideas/ForwardTacotron}},
journal = {GitHub repository},
publisher = {GitHub},
title = {{ForwardTacotron}},
year = {2020}
}

Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu
Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions
In Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP): 4779–4783, 2018. DOI

@inproceedings{shen_natural_2018,
title = {Natural {TTS} synthesis by conditioning wavenet on {MEL} spectrogram predictions},
author = {Jonathan Shen and Ruoming Pang and Ron J. Weiss and Mike Schuster and Navdeep Jaitly and Zongheng Yang and Zhifeng Chen and Yu Zhang and Yuxuan Wang and Rj Skerrv-Ryan and Rif A. Saurous and Yannis Agiomvrgiannakis and Yonghui Wu},
booktitle = {Proc. {IEEE} Intl. Conf. on Acoustics, Speech and Signal Processing
(ICASSP)},
year = 2018,
pages = {4779--4783},
isbn = {978-1-5386-4658-8},
doi = {10.1109/ICASSP.2018.8461368},
}

Lee, Sang-Hoon, Choi, Ha-Yeong, Kim, Seung-Bin, and Lee, Seong-Whan
HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis
IEEE Transactions on Neural Networks and Learning Systems, 36(10): 18422–18436, 2025. DOI

@article{lee2025_itnnls_hierspeech++,
author={Lee, Sang-Hoon and Choi, Ha-Yeong and Kim, Seung-Bin and Lee, Seong-Whan},
journal={IEEE Transactions on Neural Networks and Learning Systems},
title={HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis},
year={2025},
volume={36},
number={10},
pages={18422-18436},
keywords={Speech synthesis;Adaptation models;Acoustics;Semantics;Diffusion models;Synthesizers;Cloning;Artificial intelligence;Robustness;Transformers;Voice conversion (VC);zero-shot text-to-speech (TTS)},
doi={10.1109/TNNLS.2025.3584944}
}

Wang, Yuancheng, Zhan, Haoyue, Liu, Liwei, Zeng, Ruihong, Guo, Haotian, Zheng, Jiachen, Zhang, Qiang, Zhang, Xueyao, Zhang, Shunsi, and Wu, Zhizheng
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
arXiv preprint arXiv:2409.00750, 2024.

@article{wang2024_arxiv_maskgct,
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
journal={arXiv preprint arXiv:2409.00750},
year={2024}
}

Guo, Hao-Han, Hu, Yao, Shen, Fei-Yu, Tang, Xu, Wu, Yi-Chen, Xie, Feng-Long, and Xie, Kun
Fireredtts-1S: An upgraded streamable foundation text-to-speech system
arXiv preprint arXiv:2503.20499, 2025.

@article{guo2025_arxiv_fireredtts,
title={{Fireredtts-1S}: An upgraded streamable foundation text-to-speech system},
author={Guo, Hao-Han and Hu, Yao and Shen, Fei-Yu and Tang, Xu and Wu, Yi-Chen and Xie, Feng-Long and Xie, Kun},
journal={arXiv preprint arXiv:2503.20499},
year={2025}
}

Chen, Yushen, Niu, Zhikang, Ma, Ziyang, Deng, Keqi, Wang, Chunhui, Zhao, Jian, Yu, Kai, and Chen, Xie
F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching
arXiv preprint arXiv:2410.06885, 2024.

@article{chen2024_arxiv_f5tts,
title={{F5-TTS}: A fairytaler that fakes fluent and faithful speech with flow matching},
author={Chen, Yushen and Niu, Zhikang and Ma, Ziyang and Deng, Keqi and Wang, Chunhui and Zhao, Jian and Yu, Kai and Chen, Xie},
journal={arXiv preprint arXiv:2410.06885},
year={2024}
}

Ambuj Mehrish, Abhinav Ramesh Kashyap, Li Yingting, Navonil Majumder, and Soujanya Poria
AdapterMix: Exploring the efficacy of mixture of adapters for low-resource TTS adaptation
In Proc. Interspeech: 4284–4288, 2023. DOI

@inproceedings{mehrish23_interspeech_adaptermix,
author={Ambuj Mehrish and Abhinav {Ramesh Kashyap} and Li Yingting and Navonil Majumder and Soujanya Poria},
title={{AdapterMix: Exploring the efficacy of mixture of adapters for low-resource TTS adaptation}},
year=2023,
booktitle={Proc. Interspeech},
pages={4284--4288},
doi={10.21437/Interspeech.2023-1568}
}

Frank Zalkow, Paolo Sani, Michael Fast, Judith Bauer, Mohammad Joshaghani, Kishor Kayyar, Emanuël A. P. Habets, and Christian Dittmar
The AudioLabs System for the Blizzard Challenge 2023
In Proceedings of the Blizzard Challenge Workshop: 63–68, 2023.

@inproceedings{ZalkowEtAl23_AudioLabs_Blizzard,
address = {Grenoble, France},
author = {Frank Zalkow and Paolo Sani and Michael Fast and Judith Bauer and Mohammad Joshaghani and Kishor Kayyar and Emanu{\"e}l A. P. Habets and Christian Dittmar},
booktitle = {Proceedings of the Blizzard Challenge Workshop},
pages = {63--68},
title = {The {AudioLabs} System for the {B}lizzard {C}hallenge 2023},
year = {2023}
}

Kishor Kayyar, Frank Zalkow, Christian Dittmar, Nicola Pia, and Emanuël A. P. Habets
Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron
In Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP): 1–5, 2025. DOI

@inproceedings{kayyar25_icassp_lowres,
author={Kishor Kayyar and Frank Zalkow and Christian Dittmar and Nicola Pia and Emanu{\"e}l A. P. Habets},
booktitle={Proc. {IEEE} Intl. Conf. on Acoustics, Speech and Signal Processing
(ICASSP)},
title={Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron},
year={2025},
volume={},
number={},
pages={1-5},
doi={10.1109/ICASSP49660.2025.10890686}
}

International Audio Laboratories Erlangen

Target Speaker Adaptation in Text-to-Speech Synthesis: A Comparison of Efficient Fine-Tuning and Zero-Shot Methods

Abstract

Comparison of Neural TTS Methods

Efficient Fine-tuning with ForwardTacotron

References