Kishor Kayyar Lakshminarayana, Frank Zalkow, Christian Dittmar, Nicola Pia, Emanuël A.P. Habets
Yet to be published
Neural text-to-speech (TTS) systems, such as Tacotron [2], FastSpeech, and ForwardTacotron [1], have enabled the synthesis of highly natural speech. A key feature of TTS systems is speaker adaptation, allowing for the synthesis of speech with the specific voice characteristics of a target speaker, such as timbre, pitch, and prosody. To enable such a feature, traditional approaches require extensive speaker-specific data and full retraining, making them resource-intensive. Recent methods address these limitations by enabling faster adaptation without full retraining. These approaches fall into two categories: fine-tuning-based methods, which adapt models using small amounts of speaker data, and zero-shot methods, which generalize to unseen speakers without additional training but with a single reference sample during inference. This work compares these two target speaker adaptation methods across different metrics. We show that fine-tuning-based methods achieve similar speaker adaptation quality as zero-shot methods at a slightly lower naturalness but significantly lower computational complexity.
We compare different state-of-the-art TTS methods across different parameters below. The encoder here refers to the text-to-token model and decoder refers to the token-to-speech model. The real time factor (RTF) was measured on an NVIDIA RTX2080 GPU with 12GB memory as the ratio of generated speech duration to synthesis time. Training times of zero-shot models are from the original authors using A100 (80GB) GPUs. AdapterMix and ForwardTacotron times are for RTX2080 GPUs (12GB). Using A100s here would slightly reduce training times. HierSpeech++ was trained using multiple GPUs (e.g. eight A6000 GPUs for LibriTTS). The minimum estimated training time tabulated here assumes 4 A100 GPUs for 4 days.
| Model | Type | Encoder Params (M) | Decoder Params (M) | Training Data (hrs) | Inference Speed (RTF ↑) | Training Time (GPU hrs ↓) |
|---|---|---|---|---|---|---|
| HierSpeech++ [3] | Zero-shot | 224 | 105 | 2800 | 7.07 | >400 |
| MaskGCT [4] | Zero-shot | 695 | 353 | 100K | 0.057 | >800 |
| FireRedTTS [5] | Zero-shot | 400 | 112 | 250K | <0.001 | NA |
| F5-TTS [6] | Zero-shot | 336 | 14 | 100K | 0.58 | >800 |
| AdapterMix [7] | Fine-tuning | 144 | 53 | 45 | 18.83 | 224 + 8 |
| ForwardTacotron [8] | Traditional | 46 | 10 | 60 | 43 | 36 |
Audio Examples
Here, we share a few samples synthesized using the models listed above for two target speakers, along with the corresponding text. Adapter-Mix was fine-tuned using twenty minutes of target speaker data. ForwardTacotron versions utilized all available data for the target speaker, which was 7 hours for 'Ian' and 23 hours for '92'.
| Model | Ian (M) | 92 (F) | ||
|---|---|---|---|---|
| There are no particular "rainy" and "dry" seasons: the amount of rain stays roughly the same throughout the year. | When returning home after living abroad, you've adapted to the new culture and lost some of your habits from your home culture. | There are no particular "rainy" and "dry" seasons: the amount of rain stays roughly the same throughout the year. | When returning home after living abroad, you've adapted to the new culture and lost some of your habits from your home culture. | |
| HierSpeech++ | ||||
| MaskGCT | ||||
| FireRedTTS | ||||
| F5TTS | ||||
| AdapterMix | ||||
| ForwardTacotron Retrain [8] | ||||
| ForwardTacotron Finetune | ||||
We employ noise augmentation during fine-tuning to address low-resource data scenarios, similar to our prior work [9]. Additionally, we freeze selected parameters during fine-tuning, thereby avoiding the computation of their gradients and updates, which reduces computational cost and improves parameter efficiency. Some audio samples synthesized using models trained with these approaches are provided below.
| Fine-tuning Subset | Ian (M) | 92 (F) |
|---|---|---|
| Each temple had an open temple courtyard and then an inner sanctuary that only the priests could enter. | For example, one might say that the motor car necessarily leads to the development of roads. | |
| 7200s | ||
| 7200s + NoiseAug. | ||
| 7200s + NoiseAug. (Postnet-1L)1 | ||
| 1200s | ||
| 1200s + 5 × NoiseAug. |
1Parameter efficient version with only 13.34M trainable parameters whereas the other models had 23.03M trainable parameters.
@misc{Schaefer20_ForwardTacotron_Github,
author = {Christian Schäfer and Ollie McCarthy and contributors},
howpublished = {\url{https://github.com/as-ideas/ForwardTacotron}},
journal = {GitHub repository},
publisher = {GitHub},
title = {{ForwardTacotron}},
year = {2020}
}
@inproceedings{shen_natural_2018,
title = {Natural {TTS} synthesis by conditioning wavenet on {MEL} spectrogram predictions},
author = {Jonathan Shen and Ruoming Pang and Ron J. Weiss and Mike Schuster and Navdeep Jaitly and Zongheng Yang and Zhifeng Chen and Yu Zhang and Yuxuan Wang and Rj Skerrv-Ryan and Rif A. Saurous and Yannis Agiomvrgiannakis and Yonghui Wu},
booktitle = {Proc. {IEEE} Intl. Conf. on Acoustics, Speech and Signal Processing
(ICASSP)},
year = 2018,
pages = {4779--4783},
isbn = {978-1-5386-4658-8},
doi = {10.1109/ICASSP.2018.8461368},
}
@article{lee2025_itnnls_hierspeech++,
author={Lee, Sang-Hoon and Choi, Ha-Yeong and Kim, Seung-Bin and Lee, Seong-Whan},
journal={IEEE Transactions on Neural Networks and Learning Systems},
title={HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis},
year={2025},
volume={36},
number={10},
pages={18422-18436},
keywords={Speech synthesis;Adaptation models;Acoustics;Semantics;Diffusion models;Synthesizers;Cloning;Artificial intelligence;Robustness;Transformers;Voice conversion (VC);zero-shot text-to-speech (TTS)},
doi={10.1109/TNNLS.2025.3584944}
}
@article{wang2024_arxiv_maskgct,
title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
journal={arXiv preprint arXiv:2409.00750},
year={2024}
}
@article{guo2025_arxiv_fireredtts,
title={{Fireredtts-1S}: An upgraded streamable foundation text-to-speech system},
author={Guo, Hao-Han and Hu, Yao and Shen, Fei-Yu and Tang, Xu and Wu, Yi-Chen and Xie, Feng-Long and Xie, Kun},
journal={arXiv preprint arXiv:2503.20499},
year={2025}
}
@article{chen2024_arxiv_f5tts,
title={{F5-TTS}: A fairytaler that fakes fluent and faithful speech with flow matching},
author={Chen, Yushen and Niu, Zhikang and Ma, Ziyang and Deng, Keqi and Wang, Chunhui and Zhao, Jian and Yu, Kai and Chen, Xie},
journal={arXiv preprint arXiv:2410.06885},
year={2024}
}
@inproceedings{mehrish23_interspeech_adaptermix,
author={Ambuj Mehrish and Abhinav {Ramesh Kashyap} and Li Yingting and Navonil Majumder and Soujanya Poria},
title={{AdapterMix: Exploring the efficacy of mixture of adapters for low-resource TTS adaptation}},
year=2023,
booktitle={Proc. Interspeech},
pages={4284--4288},
doi={10.21437/Interspeech.2023-1568}
}
@inproceedings{ZalkowEtAl23_AudioLabs_Blizzard,
address = {Grenoble, France},
author = {Frank Zalkow and Paolo Sani and Michael Fast and Judith Bauer and Mohammad Joshaghani and Kishor Kayyar and Emanu{\"e}l A. P. Habets and Christian Dittmar},
booktitle = {Proceedings of the Blizzard Challenge Workshop},
pages = {63--68},
title = {The {AudioLabs} System for the {B}lizzard {C}hallenge 2023},
year = {2023}
}
@inproceedings{kayyar25_icassp_lowres,
author={Kishor Kayyar and Frank Zalkow and Christian Dittmar and Nicola Pia and Emanu{\"e}l A. P. Habets},
booktitle={Proc. {IEEE} Intl. Conf. on Acoustics, Speech and Signal Processing
(ICASSP)},
title={Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron},
year={2025},
volume={},
number={},
pages={1-5},
doi={10.1109/ICASSP49660.2025.10890686}
}