Target Speaker Adaptation in Text-to-Speech Synthesis: A Comparison of Efficient Fine-Tuning and Zero-Shot Methods

Kishor Kayyar Lakshminarayana, Frank Zalkow, Christian Dittmar, Nicola Pia, Emanuël A.P. Habets

Yet to be published

Abstract

Neural text-to-speech (TTS) systems, such as Tacotron [2], FastSpeech, and ForwardTacotron [1], have enabled the synthesis of highly natural speech. A key feature of TTS systems is speaker adaptation, allowing for the synthesis of speech with the specific voice characteristics of a target speaker, such as timbre, pitch, and prosody. To enable such a feature, traditional approaches require extensive speaker-specific data and full retraining, making them resource-intensive. Recent methods address these limitations by enabling faster adaptation without full retraining. These approaches fall into two categories: fine-tuning-based methods, which adapt models using small amounts of speaker data, and zero-shot methods, which generalize to unseen speakers without additional training but with a single reference sample during inference. This work compares these two target speaker adaptation methods across different metrics. We show that fine-tuning-based methods achieve similar speaker adaptation quality as zero-shot methods at a slightly lower naturalness but significantly lower computational complexity.

Comparison of Neural TTS Methods

We compare different state-of-the-art TTS methods across different parameters below. The encoder here refers to the text-to-token model and decoder refers to the token-to-speech model. The real time factor (RTF) was measured on an NVIDIA RTX2080 GPU with 12GB memory as the ratio of generated speech duration to synthesis time. Training times of zero-shot models are from the original authors using A100 (80GB) GPUs. AdapterMix and ForwardTacotron times are for RTX2080 GPUs (12GB). Using A100s here would slightly reduce training times. HierSpeech++ was trained using multiple GPUs (e.g. eight A6000 GPUs for LibriTTS). The minimum estimated training time tabulated here assumes 4 A100 GPUs for 4 days.

Model Type Encoder Params (M) Decoder Params (M) Training Data (hrs) Inference Speed (RTF ↑) Training Time (GPU hrs ↓)
HierSpeech++ [3] Zero-shot 224 105 2800 7.07 >400
MaskGCT [4] Zero-shot 695 353 100K 0.057 >800
FireRedTTS [5] Zero-shot 400 112 250K <0.001 NA
F5-TTS [6] Zero-shot 336 14 100K 0.58 >800
AdapterMix [7] Fine-tuning 144 53 45 18.83 224 + 8
ForwardTacotron [8] Traditional 46 10 60 43 36

Audio Examples

Here, we share a few samples synthesized using the models listed above for two target speakers, along with the corresponding text. Adapter-Mix was fine-tuned using twenty minutes of target speaker data. ForwardTacotron versions utilized all available data for the target speaker, which was 7 hours for 'Ian' and 23 hours for '92'.


Model Ian (M) 92 (F)
There are no particular "rainy" and "dry" seasons: the amount of rain stays roughly the same throughout the year. When returning home after living abroad, you've adapted to the new culture and lost some of your habits from your home culture. There are no particular "rainy" and "dry" seasons: the amount of rain stays roughly the same throughout the year. When returning home after living abroad, you've adapted to the new culture and lost some of your habits from your home culture.
HierSpeech++
MaskGCT
FireRedTTS
F5TTS
AdapterMix
ForwardTacotron Retrain [8]
ForwardTacotron Finetune

Efficient Fine-tuning with ForwardTacotron

We employ noise augmentation during fine-tuning to address low-resource data scenarios, similar to our prior work [9]. Additionally, we freeze selected parameters during fine-tuning, thereby avoiding the computation of their gradients and updates, which reduces computational cost and improves parameter efficiency. Some audio samples synthesized using models trained with these approaches are provided below.


Fine-tuning Subset Ian (M) 92 (F)
Each temple had an open temple courtyard and then an inner sanctuary that only the priests could enter. For example, one might say that the motor car necessarily leads to the development of roads.
7200s
7200s + NoiseAug.
7200s + NoiseAug. (Postnet-1L)1
1200s
1200s + 5 × NoiseAug.

1Parameter efficient version with only 13.34M trainable parameters whereas the other models had 23.03M trainable parameters.

References

  1. Christian Schäfer, Ollie McCarthy, and contributors
    ForwardTacotron
    https://github.com/as-ideas/ForwardTacotron, 2020.
    @misc{Schaefer20_ForwardTacotron_Github,
    author = {Christian Schäfer and Ollie McCarthy and contributors},
    howpublished = {\url{https://github.com/as-ideas/ForwardTacotron}},
    journal = {GitHub repository},
    publisher = {GitHub},
    title = {{ForwardTacotron}},
    year = {2020}
    }
  2. Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, Rif A. Saurous, Yannis Agiomvrgiannakis, and Yonghui Wu
    Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions
    In Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP): 4779–4783, 2018. DOI
    @inproceedings{shen_natural_2018,
    title = {Natural {TTS} synthesis by conditioning wavenet on {MEL} spectrogram predictions},
    author = {Jonathan Shen and Ruoming Pang and Ron J. Weiss and Mike Schuster and Navdeep Jaitly and Zongheng Yang and Zhifeng Chen and Yu Zhang and Yuxuan Wang and Rj Skerrv-Ryan and Rif A. Saurous and Yannis Agiomvrgiannakis and Yonghui Wu},
    booktitle = {Proc. {IEEE} Intl. Conf. on Acoustics, Speech and Signal Processing
    (ICASSP)},
    year = 2018,
    pages = {4779--4783},
    isbn = {978-1-5386-4658-8},
    doi = {10.1109/ICASSP.2018.8461368},
    }
  3. Lee, Sang-Hoon, Choi, Ha-Yeong, Kim, Seung-Bin, and Lee, Seong-Whan
    HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis
    IEEE Transactions on Neural Networks and Learning Systems, 36(10): 18422–18436, 2025. DOI
    @article{lee2025_itnnls_hierspeech++,
    author={Lee, Sang-Hoon and Choi, Ha-Yeong and Kim, Seung-Bin and Lee, Seong-Whan},
    journal={IEEE Transactions on Neural Networks and Learning Systems},
    title={HierSpeech++: Bridging the Gap Between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-Shot Speech Synthesis},
    year={2025},
    volume={36},
    number={10},
    pages={18422-18436},
    keywords={Speech synthesis;Adaptation models;Acoustics;Semantics;Diffusion models;Synthesizers;Cloning;Artificial intelligence;Robustness;Transformers;Voice conversion (VC);zero-shot text-to-speech (TTS)},
    doi={10.1109/TNNLS.2025.3584944}
    }
  4. Wang, Yuancheng, Zhan, Haoyue, Liu, Liwei, Zeng, Ruihong, Guo, Haotian, Zheng, Jiachen, Zhang, Qiang, Zhang, Xueyao, Zhang, Shunsi, and Wu, Zhizheng
    MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer
    arXiv preprint arXiv:2409.00750, 2024.
    @article{wang2024_arxiv_maskgct,
    title={MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec Transformer},
    author={Wang, Yuancheng and Zhan, Haoyue and Liu, Liwei and Zeng, Ruihong and Guo, Haotian and Zheng, Jiachen and Zhang, Qiang and Zhang, Xueyao and Zhang, Shunsi and Wu, Zhizheng},
    journal={arXiv preprint arXiv:2409.00750},
    year={2024}
    }
  5. Guo, Hao-Han, Hu, Yao, Shen, Fei-Yu, Tang, Xu, Wu, Yi-Chen, Xie, Feng-Long, and Xie, Kun
    Fireredtts-1S: An upgraded streamable foundation text-to-speech system
    arXiv preprint arXiv:2503.20499, 2025.
    @article{guo2025_arxiv_fireredtts,
    title={{Fireredtts-1S}: An upgraded streamable foundation text-to-speech system},
    author={Guo, Hao-Han and Hu, Yao and Shen, Fei-Yu and Tang, Xu and Wu, Yi-Chen and Xie, Feng-Long and Xie, Kun},
    journal={arXiv preprint arXiv:2503.20499},
    year={2025}
    }
  6. Chen, Yushen, Niu, Zhikang, Ma, Ziyang, Deng, Keqi, Wang, Chunhui, Zhao, Jian, Yu, Kai, and Chen, Xie
    F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching
    arXiv preprint arXiv:2410.06885, 2024.
    @article{chen2024_arxiv_f5tts,
    title={{F5-TTS}: A fairytaler that fakes fluent and faithful speech with flow matching},
    author={Chen, Yushen and Niu, Zhikang and Ma, Ziyang and Deng, Keqi and Wang, Chunhui and Zhao, Jian and Yu, Kai and Chen, Xie},
    journal={arXiv preprint arXiv:2410.06885},
    year={2024}
    }
  7. Ambuj Mehrish, Abhinav Ramesh Kashyap, Li Yingting, Navonil Majumder, and Soujanya Poria
    AdapterMix: Exploring the efficacy of mixture of adapters for low-resource TTS adaptation
    In Proc. Interspeech: 4284–4288, 2023. DOI
    @inproceedings{mehrish23_interspeech_adaptermix,
    author={Ambuj Mehrish and Abhinav {Ramesh Kashyap} and Li Yingting and Navonil Majumder and Soujanya Poria},
    title={{AdapterMix: Exploring the efficacy of mixture of adapters for low-resource TTS adaptation}},
    year=2023,
    booktitle={Proc. Interspeech},
    pages={4284--4288},
    doi={10.21437/Interspeech.2023-1568}
    }
  8. Frank Zalkow, Paolo Sani, Michael Fast, Judith Bauer, Mohammad Joshaghani, Kishor Kayyar, Emanuël A. P. Habets, and Christian Dittmar
    The AudioLabs System for the Blizzard Challenge 2023
    In Proceedings of the Blizzard Challenge Workshop: 63–68, 2023.
    @inproceedings{ZalkowEtAl23_AudioLabs_Blizzard,
    address = {Grenoble, France},
    author = {Frank Zalkow and Paolo Sani and Michael Fast and Judith Bauer and Mohammad Joshaghani and Kishor Kayyar and Emanu{\"e}l A. P. Habets and Christian Dittmar},
    booktitle = {Proceedings of the Blizzard Challenge Workshop},
    pages = {63--68},
    title = {The {AudioLabs} System for the {B}lizzard {C}hallenge 2023},
    year = {2023}
    }
  9. Kishor Kayyar, Frank Zalkow, Christian Dittmar, Nicola Pia, and Emanuël A. P. Habets
    Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron
    In Proc. IEEE Intl. Conf. on Acoustics, Speech and Signal Processing (ICASSP): 1–5, 2025. DOI
    @inproceedings{kayyar25_icassp_lowres,
    author={Kishor Kayyar and Frank Zalkow and Christian Dittmar and Nicola Pia and Emanu{\"e}l A. P. Habets},
    booktitle={Proc. {IEEE} Intl. Conf. on Acoustics, Speech and Signal Processing
    (ICASSP)},
    title={Low-Resource Text-to-Speech Synthesis Using Noise-Augmented Training of ForwardTacotron},
    year={2025},
    volume={},
    number={},
    pages={1-5},
    doi={10.1109/ICASSP49660.2025.10890686}
    }