Kishor Kayyar Lakshminarayana, Christian Dittmar, Nicola Pia, Emanuël A.P. Habets
Presented at the 15th ITG Conference on Speech, Aachen, Germany, 20-22 September , 2023
Click here for the paper.
Several non-autoregressive methods for fast and efficient text-to-speech synthesis have been proposed. Most of these use a duration predictor to estimate the temporal sequence of phonemes in the speech. This duration prediction is based on the input phoneme sequence in a speaker-independent fashion. The resulting constant speech pace across speakers is unnatural since every human has a unique characteristic speed in talking. This paper proposes an extension of the multi-speaker ForwardTacotron to learn this aspect with trainable speaker embeddings. The duration of synthesized speech from the proposed model across multiple speakers is much closer to the durations of speech synthesized by a baseline auto-regressive model. The proposed extension yields marginal improvements in intelligibility as measured through an automated semantically unpredictable sentence test. Further, we show that the speech rhythm does not play a significant part in the perceptual quality assessment through a listening test.