This is the accompanying website for the following paper:
@inproceedings{ZalkowGMHD23_EvalAlignmentTTS_ICASSP, author = {Frank Zalkow and Prachi Govalkar and Meinard M{\"u}ller and Emanu{\"e}l A.\ P.\ Habets and Christian Dittmar}, title = {Evaluating Speech--Phoneme Alignment and Its Impact on Neural Text-To-Speech Synthesis}, booktitle = {Proceedings of the {IEEE} International Conference on Acoustics, Speech, and Signal Processing ({ICASSP})}, address = {Rhodes Island, Greece}, year = {2023}, pages = {}, doi = {10.1109/ICASSP49357.2023.10097248}, url-pdf = {https://ieeexplore.ieee.org/document/10097248}, url-details = {https://www.audiolabs-erlangen.de/resources/NLUI/2023-ICASSP-eval-alignment-tts}, }
In recent years, the quality of text-to-speech (TTS) synthesis vastly improved due to deep-learning techniques, with parallel architectures, in particular, providing excellent synthesis quality at fast inference. Training these models usually requires speech recordings, corresponding phoneme-level transcripts, and the temporal alignment of each phoneme to the utterances. Since manually creating such fine-grained alignments requires expert knowledge and is time-consuming, it is common practice to estimate them using automatic speech–phoneme alignment methods. In the literature, either the estimation methods' accuracy or their impact on the TTS system's synthesis quality is evaluated. In this study, we perform experiments with five state-of-the-art speech–phoneme aligners and evaluate their output with objective and subjective measures. As our main result, we show that small alignment errors (below 75 ms error) do not decrease the synthesis quality, which implies that the alignment error may not be the crucial factor when choosing an aligner for TTS training.
Our implementation for the RAD system was inspired by the parallel model described by Badlani et al. [1], which extends and generalizes RAD-TTS [2]. We use a few 1-D convolutional blocks as the phonetic encoder (processing embeddings of phonemes) and a spectral encoder (processing log-scaled mel-spectral frames), respectively. An alignment matrix is then computed using the negative Euclidean distance between each pair of the phonetic and spectral encodings, respectively. We can then apply the forward sum loss to this matrix as described by Shih et al. [2] As proposed by the original authors, we also use an additional loss term to increase the values around a prior along the main diagonal. Furthermore, we also employ a binarization loss term, which is added after convergence, where a second training phase begins. The table below summarises the architecture of the model.
Layer | Output | Activation | Parameters | |
---|---|---|---|---|
Phonetic Encoder | Input | (N, V) | ||
Embedding (512) | (N, 512) | 512 · V | ||
1D-Conv (3) | (N, 512) | lReLU | 786432 | |
1D-BatchNorm | (N, 512) | 1024 | ||
1D-Conv (1) | (N, 512) | lReLU | 262144 | |
1D-BatchNorm | (N, 512) | 1024 | ||
Spectral Encoder | Input | (N, M) | ||
1D-Conv (3) | (N, 512) | lReLU | 1536 · M | |
1D-BatchNorm | (N, 512) | 1024 | ||
1D-Conv (3) | (N, 512) | lReLU | 786432 | |
1D-BatchNorm | (N, 512) | 1024 | ||
1D-Conv (1) | (N, 512) | lReLU | 262144 | |
1D-BatchNorm | (N, 512) | 1024 |
We have been inspired by the idea from Teytaut and Roebel [3] to stabilize the CTC-based posteriogram temporally by using a spectral decoder. We simplified all other aspects of their model, e.g., by removing the phonetic attention mechanism. In this section, we describe the simplified model, highlighting the changes compared to the original model [3].
The input to the model is only a mel spectrogram (without applying a scaling and without any side information or delta features as in [3]). A 2-layer bidirectional LSTM with 128 units (instead of 512 units as in [3]) directly processes the input (without prior convolutional blocks as in [3]). Similar to [3], a linear layer reduces the feature dimension to the size of the CTC posteriogram (size of the phoneme alphabet and the additional blank symbol). Unlike [3], we don't use a recurrent network for the spectral decoder but three fully connected (or dense) layers. We think a recurrent network might compensate for temporal inaccuracies in the posteriogram, which invalidates the decoder's aim to stabilize the posteriogram temporally. Thus, limiting the receptive field of the decoder should be beneficial for that aim. The table below summarises the architecture of the model. For training, we use a reconstruction loss of 1.0 instead of 0.1.
To retrieve the final alignment from the CTC posteriogram, we employ a different procedure compared to [3]. First, we remove the blank symbol probability from the posteriogram and ℓ1-normalize the rows, similar to [4]. Then, we apply a dynamic programming procedure (similar to the Viterbi algorithm) to find a probability-maximizing alignment.
Layer | Output | Activation | Parameters | |
---|---|---|---|---|
Encoder | Input | (N, M) | ||
Bi-LSTM | (N, 256) | lReLU | 1024 · (M + 130) | |
Bi-LSTM | (N, 256) | lReLU | 395264 | |
Dense | (N, V + 1) | softmax | 257 · (V + 1) | |
Decoder | Input | (N, V + 1) | ||
Remove blank | (N, V) | |||
Dense | (N, 256) | lReLU | (V + 1) · 256 | |
Dense | (N, 256) | lReLU | 65792 | |
Dense | (N, M) | linear | 257 · M |
T6B06201847: "You have neither of you any doubt as to your son's guilt?"
T6B06202613: "Very good. Now, Mister Wilson?"
T6B06202664: "Good God! What a week she must have spent!"
T6B06202849: "Did Lady Brackenstall say that screw was used?"
T6B06202850: "Are any of your people tinsmiths?"
T6B06313689: A very warm welcome to you and your family.
T6B06314479: six eight seven six seven three
T6B06324701: Mangel-wurzels are grown chiefly as cattle feed.
T6B06325330: The chewing-gum tasted spearminty.
T6B06335464: Do you not think it strange that these judgments are made?
We thank all participants of our listening test. Furthermore, we thank Alexander Adami for fruitful discussions on the listening test design and its evaluation. Parts of this work have been supported by the SPEAKER project (FKZ 01MK20011A), funded by the German Federal Ministry for Economic Affairs and Climate Action. In addition, this work was supported by the Free State of Bavaria in the DSAI project. The International Audio Laboratories Erlangen are a joint institution of the Friedrich–Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS. The authors gratefully acknowledge the technical support and HPC resources provided by the Erlangen National High Performance Computing Center (NHR@FAU) of the FAU.
@article{BadlaniEtAl21_TTSAlignment_arXiv, title = {One {TTS} Alignment To Rule Them All}, author = {Rohan Badlani and Adrian {\L{}}ancucki and Kevin J. Shih and Rafael Valle and Wei Ping and Bryan Catanzaro}, journal = {CoRR}, year = {2021}, volume = {abs/2108.10447}, eprinttype = {arXiv}, eprint = {2108.10447}, }
@inproceedings{ShihEtAl21_RADTTS_ICML, author = {Kevin Shih and Rafael Valle and Rohan Badlani and Adrian {\L{}}ancucki and Wei and Ping and Bryan Catanzaro}, title = {{RAD-TTS}: {P}arallel Flow-Based {TTS} with Robust Alignment Learning and Diverse Synthesis}, booktitle = {Proceedings of the International Conference on Machine Learning ({ICML}) Workshop on Invertible Neural Networks, Normalizing Flows, and Explicit Likelihood Models}, pages = {}, address = {}, year = {2021}, doi = {} }
@inproceedings{TeytautRoebel21_PhonemeAudioAlignment_Interspeech, author = {Yann Teytaut and Axel Roebel}, title = {Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice}, booktitle = {Proceedings of the Annual Conference of the International Speech Communication Association (Interspeech)}, pages = {61--65}, address = {Brno, Czech Republic}, year = {2021}, }
@article{ZalkowMueller_CTC_TASLP, author = {Frank Zalkow and Meinard M{\"{u}}ller}, title = {{CTC}-Based Learning of Chroma Features for Score-Audio Music Retrieval}, journal = {{IEEE}/{ACM} Transactions on Audio, Speech, and Language Processing}, year = {2021}, volume = {29}, pages = {2957--2971}, doi = {10.1109/TASLP.2021.3110137}, }