Sebastian Strahl, Yigitcan Özer, Hans-Ulrich Berendes, Meinard Müller
This website is related to the following publication:
@inproceedings{StrahlOBM24_TextAlignSynthesis_SMC,
author = {Sebastian Strahl and Yigitcan {\"O}zer and Hans-Ulrich Berendes and Meinard M{\"u}ller},
title = {Hearing Your Way Through Music Recordings: {A} Text Alignment and Synthesis Approach},
booktitle = {Proceedings of the Sound and Music Computing Conference ({SMC})},
pages = {65--72},
address = {Graz, Austria},
year = {2025},
doi = {10.5281/zenodo.15839750},
url-pdf = {https://audiolabs-erlangen.de/content/05_fau/professor/00_mueller/03_publications/2025_StrahlOBM_TextAlignSynthesis_SMC_ePrint.pdf},
url-details = {https://audiolabs-erlangen.de/resources/MIR/2025-SMC-TextAlignSynth},
url-code = {https://github.com/groupmm/textalignsynth},
}
Annotations related to musical events such as chord labels, measure numbers, or structural descriptions are typically provided in textual format within or alongside a score-based representation of a piece. However, following these annotations while listening to a recording can be challenging without additional visual or auditory display. In this paper, we introduce an approach for enriching the listening experience by mixing music recordings with synthesized text annotations. Our approach aligns text annotations from a score-based timeline to the timeline of a specific recording and then utilizes text-to-speech synthesis to acoustically superimpose them with the recording. We describe a processing pipeline for implementing this approach, allowing users to customize settings such as speaking language, speed, speech positioning, and loudness. Case studies include synthesizing text comments on measure positions in Schubert songs, chord annotations for Beatles songs, structural elements of Beethoven piano sonatas, and leitmotif occurrences in Wagner operas. Beyond these specific examples, our aim is to highlight the broader potential of speech-based auditory display. This approach offers valuable tools for researchers seeking a deeper understanding of datasets and their annotations, for evaluating music information retrieval algorithms, or for educational purposes in instrumental training, music-making, and aural training.
With case studies, we demonstrate how the proposed processing pipeline may be applied in the field of music information retrieval (MIR). Here, we use it to synthesize measure number annotations for a music recording by announcing them acoustically. The example is taken from the Schubert Winterreise Dataset [1].
@article{WeissZAMKVG21_WinterreiseDataset_ACM-JOCCH,
author = {Christof Wei{\ss} and Frank Zalkow and Vlora Arifi-M{\"u}ller and Meinard M{\"u}ller and Hendrik Vincent Koops and Anja Volk and Harald Grohganz},
title = {{S}chubert {W}interreise Dataset: {A} Multimodal Scenario for Music Analysis},
journal = {{ACM} Journal on Computing and Cultural Heritage ({JOCCH})},
volume = {14},
number = {2},
pages = {25:1--18},
year = {2021},
doi = {10.1145/3429743},
url-demo = {https://doi.org/10.5281/zenodo.4122060}
}
This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Grant No. 500643750 (MU 2686/15-1). The authors are with the International Audio Laboratories Erlangen, a joint institution of the Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU) and Fraunhofer Institute for Integrated Circuits IIS.