AudioLabs - Multi-Voice Intonation Adaptation via Gradient Descent

Multi-Voice Intonation Adaptation via Gradient Descent

This is the accompanying website to the article

Simon Schwär and Meinard Müller
Multi-Voice Intonation Adaptation via Gradient Descent
IEEE Transactions on Audio, Speech and Language Processing (TASLPRO), 34: 2491–2503, 2026. PDF DOI

@article{SchwaerM26_IntonationGradientDescent_TASLPRO,
author    = {Simon Schw{\"a}r and Meinard M{\"u}ller},
title     = {Multi-Voice Intonation Adaptation via Gradient Descent},
journal   = {IEEE Transactions on Audio, Speech and Language Processing ({TASLPRO})},
year      = {2026},
volume    = {34},
pages     = {2491--2503},
doi       = {10.1109/TASLPRO.2026.3675812},
url-pdf   = {https://ieeexplore.ieee.org/document/11447422},
}

Abstract

Intonation in ensemble performances on instruments with flexible tuning involves a complex interaction between musicians shaped by musical context, acoustic conditions, and each performer's perception and preferences. In post-production, it is often desirable to compensate for unintended deviations while preserving expressive fluctuations and musically meaningful interaction between individual tracks and voices. In this paper, we formulate multi-voice intonation adaptation as a cost minimization problem, making three main contributions. First, we introduce a differentiable cost function that explicitly balances the adherence of each voice to an equal-tempered pitch grid and the harmonic fit between voices using sensory dissonance. Second, based on this differentiable cost function, we derive a gradient descent adaptation algorithm that produces smooth, time-varying pitch-shift curves without requiring score or note-level information. We show how a small set of interpretable hyperparameters, including initialization, stopping criterion, step size, and momentum, allows for a controlled trade-off between the compensation of unintended intonation deviations and the preservation of expressive fluctuations. Third, we evaluate our method on string, wind, and vocal quartet multi-track recordings through objective and subjective experiments, demonstrating quality comparable to a commercial pitch-correction baseline while offering particular advantages in handling intonation drift and modeling interactions between voices. Beyond these results, the focus of this work is conceptual, making a typically heuristic post-production task transparent and controllable through an explicit cost-based optimization framework.

Synthetic Example Cadence

The synthetic example cadence illustrates various aspects of ensemble intonation and includes a global intonation drift (all voices collectively end the cadence around 50 cents lower), note-level intonation deviations (e.g., the second note in the soprano is slightly higher than the first), and local expressive intonation (e.g., vibrato).

The Figure to the right shows (a) sheet music and chord annotations, (b) the original fundamental frequencies (F0) for each voice (S: soprano, A: alto, T: tenor, B: bass), and (c) the pitch shift curves for each voice obtained using our intonation adaptation method (hyperparameters $w = 0.33$, $L = 1$, $\mu = 50$, $\beta = 0.9$) with the goal to counteract the global drift while retaining local expressive intonation.

You can listen to this example cadence here:

Hyperparameter Settings with the Synthetic Example Cadence

$L \in \{ 1, 10, 100 \}$ with fixed $w = 1$, $\mu = 50$, $\beta = 0$, $p^{(0)}(n) = 0$

$p^{(0)}(n) = 0$ (zero init) vs. $p^{(0)}(n) = p^{(L)}(n-1)$ (prev init) with fixed $w = 1$, $L = 1$, $\mu = 50$, $\beta = 0$

$\mu \in \{ 5, 50, 500 \}$ with fixed $w = 1$, $L = 1$, $\beta = 0$, $p^{(0)}(n) = p^{(L)}(n-1)$

$\beta \in \{ 0, 0.9, 0.99 \}$ with fixed $w = 1$, $L = 1$, $\mu = 50$, $p^{(0)}(n) = p^{(L)}(n-1)$

Application Examples

The following three case studies of multi-voice intonation adaptation demonstrate the flexibility of the cost minimization approach in different scenarios.

12-TET vs. JI-like Intonation in ChoraleBricks

In the ChoraleBricks recording process, instruments were recorded separately, without any real-time interaction between musicians to coordinate intonation. By using multi-voice intonation adaptation, we can simulate such an interaction in post-production and modify the intonation of each individual track so that they can be combined to form a plausible ensemble sound.

Adapting Global Intonation Drift in Dagstuhl ChoirSet

The recording of a vocal quartet singing the motet Locus Iste by Anton Bruckner contains a global intonation drift, where the musicians end the performance around 100 cents lower than they started. Using hyperparameters $w = 1$, $L = 1$, $\mu = 5$, $\beta = 0.9$, we can counteract this drift while preserving the local intonation of the singers. This could be beneficial, for example, in post production, where a sound engineer might want to combine multiple takes to one coherent performance, while still retaining as much of the original performance as possible.

Modifying Global and Local Intonation in Virtuoso Strings

The excerpt from String Quartet Op. 74 No. 1 by Joseph Haydn contains two chords (D major and D major with added 7th). With the expressive intonation in the string ensemble performance (e.g., including notes with and without vibrato), we can compare the effects of a hyperparameter setting that targets only global adaptation towards JI-like intonation ($\mu = 50$, $\beta = 0.9$, $L = 1$) and a setting that enables a stronger local adaptation ($\mu = 50$, $\beta = 0.0$, $L = 100$).

Listening Test

Test Item 1: Synthetic Example Cadence

Test Item 2: Dagstuhl ChoirSet

Test Item 3: Virtuoso Strings

Test Item 4: ChoraleBricks

Code

A Python implementation of this intonation adaptation method is available on GitHub.

References

William A. Sethares
Tuning, Timbre, Spectrum, Scale
Springer, ISBN: 1-85233-797-4, 1998.

@book{Sethares98_sound_BOOK,
author    = {William A. Sethares},
title     = {Tuning, Timbre, Spectrum, Scale},
year      = {1998},
isbn      = {1-85233-797-4},
address   = {London},
publisher = {Springer},
}

Sebastian Rosenzweig, Helena Cuesta, Christof Weiß, Frank Scherbaum, Emilia Gómez, and Meinard Müller
Dagstuhl ChoirSet: A Multitrack Dataset for MIR Research on Choral Singing
Transactions of the International Society for Music Information Retrieval (TISMIR), 3(1): 98–110, 2020. PDF Demo DOI

@article{RosenzweigCWSGM20_DCS_TISMIR,
author    = {Sebastian Rosenzweig and Helena Cuesta and Christof Wei{\ss} and Frank Scherbaum and Emilia G{\'o}mez and Meinard M{\"u}ller},
title     = {{D}agstuhl {ChoirSet}: {A} Multitrack Dataset for {MIR} Research on Choral Singing},
journal   = {Transactions of the International Society for Music Information Retrieval ({TISMIR})},
volume    = {3},
number    = {1},
year      = {2020},
pages     = {98--110},
publisher = {Ubiquity Press},
doi       = {10.5334/tismir.48},
url-pdf   = {2020_RosenzweigCWSGM_DagstuhlChoirSet_TISMIR_ePrint.pdf},
url-demo  = {https://www.audiolabs-erlangen.de/resources/MIR/2020-DagstuhlChoirSet}
}

Maciej Tomczak, Min Susan Li, and Massimiliano Di Luca
Virtuoso Strings: A Dataset of String Ensemble Recordings and Onset Annotations for Timing Analysis
In Late-Breaking Demos of the International Society for Music Information Retrieval Conference (ISMIR), 2023.

@inproceedings{TomczakLL_VirtuosoStrings_ISMIR-LBD,
author      = {Maciej Tomczak and Min Susan Li and Massimiliano Di Luca},
title       = {{Virtuoso Strings}: A Dataset of String Ensemble Recordings and Onset Annotations for Timing Analysis},
booktitle   = {Late-Breaking Demos of the International Society for Music Information Retrieval Conference ({ISMIR})},
address     = {Milano, Italy},
year        = {2023}
}

Stefan Balke, Axel Berndt, and Meinard Müller
ChoraleBricks: A Modular Multi-track Dataset for Wind Music Research
Transactions of the International Society for Music Information Retrieval (TISMIR), 8(1): 39–54, 2025. DOI

@article{BalkeBM25_ChoraleBricks_TISMIR,
author  = {Stefan Balke and Axel Berndt and Meinard M{\"u}ller},
title   = {{ChoraleBricks}: A Modular Multi-track Dataset for Wind Music Research},
journal = {Transactions of the International Society for Music Information Retrieval ({TISMIR})},
volume  = {8},
number  = {1},
pages   = {39--54},
year    = {2025},
doi     = {10.5334/tismir.252},
}

International Audio Laboratories Erlangen