Ünal Ege Gaznepoğlu, Frank Zalkow, Mohammad Joshaghani, Emanuël A. P. Habets, Nils Peters, Christian Dittmar
Submitted to the Proc. IWAENC, 2026.
Recently, neural vocoders utilizing time-frequency representations have been approaching the state-of-the-art quality of time-domain neural vocoders. Vocos is a notable example due to its computational efficiency, but its audio quality lags behind the time-domain vocoders and the reasons remain debated. Thus, in this study, we revisit Vocos from a phase reconstruction perspective. First, we quantify the gap between time-domain and time-frequency domain vocoders using bandlimited mel spectrograms as inputs. Later, via an ablation study, we verify the Vocos architecture is effective for magnitude modeling, but less so for phase. We then adapt the Vocos backbone to predict phase differences, a precursor for phase reconstruction, and identify 1D convolutional layers are hindering their accurate prediction. Our findings indicate that future research needs to focus on inductive biases that allow the architecture to better model the time-frequency structure of speech signals, without sacrificing the support for arbitrary input representations.
Below are the audio samples. For each item, the first five stimuli are used in the listening test. Others (i.e., after the black separator) are included for reference and comparison.
@inproceedings{2026_gaznepoglu_revisiting,
author = {Ünal Ege Gaznepoğlu, Frank Zalkow, Mohammad Joshaghani, Emanuël A. P. Habets, Nils Peters, Christian Dittmar},
booktitle = {Submitted to Proc. Int. Workshop on Acoustic Signal Enhancement (IWAENC)},
title = {Revisiting {Vocos}: That phasiness business in time-frequency neural vocoding},
keywords = {neural vocoders, phase reconstruction},
year = {2026}}