AudioLabs - Revisiting Vocos: That phasiness business in time-frequency neural vocoding

Revisiting Vocos: That phasiness business in time-frequency neural vocoding

Ünal Ege Gaznepoğlu, Frank Zalkow, Mohammad Joshaghani, Emanuël A. P. Habets, Nils Peters, Christian Dittmar

Submitted to the Proc. IWAENC, 2026.

Abstract

Recently, neural vocoders utilizing time-frequency representations have been approaching the state-of-the-art quality of time-domain neural vocoders. Vocos is a notable example due to its computational efficiency, but its audio quality lags behind the time-domain vocoders and the reasons remain debated. Thus, in this study, we revisit Vocos from a phase reconstruction perspective. First, we quantify the gap between time-domain and time-frequency domain vocoders using bandlimited mel spectrograms as inputs. Later, via an ablation study, we verify the Vocos architecture is effective for magnitude modeling, but less so for phase. We then adapt the Vocos backbone to predict phase differences, a precursor for phase reconstruction, and identify 1D convolutional layers are hindering their accurate prediction. Our findings indicate that future research needs to focus on inductive biases that allow the architecture to better model the time-frequency structure of speech signals, without sacrificing the support for arbitrary input representations.

Audio Samples

Below are the audio samples. For each item, the first five stimuli are used in the listening test. Others (i.e., after the black separator) are included for reference and comparison.

TestItem_001_FEMALE_REF.wav

TestItem_002_FEMALE_REF.wav

TestItem_003_FEMALE_REF.wav

TestItem_004_FEMALE_REF.wav

TestItem_005_FEMALE_REF.wav

TestItem_006_FEMALE_REF.wav

TestItem_007_FEMALE_REF.wav

TestItem_001_MALE_REF.wav

TestItem_002_MALE_REF.wav

TestItem_003_MALE_REF.wav

TestItem_004_MALE_REF.wav

TestItem_005_MALE_REF.wav

TestItem_006_MALE_REF.wav

TestItem_007_MALE_REF.wav

Paper (click to enlarge) (contains some minor fixes to the original submission)

@inproceedings{2026_gaznepoglu_revisiting,
author = {Ünal Ege Gaznepoğlu, Frank Zalkow, Mohammad Joshaghani, Emanuël A. P. Habets, Nils Peters, Christian Dittmar},
booktitle = {Submitted to Proc. Int. Workshop on Acoustic Signal Enhancement (IWAENC)},
title = {Revisiting {Vocos}: That phasiness business in time-frequency neural vocoding},
keywords = {neural vocoders, phase reconstruction},
year = {2026}}

International Audio Laboratories Erlangen