Benchmarking Neural Speech Codec Intelligibility with SITool

This is the accompanying website for the following paper:

  1. Anna Leschanowsky, Kishor Kayyar Lakshminarayana, Anjana Rajasekhar, Lyonel Behringer, Ibrahim Kilinc, Guillaume Fuchs, and Emanuël A. P. Habets
    Benchmarking Neural Speech Codec Intelligibility with SITool
    In Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH): 5488–5492, 2025. PDF DOI
    @inproceedings{LeschanowskyKRBKFH25_SpeechCodecIntSITool_Interspeech,
    author      = {Anna Leschanowsky and Kishor Kayyar Lakshminarayana and Anjana Rajasekhar and Lyonel Behringer and Ibrahim Kilinc and Guillaume Fuchs and Emanu{\"e}l A.\ P. Habets},
    title       = {Benchmarking Neural Speech Codec Intelligibility with {SITool}},
    booktitle   = {Proceedings of the Conference of the International Speech Communication Association (INTERSPEECH)},
    address     = {Rotterdam, The Netherlands},
    year        = {2025},
    pages       = {5488-5492},
    doi         = {10.21437/Interspeech.2025-984},
    url-pdf     = {https://www.isca-archive.org/interspeech_2025/leschanowsky25_interspeech.pdf},
    }

Abstract

Speech intelligibility assessment is essential for evaluating neural speech codecs, yet most evaluation efforts focus on overall quality rather than intelligibility. Only a few publicly available tools exist for conducting standardized intelligibility tests, like the Diagnostic Rhyme Test (DRT) and Modified Rhyme Test (MRT). We introduce the Speech Intelligibility Toolkit for Subjective Evaluation (SITool), a Flask-based web application for conducting DRT and MRT in laboratory and crowdsourcing settings. We use SITool to benchmark 13 neural and traditional speech codecs, analyzing phoneme-level degradations and comparing subjective DRT results with objective intelligibility metrics. Our findings show that, while neural speech codecs can outperform traditional ones in subjective intelligibility, only STOI and ESTOI – not WER – significantly correlate with subjective results, although they struggle to capture gender and wordlist-specific variations observed in subjective evaluations.

Code

The SITool is open sourced at https://github.com/audiolabs/SITool.

Audio Samples

Here, we provide a few of the samples which were used in the intelligibility evaluations.
The original audio files used were obtained from Crowdsourced Multilingual Speech Intelligibility Testing by Laura Lechler and Kamil Wojcicki here with Creative Commons Attribution Share Alike 4.0 International license.
We further processed these audio files using various speech codecs. We also added cough sound to some audio files from Dataset of sounds of symptoms associated with respiratory sickness by Shyamal Patel, Adan Rivas, and Dimitrios Psaltos with "CC-By Attribution 4.0 International" license available from here.
These samples are shared under license.
Condition Peak(M) Teak(F) Jaws(M) Gauze(F)
Original
Anchor
Codec2 700 bps
Codec2 2400 bps
AMR 4750 bps
AMRWB 6600 bps
EVS 8000 bps
LPCnet 1600 bps
Lyra 3200 bps
DAC-S 1500 bps
SNAC 980 bps
Mimi 550 bps
Mimi 1100 bps
SemantiCodec 340 bps
SemantiCodec 680 bps