Author: Gerald Schuller
Co-Author: Jürgen Herre
The goal of a high compression ratio in perceptual audio coding has historically led to the use of transforms with a large block size or filter banks with many bands (i.e. high frequency resolution). Such spectral decompositions are suitable to obtain high coding gains for the mostly stationary parts in music signals. On the other hand, due to the so-called uncertainty principle, a high frequency resolution implies a low temporal resolution of the time/frequency representation. Thus, a high frequency resolution results in a poor control over the temporal shape of the quantization noise in the decoded audio signal, which may lead to coding artifacts for less stationary audio material. This can be perceived, for instance, as reverberation or echoiness in speech signals or pre-echoes at the "attack" portions of transient signals, such as castanets. More background information on this effect can be found in the section on the Pre-Echo phenomenon.
To obtain a better control over the temporal shaping of the quantization noise, a number of techniques were proposed over time:
In order to demonstrate the perceptual quality of artifacts relating to temporal smearing of quantization noise as they may appear in perceptual audio coding, several sound excerpts were generated using a MATLAB program. The signals illustrate the effect of quantization noise which is injected into the signals spectral coefficients for various frequency resolutions (number of filter bank bands) and overall noise levels. This resembles the behavior of a simple perceptual audio coder which is not equipped with any of the precautions for controlling the temporal shape of the coding distortion, as they were discussed previously.
The signals were generated using the popular Modified Discrete Cosine Transform (MDCT) [6], which is used in many of today's coding schemes and a sine window. There are two parameters which were varied across the different signal versions:
The signals are based on a speech recording ("German Male Speech") from the SQAM CD of the European Broadcasting Union (EBU) which has proven to be critical for many coding schemes and thus was used in many official listening tests for coder evaluation. It is recommended that these samples are heard over headphones, otherwise the room reverberation might mask the artifacts.
Play Original Signal (unprocessed)Processed (0 dB) |
Processed (-3 dB) |
Processed (-6 dB) |
Processed (-9 dB) |
|
---|---|---|---|---|
2048 Filter Bank Bands, Window Size: 4096 samples (92.9ms) |
Play | Play | Play | Play |
1024 Filter Bank Bands, Window Size: 2048 samples (46.4ms) |
Play | Play | Play | Play |
512 Filter Bank Bands, Window Size: 1024 samples (23.2ms) |
Play | Play | Play | Play |
256 Filter Bank Bands, Window Size: 512 samples (11.6ms) |
Play | Play | Play | Play |
The following effects can be noticed when listening to the sound excerpts: The temporal "smearing" of the distortion introduces a reverberant quality into the speech signal which increases significantly with the length of the filter bank window. For large window sizes, the effect is even audible at rather small distortion levels. Accordingly, proper use of additional measures for preventing temporal unmasking is of high importance for audio coders with a high frequency resolution / number of filter bank channels.
To demonstrate the character of this coding artifact with a real coder, the number of subbands in an audio coder was artificially fixed to 1024, to avoid switching to its 128 band mode. This increases the coding artifacts to make them more audible. The test signal again consists of German male speech, a typical test signal were artifacts are easily produced and detected. The signal has been coded at a sampling rate of 32 kHz and a bit-rate of 64 kb/s (for the stereo file). Again, it is recommended that these samples are heard over headphones, otherwise the real room reverberation might mask the artifacts:
It should be easy to hear that the encoded and decoded signal features more "reverberation". This is because perceptual audio coders use a psycho-acoustic model to spectrally shape the quantization noise. The spectral shape has similarity to the audio signal. Because of the lack of temporal control in the 1024 band mode, this noise also appears just before and after the sound elements of speech, where the quantization noise sounds like room echoes.
[1] B. Edler: "Codierung von Audiosignalen mit überlappender Transformation und adaptiven Fensterfunktionen", Frequenz, Vol. 43, pp. 252-256, 1989
[2] J. Herre, J. D. Johnston: "Enhancing the Performance of Perceptual Audio Coders by Using Temporal Noise Shaping (TNS)", 101st AES Convention, Los Angeles 1996, Preprint 4384
[3] T. Vaupel: "Ein Beitrag zur Transformationscodierung von Audiosignalen unter Verwendung der Methode der 'Time Domain Aliasing Cancellation (TDAC)' und einer Signalkompandierung im Zeitbereich", PhD Thesis, Universität-Gesamthochschule Duisburg, Germany, 1991
[4] B. Edler, C. Faller, G. Schuller: "Perceptual Audio Coding Using a Time-Varying Linear Pre- and Post-Filter", 109th AES Convention, Los Angeles 2000, Preprint 5274
[5] A. Biswas, P. Hedelin, L. Villemoes, and V. Melkote, “Temporal noise shaping with companding,” in Interspeech 2018, September 2-6, Hyderabad, India, Proceedings, 2018, pp. 3548–3552.
[6] J. Princen, A. Johnson, A. Bradley: "Subband/Transform Coding Using Filter Bank Designs Based on Time Domain Aliasing Cancellation", IEEE ICASSP 1987, pp. 2161 - 2164
Note: Some of the audio source excerpts have been taken from the SQAM CD [Cat. No. 422204-2] by kind permission of the European Broadcasting Union (EBU)