Author: Jürgen Herre
Although spatial perception is far from being fully understood, it is known that directional localization of sounds depends on the evaluation of so-called spatial cues by the human auditory system (see e.g. Blauert's book on spatial hearing [1]), the most important cues being
Consequently, the fidelity of the stereo image of a coded signal depends on the coder's ability to preserve these critical cues appropriately.
For coding of high quality stereophonic (or multi-channel) audio signals at low bit rates, joint coding techniques have proven to be extremely valuable. On one hand they provide mechanisms to account for binaural psychoacoustic effects, on the other hand the required bit rate for the stereophonic signals may be reduced significantly below the bit rate for separate coding of the input channels.
Currently, the most common joint stereo coding techniques are Mid/Side (M/S) stereo coding [2] and Intensity Stereo coding [3] [4]. While the first method can account for binaural masking effects and achieve a certain amount of signal-dependent gain, the intensity stereo method provides a high potential for bit saving. Coders based on the intensity stereo principle have been described in the past for stereophonic and multi-channel coding under various names (e.g. "dynamic crosstalk" [5], "channel coupling" [6]).
Intensity stereo exploits the fact that the perception of high frequency sound components (e.g. above 4 kHz) mainly relies on the analysis of their energy-time envelopes [1] rather than the waveform itself. Thus, it is assumed sufficient to code the envelope of such a signal instead of its waveform. This is done by transmitting one common set of spectral coefficients ("carrier signal") that is shared among several audio channels instead of separate sets for each particular one. In the decoder, the carrier signal is scaled independently for each signal channel to match its original average envelope (or signal energy) for the respective coder frame. The scaling information is calculated and transmitted once for each group of spectral coefficients (scalefactor band). Effectively, the stereo image is recreated at the decoder side by a pan-pot-like operation for each spectral coder band.
Some typical stereo image loss problems:
As a consequence of the intensity stereo coding / decoding process, all output signals reconstructed from a single carrier are scaled versions of each other, i.e. they have the same envelope fine structure for the duration of the coded block (e.g. 10-20 ms). This does not present a major problem for stationary signals or signals having similar envelope fine structures in the intensity stereo coded channels.
For transient signals with dissimilar envelopes in different channels, however, the original distribution of the envelope onsets between the coded channels cannot be recovered. Figures 2 and 3 show an example for this constellation: In a stereophonic recording of an applauding audience, the individual envelopes will be very different in the right and left channel due to the distinct clapping events happening at different times in both channels (see Figure 2 for left and right channel envelopes).
After the intensity stereo encoding / decoding process, the fine time structure of the signals is mostly the same in both channels as can be seen in Figure 3. In particular, there is a structural "cross-talk" between the channels, such that perceptually important signal onsets propagate to the other opposite channel (e.g. L->R at 15 ms, R->L at 47 ms, L->R at 57 ms, L->R at 70 ms).
Consequently, the stereo image quality of the intensity stereo coded / decoded signal will decrease significantly in such cases. The spatial impression tends to narrow down and the perceived stereo image collapses into the center position. For critical signals, like the applauding audience example, the achieved quality cannot be considered as acceptable anymore.
The following sound excerpts illustrate the discussed effects:
The provided example sound files demonstrate the original applause recording as well as three sound examples with increasing deficiencies in stereo imaging quality, as would be produced by a coder without a proper control of the intensity stereo coding mechanism. Please observe the increasing loss of people applauding in the outer left and outer right seats as well as the overall lack of spatial impression and distinct reproduction of the single clap events.
Parametric coding of stereo or multi-channel signals (“Spatial Audio Coding”) is a generalization of the Intensity Stereo Coding concept and emerged successfully shortly after the year 2000. Like bandwidth extension, it contributed significantly to the state-of-the-art in offering good quality spatial audio even at very low bitrates.
Codecs with parametric coding of two or more channels generally reduce the original material to a mono (or stereo) downmix and a compact parametric side information that represents the most salient perceptual aspects of the spatial sound image, including Inter-Channel Level/Intensity Differences (ICLDs/ICIDs)), Inter-Channel Phase/Time Differences (ICPD/ICTD) and Inter-Channel Coherence/Correlation (ICC). In contrast to traditional intensity stereo coding, parametric stereo/multi-channel coding is applied to the full audio bandwidth. Like other parametric coding approaches, it is not waveform-preserving, i.e. it does not attempt to reproduce the original waveforms but produces similar output sounding comparable to the original signals.
The first widely successful parametric stereo/multi-channel coding techniques were Binaural Cue Coding (BCC) [7] and Parametric Stereo (PS) [8], the latter being used in the MPEG-4 High-Efficiency v2 (HE-AAC v2) codec [9]. MPEG Surround [10] provides a further generalization of this concept supporting efficient coding from stereo up to 3D audio formats, such as 7.1+4H, i.e. a 7.1 setup with 4 additional height speakers. Together with bandwidth extension, this technique allows good audio quality even at very low bitrates (e.g. 32 kbit/s for a stereo signal, and below 64 kbit/s for 5.1).
Parametric coding of stereo/multi-channel audio includes the following steps:
Other generalizations of the concept followed later, such as Spatial Audio Object Coding (SAOC) [11] that provides efficient parametric coding of several object (rather than channel) signals.
Generally speaking, parametric multi-channel processing does not produce fully transparent output but perceptually convincing results. Still, for very critical signals, spatial (and other) artifacts can be introduced which are best audible with headphone listening:
The examples demonstrate the following versions:
The first example is a stereo signal with two talkers, panned to left and right sides, respectively. In this case, merely synthesizing ICLDs already results in a signal reconstruction that is perceptually very similar to the original. However, a residual amount of spatial instability remains and there is the tendency that additional sound is perceived in the middle of the stereo image ("phantom whisperer") that was not present in the original. Adding ICC synthesis improves on these aspects (as demonstrated by the ICLD+ICC Coded version). However, it also has the tendency of introducing other artifacts, as can be heard in the exaggerated ICLD Coded + ICC0 version.
The second and third examples are pop music excerpts that utilize not only panning, but also stereophonic instruments and reverb. In this case, merely synthesizing ICLD results in a signal sounding narrower and with less room ambience than the original. Adding ICC synthesis widens the stereo image. Without properly controlled ICC synthesis, the resulting stereo image is too wide and blurry compared to the original stereo signal. With properly coded ICLD + ICC synthesis, the spaciousness of the original stereo signal can be restored.
[1] J. Blauert, "Spatial Hearing", MIT Press, 1983
[2] J. D. Johnston, A. J. Ferreira: "Sum-Difference Stereo Transform Coding", IEEE ICASSP 1992, pp. 569-571
[3] R.G.v.d. Waal, R.N.J. Veldhuis, "Subband Coding of Stereophonic Digital Audio Signals", IEEE ICASSP 1991, pp. 3601 - 3604
[4] J. Herre, K. Brandenburg, D. Lederer, "Intensity Stereo Coding", 96th AES Convention, Amsterdam 1994, Preprint #3799
[5] G. Stoll, G. Theile, S. Nielsen, A. Silzle, M. Link, R. Sedlmayer, A. Breford, "Extension of ISO/MPEG-Audio Layer II to Multi-Channel Coding: The Future Standard for Broadcasting, Telecommunication, and Multimedia Applications", presented at the 94th AES Convention, Berlin 1994, Preprint # 3550
[6] Mark Davis, "The AC-3 Multichannel Coder", 95th AES Convention, New York October 1993, Preprint # 3774
[7] C. Faller and F. Baumgarte, “Binaural Cue Coding—Part II: Schemes and Applications,” IEEE Trans. Speech Audio Processing, vol. 11 (2003 Nov.).
[8] E. Schuijers, J. Breebaart, H. Purnhagen, J. Engdegård: “Low complexity parametric stereo coding”, Proc. 116th AES convention, Berlin, Germany, 2004, Paper 6073.
[9] J. Herre, M. Dietz: "Standards in a Nutshell: MPEG-4 High-Efficiency AAC Coding", IEEE Signal Processing Magazine, Volume 25, Issue 3, pp 137 - 142, May 2008.
[10] J. Herre, K. Kjörling, J. Breebaart, C. Faller, S. Disch, H. Purnhagen, J. Koppens, J. Hilpert, J. Rödén. W. Oomen, K. Linzmeier, K. S. Chong: "MPEG Surround – The ISO/MPEG Standard for Efficient and Compatible Multichannel Audio Coding", Journal of the AES, Vol. 56, No. 11, November 2008, pp. 932-955
[11] J. Herre, H. Purnhagen, J. Koppens, O. Hellmuth, J. Engdegård, J. Hilpert, L. Villemoes, L. Terentiv, C. Falch, A. Hölzer, M. L. Valero, B. Resch, H. Mundt, and H. Oh: "MPEG Spatial Audio Object Coding – The ISO/MPEG Standard for Efficient Coding of Interactive Audio Scenes", Journal of the AES, Vol. 60, No. 9, September 2012, pp. 655-673