Author: Jürgen Herre
Co-Author: Sascha Dick
While the idea of bandwidth extension (BWE) had been explored for speech coding before, it has emerged for perceptual audio coding around the year 2000 and has proven to be an extremely useful coding approach that can provide high reproduced audio bandwidth also at very low bitrates. Where conventional audio codecs without BWE have to limit the transmitted audio bandwidth due to the limited number of available bits, BWE-enabled codecs add some compact side information to the bitstream that allows the decoder to synthesize a perceptually convincing high-frequency (HF) portion from the transmitted low-frequency (LF) part. In this way, full-bandwidth reconstruction can be achieved from a transmitted audio bandwidth as low as 4kHz. Like Perceptual Noise Substitution, bandwidth extension is an efficient coding method that is not waveform-preserving.
The first widely successful BWE technique is called Spectral Band Replication (SBR) [1] and was used in the MPEG-4 High-Efficiency (HE-AAC) codec [2]. SBR uses a QMF-type filterbank to ‘transpose’ (copy) LF content into the HF part of the output signal. This constitutes a computationally and conceptually simple but powerful BWE scheme that extended the usability of perceptual audio coding greatly towards low bitrates (say, below 48kbit/s down to 20kbit/s per channel). SBR/BWE processing includes the following steps:
Other more advanced schemes for BWE followed later. Enhanced SBR (eSBR) [3] improves the reconstruction of tonal, harmonic signals by preserving the continuation of the harmonic structure. Intelligent Gap Filling (IGF) [4] is a semi-parametric coding scheme that operates in the MDCT domain and adaptively combines waveform preserving coding with parametric BWE as well as noise substitution techniques similar to Perceptual Noise Substitution. Thus, perceptually important HF parts can still be coded waveform-preserving, while perceptually less important portions can be parametrically reconstructed even in the lower frequency ranges. Advanced Spectral Extension (A-SPX) [5] technology is e.g. employed as the BWE tool in the propriety AC-4 codec. A-SPX operates also in the QMF -domain and performs BWE conceptually like SBR. However, unlike SBR, and similar to IGF, A-SPX allows for very flexible interleaving of waveform coded elements with the parametrically coded elements, thus, circumventing one of the fundamental limitations, namely the inability to accurately re-construct important tonal or transient components in high frequencies.
Generally speaking, BWE processing does not produce fully transparent output but perceptually convincing results when used carefully (e.g. to recreate frequencies above 8kHz). When used more aggressively (at cross-over frequencies of 4kHz or even below), artifacts can be introduced in the HF part.
This section includes schematic drawings as well as interactive spectrograms based on natural signals to illustrate the artifacts. Click on the spectrograms to toggle view between original and degraded version.
Roughness or Noisiness:
When the original HF part is rather tonal, but the reconstructed HF part is more noise-like (e.g. due to copying up noise-like sections from the LF part or applying noise substitution techniques when not appropriate), artifacts that result in rough or noisy sound can occur. Figure 3 illustrates this on the spectrogram of a very tonal signal (accordion), for which the frequencies above 4kHz have been artificially substituted by noise. It is evident, that even tough the energy is maintained, the tonal structure of the spectrogram is lost.
Exaggerated tonality
When the original HF part is rather noisy, but the transmitted LF part is tonal, the reconstructed HF part is too tonal, this can lead to "buzzing" or "sawing" artifacts. Figure 5 illustrates the spectrogram of a pop music signal, where frequencies above 4kHz have been reconstructed by repeatedly copying up the LF part. Note how a prominent tonal line slightly above 4 kHz is introduced due to copying up a fundamental frequency from the bass section (and also at the borders of the repeated copy-up parts).Harmonicity mismatch
When both LF and HF part are rather tonal, but the harmonic structure of the reconstructed signal does not match the original signal, similar artifacts as for exaggerated tonality can occur. Moreover, tones that are placed too closely together near the crossover frequency can lead to modulation and beating artifacts. Figure 7 illustrates the spectrogram of a tonal, harmonic signal (accordion), where frequencies above 4kHz have been reconstructed by repeatedly copying up the LF part. Note how the overall tonality is still preserved in the HF part, but the pattern does not match the harmoic continuation of the LF part as seen in the original signal.
Inappropriate Temporal Shaping
When the temporal shaping of the reconstructed signal does not match the original spectrum, pre-echo like artifacts can occur. Figure 9 illustrates the spectrogram of a transient signal (glockenspiel), where the frequencies above 4kHz have been substituted by noise, matching the average energy in frames of 21 ms length. Besides the loss of tonality, the sharp attacks are temporally smeared in the HF part, resembling pre-echos with a high-pass characteristic.
For the following sound examples, intentionally degraded signals were generated to simulate clearly audible artifacts to sensitize listeners to more subtle coder-induced artifacts of the same nature. To demonstrate noisiness, the HF spectrum is replaced by noise of the same spectral envelope. To demonstrate exaggerated tonality and harmonicity mismatch, the HF spectrum is replaced by a scaled version of the LF spectrum. Further details on the simulation of coding artifacts for demonstration and research purposes can be found in [6].
Accordion: The accordion has a high tonality with a dense harmonic structure. Improper BWE can lead to different artifacts here: If the reconstructed HF part is too noise-like, the harmonic structure is lost and the noise can be perceived as an additional signal, which may sound like "air additionally leaking out of the accordion". Alternatively, if the tonality is preserved but the harmonic structure of the LF part is not properly continued by the reconstructed HF part, accordion's timbre may sound to harsh, resembling a resonant or metallic sound. However, if the crossover frequency is high enough (e.g. 8kHz), the ear cannot resolve the harmonic fine-structure, resulting in acceptable quality.
Glockenspiel: The glockenspiel also has high tonality, however not a dense harmonic structure like the accordion. Additionally, the glockenspiel has a distinct temporal structure with clear attacks at the onset of each note. As the temporal structure of the attacks is smeared in the reconstructed versions, artifacts from inappropriate temporal shaping are audible, making the attacks sound less sharp and "wet". Moreover, replacing the tonal lines in the HF part by noise is perceived as an additional signal, similar to the artifacts demonstraded in the accordion. If the sparse tonal lines in the HF part are reconstructed from the much denser structure in the LF part, ringing artifact are audible, that may sound like loose parts are resonating in the glockenspiel.
Funky: This pop music signal is more complex than the isolated accordion and glockenspiel. It has tonal components in the LF part, however, the HF part is rather noise-like due to the reverb and decaying hi-hats. Therefore, the HF noise substitution works rather well above reasonable crossover frequencies (e.g. 8kHz). However, the copy-up process introduces tonality from the LF part into the HF part, resulting in a resonant or sizzling sound.
[1] P. Ekstrand: "Bandwidth extension of audio signals by spectral band replication", in Proc.1st IEEE Benelux Workshop on Model based Processing and Coding of Audio (MPCA-2002), Leuven, Belgium, Nov. 15, 2002, pp. 73–79.
[2] J. Herre, M. Dietz: "Standards in a Nutshell: MPEG-4 High-Efficiency AAC Coding", IEEE Signal Processing Magazine, Vol. 25(3), pp. 137-142, May 2008.
[3] F. Nagel,S. Disch: "A harmonic bandwidth extension method for audio codecs" In Proc. IEEE
International Conference on Acoustics, Speech and Signal Processing (ICASSP '09), Taipei, Taiwan, 19–24 April 2009.
[4] S. Disch, A. Niedermeier, C.R. Helmrich, C. Neukam, K. Schmidt, R. Geiger, J. Lecomte, F. Ghido,
F. Nagel, B. Edler: "Intelligent Gap Filling in Perceptual Transform Coding of Audio", in Proc. AES 141st Convention, October 2016.
[5] K. Kjörling et al. “AC-4–the next generation audio codec,” in 140th Audio Engineering Society Convention, June 4-7, Paris, France, Proceedings, 2016, preprint #9491.
[6] S. Dick, N. Schinkel-Bielefeld, S. Disch: "Generation and Evaluation of Isolated Audio Coding Artifacts", Proc. AES 143rd Convention, October 2017.