Bandwidth Extension

Author: Jürgen Herre
Co-Author: Sascha Dick

Background: Bandwidth Extension

While the idea of bandwidth extension (BWE) had been explored for speech coding before, it has emerged for perceptual audio coding around the year 2000 and has proven to be an extremely useful coding approach that can provide high reproduced audio bandwidth also at very low bitrates. Where conventional audio codecs without BWE have to limit the transmitted audio bandwidth due to the limited number of available bits, BWE-enabled codecs add some compact side information to the bitstream that allows the decoder to synthesize a perceptually convincing high-frequency (HF) portion from the transmitted low-frequency (LF) part. In this way, full-bandwidth reconstruction can be achieved from a transmitted audio bandwidth as low as 4kHz. Like Perceptual Noise Substitution, bandwidth extension is an efficient coding method that is not waveform-preserving.

The first widely successful BWE technique is called Spectral Band Replication (SBR) [1] and was used in the MPEG-4 High-Efficiency (HE-AAC) codec [2]. SBR uses a QMF-type filterbank to ‘transpose’ (copy) LF content into the HF part of the output signal. This constitutes a computationally and conceptually simple but powerful BWE scheme that extended the usability of perceptual audio coding greatly towards low bitrates (say, below 48kbit/s down to 20kbit/s per channel). SBR/BWE processing includes the following steps:

Extraction of the properties of the original signal high-frequency part (most notably its time/frequency envelope, but also other properties such as tonality information)
Compact quantization, coding and transmission of this information as additional bitstream side information
Transmission of the original LF signal part using a conventional audio codec approach
Regeneration of the (not transmitted) HF signal part by deriving it from the transmitted LF components using the bit stream side information. SBR uses a QMF filterbank and a copy-up operation (‘transposing’) into the target frequency range for this purpose.
Adjustment of the properties of this raw HF content to match the properties of the original HF signal (e.g. its time/frequency envelope, tonality, …)

Figure 1: Illustration of bandwidth extension

Other more advanced schemes for BWE followed later. Enhanced SBR (eSBR) [3] improves the reconstruction of tonal, harmonic signals by preserving the continuation of the harmonic structure. Intelligent Gap Filling (IGF) [4] is a semi-parametric coding scheme that operates in the MDCT domain and adaptively combines waveform preserving coding with parametric BWE as well as noise substitution techniques similar to Perceptual Noise Substitution. Thus, perceptually important HF parts can still be coded waveform-preserving, while perceptually less important portions can be parametrically reconstructed even in the lower frequency ranges. Advanced Spectral Extension (A-SPX) [5] technology is e.g. employed as the BWE tool in the propriety AC-4 codec. A-SPX operates also in the QMF -domain and performs BWE conceptually like SBR. However, unlike SBR, and similar to IGF, A-SPX allows for very flexible interleaving of waveform coded elements with the parametrically coded elements, thus, circumventing one of the fundamental limitations, namely the inability to accurately re-construct important tonal or transient components in high frequencies.

Artifacts Introduced by Improper Bandwidth Extension

Generally speaking, BWE processing does not produce fully transparent output but perceptually convincing results when used carefully (e.g. to recreate frequencies above 8kHz). When used more aggressively (at cross-over frequencies of 4kHz or even below), artifacts can be introduced in the HF part.

This section includes schematic drawings as well as interactive spectrograms based on natural signals to illustrate the artifacts. Click on the spectrograms to toggle view between original and degraded version.

Roughness or Noisiness:

When the original HF part is rather tonal, but the reconstructed HF part is more noise-like (e.g. due to copying up noise-like sections from the LF part or applying noise substitution techniques when not appropriate), artifacts that result in rough or noisy sound can occur. Figure 3 illustrates this on the spectrogram of a very tonal signal (accordion), for which the frequencies above 4kHz have been artificially substituted by noise. It is evident, that even tough the energy is maintained, the tonal structure of the spectrogram is lost.

Figure 2: Noisiness due to bandwidth extension

Figure 3: Spectrogram illustrating noisiness in "Accordion"
*Click on spectrogram to toggle between original and degraded signal!*

Exaggerated tonality

When the original HF part is rather noisy, but the transmitted LF part is tonal, the reconstructed HF part is too tonal, this can lead to "buzzing" or "sawing" artifacts. Figure 5 illustrates the spectrogram of a pop music signal, where frequencies above 4kHz have been reconstructed by repeatedly copying up the LF part. Note how a prominent tonal line slightly above 4 kHz is introduced due to copying up a fundamental frequency from the bass section (and also at the borders of the repeated copy-up parts).

Figure 5: Spectrogram illustrating exaggerated tonality in pop music item "Funky"
*Click on spectrogram to toggle between original and degraded signal!*

Harmonicity mismatch

When both LF and HF part are rather tonal, but the harmonic structure of the reconstructed signal does not match the original signal, similar artifacts as for exaggerated tonality can occur. Moreover, tones that are placed too closely together near the crossover frequency can lead to modulation and beating artifacts. Figure 7 illustrates the spectrogram of a tonal, harmonic signal (accordion), where frequencies above 4kHz have been reconstructed by repeatedly copying up the LF part. Note how the overall tonality is still preserved in the HF part, but the pattern does not match the harmoic continuation of the LF part as seen in the original signal.

Figure 7: Spectrogram Illustrating harmonicity mismatch in "Accordion"
Note how the position of the tonal lines in the HF part shift.
*Click on spectrogram to toggle between original and degraded signal!*

Inappropriate Temporal Shaping

When the temporal shaping of the reconstructed signal does not match the original spectrum, pre-echo like artifacts can occur. Figure 9 illustrates the spectrogram of a transient signal (glockenspiel), where the frequencies above 4kHz have been substituted by noise, matching the average energy in frames of 21 ms length. Besides the loss of tonality, the sharp attacks are temporally smeared in the HF part, resembling pre-echos with a high-pass characteristic.

Figure 9: Spectrogram illustrating inappropriate temporal shaping in "Glockenspiel"
Note how the attacks are smeared (in addition to the increased noisiness).
*Click on spectrogram to toggle between original and degraded signal!*

Sound examples

For the following sound examples, intentionally degraded signals were generated to simulate clearly audible artifacts to sensitize listeners to more subtle coder-induced artifacts of the same nature. To demonstrate noisiness, the HF spectrum is replaced by noise of the same spectral envelope. To demonstrate exaggerated tonality and harmonicity mismatch, the HF spectrum is replaced by a scaled version of the LF spectrum. Further details on the simulation of coding artifacts for demonstration and research purposes can be found in [6].

Accordion: The accordion has a high tonality with a dense harmonic structure. Improper BWE can lead to different artifacts here: If the reconstructed HF part is too noise-like, the harmonic structure is lost and the noise can be perceived as an additional signal, which may sound like "air additionally leaking out of the accordion". Alternatively, if the tonality is preserved but the harmonic structure of the LF part is not properly continued by the reconstructed HF part, accordion's timbre may sound to harsh, resembling a resonant or metallic sound. However, if the crossover frequency is high enough (e.g. 8kHz), the ear cannot resolve the harmonic fine-structure, resulting in acceptable quality.

Play Accordion Original: Original
Play Accordion Noisy 4kHz: HF sounds too noise-like,
clearly audible noise
Play Accordion Noisy 8kHz: Noisiness more subtle
closer to real-world artifacts
Play Accordion Harmonics 4kHz: Harmonic structure not matched
sounds harsh or metallic
Play Accordion Harmonic 8kHz: Only HF harmonic mismatch
reaching acceptable quality

Glockenspiel: The glockenspiel also has high tonality, however not a dense harmonic structure like the accordion. Additionally, the glockenspiel has a distinct temporal structure with clear attacks at the onset of each note. As the temporal structure of the attacks is smeared in the reconstructed versions, artifacts from inappropriate temporal shaping are audible, making the attacks sound less sharp and "wet". Moreover, replacing the tonal lines in the HF part by noise is perceived as an additional signal, similar to the artifacts demonstraded in the accordion. If the sparse tonal lines in the HF part are reconstructed from the much denser structure in the LF part, ringing artifact are audible, that may sound like loose parts are resonating in the glockenspiel.

Play Glockenspiel Original: Original
Play Glockenspiel Noisy 4kHz: HF too noisy,
smeared, noisy transients
Play Glockenspiel Noisy 8kHz: Noisiness more subtle
transients still slightly "wet"
Play Glockenspiel Harmonics 4kHz: ringing/resonance on transients
sounds slightly out-of-tune
Play Glockenspiel Harmonic 8kHz: More subtle effect
closer to real-worlds artifacts

Funky: This pop music signal is more complex than the isolated accordion and glockenspiel. It has tonal components in the LF part, however, the HF part is rather noise-like due to the reverb and decaying hi-hats. Therefore, the HF noise substitution works rather well above reasonable crossover frequencies (e.g. 8kHz). However, the copy-up process introduces tonality from the LF part into the HF part, resulting in a resonant or sizzling sound.

Play Funky Original: Original
Play Funky Noisy 4kHz: HF substituted by noise
sounds slightly washed out
Play Funky Noisy 8kHz: Noisiness still audible
but reaching acceptable quality
Play Funky Tonality 4kHz: HF part too tonal
sizzling and resonant sound
Play Funky Tonality 8kHz: More subtle tonality exaggeration
lower quality than noise substitution

References

[1] P. Ekstrand: "Bandwidth extension of audio signals by spectral band replication", in Proc.1st IEEE Benelux Workshop on Model based Processing and Coding of Audio (MPCA-2002), Leuven, Belgium, Nov. 15, 2002, pp. 73–79.
[2] J. Herre, M. Dietz: "Standards in a Nutshell: MPEG-4 High-Efficiency AAC Coding", IEEE Signal Processing Magazine, Vol. 25(3), pp. 137-142, May 2008.
[3] F. Nagel,S. Disch: "A harmonic bandwidth extension method for audio codecs" In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP '09), Taipei, Taiwan, 19–24 April 2009.
[4] S. Disch, A. Niedermeier, C.R. Helmrich, C. Neukam, K. Schmidt, R. Geiger, J. Lecomte, F. Ghido, F. Nagel, B. Edler: "Intelligent Gap Filling in Perceptual Transform Coding of Audio", in Proc. AES 141st Convention, October 2016.
[5] K. Kjörling et al. “AC-4–the next generation audio codec,” in 140th Audio Engineering Society Convention, June 4-7, Paris, France, Proceedings, 2016, preprint #9491.
[6] S. Dick, N. Schinkel-Bielefeld, S. Disch: "Generation and Evaluation of Isolated Audio Coding Artifacts", Proc. AES 143rd Convention, October 2017.