In this notebook, we summarize the various variants for computing and interpreting a discrete STFT, while fixing the conventions we use throughout the FMP notebooks (if not specified otherwise explicitly). For details on the STFT, we refer to Section 2.1.4 and Section 2.5.3 of [Müller, FMP, Springer 2015].
Let $x=(x(0),x(1), \ldots x(L-1))^\top \in \mathbb{R}^L$ be a discrete-time signal of length $L\in\mathbb{N}$. Furthermore, let $F_\mathrm{s}$ be the sample rate. Then, we associate to $x$ the vector $t=(t(0),t(1), \ldots t(L-1))^\top \in \mathbb{R}^L$ of physical time positions (given in seconds) defined by
$$ t(n) = \frac{n}{F_\mathrm{s}} $$for $n\in[0:L-1]$. In other words:
The following code shows how to visualize a waveform using a physical time axis specified in seconds.
import os
import sys
import numpy as np
from matplotlib import pyplot as plt
import matplotlib
import librosa
import librosa.display
import IPython.display as ipd
sys.path.append('..')
import libfmp.b
import libfmp.c2
%matplotlib inline
# Load wav
fn_wav = os.path.join('..', 'data', 'C2', 'FMP_C2_F05c_C4_violin.wav')
Fs = 11025
x, Fs = librosa.load(fn_wav, sr=Fs)
ipd.Audio(x, rate=Fs)
L = x.shape[0]
t_wav = np.arange(L) / Fs
x_duration = L / Fs
print('t[0] = %0.4f, t[-1] = (L-1)/Fs = %0.4f, Fs = %0.0f, L = %0.0f, dur_x=%0.4f'
% (t_wav[0], t_wav[-1], Fs, L, x_duration))
ipd.display(ipd.Audio(x, rate=Fs))
plt.figure(figsize=(6, 2))
plt.plot(t_wav, x, color='gray')
plt.xlim([t_wav[0], t_wav[-1]])
plt.xlabel('Time (seconds)')
plt.tight_layout()
The librosa
function librosa.display.waveplot
also visualizes the waveform with a physical time axis (given in seconds). Note that, rather than showing the samples of the waveform, this function plots a symmetric amplitude envelope. It offers many more functionalities such as plotting stereo signals.
plt.figure(figsize=(6, 2))
librosa.display.waveplot(x, sr=Fs, color='gray')
plt.tight_layout()
A function to plot a waveform is also included in libfmp
, which is illustrated by the following code cell.
libfmp.b.plot_signal(x, Fs, figsize=(6, 2));
When considering a windowed section of the signal, we adopt a centered view, where the center of the window is used as reference to relate to the physical domain. In particular, when computing an STFT, we extend the signal to the left by applying zero padding of half the window length. More precisely, let $w:[0:N-1]\to\mathbb{R}$ be a window function of even window length $N\in\mathbb{N}$, and let $H\in\mathbb{N}$ be the hop size. Then we put $N/2$ zero values in front defining:
$$ \tilde{x}=(0,\ldots,0,x(0),x(1), \ldots x(L-1))^\top \in \mathbb{R}^{L+N/2} $$and
\begin{eqnarray} \mathcal{X}(m,k):= \sum_{n=0}^{N-1} \tilde{x}(n+mH)w(n)\mathrm{exp}(-2\pi ikn/N). \end{eqnarray}Furthermore, we use the convention that the frame index $m$ is associated to the physical time position
$$ T_\mathrm{coef}(m) := \frac{m\cdot H}{F_\mathrm{s}} $$given in seconds. In particular, the following holds:
When $x$ and $w$ are real-valued, the upper half of the frequency coefficients are redundant. Therefore, only the coefficients $k\in[0:K]$ with $K=N/2$ are used. In particular, the index $k=N/2$ corresponds to the Nyquist frequency $\omega=F_\mathrm{s}/2$. Furthermore, the index $k$ corresponds to the frequency
\begin{equation} F_\mathrm{coef}(k) := \frac{k\cdot F_\mathrm{s}}{N} \end{equation}given in Hertz.
The following code shows how to use the librosa
function librosa.stft
to implement these conventions. The parameter setting center=True
activates the centered view, while pad_mode='constant'
switches to the zero-padding mode. Furthermore, the code shows how to implement the conversion functions $T_\mathrm{coef}$ and $F_\mathrm{coef}$, once using the formulas above and once using the librosa
built-in functions. Note that for odd window sizes $N$, there may be different conventions for rounding. In practice, one typically uses even window sizes (in particular, being a power of two in view of the FFT algorithm).
N = 256
H = 64
color = 'gray_r'
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, window='hann', pad_mode='constant', center=True)
Y = np.log(1 + 100 * np.abs(X) ** 2)
T_coef = np.arange(X.shape[1]) * H / Fs
T_coef_librosa = librosa.frames_to_time(np.arange(X.shape[1]), sr=Fs, hop_length=H)
print('Computation of T_coef agrees:', np.allclose(T_coef, T_coef_librosa))
K = N // 2
F_coef = np.arange(K+1) * Fs / N
F_coef_librosa = librosa.fft_frequencies(sr=Fs, n_fft=N)
print('Computation of F_coef agrees:', np.allclose(F_coef, F_coef_librosa))
plt.figure(figsize=(6, 3))
extent = [T_coef[0], T_coef[-1], F_coef[0], F_coef[-1]]
plt.imshow(Y, cmap=color, aspect='auto', origin='lower', extent=extent)
plt.xlabel('Time (seconds)')
plt.ylabel('Frequency (Hz)')
plt.colorbar()
plt.tight_layout()
plt.show()
To adopt a centered view also in the visualization, one has to adjust the left and right margin by half a frame length, and the lower and upper margin by half a bin width. This is demonstrated by the following code example. For large spectrograms, however, these small adjustments in the visualization are not relevant.
plt.figure(figsize=(6, 3))
extent = [T_coef[0] - (H / 2) / Fs, T_coef[-1] + (H / 2) / Fs,
F_coef[0] - (Fs / N) / 2, F_coef[-1] + (Fs / N) / 2]
plt.imshow(Y, cmap=color, aspect='auto', origin='lower', extent=extent)
plt.xlim([T_coef[0], T_coef[-1]])
plt.ylim([F_coef[0], F_coef[-1]])
plt.xlabel('Time (seconds)')
plt.ylabel('Frequency (Hz)')
plt.colorbar()
plt.tight_layout()
Finally, as an alternative, the following code example shows how to use the librosa
-function librosa.display.specshow
to visualize a spectrogram. This function offers many options to modify the visualization.
plt.figure(figsize=(6, 3))
librosa.display.specshow(Y, y_axis='linear', x_axis='time', sr=Fs, hop_length=H, cmap=color)
plt.colorbar()
plt.tight_layout()
Functions for computing and visualizing the STFT are also part of libfmp
. Furthermore, we included in libfmp
the function stft_convention_fmp
, which is a wrapper of the librosa
implementation with specific settings as used in the FMP notebooks. Note that the STFT functions may deviate in terms of the padding strategy (in particular, the padding on right side of the signal) and data types used. The following code cell shows how to use the libfmp
-functions.
w = np.hanning(N)
X = libfmp.c2.stft(x, w, H, zero_padding=0, only_positive_frequencies=True)
Y = np.log(1 + 100 * np.abs(X) ** 2)
print('=== Using libfmp.c2.stft ===')
print('Y.shape = (%d, %d), Y.dtype = %s'%(Y.shape[0], Y.shape[1], Y.dtype))
libfmp.b.plot_matrix(Y, Fs=Fs/H, Fs_F=N/Fs);
plt.show()
Y, T_coef, F_coef= libfmp.c2.stft_convention_fmp(x, Fs, N, H, mag=True, gamma=100)
print('=== Using libfmp.c2.stft_convention_fmp ===')
print('Y.shape = (%d, %d), Y.dtype = %s'%(Y.shape[0], Y.shape[1], Y.dtype))
libfmp.b.plot_matrix(Y, Fs=Fs/H, Fs_F=N/Fs);
plt.show()
We discussed in the FMP notebook on frequency grid density how to refine the frequency grid by suitably padding the windowed sections. The librosa
function librosa.stft
implements this idea by means of the two parameters n_fft
(corresponding to $L$, the size of the padded section) and win_length
(corresponding to $N$, the size of the windowed section). Using this padding variant with $L$ (instead of $N$), one needs to adjust the computation of the function $F_\mathrm{coef}$ as follows:
for $k\in [0:K]$ with $K=L/2$. As before, to avoid rounding issues, it is advisable to choose an even number $L$ (possibly, being a power of two).
N = 256
L = 512
H = 64
color = 'gray_r'
X = librosa.stft(x, n_fft=L, hop_length=H, win_length=N, window='hann', pad_mode='constant', center=True)
Y = np.log(1 + 100 * np.abs(X) ** 2)
T_coef = np.arange(0, X.shape[1]) * H / Fs
K = L // 2
F_coef = np.arange(K + 1) * Fs / L
F_coef_librosa = librosa.fft_frequencies(sr=Fs, n_fft=L)
print('Computation of F_coef agrees:', np.allclose(F_coef, F_coef_librosa))
print('Y.shape = (%d,%d)'%(Y.shape[0],Y.shape[1]))
plt.figure(figsize=(6, 3))
extent = [T_coef[0], T_coef[-1], F_coef[0], F_coef[-1]]
plt.imshow(Y, cmap=color, aspect='auto', origin='lower', extent=extent)
plt.xlabel('Time (seconds)')
plt.ylabel('Frequency (Hz)')
plt.colorbar()
plt.tight_layout()
Zero padding is also possible with the STFT function from libfmp
. However, instead of passing $L$ to the function, it has an argument zero_padding
, which denotes number $L-N$ of samples to be padded.
w = np.hanning(N)
X = libfmp.c2.stft(x, w, H, zero_padding=N, only_positive_frequencies=True)
Y = np.log(1 + 100 * np.abs(X) ** 2)
print('Y.shape = (%d,%d)'%(Y.shape[0],Y.shape[1]))
libfmp.b.plot_matrix(Y, Fs=Fs/H, Fs_F=(N+N)/Fs);
There are many further aspects that are crucial when computing and working with the STFT. Among others, we cover in the FMP notebooks the following related topics: