In this notebook, we introduce the inverse DFT and then show how a signal can be recovered from its STFT under relatively mild conditions on the windowing process. In most parts, we closely follow Section Section 2.4.2 and Section 8.1.2.1 of [Müller, FMP, Springer 2015].
Given a vector $x\in \mathbb{C}^N$ of length $N\in\mathbb{N}$, the discrete Fourier transform is defined by the matrix–vector product
$$ X = \mathrm{DFT}_N \cdot x $$with the DFT matrix $\mathrm{DFT}_N \in \mathbb{C}^{N\times N}$ given by
$$ \mathrm{DFT}_N(n, k) = \mathrm{exp}(-2 \pi i k n / N) $$for $n\in[0:N-1]$ and $k\in[0:N-1]$. The DFT is invertible in the sense that the vector $x$ can be recovered from its spectral vector $X$. The inverse DFT is again specified by a matrix–vector product
$$ x = \mathrm{DFT}_N^{-1} \cdot X, $$where $\mathrm{DFT}_N^{-1}$ denotes the inverse of the DFT matrix $\mathrm{DFT}_N$. It is a well-known fact that this inverse is given by
$$ \mathrm{DFT}_N^{-1}(n, k) = \frac{1}{N}\mathrm{exp}(2 \pi i k n / N) $$for $n\in[0:N-1]$ and $k\in[0:N-1]$. In other words, the inverse essentially coincides with the DFT matrix up to some normalizing factor and complex conjugation. In the following code cell, we generate the DFT matrix and its inverse. Furthermore, we show that the two matrices are indeed inverse to each other (up to some numerical issues due to rounding in floating point arithmetic). To this end, we measure the entry-wise difference between $\mathrm{DFT}_N \cdot \mathrm{DFT}_N^{-1}$ and the the identity matrix $I_N\in \mathbb{R}^{N\times N}$ as well as between $\mathrm{DFT}_N^{-1} \cdot\mathrm{DFT}_N$ and $I_N$.
import os
import numpy as np
import scipy.signal
import librosa
from numba import jit
from matplotlib import pyplot as plt
import sys
sys.path.append('..')
import libfmp.c2
%matplotlib inline
@jit(nopython=True)
def generate_matrix_dft_inv(N, K):
"""Generates an IDFT (inverse discrete Fourier transfrom) matrix
Notebook: C2/C2_STFT-Inverse.ipynb
Args:
N (int): Number of samples
K (int): Number of frequency bins
Returns:
dft (np.ndarray): The IDFT matrix
"""
dft = np.zeros((K, N), dtype=np.complex128)
for n in range(N):
for k in range(K):
dft[k, n] = np.exp(2j * np.pi * k * n / N) / N
return dft
N = 32
dft_mat = libfmp.c2.generate_matrix_dft(N, N)
dft_mat_inv = generate_matrix_dft_inv(N, N)
I = np.eye(N)
A = np.dot(dft_mat, dft_mat_inv)
B = np.dot(dft_mat_inv, dft_mat)
plt.figure(figsize=(11, 3))
plt.subplot(1, 3, 1)
plt.title(r'$I_N$ for $N = %d$'%N)
plt.imshow(I, origin='lower', cmap='seismic', aspect='equal')
plt.colorbar()
plt.subplot(1, 3, 2)
plt.title(r'$|I_N - \mathrm{DFT}_N \cdot \mathrm{DFT}_N^{-1}|$')
plt.imshow(np.abs(I-A), origin='lower', cmap='seismic', aspect='equal')
plt.colorbar()
plt.subplot(1, 3, 3)
plt.title(r'$|I_N - \mathrm{DFT}_N^{-1} \cdot \mathrm{DFT}_N|$')
plt.imshow(np.abs(I-B), origin='lower', cmap='seismic', aspect='equal')
plt.colorbar();
plt.tight_layout()
Recall that the DFT can be computed efficiently using the FFT algorithm. Basically the same algorithm can be also applied to compute the inverse DFT. In the following, we will use the implementations numpy.fft.fft
and numpy.fft.ifft
.
Next, we discuss the problem of inverting the discrete short-time Fourier transform (STFT). As usual, let $x:\mathbb{Z}\to\mathbb{R}$ be a discrete-time signal and $\mathcal{X}$ its STFT. Furthermore, let $w:[0:N-1]\to\mathbb{R}$ denote the underlying real-valued discrete window function of length $N\in\mathbb{N}$ and $H\in\mathbb{N}$ the hop size parameter. For notational convenience, we extend the window function to $w:\mathbb{Z}\to\mathbb{R}$ using zero padding. Let $x_n:\mathbb{Z}\to\mathbb{R}$ be the windowed signal defined by
\begin{equation} x_n(r):=x(r+nH)w(r) \end{equation}for $r\in\mathbb{Z}$. Then, the STFT coefficients $\mathcal{X}(n,k)$ for $k\in[0:N-1]$ are obtained via
$$ (\mathcal{X}(n,0),\ldots, \mathcal{X}(n,N-1))^\top = \mathrm{DFT}_N \cdot (x_n(0),\ldots, x_n(N-1))^\top. $$Since $\mathrm{DFT}_N$ is an invertible matrix, we can reconstruct the windowed signal $x_n$ from the STFT by
$$ (x_n(0),\ldots x_n(N-1))^\top = \mathrm{DFT}_N^{-1} \cdot (\mathcal{X}(n,0),\ldots, \mathcal{X}(n,N-1))^\top $$and $x_n(r)=0$ for $r\in\mathbb{Z}\setminus[0,N-1]$. To obtain the samples $x(r)$ of the original signal, one has to reverse the windowing process. We now show that this is possible under relatively mild conditions on the windowing process. Let us consider the superposition over all suitably shifted versions of windowed sections of the signal:
$$ \sum_{n\in\mathbb{Z}} x_n(r-nH) = \sum_{n\in\mathbb{Z}} x(r-nH+nH)w(r-nH) = x(r)\sum_{n\in\mathbb{Z}} w(r-nH). $$Therefore, one can recover the samples $x(r)$ via
$$ x(r) = \frac{\sum_{n\in\mathbb{Z}} x_n(r-nH)}{\sum_{n\in\mathbb{Z}} w(r-nH)} $$as long as the condition
$$ \sum_{n\in\mathbb{Z}} w(r-nH)\not= 0 $$is satisfied for $r\in\mathbb{Z}$. This overall approach is based on the so-called overlap–add technique, where the overlapping reconstructed windowed sections are simply overlaid and added up (and then normalized to compensate for the windowing).
It is not hard to find window functions along with hop sizes that satisfy the above condition. For example, when the window function $w:[0:N-1]\to\mathbb{R}$ is strictly positive and the hop size is smaller than or equal to the window length, then the sum over the time-shifted windows is always positive. Often, one chooses a window function and a hop size such that the stronger condition
$$ \sum_{n\in\mathbb{Z}} w(r-nH) = 1 $$for all $r\in\mathbb{Z}$ is fulfilled. In this case, one also says that the time-shifted window functions define a partition of unity of the discrete time axis $\mathbb{Z}$. For example, one obtains a partition of unity when using the squared sinusoidal as window $w:\mathbb{Z}\to\mathbb{R}$ defined by
\begin{equation} w(r):= \left\{ \begin{array}{cl} \sin(\pi r/N)^2 & \,\mbox{if}\,\,\, r\in[0:N-1],\\ 0 & \,\mbox{otherwise} , \end{array} \right. \end{equation}and a hop size of $H=N/2$. Note that the property of being a partition of unity not only depends on the window function itself but also on the hop size parameter. The following figure shows various window functions of length $N$ and their time-shifted versions using a hop size $H$. The sum of the shown time-shifted versions is indicated by the thick red curve.
def plot_sum_window(w, H, L, title='', figsize=(5, 1.5)):
N = len(w)
M = np.floor((L - N) / H).astype(int) + 1
w_sum = np.zeros(L)
plt.figure(figsize=figsize)
for m in range(M):
w_shifted = np.zeros(L)
w_shifted[m * H:m * H + N] = w
plt.plot(w_shifted, 'k')
w_sum = w_sum + w_shifted
plt.plot(w_sum, 'r', linewidth=3)
plt.xlim([0, L-1])
plt.ylim([0, 1.1*np.max(w_sum)])
plt.title(title)
plt.tight_layout()
plt.show()
return w_sum
L = 256
N = 64
H = N//2
w_type = 'triang'
w = scipy.signal.get_window(w_type, N)
plot_sum_window(w, H, L, title='Triangular window, H = N/2');
H = N//2
w_type = 'hann'
w = scipy.signal.get_window(w_type, N)
plot_sum_window(w, H, L, title='Hann window, H = N/2');
H = 3*N//8
w_type = 'hann'
w = scipy.signal.get_window(w_type, N)
plot_sum_window(w, H, L, title='Hann window, H = 3N/8');
H = N//4
w = scipy.signal.gaussian(N, std=8)
plot_sum_window(w, H, L, title='Gaussian window, H = N/4');
In the following code cell, we provide a pair of implementations for the STFT and its inverse. Here are some notes one needs to keep in mind that are of practical relevance:
def istft_basic(X, w, H, L):
"""Compute the inverse of the basic discrete short-time Fourier transform (ISTFT)
Notebook: C2/C2_STFT-Inverse.ipynb
Args:
X (np.ndarray): The discrete short-time Fourier transform
w (np.ndarray): Window function
H (int): Hopsize
L (int): Length of time signal
Returns:
x (np.ndarray): Time signal
"""
N = len(w)
M = X.shape[1]
x_win_sum = np.zeros(L)
w_sum = np.zeros(L)
for m in range(M):
x_win = np.fft.ifft(X[:, m])
# Avoid imaginary values (due to floating point arithmetic)
x_win = np.real(x_win)
x_win_sum[m * H:m * H + N] = x_win_sum[m * H:m * H + N] + x_win
w_shifted = np.zeros(L)
w_shifted[m * H:m * H + N] = w
w_sum = w_sum + w_shifted
# Avoid division by zero
w_sum[w_sum == 0] = np.finfo(np.float32).eps
x_rec = x_win_sum / w_sum
return x_rec, x_win_sum, w_sum
L = 256
t = np.arange(L) / L
omega = 4
x = np.sin(2 * np.pi * omega * t * t)
N = 64
H = 3 * N // 8
w_type = 'hann'
w = scipy.signal.get_window(w_type, N)
X = libfmp.c2.stft_basic(x, w=w, H=H)
x_rec, x_win_sum, w_sum = istft_basic(X, w=w, H=H, L=L)
plt.figure(figsize=(8, 3))
plt.plot(x, color=[0, 0, 0], linewidth=4, label='Original signal')
plt.plot(x_win_sum, 'b', label='Summed windowed signals')
plt.plot(w_sum, 'r', label='Summed windows')
plt.plot(x_rec, color=[0.8, 0.8, 0.8], linestyle=':', linewidth=4, label='Reconstructed signal')
plt.xlim([0,L-1])
plt.legend(loc='lower left')
plt.tight_layout()
plt.show()
We have seen above that the samples $x(r)$ can be recovered from a signal's windowed sections via the formula
$$ x(r) = \frac{\sum_{n\in\mathbb{Z}} x_n(r-nH)}{\sum_{n\in\mathbb{Z}} w(r-nH)} $$in the case that the denominator is nonzero for $r\in\mathbb{Z}$. In practice, when reconstructing a time-domain signal from an STFT, one often has to deal with unwanted or deliberate modification in the STFT domain. This topic will be covered in the FMP notebook on signal reconstruction. In view of such modifications, it turns out that the following formula yields a better reconstruction of the samples $x(r)$:
$$ x(r) = \frac{\sum_{n\in\mathbb{Z}} w(r-nH)x_n(r-nH)}{\sum_{n\in\mathbb{Z}} w(r-nH)^2}, $$where one again assumes that the denominator is nonzero for $r\in\mathbb{Z}$. In the case that there are no modifications in the STFT domain, both reconstruction formulas yield exactly the original samples. The second strategy is also used in the function librosa.istft
.
The python package LibROSA offers the two functions librosa.stft
and librosa.istft
for computing the STFT and its inverse. In the following, we test these two functions. In particular, we want to demonstrate that it is important to compensate for possible padding, e.g., when using the centered variant of the STFT. As an example, we use the following violin sound.
def print_plot(x, x_rec):
print('Number of samples of x: ', x.shape[0])
print('Number of samples of x_rec:', x_rec.shape[0])
if x.shape[0] == x_rec.shape[0]:
print('Signals x and x_inv agree:', np.allclose(x, x_rec))
plt.figure(figsize=(6, 2))
plt.plot(x-x_rec, color='red')
plt.xlim([0, x.shape[0]])
plt.title('Differences between x and x_rec')
plt.xlabel('Time (samples)');
plt.tight_layout()
plt.show()
else:
print('Number of samples of x and x_rec does not agree.')
fn_wav = os.path.join('..', 'data', 'C2', 'FMP_C2_F05c_C4_violin.wav')
Fs = 11025
x, Fs = librosa.load(fn_wav, sr=Fs)
N = 4096
H = 2048
L = x.shape[0]
print('=== Centered Case ===')
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, window='hann', pad_mode='constant', center=True)
x_rec = librosa.istft(X, hop_length=H, win_length=N, window='hann', center=True, length=L)
print('stft: center=True; istft: center=True')
print_plot(x, x_rec)
print('=== Non-Centered Case ===')
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, window='hann', pad_mode='constant', center=False)
x_rec = librosa.istft(X, hop_length=H, win_length=N, window='hann', center=False, length=L)
print('stft: center=False; istft: center=False')
print_plot(x, x_rec)
print('=== Centered vs. Non-Centered Case ===')
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, window='hann', pad_mode='constant', center=True)
x_rec = librosa.istft(X, hop_length=H, win_length=N, window='hann', center=False, length=L)
print('stft: center=True; istft: center=False')
print_plot(x, x_rec)
plt.figure(figsize=(6, 2))
plt.plot(x, color='black')
plt.xlim([0, x.shape[0]])
plt.plot(x_rec, color='gray')
plt.title('Signal x (black) and x_rec (gray)')
plt.xlabel('Time (samples)');
plt.tight_layout()
plt.show()
We also provide an implementation of the inverse FFT, the STFT, and the inverse STFT in our library libfmp
. These implementations also include padding options and realize the centric view as used in the FMP notebooks (similar to librosa
). In the following code cell, we call these libfmp
function.
import sys
sys.path.append('..')
from libfmp.c2 import fft, stft, ifft, istft
L = 256
t = np.arange(L)/L
omega = 4
x = np.sin(2 * np.pi * omega * t*t)
X = fft(x)
x_rec = np.real(ifft(X))
plt.figure(figsize=(8, 2))
plt.plot(x, color='k', linewidth=4, label='Original waveform')
plt.plot(x_rec, color=[0.8, 0.8, 0.8], linestyle=':', linewidth=4, label='Reconstructed with inverse FFT')
plt.xlim([0,L-1])
plt.xlabel('Time index')
plt.ylabel('Amplitude')
plt.legend()
plt.tight_layout()
N = 64
H = 3 * N // 8
w = scipy.signal.get_window('hann', N)
X_stft = stft(x, w=w, H=H)
x_stft_rec = istft(X_stft, w=w, H=H, L=L)
plt.figure(figsize=(8, 2))
plt.plot(x, color='k', linewidth=4, label='Original waveform')
plt.plot(x_stft_rec, color=[0.8, 0.8, 0.8], linestyle=':', linewidth=4, label='Reconstructed with inverse STFT')
plt.xlim([0,L-1])
plt.xlabel('Time index')
plt.ylabel('Amplitude')
plt.legend()
plt.tight_layout()
In this notebook, we have seen that a time-domain signal can be perfectly reconstructed from an STFT under mild conditions on the window function and the hopsize parameter (one only requires $\sum_{n\in\mathbb{Z}} w(r-nH)\not= 0$ for $r\in\mathbb{Z}$). In practice, one often modifies an STFT and then tries to reconstruct a time-domain signal from the modified STFT. Such a reconstruction, however, is generally not possible and involves some unanticipated pitfalls. For a further discussion of this topic, we refer to the FMP notebook on signal reconstruction.