Processing math: 100%
FMP AudioLabs
C8

Instantaneous Frequency Estimation


Following Section 8.2.1 of [Müller, FMP, Springer 2015], we introduce in this notebook the notation of instantaneous frequency and show how it can be estimated.

Introduction¶

In many music processing tasks, the first step is to convert the audio signal into a time–frequency representation using an STFT, which introduces a linear frequency grid. In the following we introduce a technique referred to as instantaneous frequency estimation. The estimates, which are derived by looking at the STFT's phase information, allows for improving the frequency quantization introduced by the STFT. We have seen alternative approaches in the FMP notebook on frequency grid density and in the FMP notebook on frequency interpolation.

We now summarize some properties of the discrete STFT, while fixing some notation. Let x denote the given music signal sampled at a rate of Fs Hertz. Furthermore, let X be its STFT using a suitable window function of length N∈N and hop size H∈N. Recall that, for the Fourier coefficient X(n,k), the frame index n∈Z is associated to the physical time

Tcoef(n):=nâ‹…HFs

(given in seconds) and the frequency index k∈[0:N/2] corresponds to the frequency

Fcoef(k):=kâ‹…FsN

(given in Hertz). In particular, the discrete STFT introduces a linear sampling of the frequency axis with a resolution of Fs/N Hz. This resolution may not suffice to accurately capture certain time–frequency patterns (e.\,g., continuously changing patterns due to vibrato or glissando). Furthermore, because of the logarithmic perception of frequency, the linear sampling of the frequency axis becomes particularly problematic for the low-frequency part of the spectrum. Increasing the frequency resolution by simply increasing the window length N is not a viable solution, since this process decreases the temporal resolution. In the following, we discuss a technique for obtaining an enhanced frequency estimation by exploiting the phase information encoded in the complex-valued STFT.

Instantaneous Frequency¶

In order to explain this technique, let us start by recalling the main ideas of expressing and measuring frequency. As our prototypical oscillations, we consider complex-valued exponential functions of the form

expω,φ:R→C,expω(t):=exp(2πi(ωt−φ))

for a frequency parameter ω∈R (measured in Hz) and a phase parameter φ (measured in normalized radians with 1 corresponding to an angle of 360∘). In the case φ=0, we set

expω:=expω,0.

Uniformly increasing the time parameter t, the exponential function describes a circular motion around the unit circle. When projected onto the real and imaginary axes, this yields two sinusoidal motions (described by a cosine and a sine function). Thinking of the circular motion as a uniformly rotating wheel, the frequency parameter ω corresponds to the number of revolutions per unit time (in our case, the duration of one second). In other words, the frequency can be interpreted as the rate of rotation. Based on this interpretation, one can associate a frequency value with a rotating wheel for arbitrary time intervals [t1,t2] with t1<t2. To this end, one measures the angular position φ1 at time t1 and the angular position φ2 at time t2. This is illustrated by the following figure:

FMP_C8_F12

The frequency is then defined as the change φ2−φ1 in angular position divided by the length t2−t1 of the time interval. In the limit case, when the time interval becomes arbitrarily small, one obtains the instantaneous frequency ωt1 given by

ωt1:=limt2→t1φ2−φ1t2−t1.

Phase Prediction Error¶

For the moment, let us assume a time-continuous perspective, fixing a frequency value ω∈R and two time instances, say t1∈R and t2∈R. Later, we will choose specific values that are related to the STFT parameters. Correlating the signal x with a windowed version of the analysis function expω, one positioned at t1 and one at t2, we obtain two complex Fourier coefficients. Let φ1 and φ2 be the phases of these two coefficients, respectively. In the case that the signal x contains a strong frequency component of frequency ω, the two phases φ1 and φ2 should be consistent in the following way: A rotation of frequency ω that assumes the angular position φ1 at time position t1 should have the phase

φPred:=φ1+ω⋅Δt

at time t2, where Δt:=t2−t1. Therefore, in the case that the signal x behaves similarly to the function expω, one should have φ2≈φPred. In the case that the signal x oscillates slightly slower than expω, the phase increment from time instance t1 to instance t2 for the signal x is less than the one for the prototype oscillation expω. As a result, the phase φ2 measured at t2 is less than the predicted phase φPred. In the third case that x oscillates slightly faster than expω, the phase φ2 is larger than the predicted phase φPred. The three cases are illustrated by the following figure:

FMP_C8_F13

To measure the difference between φ2 and φPred, we introduce the prediction error defined by

φErr:=Ψ(φ2−φPred).

In this definition, Ψ:R→[−0.5,0.5] is the principal argument function, which maps phase differences into the range [−0.5,0.5] by adding or subtracting a suitable integer value, thus avoiding discontinuities when computing phase differences. The prediction error can be used to correct the frequency value ω to obtain a refined frequency estimate IF(ω) for the signal x:

IF(ω):=ω+φErrΔt.

This value (being a bit sloppy in word use) is also called the instantaneous frequency (IF), which can be thought of an adjustment of the initial frequency ω. Strictly speaking, rather than referring to a single time instance, the instantaneous frequency refers—in this case—to an entire time interval [t1,t2]. In practice, however, this interval is typically chosen to be very small (on the order of a couple of milliseconds).

Improving the STFT Frequency Resolution¶

We now apply the concept of instantaneous frequency for improving the frequency resolution of a discrete STFT. Using the polar coordinate representation, a Fourier coefficient X(n,k)∈C can be written as

X(n,k)=|X(n,k)|exp(2πiφ(n,k))

with the phase φ(n,k)∈[0,1). For the prototype oscillation, we use the frequency determined by the frequency parameter k∈[0:N/2]:

ω=Fcoef(k)=k⋅FsN.

Furthermore, the two time instances are determined by the positions of the previous frame and the current frame:

t1=Tcoef(n−1)=(n−1)⋅HFsandt2=Tcoef(n)=n⋅HFs.

Finally, the measured phases at these time instances are the ones obtained by the STFT:

φ1=φ(n−1,k)andφ2=φ(n,k).

From this, using the above equations and doing some rearrangements, we obtain the instantaneous frequency

FIFcoef(k,n):=IF(ω)=(k+κ(k,n))⋅FsN

where the bin offset κ(k,n) is calculated as

κ(k,n)=NH⋅Ψ(φ(n,k)−φ(n−1,k)−k⋅HN).

Implementation¶

In the next code cell, we provide an implementation for the above procedure that estimate the instantaneous frequency (IF) for all time-frequency bins of a given STFT.

  • The input of the procedure consists of the sampling rate Fs, the window length N, the hops size H, and a complex-valued STFT matrix X of dimension K (number of frequency bins) and L (number of frames).

  • The output is a real-valued matrix F_coef_IF that contains the estimates of the IFs for each time-frequency bin. We apply padding so that F_coef_IF is also a (K × L)-matrix

  • In the implementation, the IF is computed simultaneously for all time-frequency bins using matrix operations.

In [1]:
import numpy as np
import os, sys, librosa
from scipy import signal
from matplotlib import pyplot as plt
import IPython.display as ipd
from numba import jit

sys.path.append('..')
import libfmp.b

%matplotlib inline

@jit(nopython=True)
def principal_argument(v):
    """Principal argument function

    | Notebook: C6/C6S1_NoveltyPhase.ipynb, see also
    | Notebook: C8/C8S2_InstantFreqEstimation.ipynb

    Args:
        v (float or np.ndarray): Value (or vector of values)

    Returns:
        w (float or np.ndarray): Principle value of v
    """
    w = np.mod(v + 0.5, 1) - 0.5
    return w

@jit(nopython=True)
def compute_if(X, Fs, N, H):
    """Instantenous frequency (IF) estamation

    | Notebook: C8/C8S2_InstantFreqEstimation.ipynb, see also
    | Notebook: C6/C6S1_NoveltyPhase.ipynb

    Args:
        X (np.ndarray): STFT
        Fs (scalar): Sampling rate
        N (int): Window size in samples
        H (int): Hop size in samples

    Returns:
        F_coef_IF (np.ndarray): Matrix of IF values
    """
    phi_1 = np.angle(X[:, 0:-1]) / (2 * np.pi)
    phi_2 = np.angle(X[:, 1:]) / (2 * np.pi)

    K = X.shape[0]
    index_k = np.arange(0, K).reshape(-1, 1)
    # Bin offset (FMP, Eq. (8.45))
    kappa = (N / H) * principal_argument(phi_2 - phi_1 - index_k * H / N)
    # Instantaneous frequencies (FMP, Eq. (8.44))
    F_coef_IF = (index_k + kappa) * Fs / N

    # Extend F_coef_IF by copying first column to match dimensions of X
    F_coef_IF = np.hstack((np.copy(F_coef_IF[:, 0]).reshape(-1, 1), F_coef_IF))

    return F_coef_IF

Visualization of IF Values¶

We now introduce a visualization that provides some deeper insights into the IF estimation procedure. As example, we consider a recording or a single note C4 played on a piano. This note corresponds to MIDI pitch p=60 having a center frequency of 261.6 Hz. The following figure shows the waveform and a spectrogram representation of the audio signal.

In [2]:
fn_wav = os.path.join('..', 'data', 'C8', 'FMP_C8_NoteC4_Piano.wav')
x, Fs = librosa.load(fn_wav)

N = 2048
H = 64
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, window='hann')
Y = np.log(1+ 10*np.abs(X))
K = X.shape[0]
L = X.shape[1]
ylim = [0,1500]

fig, ax = plt.subplots(2, 2, gridspec_kw={'width_ratios': [1, 0.05], 
                                          'height_ratios': [1, 2]}, figsize=(5, 4))        

libfmp.b.plot_signal(x, Fs, ax=ax[0,0])
ax[0,1].set_axis_off()
libfmp.b.plot_matrix(Y, Fs=Fs/H, Fs_F=N/Fs, ax=[ax[1,0],ax[1,1]], colorbar=True)
ax[1,0].set_ylim(ylim)
plt.tight_layout()

The output of the IF estimation is a frequency value FIFcoef(k,n)∈R (given in Hertz) for k∈[0:K−1] and n∈[0:L−1]. This value is a refinement of the (frame-independent) frequency value Fcoef(k) that is associated to the kth Fourier coefficient of the STFT. In the following figure, we show the (K×N)-matrices given by Fcoef and by FIFcoef in a color-coded form. Furthermore, we also visualize the difference

FIFcoef−Fcoef=κ⋅NFs,

which is the bin offset specified in Hertz. Note that opposed to visualizing a spectrogram (were the magnitudes are color-coded), this time we visualize frequency values (given in Hertz).

In [3]:
def plot_IF(F_coef_X, F_coef_IF, ylim, figsize=(10,3)):
    plt.figure(figsize=figsize)
    ax = plt.subplot(1,3,1)
    libfmp.b.plot_matrix(F_coef_X, Fs=Fs/H, Fs_F=N/Fs, cmap='PuOr', clim=ylim, ax=[ax],
            title=r'$F_\mathrm{coef}$')
    plt.ylim(ylim)
    ax = plt.subplot(1,3,2)
    libfmp.b.plot_matrix(F_coef_IF, Fs=Fs/H, Fs_F=N/Fs, cmap='PuOr', clim=ylim, ax=[ax],
            title=r'$F_\mathrm{coef}^\mathrm{IF}$')
    plt.ylim(ylim)
    ax = plt.subplot(1,3,3)
    libfmp.b.plot_matrix(F_coef_IF-F_coef_X, Fs=Fs/H, Fs_F=N/Fs, cmap='seismic', ax=[ax],
            title=r'$F_\mathrm{coef}^\mathrm{IF}-F_\mathrm{coef}$')
    plt.ylim(ylim)
    plt.tight_layout()
    
F_coef_IF = compute_if(X, Fs, N, H)
F_coef = np.arange(K) * Fs / N
F_coef_X = np.ones((K,L)) * F_coef.reshape(-1,1)
plot_IF(F_coef_X, F_coef_IF, ylim)    

In this visualization, the matrices Fcoef and FIFcoef look quite similar. However, in the visualization of the difference one can clearly see the adjustment of frequency values in regions along the fundamental frequency of the note C4 (261.6 Hz) and its harmonics. Frequency values just below a harmonic are adjusted upwards (red color) and frequency values just above a harmonic are corrected downwards (blue color). In the next figure, we zoom into a neighborhood of the fundamental frequency 261.6 Hz, while adjusting the color coding for frequency values. This visualization indicates that the IF estimation procedure assigns all frequency coefficients in a neighborhood of 261.6 Hz to exactly that frequency. We will show in the FMP notebook on salience representations how this reassignment procedure can be used to overcome the limitations of the linear frequencies grid introduced by the discrete STFT.

In [4]:
# Compute a neighborhood around center frequency of C4 (p=60)
pitch = 60
freq_fund = 2 ** ((pitch - 69) / 12) * 440
freq_0 = freq_fund-150
freq_1 = freq_fund+150
ylim = [freq_0, freq_1]

plot_IF(F_coef_X, F_coef_IF, ylim)  

Dependency on Hop Size¶

Note that the quality of the estimated instantaneous frequency depends, among other parameters, on the length

Δt=t2−t1=Tcoef(n)−Tcoef(n−1)=H/Fs

between two subsequent frames. Therefore, when applied to the discrete STFT, it is beneficial to use a small hop size H. This is illustrated by the next figures. On the downside, using a small hop size increases the computational cost for calculating the discrete STFT.

In [5]:
import time

def compute_plot_IF(x, N, H):
    
    # Compute STFT
    start = time.time()
    X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, window='hann')
    Y = np.log(1+ 10*np.abs(X))
    end = time.time()
    print('Runtime (STFT): %.8f seconds' % (end - start))    
    K = X.shape[0]
    L = X.shape[1]
    
    # Compute IF
    start = time.time()
    F_coef_IF = compute_if(X, Fs, N, H)
    end = time.time()
    print('Runtime (IF): %.8f seconds' % (end - start))    
    
    # Plot
    F_coef = np.arange(K) * Fs / N
    F_coef_X = np.ones((K,L)) * F_coef.reshape(-1,1)
    plot_IF(F_coef_X, F_coef_IF, ylim) 
    plt.show()
    
N, H = 2048, 64
print('Instanteneous frequency estimation using N=%d and H=%d'%(N,H))
compute_plot_IF(x, N, H)

N, H = 2048, 256
print('Instanteneous frequency estimation using N=%d and H=%d'%(N,H))
compute_plot_IF(x, N, H)

N, H = 2048, 1024
print('Instanteneous frequency estimation using N=%d and H=%d'%(N,H))
compute_plot_IF(x, N, H)
Instanteneous frequency estimation using N=2048 and H=64
Runtime (STFT): 0.02593803 seconds
Runtime (IF): 0.08570385 seconds
Instanteneous frequency estimation using N=2048 and H=256
Runtime (STFT): 0.01310396 seconds
Runtime (IF): 0.02418208 seconds
Instanteneous frequency estimation using N=2048 and H=1024
Runtime (STFT): 0.00340986 seconds
Runtime (IF): 0.00579691 seconds
Acknowledgment: This notebook was created by Meinard Müller and Sebastian Rosenzweig.
C0 C1 C2 C3 C4 C5 C6 C7 C8