Following Section 8.2 and Section 8.2.3.3 of [Müller, FMP, Springer 2015], we cover in this notebook two related problems referred to as melody extraction and melody separation. The overall approach was suggested by Salamon and Gómez:
When asked to describe a specific song, we are often able to sing or hum the main melody. In general terms, a melody may be defined as a linear succession of musical tones expressing a particular musical idea. Because of the special arrangement of tones, a melody is perceived as a coherent entity, which gets stuck in a listener's head as the most memorable element of a song. As the original Greek term melōidía (meaning "singing" or "chanting") implies, a melody is often performed by a human voice. Oftentimes, the melody constitutes the leading element in a composition, appearing in the foreground, while the accompaniment is in the background. Sometimes melody and accompaniment may even be played on a single instrument such as a guitar or a piano. In any case, the melody typically stands out in one way or another. For example, the melody often comprises the higher notes in a musical composition, while the accompaniment consists of the lower notes. Or the melody is played by some instrument with a characteristic timbre . In some performances, the notes of a melody may feature easily discernible time–frequency patterns such as vibrato, tremolo, or glissando (i.e., a continuous glide from one pitch to another). As with many other concepts in music processing, the notion of melody remains rather vague. In music processing, one often considers a restricted scenario. In the FMP notebook on F0 tracking, for example, we make the following simplifying assumptions:
Based on these assumptions, the melody extraction problem can be regarded as the following signal processing task: Given a recording, the objective is to automatically estimate the sequence of predominant F0-values that correspond to the notes played by the lead voice or instrument. This is exactly the problem we discussed in the FMP notebook on F0 tracking. The basic procedure can be summarized as follows:
The melody extraction problem is challenging even in the restricted case that the simplifying assumptions are valid.
Continuing our Freischütz example, the following figure shows a spectrogram representation as well as the F0-trajectory of the melody performed by a soprano singer. Note the following:
The singer is not always active. This is also indicated by the activity pattern plotted below the spectrogram.
At some time points, the spectral bin with the highest magnitude may belong to the accompaniment. This is one reason for errors in melody estimation.
Some of the higher harmonics may contain more energy than the fundamental frequency, which may lead to octave errors in melody estimation.
import numpy as np
import os, sys, librosa
from scipy import signal
from scipy import ndimage
from scipy.interpolate import interp1d
import matplotlib
from matplotlib import pyplot as plt
from matplotlib.colors import LinearSegmentedColormap, BoundaryNorm
from matplotlib import gridspec
import IPython.display as ipd
from numba import jit
import pandas as pd
from collections import OrderedDict
sys.path.append('..')
import libfmp.b
import libfmp.c3
import libfmp.c8
%matplotlib inline
def plot_STFT_F0_activity(Y, traj, figsize=(7,4), ylim = [0, 4000], cmap='gray_r'):
fig, ax = plt.subplots(2, 2, gridspec_kw={'width_ratios': [10, 0.2],
'height_ratios': [10, 1]}, figsize= figsize)
fig_im, ax_im, im = libfmp.b.plot_matrix(Y, Fs=Fs/H, Fs_F=N/Fs,
ax=[ax[0,0], ax[0,1]], title='', cmap=cmap);
ax[0,0].set_ylim(ylim)
traj_plot = traj[traj[:, 1] > 0, :]
ax[0,0].plot(traj_plot[:, 0], traj_plot[:, 1], color='r', markersize=4, marker='.', linestyle='');
# F0 activity
activity = np.array(traj[:, 1] > 0).reshape(1, -1)
im = ax[1,0].imshow(activity, aspect='auto', cmap='bwr', clim=[-1, 1])
ax[1,0].set_xticks([])
ax[1,0].set_yticks([])
ax[1,1].axis('off')
plt.tight_layout()
return fig, ax
# Load audio
fn_wav = os.path.join('..', 'data', 'C8', 'FMP_C8_F10_Weber_Freischuetz-06_FreiDi-35-40.wav')
Fs = 22050
x, Fs = librosa.load(fn_wav, sr=Fs)
ipd.Audio(x, rate=Fs)
# Computation of STFT
N = 2048
H = N//4
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, pad_mode='constant')
Y = np.log(1 + np.abs(X))
Fs_feature = Fs/H
T_coef = np.arange(X.shape[1]) / Fs_feature
freq_res = Fs / N
F_coef = np.arange(X.shape[0]) * freq_res
#F_coef = libfmp.c3.F_coef(np.arange(Y.shape[0]), Fs, N)
# Load F0 trajectory and and resample to fit to X
fn_traj = os.path.join('..', 'data', 'C8', 'FMP_C8_F10_Weber_Freischuetz-06_FreiDi-35-40_F0-user-Book.csv')
traj_df = libfmp.b.read_csv(fn_traj)
traj = traj_df.values
Fs_traj = 1/(traj[1,0]-traj[0,0])
traj_X_values = interp1d(traj[:,0], traj[:,1], kind='nearest', fill_value='extrapolate')(T_coef)
traj_X = np.hstack((T_coef[:,None], traj_X_values[:,None]))
# Visualization
figsize=(7,4)
ylim = [0, 4000]
fig, ax = plot_STFT_F0_activity(Y, traj_X, figsize=figsize, ylim=ylim)
plt.show()
# Sonification
soni_mono = libfmp.c8.sonify_trajectory_with_sinusoid(traj_X, len(x), Fs)
soni_stereo = np.vstack((x.reshape(1,-1), soni_mono.reshape(1,-1))) # left: x, right: sonification
ipd.Audio(soni_stereo, rate=Fs)
In our simplified scenario, the objective of melody extraction is to estimate the sequence of F0-values that correspond to the main melody. A related task is referred to as melody separation, which aims at decomposing a music signal into a melody component that captures the main melodic voice and an accompaniment component that captures the remaining acoustic events. One application of melody separation is the automated generation of a Karaoke version for a given song, where the main melodic voice is to be removed from the song's original recording. More technically speaking, the objective of melody separation is to decompose a given signal $x$ into a melody component $x^\mathrm{Mel}$ and an accompaniment component $x^\mathrm{Acc}$ such that
\begin{equation} x = x^\mathrm{Mel} + x^\mathrm{Acc}. \end{equation}One general approach is to first apply a melody extraction algorithm to derive the F0-trajectory of the main melody. Based on these F0-values and their harmonics, one can construct a binary mask for the melodic component, while the binary mask of the accompaniment is defined to be the complement. Using masking techniques as described in the FMP notebook on HPS (see Section 8.1.1.2 of [Müller, FMP, Springer 2015]), one then derives the two masked STFTs $\mathcal{X}^\mathrm{Mel}$ and $\mathcal{X}^\mathrm{Acc}$. From this, one obtains the signals $x^\mathrm{Mel}$ and $x^\mathrm{Acc}$ by applying signal reconstruction techniques (see Section 8.1.2 of [Müller, FMP, Springer 2015]). We now implement this melody separation procedure and apply it to the Freischütz example.
There are many ways to define a binary mask for a given F0-trajectory. Taking harmonics into account, we provide in the following code cell two variants.
tol_bin
(given in frequency bins), which is used to define a frequency neighborhood of fixed size around each frequency bin.tol_cents
(given in cents), which is used to define a neighborhood of frequency-dependent size around each frequency bin. More precisely, the neighborhood size increases linearly with the center frequency (so that it remains constant when measured in cents).In the following figure, we show the binary masks for the melody (as well as their complements for the accompaniment) for both variants.
def convert_trajectory_to_mask_bin(traj, F_coef, n_harmonics=1, tol_bin=0):
"""Computes binary mask from F0 trajectory
Notebook: C8/C8S2_MelodyExtractSep.ipynb
Args:
traj (np.ndarray): F0 trajectory (time in seconds in 1st column, frequency in Hz in 2nd column)
F_coef (np.ndarray): Frequency axis
n_harmonics (int): Number of harmonics (Default value = 1)
tol_bin (int): Tolerance in frequency bins (Default value = 0)
Returns:
mask (np.ndarray): Binary mask
"""
# Compute STFT bin for trajectory
traj_bin = np.argmin(np.abs(F_coef[:, None] - traj[:, 1][None, :]), axis=0)
K = len(F_coef)
N = traj.shape[0]
max_idx_harm = np.max([K, np.max(traj_bin)*n_harmonics])
mask_pad = np.zeros((max_idx_harm.astype(int)+1, N))
for h in range(n_harmonics):
mask_pad[traj_bin*h, np.arange(N)] = 1
mask = mask_pad[1:K+1, :]
if tol_bin > 0:
smooth_len = 2*tol_bin + 1
mask = ndimage.filters.maximum_filter1d(mask, smooth_len, axis=0, mode='constant', cval=0, origin=0)
return mask
def convert_trajectory_to_mask_cent(traj, F_coef, n_harmonics=1, tol_cent=0.0):
"""Computes binary mask from F0 trajectory
Notebook: C8/C8S2_MelodyExtractSep.ipynb
Args:
traj (np.ndarray): F0 trajectory (time in seconds in 1st column, frequency in Hz in 2nd column)
F_coef (np.ndarray): Frequency axis
n_harmonics (int): Number of harmonics (Default value = 1)
tol_cent (float): Tolerance in cents (Default value = 0.0)
Returns:
mask (np.ndarray): Binary mask
"""
K = len(F_coef)
N = traj.shape[0]
mask = np.zeros((K, N))
freq_res = F_coef[1] - F_coef[0]
tol_factor = np.power(2, tol_cent/1200)
F_coef_upper = F_coef * tol_factor
F_coef_lower = F_coef / tol_factor
F_coef_upper_bin = (np.ceil(F_coef_upper / freq_res)).astype(int)
F_coef_upper_bin[F_coef_upper_bin > K-1] = K-1
F_coef_lower_bin = (np.floor(F_coef_lower / freq_res)).astype(int)
for n in range(N):
for h in range(n_harmonics):
freq = traj[n, 1] * (1 + h)
freq_bin = np.round(freq / freq_res).astype(int)
if freq_bin < K:
idx_upper = F_coef_upper_bin[freq_bin]
idx_lower = F_coef_lower_bin[freq_bin]
mask[idx_lower:idx_upper+1, n] = 1
return mask
figsize = (12, 2.5)
# Binary masks for melody and accompaniment using bins
tol_bin = 8
mask_mel_bin = convert_trajectory_to_mask_bin(traj_X, F_coef,
n_harmonics=30, tol_bin=tol_bin)
mask_acc_bin = np.ones(mask_mel_bin.shape) - mask_mel_bin
plt.figure(figsize=figsize)
ax = plt.subplot(121)
title = r'Binary mask for melody (tol_bin = %d)'%tol_bin
fig_im, ax_im, im = libfmp.b.plot_matrix(mask_mel_bin, Fs=Fs_feature, ax=[ax],
Fs_F=1/freq_res, figsize=figsize, title=title);
plt.ylim(ylim);
ax = plt.subplot(122)
title=r'Binary mask for accompaniment (tol_bin = %d)'%tol_bin
fig, ax, im = libfmp.b.plot_matrix(mask_acc_bin, Fs=Fs_feature, ax=[ax],
Fs_F=1/freq_res, figsize=figsize, title=title);
plt.ylim(ylim);
plt.tight_layout()
# Binary masks for melody and accompaniment using cents
tol_cent = 80
mask_mel_cent = convert_trajectory_to_mask_cent(traj_X, F_coef, n_harmonics=30, tol_cent=tol_cent)
mask_acc_cent = np.ones(mask_mel_cent.shape) - mask_mel_cent
plt.figure(figsize=figsize)
ax = plt.subplot(121)
title=r'Binary mask for melody (tol_cent = %d)'%tol_cent
fig_im, ax_im, im = libfmp.b.plot_matrix(mask_mel_cent, Fs=Fs_feature, ax=[ax],
Fs_F=1/freq_res, figsize=figsize, title=title);
plt.ylim(ylim);
ax = plt.subplot(122)
title=r'Binary mask for accompaniment (tol_cent = %d)'%tol_cent
fig, ax, im = libfmp.b.plot_matrix(mask_acc_cent, Fs=Fs_feature, ax=[ax],
Fs_F=1/freq_res, figsize=figsize, title=title);
plt.ylim(ylim);
plt.tight_layout()
Next, we construct the masked STFTs $\mathcal{X}^\mathrm{Mel}$ and $\mathcal{X}^\mathrm{Acc}$ and the signals $x^\mathrm{Mel}$. From this, we obtain the time-domain signals $x^\mathrm{Acc}$ by applying signal reconstruction techniques. Again, we present the results for both masking variants.
#def display_signals_mel_acc(x_mel, x_acc, header_mel, header_acc):
# html_x_mel = libfmp.c8.generate_audio_tag_html_list([x_mel], Fs=Fs, width='350')
# html_x_acc = libfmp.c8.generate_audio_tag_html_list([x_acc], Fs=Fs, width='350')
#
# pd.set_option('display.max_colwidth', None)
# df = pd.DataFrame(OrderedDict([
# (header_mel, html_x_mel),
# (header_acc, html_x_acc)]))
# ipd.display(ipd.HTML(df.to_html(escape=False, header=True, index=False)))
def compute_plot_display_mel_acc(Fs, X, N, H, mask_mel, figsize=figsize):
mask_acc = np.ones(mask_mel.shape) - mask_mel
X_mel = X * mask_mel
X_acc = X * mask_acc
x_mel = librosa.istft(X_mel, hop_length=H, win_length=N, window='hann', center=True, length=x.size)
x_acc = librosa.istft(X_acc, hop_length=H, win_length=N, window='hann', center=True, length=x.size)
plt.figure(figsize=figsize)
ax = plt.subplot(121)
fig_im, ax_im, im = libfmp.b.plot_matrix(np.log(1+10 * np.abs(X_mel)), Fs=Fs_feature, ax=[ax], Fs_F=1/freq_res, figsize=figsize,
title='Masked STFT for melody');
plt.ylim(ylim);
ax = plt.subplot(122)
fig, ax, im = libfmp.b.plot_matrix(np.log(1+10 * np.abs(X_acc)), Fs=Fs_feature, ax=[ax], Fs_F=1/freq_res, figsize=figsize,
title='Masked STFT for accompaniment');
plt.ylim(ylim);
plt.show()
libfmp.b.audio_player_list([x_mel, x_acc], [Fs, Fs], width=350,
columns=['Reconstructed signal for melody', 'Reconstructed signal for accompaniment'])
print('Melody separation (tol_bin = %d)'%tol_bin)
compute_plot_display_mel_acc(Fs, X, N, H, mask_mel_bin, figsize=figsize)
print('Melody separation (tol_cent = %d)'%tol_cent)
compute_plot_display_mel_acc(Fs, X, N, H, mask_mel_cent, figsize=figsize)
mask_mel_joint = np.logical_or(mask_mel_bin, mask_mel_cent)
print('Melody separation (joint mask using "logical or" operator)')
compute_plot_display_mel_acc(Fs, X, N, H, mask_mel_joint, figsize=figsize)
Assuming that an F0-trajectory is given for the melody, the overall procedure for melody-accompaniment separation is summarized in the following code cell. As an example, we use excerpt of the song "Stop Messing With Me" by Sven Bornemark, where the melody is performed by a male singer. As input to the separation procedure, we use the constraint-based F0-trajectory as computed at the end of the FMP notebook on fundamental frequency tracking. Besides presenting the separated components signals, we also provide the ideal melody track and accompaniment track, which are also available for the song.
def separate_melody_accompaniment(x, Fs, N, H, traj, n_harmonics=10, tol_cent=50.0):
"""F0-based melody-accompaniement separation
Notebook: C8/C8S2_MelodyExtractSep.ipynb
Args:
x (np.ndarray): Audio signal
Fs (scalar): Sampling frequency
N (int): Window size in samples
H (int): Hopsize in samples
traj (np.ndarray): F0 traj (time in seconds in 1st column, frequency in Hz in 2nd column)
n_harmonics (int): Number of harmonics (Default value = 10)
tol_cent (float): Tolerance in cents (Default value = 50.0)
Returns:
x_mel (np.ndarray): Reconstructed audio signal for melody
x_acc (np.ndarray): Reconstructed audio signal for accompaniement
"""
# Compute STFT
X = librosa.stft(x, n_fft=N, hop_length=H, win_length=N, pad_mode='constant')
Fs_feature = Fs / H
T_coef = np.arange(X.shape[1]) / Fs_feature
freq_res = Fs / N
F_coef = np.arange(X.shape[0]) * freq_res
# Adjust trajectory
traj_X_values = interp1d(traj[:, 0], traj[:, 1], kind='nearest', fill_value='extrapolate')(T_coef)
traj_X = np.hstack((T_coef[:, None], traj_X_values[:, None, ]))
# Compute binary masks
mask_mel = convert_trajectory_to_mask_cent(traj_X, F_coef, n_harmonics=n_harmonics, tol_cent=tol_cent)
mask_acc = np.ones(mask_mel.shape) - mask_mel
# Compute masked STFTs
X_mel = X * mask_mel
X_acc = X * mask_acc
# Reconstruct signals
x_mel = librosa.istft(X_mel, hop_length=H, win_length=N, window='hann', center=True, length=x.size)
x_acc = librosa.istft(X_acc, hop_length=H, win_length=N, window='hann', center=True, length=x.size)
return x_mel, x_acc
# Load audio
fn_wav = os.path.join('..', 'data', 'C8', 'FMP_C8_Audio_Bornemark_StopMessingWithMe-Excerpt_SoundCloud_mix.wav')
Fs = 22050
x, Fs = librosa.load(fn_wav, sr=Fs)
x_duration = len(x)/Fs
fn_wav_track_mel = os.path.join('..', 'data', 'C8', 'FMP_C8_Audio_Bornemark_StopMessingWithMe-Excerpt_SoundCloud_vocals.wav')
x_track_mel, Fs = librosa.load(fn_wav_track_mel, sr=Fs)
fn_wav_track_acc = os.path.join('..', 'data', 'C8', 'FMP_C8_Audio_Bornemark_StopMessingWithMe-Excerpt_SoundCloud_accomp.wav')
x_track_acc, Fs = librosa.load(fn_wav_track_acc, sr=Fs)
# Read constraint regions
constraint_region_df = libfmp.b.read_csv(os.path.join('..', 'data', 'C8', 'FMP_C8_F11_SvenBornemark_StopMessingWithMe_CR.csv'))
constraint_region = constraint_region_df.values
# Compute trajectory
traj, Z, T_coef, F_coef_hertz, F_coef_cents = libfmp.c8.compute_traj_from_audio(x, Fs=Fs,
constraint_region=constraint_region, gamma=0.1)
N = 2048
H = N//4
x_mel, x_acc = separate_melody_accompaniment(x, Fs=Fs, N=N, H=H, traj=traj, n_harmonics=30, tol_cent=50)
libfmp.b.audio_player_list([x_mel, x_acc], [Fs, Fs], width=350,
columns=['Reconstructed signal for melody', 'Reconstructed signal for accompaniment'])
libfmp.b.audio_player_list([x_track_mel, x_track_acc], [Fs, Fs], width=350,
columns=['Ideal signal for melody', 'Ideal signal for melody for accompaniment'])
In this notebook, we presented a procedure for decomposing a music recording into a melody component and an accompaniment component, where we assumed that the F0-trajectory of the melody is given. Obviously, in this simplistic voice separation scenario, the requirements on the accuracy of the extracted F0-values are very high. Furthermore, nonharmonic properties of the singing voice such as fricative and plosive components (coming from consonants) have not been considered. Modeling and separating such nonharmonic components from sound mixtures generally constitutes a hard research question. Recent approaches for singing voice separation (or, more generally speaking, source separation approaches) are mostly based on deep learning techniques using large amounts of training examples. Here are some useful links: