This notebook accompanies the following paper:
The paper describes an approach to train a neural network with weakly aligned score–audio pairs to compute an enhanced chroma representation. See more details in the paper.
The notebook shows how to apply the neural network model described in the paper to compute the enhanced chroma representation. The repository contains several model variants, due to different training and validation splits. Furthermore, the repository also contains two public domain audio excerpts. The following table shows both audio excerpts.
Composer | Work | Performer | Description | Audio |
---|---|---|---|---|
Beethoven | Symphony no. 5, op. 67 | Davis High School Symphony Orchestra | First movement, first theme | |
Beethoven | Piano Sonata no. 2, op. 2 no. 2 | Paul Pitman | First movement, second theme |
We start by importing some Python packages and setting some paths. You may change the variables cur_model_id
and cur_audio_path
to select a different model variant or a different audio file.
import numpy as np
from matplotlib import pyplot as plt
import librosa
import IPython.display as ipd
import ctc_chroma
model_ids = ['train123valid4',
'train234valid5',
'train345valid1',
'train451valid2',
'train512valid3',
'train1234valid5',
'train2345valid1',
'train3451valid2',
'train4512valid3',
'train5123valid4']
audio_paths = ['audio/Beethoven_Op002-2-01_Pitman.wav',
'audio/Beethoven_Op067-01_DavidHighSchool.wav']
cur_model_id = model_ids[2]
cur_audio_path = audio_paths[1]
Next, we load the model and its weights, compute the input representation (HCQT), and apply the model. To compare the enhanced chroma representation, we also compute a baseline chroma representation based on the CQT representation that is part of the network's input.
model = ctc_chroma.models.get_model(cur_model_id)
hcqt, times, freqs = ctc_chroma.features.compute_hcqt_median(cur_audio_path, feature_rate=25)
hcqt_norm = librosa.util.normalize(hcqt.T, norm=2, fill=True, axis=1)
probabilities_ctc = model.predict(hcqt_norm[np.newaxis, :, :, :])[0, :, :].T
chroma_ctc = librosa.util.normalize(probabilities_ctc[:-1, :], norm=2, fill=True, axis=0)
bins_per_octave, n_octaves, harmonics, sr, fmin, hop_length = ctc_chroma.features.get_hcqt_params()
chroma_cqt = librosa.feature.chroma_cqt(C=hcqt_norm[:, :, 1].T, fmin=fmin, bins_per_octave=bins_per_octave)
Finally, the next code cell presents an audio player for the music recording and visualizes several matrices. First, we display a slice of the HCQT representation (corresponding to the fundamental $h=1$). Second, we display a chroma representation based on this slice of the HCQT. This is not used for the neural network but is visualized for the sake of comparison. Third, we display the output of the network. Forth, we visualize the enhanced chroma representation computed from the output of the network.
def subplot_imshow(ax, x, title, ymin, ymax, ylabel, yticks=None, yticklabels=None):
ax.set_title(title)
im = ax.imshow(x, aspect='auto', origin='lower', cmap='gray_r',
extent=[0, x.shape[1] / 25, ymin, ymax])
ax.set_ylabel(ylabel)
plt.colorbar(im, ax=ax)
if yticks is not None and yticklabels is not None:
ax.set_yticks(np.array(yticks) + 0.5)
ax.set_yticklabels(yticklabels.split())
ipd.display(ipd.Audio(filename=cur_audio_path))
fig, ax = plt.subplots(4, 1, figsize=(12, 10), sharex=True)
midi_min = librosa.hz_to_midi(fmin)
subplot_imshow(ax[0], hcqt_norm[:, :, 1].T, 'HCQT Slice for $h=1$', midi_min, midi_min + n_octaves * 12,
'Pitch')
subplot_imshow(ax[1], chroma_cqt, 'Chroma Pooling of HCQT Slice for $h=1$', 0, 12,
'Chroma', [0, 2, 4, 5, 7, 9, 11], 'C D E F G A B')
subplot_imshow(ax[2], probabilities_ctc, 'Model output (CTC probabilities)', 0, 13,
'Chroma (with $\\epsilon$)', [0, 2, 4, 5, 7, 9, 11, 12], 'C D E F G A B $\\epsilon$')
subplot_imshow(ax[3], chroma_ctc, 'Chroma representation based on model output', 0, 12,
'Chroma', [0, 2, 4, 5, 7, 9, 11], 'C D E F G A B')
ax[3].set_xlabel('Time (seconds)')
plt.tight_layout()