On this website you can find various demos, data sets, and source code that has been made publicly available by my research group and our partners.
Three-voiced funeral songs from Svaneti in North-West Georgia (also referred to as Zär) are believed to represent one of Georgia’s oldest preserved forms of collective music-making. Throughout a Zär performance, the singers often jointly and intentionally drift upwards in pitch. Furthermore, the singers tend to use pitch slides at the beginning and end of sung notes. As part of a study on interactive computational tools for tonal analysis of Zär, we compiled a dataset from the previously annotated audio material, which we release under an open-source license for research purposes.
We introduce a differentiable cost measure by adapting and combining existing principles for measuring intonation. In particular, our measure consists of two terms, representing a tonal aspect (the proximity to a tonal grid) and a harmonic aspect (the perceptual dissonance between salient frequencies). In an experiment, we demonstrate the potential of our approach for the task of intonation adaptation of amateur choral music using recordings from a publicly available multitrack dataset.
Choir singers typically practice their choral parts individually in preparation for joint rehearsals. Over the last years, applications have become popular that support individual rehearsals, e.g., with sing-along and score-following functionalities. In this work, we present a web-based interface with real-time intonation feedback for choir rehearsal preparation. The interface combines several open-source tools that have been developed by the MIR community.
The Schubert Winterreise Dataset (SWD) is a multimodal dataset comprising various representations and annotations of Franz Schubert's 24-song cycle Winterreise. The primary material (raw data) consists of textual representations of the songs' lyrics, music scores in image, symbolic, and MIDI format, as well as audio recordings of nine performances. The secondary material (annotations) comprises information of musical measure positions in sheet music images and audio recordings as well as analyses of chords, local keys, global keys, and structural parts.
Musical themes are essential elements in Western classical music. In this paper, we present the Musical Theme Dataset (MTD), a multimodal dataset inspired by “A Dictionary of Musical Themes” by Barlow and Morgenstern from 1948. For a subset of 2067 themes of the printed book, we created several digital representations of the musical themes. Beyond graphical sheet music, we provide symbolic music encodings, audio snippets of music recordings, alignments between the symbolic and audio representations, as well as detailed metadata on the composer, work, recording, and musical characteristics of the themes. In addition to the data, we also make several parsers and web-based interfaces available to access and explore the different modalities and their relations through visualizations and sonifications. These interfaces also include computational tools, bridging the gap between the original dictionary and music information retrieval (MIR) research. The dataset is of relevance for various subfields and tasks in MIR, such as cross-modal music retrieval, music alignment, optical music recognition, music transcription, and computational musicology.
Many music information retrieval tasks involve the comparison of a symbolic score representation with an audio recording. A typical strategy is to compare score–audio pairs based on a common mid-level representation, such as chroma features. Several recent studies demonstrated the effectiveness of deep learning models that learn task-specific mid-level representations from temporally aligned training pairs. However, in practice, there is often a lack of strongly aligned training data, in particular for real-world scenarios. In our study, we use weakly aligned score–audio pairs for training, where only the beginning and end of a score excerpt is annotated in an audio recording, without aligned correspondences in between. To exploit such weakly aligned data, we employ the Connectionist Temporal Classification (CTC) loss to train a deep learning model for computing an enhanced chroma representation. We then apply this model to a cross-modal retrieval task, where we aim at finding relevant audio recordings of Western classical music, given a short monophonic musical theme in symbolic notation as a query. We present systematic experiments that show the effectiveness of the CTC-based model for this theme-based retrieval task.
Choral singing is a central part of musical cultures across the world, yet many facets of this widespread form of polyphonic singing are still to be explored. Music information retrieval (MIR) research on choral singing benefits from multitrack recordings of the individual singing voices. However, there exist only few publicly available multitrack datasets on polyphonic singing. In this paper, we present Dagstuhl ChoirSet (DCS), a multitrack dataset of a cappella choral music designed to support MIR research on choral singing. The dataset includes recordings of an amateur vocal ensemble performing two choir pieces in full choir and quartet settings. The audio data was recorded during an MIR seminar at Schloss Dagstuhl using different close-up microphones to capture the individual singers’ voices.
The analysis of recorded audio material using computational methods has received increased attention in ethnomusicological research. We present a curated dataset of traditional Georgian vocal music for computational musicology. The corpus is based on historic tape recordings of three-voice Georgian songs performed by the the former master chanter Artem Erkomaishvili. In this article, we give a detailed overview on the audio material, transcriptions, and annotations contained in the dataset. Beyond its importance for ethnomusicological research, this carefully organized and annotated corpus constitutes a challenging scenario for music information retrieval tasks such as fundamental frequency estimation, onset detection, and score-to-audio alignment. The corpus is publicly available and accessible through score-following web-players.
Tempo and beat are fundamental properties of music. Many tempo estimation procedures suffer from so-called tempo octave errors, where a tempo octave refers to the difference between two tempi with half or double the value. More generally, there is often a confusion between tempi that differ by integer multiples (tempo harmonics) or integer fractions (tempo subharmonics). In our contributions, we present different post-processing procedures for mitigating the effect of tempo confusion. Furthermore, we present a novel tempo estimation approach based on convolutional neural networks (CNNs). When computing a tempo value, previous approaches typically proceed in two steps. In the first step, the audio signal is converted into a novelty representation, which indicates note onset candidates. In the second step, a periodicity analysis is performed based on the novelty representation to derive a tempo value. Rather than following this two-step strategy, our procedure directly outputs a tempo estimate from a given time-frequency representation of the input audio signal (using patches of 11.9 seconds of duration). We conducted extensive experiments to validate and compare his novel procedure with previous approaches. To this end, we not only used existing datasets, but also created several novel datasets comprising different music gneres such as rock, pop, dance music, country, latin, electronic music, and so on.
Nonnegative matrix factorization (NMF) is a family of methods widely used for information retrieval across domains including text, images, and audio. Within music processing, NMF has been used for tasks such as transcription, source separation, and structure analysis. Prior work has shown that initialization and constrained update rules can drastically improve the chances of NMF converging to a musically meaningful solution. Along these lines we present the NMF toolbox, containing MATLAB and Python implementations of conceptually distinct NMF variants—in particular, this paper gives an overview for two algorithms. The first variant, called nonnegative matrix factor deconvolution (NMFD), extends the original NMF algorithm to the convolutive case, enforcing the temporal order of spectral templates. The second variant, called diagonal NMF, supports the development of sparse diagonal structures in the activation matrix. Our toolbox contains several demo applications and code examples to illustrate its potential and functionality. By providing MATLAB and Python code on a documentation website under a GNU-GPL license, as well as including illustrative examples, our aim is to foster research and education in the field of music processing.
From the 19th century on, several composers of Western opera made use of leitmotifs (short musical ideas referring to semantic entities such as characters, places, items, or feelings) for guiding the audience through the plot and illustrating the events on stage. A prime example of this compositional technique is Richard Wagner’s four-opera cycle Der Ring des Nibelungen. Across its different occurrences in the score, a leitmotif may undergo considerable musical variations. Additionally, the concrete leitmotif instances in an audio recording are subject to acoustic variability. Our paper approaches the task of classifying such leitmotif instances in audio recordings. As our main contribution, we conduct a case study on a dataset covering 16 recorded performances of the Ring with annotations of ten central leitmotifs, leading to 2403 occurrences and 38448 instances in total. We build a neural network classification model and evaluate its ability to generalize across different performances and leitmotif occurrences. Our findings demonstrate the possibilities and limitations of leitmotif classification in audio recordings and pave the way towards the fully automated detection of leitmotifs in music recordings.
In score following, one main goal is to highlight measure positions in sheet music synchronously to audio playback. Such applications require alignments between sheet music and audio representations. Often, such alignments can be computed automatically in the case that the sheet music representations are given in some symbolically encoded music format. However, sheet music is often available only in the form of digitized scans. In this case, the automated computation of accurate alignments poses still many challenges. In this contribution, we present various semi-automatic tools for solving the subtask of determining bounding boxes (given in pixels) of measure positions in digital scans of sheet music—a task that is extremely tedious when being done manually.
In this paper, we consider a cross-modal retrieval scenario of Western classical music. Given a short monophonic musical theme in symbolic notation as query, the objective is to find relevant audio recordings in a database. A major challenge of this retrieval task is the possible difference in the degree of polyphony between the monophonic query and the music recordings. Previous studies for popular music addressed this issue by performing the cross-modal comparison based on predominant melodies extracted from the recordings. For Western classical music, however, this approach is problematic since the underlying assumption of a single predominant melody is often violated. Instead of extracting the melody explicitly, another strategy is to perform the cross-modal comparison directly on the basis of melody-enhanced salience representations. As the main contribution of this paper, we evaluate several conceptually different salience representations for our cross-modal retrieval scenario. Our extensive experimental results, which have been made available on a website, comprise more than 2000 musical themes and 100 hours of audio recordings.
Due to the complex nature of the human voice, the computational analysis of polyphonic vocal music recordings constitutes a challenging scenario. Development and evaluation of automated music processing methods often rely on multitrack recordings comprising one or several tracks per voice. However, recording singers separately is neither always possible, nor is it generally desirable. As a consequence, producing clean recordings of individual voices for computational analysis is problematic. In this context, one may use throat microphones which capture the vibrations of a singers’ throat, thus being robust to other surrounding acoustic sources. In this contribution, we sketch the potential of such microphones for music information retrieval tasks such as melody extraction. Furthermore, we report on first experiments conducted in the course of a recent project on computational ethnomusicology, where we use throat microphones to analyze traditional three-voice Georgian vocal music.
Music can be represented in many different ways. In particular, audio and sheet music renditions are of high importance in Western classical music. For choral music, a sheet music representation typically consists of several parts (for the individual singing voice sections) and possibly an accompaniment. Within a choir rehearsal scenario, there are various tasks that can be supported by techniques developed in music information retrieval (MIR). For example, it may be helpful for a singer if both, audio and sheet music modalities, are present synchronously—a well-known task that is known as score following. Furthermore, listening to individual parts of choral music can be very instructive for practicing. The listening experience can be enhanced by switching between the audio tracks of a suitable multi-track recording. In this contribution, we introduce a web-based interface that integrates score-following and track-switching functionalities, build upon already existing web technology.
Redrumming or drum replacement is used to substitute or enhance the drum hits in a song with one-shot drum sounds obtained from an external collection or database. In an ideal setting, this is done on multitrack audio, where one or more tracks are dedicated exclusively to drums and percussion. However, most non-professional producers and DJs only have access to mono or stereo downmixes of the music they work with. Motivated by this scenario, as well as previous work on decomposition techniques for audio signals, we propose a step towards enabling full-fledged redrumming with mono downmixes.
In magnetic resonance imaging (MRI), a patient is exposed to beat-like knocking sounds, often interrupted by periods of silence, which are caused by pulsing currents of the MRI scanner. In order to increase the patient's comfort, one strategy is to play back ambient music to induce positive emotions and to reduce stress during the MRI scanning process. To create an overall acceptable acoustic environment, one idea is to adapt the music to the locally periodic acoustic MRI noise. Motivated by this scenario, we consider in this contribution the general problem of adapting a given music recording to fulfill certain temporal constraints. More concretely, the constraints are given by a reference time axis with specified time points (e.g., the time positions of the MRI scanner's knocking sounds). Then, the goal is to temporally modify a suitable music recording such that its beat positions align with the specified time points. As one technical contribution, we model this alignment task as an optimization problem with the objective to fulfill the constraints while avoiding strong local distortions in the music. Furthermore, we introduce an efficient algorithm based on dynamic programming for solving this task. Based on the computed alignment, we use existing time-scale modification procedures for locally adapting the music recording. To illustrate the outcome of our procedure, we discuss representative synthetic and real-world examples, which can be accessed via an interactive website. In particular, these examples indicate the potential of automated methods for noise beautification within the MRI application scenario.
In Western popular music, drums and percussion are an important means to emphasize and shape the rhythm, often defining the musical style. If computers were able to analyze the drum part in recorded music, it would enable a variety of rhythm-related music processing tasks. Especially the detection and classification of drum sound events by computational methods is considered to be an important and challenging research problem in the broader field of Music Information Retrieval. Over the last two decades, several authors have attempted to tackle this problem under the umbrella term Automatic Drum Transcription (ADT). This paper presents a comprehensive review of ADT research, including a thorough discussion of the task-specific challenges, categorization of existing techniques, and evaluation of several state-of-the-art systems. To provide more insights on the practice of ADT systems, we focus on two families of ADT techniques, namely methods based on Nonnegative Matrix Factorization and Recurrent Neural Networks. We explain the methods' technical details and drum-specific variations and evaluate these approaches on publicly available datasets with a consistent experimental setup. Finally, the open issues and under-explored areas in ADT research are identified and discussed, providing future directions in this field.
This paper addresses the separation of drums from music recordings, a task closely related to harmonic-percussive source separation (HPSS). In previous works, two families of algorithms have been prominently applied to this problem. They are based either on local filtering and diffusion schemes, or on global low-rank models. In this paper, we propose to combine the advantages of both paradigms. To this end, we use a local approach based on Kernel Additive Modeling (KAM) to extract an initial guess for the percussive and harmonic parts. Subsequently, we use Non-Negative Matrix Factorization (NMF) with soft activation constraints as a global approach to jointly enhance both estimates. As an additional contribution, we introduce a novel constraint for enhancing percussive activations and a scheme for estimating the percussive weight of NMF components. Throughout the paper, we use a real-world music example to illustrate the ideas behind our proposed method. Finally, we report promising BSS Eval results achieved with the publicly available test corpora ENST-Drums and QUASI, which contain isolated drum and accompaniment tracks.
Website for Dataset and Annotations
The analysis of recorded audio sources has become increasingly important in ethnomusicological research. Such audio material may contain important cues on performance practice, information that is often lost in manually generated symbolic music transcriptions. In collaboration with Frank Scherbaum (Universität Potsdam, Germany), we considered a musically relevant audio collection that consists of more than 100 three-voice polyphonic Georgian chants. These songs were performed by Artem Erkomaishvili (1887–1967)—one of the last representative of the master chanters of Georgian music—in a three-stage recording process. This website provides the segment annotations as well as the F0 annotations for each of the songs in a simple CSV format. Furthermore, visualizations and sonifications of the F0 trajectories are provided.
DJs and producers of sample-based electronic dance music (EDM) use breakbeats as an essential building block and rhythmic foundation for their artistic work. The practice of reusing and resequencing sampled drum breaks critically influenced modern musical genres such as hip hop, drum'n'bass, and jungle. While EDM artists have primarily sourced drum breaks from funk, soul, and jazz recordings from the 1960s to 1980s, they can potentially be sampled from music of any genre. In this paper, we introduce and formalize the task of automatically finding suitable drum breaks in music recordings. By adapting an approach previously used for singing voice detection, we establish a first baseline for drum break detection. Besides a quantitative evaluation, we discuss benefits and limitations of our procedure by considering a number of challenging examples.
A typical micro-rhythmic trait of jazz performances is their swing feel. According to several studies, uneven eighth notes contribute decisively to this perceived quality. In this paper we analyze the swing ratio (beat-upbeat ratio) implied by the drummer on the ride cymbal. Extending previous work, we propose a new method for semi-automatic swing ratio estimation based on pattern recognition in onset sequences. As a main contribution, we introduce a novel time-swing ratio representation called swingogram, which locally captures information related to the swing ratio over time. Based on this representation, we propose to track the most plausible trajectory of the swing ratio of the ride cymbal pattern over time via dynamic programming. We show how this kind of visualization leads to interesting insights into the peculiarities of jazz musicians improvising together.
Music with its many representations can be seen as a multimedia scenario: There exist a number of different media objects (e.g., video recordings, lyrics, or sheet music) beside the actual music recording, which describe the music in different ways. In the course of digitization efforts, many of these media objects are nowadays publically available on the Internet. However, the media objects are ususally accessed individually without using their musical relationships. Using these relationships could open up new ways of navigating and interacting with the music. In this work, we model these relationships by taking the opera Die Walküre by Richard Wagner as a case study. As a first step, we describe the opera as a multimedia scenario and introduce the considered media objects. By using manual annotations, we establish mutual relationships between the media objects. These relationships are then modelled in a database schema. Finally, we preset a web-based demonstrator which offers several ways of navigation within the opera recordings and allows for accessing the media objects in a user-friendly way.
To learn an instrument, many people acquire the necessary sensorimotor and musical skills by imitating their teachers. In the case of studying jazz improvisation, the student needs to learn fundamental harmonic principles. In this work, we indicate the potential of incorporating computer-assisted methods in jazz piano lessons. In particular, we present a web-based tool offers an easy interaction with the offered multimedia content. This tool enables the student to revise the lesson's content with the help of recorded and annotated examples in an individual tempo.
Electronic Music (EM) is a popular family of genres which has increasingly received attention as a research subject in the field of MIR. A fundamental structural unit in EM are loops – audio fragments whose length can span several seconds. The devices commonly used to produce EM, such as sequencers and digital audio workstations, impose a musical structure in which loops are repeatedly triggered and overlaid. This particular structure allows new perspectives on well-known MIR tasks. In this paper we first review a prototypical production technique for EM from which we derive a simplified model. We then use our model to illustrate approaches for the following task: given a set of loops that were used to produce a track, decompose the track by finding the points in time at which each loop was activated. To this end, we repurpose established MIR techniques such as fingerprinting and non-negative matrix factor deconvolution.
Harmonic-percussive separation is a technique that splits music recordings into harmonic and percussive components—it can be used as a preprocessing step to facilitate further tasks like key detection (harmonic component) or drum transcription (percussive component). In this demo, we propose a cascaded harmonic-residual-percussive (HRP) procedure yielding a mid-level feature to analyze musical phenomena like percussive event density, timbral changes, and homogeneous structural segments.
In this work, we focus on transcribing walking bass lines, which provide clues for revealing the actual played chords in jazz recordings. Our transcription method is based on a deep neural network (DNN) that learns a mapping from a mixture spectrogram to a salience representation that emphasizes the bass line. Furthermore, using beat positions, we apply a late-fusion approach to obtain beat-wise pitch estimates of the bass line. First, our results show that this DNN-based transcription approach outperforms state-of-the-art transcription methods for the given task. Second, we found that an augmentation of the training set using pitch shifting improves the model performance. Finally, we present a semi-supervised learning approach where additional training data is generated from predictions on unlabeled datasets.
Retrieving short monophonic queries in music recordings is a challenging research problem in Music Information Retrieval (MIR). In jazz music, given a solo transcription, one retrieval task is to find the corresponding (potentially polyphonic) recording in a music collection. Many conventional systems approach such retrieval tasks by first extracting the predominant F0-trajectory from the recording, then quantizing the extracted trajectory to musical pitches and finally comparing the resulting pitch sequence to the monophonic query. In this paper, we introduce a data-driven approach that avoids the hard decisions involved in conventional approaches: Given pairs of time-frequency (TF) representations of full music recordings and TF representations of solo transcriptions, we use a DNN-based approach to learn a mapping for transforming a "polyphonic" TF representation into a "monophonic" TF representation. This transform can be considered as a kind of solo voice enhancement. We evaluate our approach within a jazz solo retrieval scenario and compare it to a state-of-the-art method for predominant melody extraction.
In view of applying Music Information Retrieval (MIR) techniques for music production, our goal is to extract high-quality component signals from drum solo recordings (so-called breakbeats). Specifically, we employ audio source separation techniques to recover sound events from the drum sound mixture that correspond to the individual drum strokes. Our separation approach is based on an informed variant of Non- Negative Matrix Factor Deconvolution (NMFD) that has been proposed and applied to drum transcription and separation in earlier works. In this article, we systematically study the suitability of NMFD and the impact of audio- and score-based side information in the context of drum separation. In the case of imperfect decompositions, we observe different cross-talk artifacts appearing during the attack and the decay segment of the extracted drum sounds. Based on these findings, we propose and evaluate two extensions to the core technique. The first extension is based on applying a cascaded NMFD decomposition while retaining selected side information. The second extension is a time-frequency selective restoration approach using a dictionary of single note drum sounds. For all our experiments, we use a publicly available data set consisting of multi-track drum recordings and corresponding annotations that allows us to evaluate the source separation quality. Using this test set, we show that our proposed methods can lead to an improved quality of the component signals.
The automated analysis of vibrato in complex music signals is a highly challenging task. A common strategy is to proceed in a two-step fashion. First, a fundamental frequency (F0) trajectory for the musical voice that is likely to exhibit vibrato is estimated. In a second step, the trajectory is then analyzed with respect to periodic frequency modulations. As a major drawback, however, such a method cannot recover from errors made in the inherently difficult first step, which severely limits the performance during the second step. In this work, we present a novel vibrato analysis approach that avoids the first error-prone F0-estimation step. Our core idea is to perform the analysis directly on a signal's spectrogram representation where vibrato is evident in the form of characteristic spectro-temporal patterns. We detect and parameterize these patterns by locally comparing the spectrogram with a predefined set of vibrato templates. Our systematic experiments indicate that this approach is more robust than F0-based strategies.
Harmonic–percussive–residual (HPR) sound separation is a useful preprocessing tool for applications such as pitched instrument transcription or rhythm extraction. In this demo, we show results from a novel method that uses the structure tensor—a mathemathmatical tool known from image processing—to calculate predominant orientation angles in the magnitude spectrogram. This orientation information can be used to distinguish between harmonic, percussive, and residual signal components, even in the case of frequency modulated signals.
Given a music recording, the objective of music structure analysis is to identify important structural elements and to temporally segment the recording according to these elements. As an important technical tool, the concept of self-similarity matrices is of fundamental importance in computational music structure. In this demo, you find many examples of such matrices for recordings of the "Winterreise" (Winter Journey). This song cycle, which consistis of 24 songs for single voice (usually sung by a tenor or baritone) accompanied by a piano, was composed by Franz Schubert in 1827 (D 911, op. 89).
The Single Microphone Switcher is a demo for exploring the recordings of the individual microphones used in the Freischütz Multitrack Dataset recordings. It sketches how the microphones were positioned in the room. The interface provides the possibility to listen to the individual microphone recordings. Furthermore, the instrument activation matrix, provides a visualization that shows which instruments are currently active (black) or inactive (white) at the current playback position.
When recording a live musical performance, the different voices, such as the instrument groups or soloists of an orchestra, are typically recorded in the same room simultaneously, with at least one microphone assigned to each voice. However, it is difficult to acoustically shield the microphones. In practice, each one contains interference from every other voice. In this paper, we aim to reduce these interferences in multi-channel recordings to recover only the isolated voices. Following the recently proposed Kernel Additive Modeling framework, we present a method that iteratively estimates both the power spectral density of each voice and the corresponding strength in each microphone signal. With this information, we build an optimal Wiener filter, strongly reducing interferences. The trade-off between distortion and separation can be controlled by the user through the number of iterations of the algorithm. Furthermore, we present a computationally efficient approximation of the iterative procedure. Listening tests demonstrate the effectiveness of the method.
A swarm of bees buzzing “Let it be” by the Beatles or the wind gently howling the romantic “Gute Nacht” by Schubert – these are examples of audio mosaics as we want to create them. Given a target and a source recording, the goal of audio mosaicing is to generate a mosaic recording that conveys musical aspects (like melody and rhythm) of the target, using sound components taken from the source. In this work, we propose a novel approach for automatically generating audio mosaics with the objective to preserve the source’s timbre in the mosaic. Inspired by algorithms for non-negative matrix factorization (NMF), our idea is to use update rules to learn an activation matrix that, when multiplied with the spectrogram of the source recording, resembles the spectrogram of the target recording. However, when applying the original NMF procedure, the resulting mosaic does not adequately reflect the source’s timbre. As our main technical contribution, we propose an extended set of update rules for the iterative learning procedure that supports the development of sparse diagonal structures in the activation matrix. We show how these structures better retain the source’s timbral characteristics in the resulting mosaic.
The problem of extracting singing voice from music recordings has received increasing research interest in recent years. Many proposed decomposition techniques are based on one of the following two strategies. The first approach is to directly decompose a given music recording into one component for the singing voice and one for the accompaniment by exploiting knowledge about specific characteristics of singing voice. Procedures following the second approach disassemble the recording into a large set of fine-grained components, which are classified and reassembled afterwards to yield the desired source estimates. In this paper, we propose a novel approach that combines the strengths of both strategies. We first apply different audio decomposition techniques in a cascaded fashion to disassemble the music recording into a set of mid-level components. This decomposition is fine enough to model various characteristics of singing voice, but coarse enough to keep an explicit semantic meaning of the components. These properties allow us to directly reassemble the singing voice and the accompaniment from the components. Our objective and subjective evaluations show that this strategy can compete with state-of-the-art singing voice separation algorithms and yields perceptually appealing results.
In recent years, methods to decompose an audio signal into a harmonic and a percussive component have received a lot of interest and are frequently applied as a processing step in a variety of scenarios. One problem is that the computed components are often not of purely harmonic or percussive nature but also contain noise-like sounds that are neither clearly harmonic nor percussive. Furthermore, depending on the parameter settings, one often can observe a leakage of harmonic sounds into the percussive component and vice versa. In this paper we present two extensions to a state-of-the-art harmonic-percussive separation procedure to target these problems. First, we introduce a separation factor parameter into the decomposition process that allows for tightening separation results and for enforcing the components to be clearly harmonic or percussive. As second contribution, inspired by the classical sines+transients+noise (STN) audio model, this novel concept is exploited to add a third residual component to the decomposition which captures the sounds that lie in between the clearly harmonic and percussive sounds of the audio signal.
A major problem in time-scale modification (TSM) of music signals is that percussive transients are often perceptually degraded. To prevent this degradation, some TSM approaches try to explicitly identify transients in the input signal and to handle them in a special way. However, such approaches are problematic for two reasons. First, errors in the transient detection have an immediate influence on the final TSM result and, second, a perceptual transparent preservation of transients is by far not a trivial task. In this paper we present a TSM approach that handles transients implicitly by first separating the signal into a harmonic component as well as a percussive component which typically contains the transients. While the harmonic component is modified with a phase vocoder approach using a large frame size, the noise-like percussive component ismodified with a simple time-domain overlap-add technique using a short frame size, which preserves the transients to a high degree without any explicit transient detection.
The separation of different sound sources from polyphonic music recordings constitutes a complex task since one has to account for different musical and acoustical aspects. In the last years, various score-informed procedures have been suggested where musical cues such as pitch, timing, and track information are used to support the source separation process. In this paper, we discuss a framework for decomposing a given music recording into notewise audio events which serve as elementary building blocks. In particular, we introduce an interface that employs the additional score information to provide a natural way for a user to interact with these audio events. By simply selecting arbitrary note groups within the score a user can access, modify, or analyze corresponding events in a given audio recording. In this way, our framework not only opens up new ways for audio editing applications, but also serves as a valuable tool for evaluating and better understanding the results of source separation algorithms.
Within the Freischütz Digital project, three numbers (No. 6, 8, and 9) of the opera "Der Freischütz" have been produced by the Erich-Thienhaus-Institute (HfM Detmold). The main purpose for the recording sessions was to produce royalty free audio material that can be used for demonstration and research purposes. The recording was carried out by the Tonmeister students Stefan Antonin (No. 6), Florian Bitzer (No. 8), and Matthias Kieslich (No. 9) under the supervision of Prof. Dipl-Tonm. Bernhard Güttler, and Prof. Dipl-Tonm. Michael Sandner. Besides a professional stereo mix of the three numbers, the dataset provides the raw multitrack recordings from the individual microphones as well as individual group mixes that emphasize different voices or instrument sections.
The TSM toolbox has been developed by Jonathan Driedger and Meinard Müller. It contains MATLAB implementations of various classical time-scale modification (TSM) algorithms like OLA, WSOLA, and the phase vocoder. Furthermore, the toolbox also provides the code for a recently proposed TSM algorithm based on a combination of the classical algorithms as well as harmonic-percussive source separation (HPSS). Finally, it also includes a wrapper function which allows to call the commercial state-of-the-art TSM algorithm élastique by zPlane directly from MATLAB. To show how the algorithms can be applied and to give some demo applications, the toolbox also includes several demo scripts and additional code examples. The MATLAB implementations provided on this website are published under the terms of the General Public License (GPL).
The SM Toolbox has been developed by Meinard Müller, Nanzhu Jiang, Peter Grosche, and Harald G. Grohganz. It contains MATLAB implementations for computing and enhancing similarity matrices in various ways. Furthermore, the toolbox includes a number of additional tools for parsing, navigation, and visualization synchronized with audio playback. Also, it contains code for a recently proposed audio thumbnailing procedure that demonstrates the applicability and importance of enhancement concepts. The MATLAB implementations provided on this website are published under the terms of the General Public License (GPL).
Website for the Chroma Toolbox, Mirror at MPII
The Chroma Toolbox has been developed by Meinard Müller and Sebastian Ewert. It contains MATLAB implementations for extracting various types of novel pitch-based and chroma-based audio features. The MATLAB implementations provided on this website are published under the terms of the General Public License (GPL). A general overview of the chroma toolbox is given in [ME_ISMIR2011].
Website for the Tempogram Toolbox, Mirror at MPII
The Tempogram Toolbox has been developed by Peter Grosche and Meinard Müller. It contains MATLAB implementations for extracting various types of recently proposed tempo and pulse related audio representations, see [GM_IEEE-TASLP2011]. These representations are particularly designed to reveal useful information even for music with weak note onset information and changing tempo. The MATLAB implementations provided on this website are published under the terms of the General Public License (GPL).
On this website you find results and audio examples for a score-informed source separation procedure as described in [EM_ICASSP2012]. Based on non-negative matrix factorization (NMF), the main idea used in this approach is to impose score-based constraints on both the template as well as the activation side. Using such double constraints results in musically meaningful decompositions similar to parametric approaches, while being computationally less demanding and easier to implement.
The objective evaluation and comparison of various techniques is crucial for the scientific progress in applied fields such as music information retrieval. Here, the availability of common datasets are of foremost importance. One important goal of our collaboration is to generate royalty free music data without any copyright restrictions, which is freely available for research purposes. On this website we supply various types of music data recorded at the HFM. This data is referred to as Saarland Music Data (SMD). Besides usual music recordings (SMD Western Music), we also supply MIDI-audio pairs (SMD MIDI-Audio Piano Music). These pairs, which have been generated by using hybrid acoustic/digital pianos (Disklavier), constitute valuable ground truth material for various MIR tasks.
Website for SyncRWC, Mirror at MPII
On this web page, we supply MIDI-audio synchronization results obtained from the synchronization procedure described in [EMG_ICASSP2009]. Providing these results, our goal is to establish a platform for comparing and discussing various synchronization and alignment procedures. In general terms, music synchronization denotes a procedure which, for a given position in one representation of a piece of music, determines the corresponding position within another representation. Depending upon the respective data formats, one distinguishes between various synchronization tasks. For example, audio-audio synchronization refers to the task of time aligning two different audio recordings of a piece of music. These alignments can be used to jump freely between different interpretations, thus affording efficient and convenient audio browsing. The goal of MIDI-audio synchronization is to coordinate MIDI note events with audio data. The result can be regarded as an automated annotation of the audio recording with available MIDI data.
Website for the Mocap Database HDM05
It is the objective of our motion capture database HDM05 to supply free motion capture data for research purposes. HDM05 contains more than three hours of systematically recorded and well-documented motion capture data in the C3D as well as in the ASF/AMC data format. Furthermore, HDM05 contains for more than 70 motion classes in 10 to 50 realizations executed by various actors. The HDM05 database has been designed and set up under the direction of Meinard Müller, Tido Röder, Michael Clausen, Bernhard Eberhardt, Björn Krüger, and Andreas Weber. The motion capturing has been conducted in the year 2005 at the Hochschule der Medien (HDM), Stuttgart, Germany, supervised by Bernhard Eberhardt.