This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Alexander Lerch 22institutetext: Georgia Institute of Technology, Atlanta, GA, 22email: [email protected]

Audio Content Analysis

Alexander Lerch

1 Introduction

Audio signals contain a wealth of information: just by listening to an audio signal, we are able to infer a variety of properties. For example, a speech signal not only transports the textual information, but might also reveal information about the speaker (gender, age, accent, etc.) and the recording environment (e.g., indoors vs. outdoors). In a music signal we might identify the instruments playing, the musical structure or musical genre, the melody, harmonies, and tonality, the projected emotion, and characteristics of the performance as well as the proficiency of the performers. An audio signal can contain and transport a wide variety of content beyond these examples; the field of Audio Content Analysis (ACA) aims at creating and using algorithms for automatically extracting this information from the raw (digital) audio signal lerch_introduction_2012 , enabling us to sort, categorize, segment, and visualize the audio signal based on its content. Use cases include applications such as content-based automatic playlist generation and music recommendation systems, computer-assisted music production and editing, and intelligent music tutoring systems identifying mistakes and areas of improvement for young instrumentalists.

This chapter gives an overview of ACA techniques and applications. While the processing of speech signals is covered in Chapter LABEL:X, this chapter focuses on music signals, where ACA is often referred to as Music Information Retrieval (MIR) schedl_music_2014 ; burgoyne_music_2015 , although the latter additionally encompasses the analysis and generation of symbolic (non-audio) music data such as musical scores.

1.1 Audio content

It is important to identify the main sources of content in order to understand what encompasses the content of a signal. In music recordings, the content can be traced back to three origins:

  • Composition: while in western classical music, this would be a written full score, in other musical styles it might be a lead sheet or simply a musical idea. The content related to the composition allows us to recognize different renderings of the same song or symphony as being the same piece. In most genres of western music, this encompasses musical elements such as melody, harmony, and rhythm.

  • Music performance: the performance realizes the composition in a unique acoustic rendition, the actual music that can be perceived. A performance communicates the explicit information from the composition but also interprets and adds to this information. This happens through variations of tempo, timing, dynamics, and playing techniques.

  • Music production: since the input of an analysis system is usually an audio recording, the choices made during the recording as well as the editing and processing can significantly influence the final result maempel_musikaufnahmen_2011 . Microphone positioning, filtering, and editing are examples of the production teams’ impact on the final recording.

An analogy can again be found for speech recordings: the composition corresponds to the text, the performance to the actual speech, and the production to the recording and audio processing.

From a musical point of view, audio content can categorized into timing (tempo and rhythm), dynamics, pitch, and timbre. The combination of specific characteristics and variations across these categories can convey higher level content such as musical genre or conveyed emotion. In addition to the musical content alone, a recording might also contain additional content such as song lyrics.

1.2 Generalized audio content analysis system

A digital audio signal is represented as a series of numbers, so-called samples (see Chap. LABEL:X). Direct inspection of these samples —in case of CD audio quality 44100 per second— does not necessarily allow the observer to draw conclusions about the audio content. A long term context, however, might still give some insights: Fig. 1 shows the common waveform representation (sample values over time) of three different audio signals: a string quartet recording, a pop production, and speech.

Refer to caption
Figure 1: Waveform representation of three different audio recordings: speech (left), string quartet (middle), pop (right)

The general shape of the waveform envelope already enables an observer to differentiate between these different signals by deriving descriptive properties related to dynamic range, fluctuations, and pauses. This can be modeled algorithmically: first, one or more descriptors which capture general properties of the audio signal can be defined (for instance, a measure for how often and how much the waveform envelope changes), and second, a mapping for different descriptor ranges to the audio signal class can be heuristically found (for instance, thresholding a descriptor for classifying the signal as pop). These two processing steps form a simplified model of a generalized system for ACA as shown in Fig. 2: the first stage extracts descriptors, also commonly referred to as audio features, and the second stage infers the desired content information from the features. There are some parallels of these two stages to two steps in human information processing, namely perception and cognition.

Figure 2: General processing stages of a system for audio content analysis

Feature extraction

The Feature Extraction stage has two main objectives: first, it reduces the overall amount of data to be processed, leading to a compact representation of the content. For example, some audio classification systems use only dozens or hundreds of feature values to describe a complete song instead of millions of samples. Second, it focuses on the relevant aspects of the content, stripping away irrelevant and possibly redundant information. If, for example, a system is supposed to detect the fundamental frequencies of an audio signal, the feature set should probably be invariant to loudness and timbre. It is important to note that features do not necessarily have to be musically meaningful or even interpretable by humans, it suffices that they contain the relevant information needed for inference to provide a correct mapping to the human-understandable target meta data. This is particularly true for most state-of-the-art systems: while for a long time experts carefully designed features capturing specific content information (such as the envelope variation in the simple example above), the last decade has seen a new generation of data-driven systems which automatically learn such features from data humphrey_feature_2013 . Prominent examples of these modern systems are neural networks which can automatically learn a compressed representation of the input data as features.

The input of the majority of ACA systems is either the waveform as discussed above or some kind of spectrogram representation. A spectrogram is a pseudo-3D plot that, compared to the waveform, often gives more insights into the frequency content of the signal by plotting the magnitude of many overlapping Short Time Fourier Transforms (STFT) over time.

Refer to caption
Figure 3: Waveform (top) and spectrogram (bottom) view of a single-voiced saxophone recording of the jazz standard ’Summertime’

Figure 3 shows the first 24 bars of a single saxophone playing the jazz standard ’Summertime;’ the top visualizes the common waveform representation (x-axis: time, y-axis: amplitude) and the bottom displays the spectrogram (x-axis: time, y-axis: frequency, color: amplitude). Each column of the spectrogram is one magnitude spectrum of a short block of samples with low values colored blue and high values colored yellow. While the phrases are easily identifiable in both representations, the spectrogram also shows that (i) it is a recording of a single monophonic instrument (clear spectral structure of fundamental frequency and harmonics at integer multiples), (ii) it is an instrument that allows vibrato (see seconds 33 and 1818), (iii) it is an instrument with a significant number of harmonics (number and strength of “parallel” lines), and that (iv) the melody can be directly derived from the visualization by identifying the lowest frequency of the harmonic series and mapping it to musical pitch.

Inference stage

The second stage of the ACA system, the inference, takes the extracted features and maps them into a domain both usable and comprehensible by humans. In other words, it interprets the feature data and maps it to meaningful meta-data, such as a class label for the example in Fig. 1 (speech, chamber music, pop) or the pitch of a melody. This inference system can be an expert-defined algorithm, a classifier, or a regression algorithm. The more condensed and the more meaningful the features are, the less powerful the inference algorithm has to be and vice versa: a raw feature representation close to the audio sample requires a sophisticated inference approach.

It is important to note that no machine learning system will work with hundred percent accuracy in all but the most trivial tasks. Every machine model of data and its patterns will always be imperfect just like a human annotating data, and it is the goal of the scientists and engineers to minimize the number of errors.

2 Music transcription

Music transcription systems estimate (explicit or implicit) score information from the audio recording. Music transcription broadly defined encompasses a variety of extraction tasks covering, for instance, melody, chords, musical key, musical instruments, rhythm, time signature, and structure. The basics of music transcription systems will be explained here by introducing simple example approaches to various tasks.

2.1 Musical key detection

Estimating the key of a musical audio signal has multiple applications ranging from large-scale musicological studies to enabling DJs to automatically identify songs with tonal compatibility for mash-ups. Key detection systems could work very reliably if they could simply utilize a transcribed version of all notes with pitch and length as the key is mostly defined by the pitch content of a piece. Practically, however, this approach is not necessarily the most robust approach as, on the one hand, the transcription of pitches from polyphonic audio remains to be challenging and error prone and, on the other hand, key detection approaches can work reasonably well without requiring such detailed information pauws_musical_2004 ; chuan_polyphonic_2005 ; izmirli_template_2005 .

Pitch chroma

The most common feature that is used for key detection is the so-called average pitch chroma, which is an easy-to-extract octave-independent approximation of the pitch content in the audio. Its use was first proposed in the context of chord detection fujishima_realtime_1999 . Similar to many other low-level features, the pitch chroma is usually derived from a STFT computed on the blocked audio signal. For pitch chroma computation, the frequency bins are grouped to the frequency of the closest musical pitch, e.g., A4. Then, all the magnitudes belonging to the same pitch class (e.g., A3, A4, A5) are accumulated over the octaves, resulting in a 12-dimensional pitch class vector (C, C#, D, D#, etc.) per audio block. The result is shown in Fig. 4 (bottom); it is similar to the spectrogram in the sense that it is a pseudo-3D plot with time on the x-axis, however, the y-axis represents the 12 pitch classes C–B now.

Refer to caption
Figure 4: Spectrogram (top) and pitch chroma (bottom) view of a single-voiced saxophone recording of the jazz standard ’Summertime’

The pitch chroma is a fitting feature for detecting the musical key as it focuses on the tonal content, reduces timbre and rhythm interference, and removes key-irrelevant octave information gomez_tonal_2006 ; muller_information_2007 .

Inference

Under the simplifying assumption that the key will not change over the course of the piece of music (an assumption mostly valid for some genres such as rock and pop but invalid for others, e.g., classical music), the average pitch chroma of the whole piece gives a good approximation of the overall pitch content of the piece. Figure 5 displays an average pitch chroma extracted from a pop song in key D Major; it shows that the pitches D and A are the most prominent (salient and often occurring) and that unlikely pitch classes such as G# are less salient.

Refer to caption
Figure 5: Average pitch chroma of a pop song in the key of D Major

A simple key detection system uses such an extracted pitch chroma and infers the estimated key by comparing it to previously defined pitch class distribution templates, also referred to as key profiles. Examples for such templates as shown in Fig. 6 can be derived from music knowledge (diatonic), listening experiments on tonality (Krumhansl) krumhansl_cognitive_1990 , or data (Temperley) temperley_tonal_2007 . One disadvantage of the diatonic profile is that the profile is identical between a major key and its relative (aeolic) minor key; the other two key profiles solve this issue by allowing different weights for different scale degrees.

Refer to caption
Figure 6: Three common template normalized key profiles (C major)

As the key profile of, for instance, C# Major can be assumed to be identical to the C Major key profile but shifted by one semi-tone, only two 12-dimensional templates are required for the basic task of detecting major and minor keys. The final key estimate is the key that minimizes the distance between the extracted average pitch chroma and the (shifted) major and minor key profile templates.

Results

Modern key detection systems achieve correct detection rates of approximately 60–80 %, depending on the data used for evaluation. The most common errors are confusions with the relative key (e.g., G Major vs. E Minor), the closely related keys (e.g., D Major vs. A Major), and the parallel key (e.g., A Major vs. A Minor),111cf. https://www.music-ir.org/mirex/wiki/2019:Audio_Key_Detection_Results, last accessed 01/14/2020 which is easily understandable due to the significantly overlapping pitch content.

2.2 Monophonic fundamental frequency detection

The estimation of the varying fundamental frequency f_0f_{\_}0 from a single-voiced audio recording enables a variety of different applications ranging from pitch correction over automatic accompaniment systems to karaoke systems. It is generally considered to be a solved problem, although some standard systems sometimes still lack the robustness required by specific applications. Fundamental frequency detection systems detect repeating patterns in tonal, quasi-periodic components of the signal. Most approaches build on the basic assumption that the single-voiced signal is a weighted superposition of multiple sinusoidals with frequencies at integer multiples of the fundamental frequency. Once the fundamental frequency is detected, it can be easily mapped to a musical pitch.

One of the most intuitive ways of approaching fundamental frequency detection is to measure the distance between zero-crossings and local maxima (or minima) to estimate the fundamental period length, the inversion of which is the fundamental frequency f_0f_{\_}0. While this is a simple way of approaching this problem, the results are too unreliable to make it practically usable, especially in the case of numerous harmonics.

Over the past few decades, a variety of approaches to fundamental frequency detection have been proposed; the following sections present representative approaches for two analysis domains, the time domain and the frequency domain.

Auto correlation function

Nearly every pitched signal is periodic with its fundamental period length. An established way of detecting this periodicity or self-similarity is the Auto Correlation Function (ACF). Assuming that the fundamental frequency of a signal does not significantly change within a short block of samples of length 𝒩\mathcal{N}, the periodicity can be found by multiplying this block of samples x(n)x(n) with a shifted version of itself x(n+η)x(n+\eta) and summing the result for each η\eta:

r_xx(η)=_n=0𝒩x(n)x(n+η).r_{\_}\mathrm{xx}(\eta)=\sum\limits_{\_}{n=0}^{\mathcal{N}}{x(n)\cdot x(n+\eta)}. (1)

The result will be maximal at a shift of η=0\eta=0 (maximum self similarity), but it will also show local maxima at multiples of the fundamental period length. This is due to the high similarity of neighboring periods of the periodic signal. Thus, the ACF indicates the similarity per shift η\eta. The shift of the local maximum is therefore an indicator of the length of the fundamental period length in samples. The ACF is often normalized so that r_xx(0)=1r_{\_}{xx}(0)=1. Figure 7 visualizes this function (bottom) for an example signal (top) for η0\eta\geq 0.

Refer to caption
Figure 7: Normalized auto correlation function (bottom) of a periodic audio input (top)

The ACF has been a popular pitch detection algorithm for many decades rabiner_use_1977 and related and modified algorithms are still used mauch_pyin_2014 .

Harmonic product spectrum

Intuitively, the frequency domain is a fitting domain to find the fundamental frequency as it is a relatively sparse representation of the frequency content of each block. Simply picking the location of the maximum of the STFT, however, does not lead to a reliable fundamental frequency estimate, as the loudest tonal component of natural sounds is often one of the higher harmonics instead of the fundamental frequency itself. The Harmonic Product Spectrum (HPS) is a way to address this challenge by taking advantage of the comb-like structure of the spectrum of a periodic sound in the frequency domain noll_pitch_1969 , as the magnitude spectrum of a periodic sound will show local maxima at the location of fundamental frequency as well as its integer multiples. The HPS is computed iteratively: first, the magnitude spectrum is decimated by keeping only every second value so that the length of the spectrum is halved. Whatever the fundamental frequency is, the location of the second harmonic now has the same index as the fundamental frequency of the original spectrum. Multiplying those two spectra will increase the value at the fundamental frequency (multiplication of local maxima) and minimize at other locations of the spectrum (lower magnitudes). This process is iteratively repeated while decimating the spectrum by factors of 33, 44, 55, etc., with the maximum at the fundamental frequency getting more and more pronounced as higher harmonics are multiplied. The location of the maximum of this HPS is, then, the estimated fundamental frequency. Figure 8 illustrates the decimated spectra and the resulting HPS.

Refer to caption
Figure 8: 4th order Harmonic Product Spectrum and decimated spectra (bottom) for the signal shown top

The HPS is a simple way of detecting the fundamental frequency in the frequency domain, but it faces some inherent challenges. First, the spectral frequency resolution is often insufficient especially for signals at low frequencies. Second, it might fail for signals in which one harmonic has low magnitude. Third, fundamental frequencies that fall between bin frequencies might not be detected. But even given these issues, the HPS is a clever and intriguing way of using the harmonic structure in the frequency domain to detect the fundamental frequency.

2.3 Polyphonic fundamental frequency detection

While fundamental frequency detection for single-voiced recordings is considered a solved problem, a multi-voiced, polyphonic signal still poses challenges to state-of-the-art systems benetos_automatic_2013 . Historically, this task was initially approached with methods that can be summarized under the term Iterative Subtraction: first, a monophonic pitch detector is applied to the signal to detect the most salient fundamental frequency. Second, this frequency is stored is a candidate and all its harmonics are removed from the signal. Third, this process is repeated until the termination criteria are met. Termination criteria can include, for example, a maximum number of voices or a minimum remaining energy in the signal. Approaches utilizing iterative subtraction have been reasonably successful in the 2000s and have been applied to both time domain signals cheveigne_multiple_1999 and frequency domain signals klapuri_multiple_2003 .

Later, Non-Negative Matrix Factorization (NMF) became widely-used for polyphonic fundamental frequency detection smaragdis_non-negative_2003 , and was especially successful when applied to instruments with quantized frequencies such as the piano bertin_blind_2007 . NMF attempts to approximate one positive matrix (usually the spectrogram) through the multiplication of two matrices which are randomly initialized and iteratively updated, based on the distance between the target spectrogram and the result of the multiplication. After convergence, the two matrices can be interpreted as the template dictionary, containing all basic sound components that might contribute to the final spectrogram, and the activation, indicating how strong each template is at each time. In the case of the piano, the templates would be all the different piano pitches and the activations indicate the volume of each pitch over time.

Unsurprisingly, the majority of modern systems for polyphonic pitch detection is based on deep neural networks benetos_automatic_2019 . Most of them work with a spectrogram-like input representation. The achieved accuracies vary, dependent on the data set used for evaluation, between 40 and 70%. These and more results222cf. https://www.music-ir.org/mirex/wiki/2019:Multiple_Fundamental_Frequency_Estimation_\%26_Tracking_Results_-_MIREX_Dataset, last accessed 01/14/2020 can be found on the website for the so-called MIREX (Music Information Retrieval Evaluation eXchange), an annual event comparing the performance state-of-the-art systems for a variety of Music Information Retrieval tasks on a variety of data and metrics.

2.4 Musical structure identification

Most music is inherently formally organized and structured into various hierarchical levels, starting from groups of notes, bars, and phrases to sections and movements. The detection of musical structure can enable intelligent listening applications, capable of jumping to specific segments such as chorus or verse, or large-scale musicological analyses. The way humans infer structure is based on three main characteristics paulus_state_2010 , (i) novelty and contrast, meaning something new and possibly unexpected happens, (ii) homogeneity, meaning that similar parts tend to be grouped together, and (iii) repetition, meaning that the recognition of a repeated segment indicates a structural segment. All of these characteristics can be represented through a variety of musical elements including but not limited to harmony, rhythm, melody, instrumentation, tempo, and dynamics. Therefore, a system for automatic structure detection will extract one or more features representing these musical elements and compute an intermediate representation of the feature data. The most common intermediate representation for a structure detection system is a so-called Self-Similarity Matrix (SSM). The SSM is an intuitive way of indicating and visualizing structure. It is computed by extracting a meaningful feature to investigate, for example, the pitch chroma for tonal content. Then, a pairwise similarity is computed between each short-time pitch chroma and all others, leading to a self-similarity matrix as shown in Fig. 9.

Refer to caption
Figure 9: Self-Similarity Matrix for Michael Jackson’s “Bad”

The resulting SSM is symmetric; both axes of the SSM indicate time and the similarity between (t_1,t_2)(t_{\_}1,t_{\_}2) equals the similarity between (t_2,t_1)(t_{\_}2,t_{\_}1). The diagonal indicates the maximum of self-similarity (red). Blocks of constant red color indicate areas of high homogeneity such a held chord or a short repeated pattern, and blue vertical or horizontal lines indicate low similarity to all other points (often rests and pauses).

The SSM shown in Fig. 9 was computed from Michael Jackson’s pop song “Bad.” We can clearly identify several structural elements of the song: the intro stops at about 19 s, the bridge at 60–69 s is followed by a chorus (69–86 s) and the same constellation is repeated (as indicated by the line parallel to the diagonal) at 120 s, the high homogeneity of the instrumental is indicated by the red box from 145–170 s, and the songs ends with four repetitions of the chorus starting at 180 s visualized by the high similarity with diagonal structure in the end.

There are several ways to use the SSM to algorithmically infer the musical structure. One way is to use image processing techniques. For instance, the start of a new segment might be indicated by peaks in the output of a high pass filter (checker kernel) swiped along the diagonal foote_automatic_2000 . Repetitions, indicated by lines parallel to the diagonal, might be detected with edge detection techniques on the SSM dannenberg_music_2008 .

The formal evaluation of structure detection systems remains a challenge: human annotators of musical structure, even if in agreement, tend to annotate different structural levels. For instance, what might just be the two segment structure (A) (B) for one annotator could easily be annotated by another as (a b) (c d), which makes an evaluation by simply matching the ground truth impractical, given that both annotations are correct and the system might detect either one.

3 Music performance analysis and assessment

Since the ultimate goal of most music transcription tasks is the extraction of a score-like description from an audio file, the performance information is often discarded. There is, however, a notable difference between the two renditions of the same sonata from two pianists or the realization of a jazz standard by different performers palmer_music_1997 . Music Performance Analysis (MPA) deals with extracting this performance information from audio recordings. What constitutes performance information varies from genre to genre, but the performance parameters usually considered to be most impactful are dynamics and tempo lerch_software-based_2009 , although expressive variations of pitch (vibrato, intonation) and playing techniques can play important roles as well.

3.1 Dynamics

Musical dynamics are closely related to loudness, although the absolute perceived loudness is not the only cue to indicate musical dynamics weinzierl_sound_2018 . A forte passage on a recording is, for example, easily identifiable even at low reproduction volume. Therefore, measures of acoustic intensity are outperformed by humans judging musical dynamics nakamura_communication_1987 . Even so, it is a valid simplification to reduce the extraction of musical dynamics to extracting simple features representing the energy of a signal as described in Sect. 4.2.

3.2 Tempo and beat detection

The tempo of a performance has been shown to convey structural information of the music to the listener palmer_mapping_1989 . Moreover, tempo and its variation has been linked to the perception of projected emotion juslin_cue_2000 . The tempo of a piece of music is set by a train of quasi-equidistant pulses, so the so-called tactus lerdahl_generative_1983 . The tactus indicates the beats at which listeners will clap or tap their foot with the music. It is important to note that the tactus is a perceptual concept: while it is determined by groups and accents of note events, the pulse of a tactus may or may not fall on a note event, and a note event may or may not fall on a pulse. The frequency of a tactus is usually in the general range of 1–3 Hz, corresponding to 60–180 BPM (Beats Per Minute) fraisse_time_1978 .

Automatically extracting the tempo and the beats from audio recordings enables applications in music recommendation and systems for playlist generation as well as in creative usages such as DJ software.

As the pulse train is periodically repeating, the concept of detecting the tempo shows some similarity to the task of fundamental frequency detection as both tasks focus on estimating the period length of a periodic input signal. The main difference is the time scale of this detection: while for fundamental frequency detection the period lengths of interest range from approximately 0.3–30 ms, a typical beat period length is approximately in the range of 300–1000 ms.

Novelty function

Due to these different time scales and because the series of samples carries a large amount of unrelated information, a beat analysis is usually applied to an intermediary representation, the so-called novelty function lerch_introduction_2012 . The novelty function is a time-series of values that has local maxima at positions where “something new is happening” such as the onset of a new note or a drum hit and is low at times when nothing new begins such as during a held note. While early systems attempted to extract this novelty function from the time domain envelope itself schloss_automatic_1985 , it can be extracted from a variety of representations or features, including tonal representations such as the pitch chroma and low level features describing spectral shape such as MFCCs or the spectral centroid (see Sect. 4.2) lykartsis_beat_2015 . The common process for extracting the novelty function involves (i) extracting the feature, (ii) computing the derivative over time, (iii) truncating negative values to zero, and (iv) smoothing the result with a lowpass filter.

If the extracted novelty function works as intended, the local maxima indicate note onsets; therefore, picking the peaks of this function is referred to as onset detection bello_tutorial_2005 . Figure 10 shows an example of a novelty function and picked onsets.

Refer to caption
Figure 10: Rectified audio envelope (top) and novelty function with picked onsets (bottom)

Tempo induction

As the novelty function gives an indication of note events or other musical events, there is no direct mapping from these events to tempo since, as mentioned above, note events do not necessarily fall on beat events and vice versa. However, either (or both) the novelty function or the series of detected onsets are useful representations for infering the tempo. More specifically, the periodicity of the novelty function allows to estimate the tempo.

To give an example of an early system for tempo induction, Scheirer proposed to use a bank of resonance filters scheirer_tempo_1998 . Each filter is tuned to one possible tempo, for example, 120 BPM, and the resonance frequency of the filter with the highest output energy is the most likely tempo. This means that the real tempo can only be detected if it is close to a filter frequency; thus, the number of filters combined with the overall range indicate the possible tempo resolution.

Other simple ways of detecting the tempo include an Autocorrelation analysis of the novelty function gouyon_beat_2003 or a picking the maximum of the Inter-Onset-Interval histogram dixon_beat_1999 .

An example for a state of the art system for tempo induction is based on recurrent neural networks for analysis and yielding results in the range of 50–90 % depending on the dataset bock_accurate_2015 .

Beat detection

The knowledge of the tempo indicates the period length of the “foot-tapping rate” or the distance between the pulses, respectively, however, it does not imply the actual beat locations, sometimes referred to as beat phase. Thus, beat detection systems aim at detecting the beat locations from the novelty function.

One of the first systems proposed to detect beats based on oscillators with adaptive parameters spawned a whole class of beat tracking systems: here, a pulse generator predicts beat locations which are then compared to actual onset times and strengths; dependent on the distance between onset and beat as well as their positions, the pulse generator parameters are adapted to optimize the estimated fit with future onsets large_beat_1995 . Both beat phase and beat distance are adapted and estimated simultaneously. The advantage of these oscillator-based systems is that they are capable of real-time processing; their main disadvantage is slow adaptation to sudden tempo changes.

3.3 Performance assessment

Instead of extracting individual performance parameters, the goal of performance assessment is the estimation of overall ratings for a performance, taking into account parameters spanning the domains pitch (vibrato rates, intonation, tuning and temperament), dynamics (accents, tension), and timbre (playing techniques, instrument settings). This is a task of considerable commercial interest as it enables musically intelligent training software.

Performance assessment is a ubiquitous aspect of music pedagogy: regular feedback from teachers improves the students’ playing and auditions are used to place students in ensembles. It is, however, seen as a highly subjective and aesthetically challenging task with considerable disagreement on assessments between educators thompson_evaluating_2003 ; wesolowski_examining_2016 . An automatic system for assessment of performances would provide objective, reliable, and repeatable feedback to the student during practice sessions and increase accessibility of affordable instrument education. Generally, the structure of a performance assessment system resembles the basic structure of an audio content analysis system: features describing the performance are extracted and then used in the inference stage to estimate one or more ratings.

The features might be simple standard features as used in other content analysis systems knight_potential_2011 or designed specifically for the task of performance assessment (e.g., pitch and timing stability) nakano_automatic_2006 ; abeser_automatic_2014 ; wu_towards_2016 . Some systems also incorporate musical score information into the feature computation devaney_automatically_2011 ; bozkurt_dataset_2017 ; vidwans_objective_2017 .

The general trend in content analysis from feature design towards feature learning can also be observed in studies on performance assessment lerch_music_2019 , albeit a bit more reluctant. One of the reasons for this reluctance is the non-interpretability of learned features; an educational setting requires not only an accurate assessment but also an explanation of the reasons for that assessment (and possibly a strategy to improve the performance).

The success rate of tools for automatic performance assessment still leaves room for improvement. Most of the presented systems either work well only for very select data knight_potential_2011 or have comparably low prediction accuracies vidwans_objective_2017 ; wu_towards_2016 , rendering them unusable in most practical scenarios.

4 Music identification and categorization

A significant part of research in the area of ACA is concerned with the categorization of audio data, e.g., into musical genres. This group of tasks is —from a technical point of view— related to the estimation of similarity between different music files as well as the estimation of the emotion in a recording of music. Before we go into details on how to automatically categorize music, we will first look into music identification through audio fingerprinting, which was the first MIR technology with large scale adoption by consumers in the early 2000s.

4.1 Audio fingerprinting

Audio fingerprinting aims at representing a recording in a small and unique so-called fingerprint (also: perceptual hash) in order to look up this recording in a previously prepared database and map it to the stored meta data. In contrast to many other presented systems in this chapter, fingerprinting is not concerned with extracting musical meaning from audio but solely with identifying a recording unambiguously. The two main applications of audio fingerprinting are (i) the automated monitoring of broadcast stations for independent supervision of the broadcasted songs in order to verify broadcasting rights, and (ii) end consumer services allowing the identification of audio in order to provide meta data such as artist name, song title, or album art cano_audio_2005 . The design of an audio fingerprinting system has to solve multiple inherent core problems cano_review_2005 . On the one hand, the fingerprint has to be small in size to be transmitted and searched for efficiently, on the other hand, it has to be unique in order to identify one specific song from a database of possibly millions of songs. Furthermore, it has to be robust against quality degradations in the audio signal, as a user might record the audio of a song playing over speakers in a noisy environment. Last but not least, the fingerprint extraction has to be efficient enough to run on embedded and mobile devices.

A fingerprinting system has two main processing steps: the fingerprint extraction (often on a mobile devices at the client side) and the identification in a database (usually on the server side), see Fig. 11 bottom. It can only identify recordings which were used for the database creation, see Fig. 11 top.

Unlabeled Recording Fingerprint Extraction Find Match ID
Database Creation
Identification
Recording IDs Database Collection of Recordings Fingerprint Extraction
Figure 11: Flowchart of a general fingerprinting system

Fingerprint extraction

As the goal of audio fingerprinting is the identification of a specific recording (not a song, meaning it is supposed to differentiate between, e.g., a studio and a live version of the same song), the fingerprint does not have to contain musical information but can focus on the raw content of the audio. A simple predecessor of modern fingerprinting systems proposed for the identification of commercials in broadcast stems used segments of the time domain envelope as a fingerprint lourens_detection_1990 . Nowadays, the fingerprint is usually derived from the STFT. There exist several approaches for fingerprint extraction. Two prominent approaches are (i) to encode spectral bands, more specifically the changes of band energy over both time and frequency in a binary format haitsma_highly_2002 , and (ii) to identify salient peaks of the spectrogram and encode their relative location to each other wang_industrial_2003 . The resulting size of the extracted fingerprint depends on the system: the first system, for example, represents three seconds of audio with 8 Kbit haitsma_highly_2002 .

Fingerprint identification

After successful extraction, the fingerprint of an unknown query signal has to be compared with a large number of previously extracted fingerprints in a database. Since this database can be large, this comparison has to be as efficient as possible in order to minimize processing time. Fingerprint systems use multiple tricks to speed up the lookup process, including hash lookup tables in which database entries are referenced by their fingerprint content or the use of fast-to-compute distance measures such as the Hamming distance hamming_error_1950 . Reordering the database entries according to their popularity can also decrease the average lookup time significantly.

4.2 Music genre classification

Music Genre Classification (MGC) is a classical machine learning task that used to be one of the most popular tasks in the field of MIR. It is obviously useful to describe and categorize large collections of music to enable music discovery or content-based music recommendation systems. Despite its initial popularity and clear demand from users for classification systems, the task has some fundamental inherent problems that concern the subjective, inconsistent, and somewhat arbitrary definition of music genre. For example, there is disagreement on genre taxonomies and what they are based on (geography: “Indian music, ” instrumentation: “symphonic music,” epoch: “baroque,” technical requirements: “barbershop,” or use of the music: “Christmas songs”) pachet_taxonomy_2000 . Despite these issues, MGC systems can be “successfully” developed as long as they adhere to one consistent and coherent definition of genre. The performance of the system strongly depends on (i) the features and the information they are able to transport, (ii) the data that the machine learning system is being trained on, and (iii) the capabilities of the machine learning system or classifier itself.

Feature examples

Even as the definition of genre is problematic as pointed out above, it is obvious to the observant listener that the identification of specific genres may require information related to timbre, pitch, rhythm and tempo, and dynamics. Traditional systems with custom-designed features therefore use a variety of different features. Hundreds and possibly thousands of different features have been investigated over the years for their performance in MGC systems. The most successful features have been shown, interestingly enough, to be very simple features describing the spectral shape and the intensity of the signal.

Many of these well-known low-level features are extracted from short blocks of audio samples, resulting in a time series of values for each feature. In a next step, features are often aggregated over time by computing, for example, their mean or their standard deviation. This means that a complete audio file is ultimately represented by one vector with a length of one or two (mean and standard deviation) times the number of features.

Spectral shape (timbre) features

Features describing the spectral shape of a signal are widely used. The spectral shape significantly influences our perception of the timbre of a sound helmholtz_lehre_1870 . Most of these features are extracted from the local magnitude spectrum (one column of the spectrogram as shown in Fig. 3). Here, only two commonly used features which have proven useful in analysis tasks will be presented as representatives of a large number of common features.

  • Spectral Centroid: The spectral centroid is the center of mass of a magnitude spectrum, i.e., the frequency at which the spectral magnitudes can be separated into two equal parts. That means that low frequency signals will have a low centroid while substantial high frequency components and noise will increase the centroid. Despite its technical definition, the spectral centroid has been shown to be strongly correlated to the perceptual sound attribute brightness mcadams_perceptual_1995 ; caclin_acoustic_2005 .

  • Mel Frequency Cepstral Coefficients: The Mel Frequency Cepstral Coefficients (MFCCs) are a measure of the cosine-shape of the STFT on a logarithmic (Mel-shaped) frequency axis. To compute, the STFT axis is first split into logarithmically spaced frequency bands according to the Mel-scale modeling the human frequency perception stevens_scale_1937 . Then, a logarithm is applied to the spectrum and the bands are transformed into the cepstral domain with a Discrete Cosine Transform (DCT). The resulting DCT coefficients are the MFCCs. In contrast to the spectral centroid, the MFCCs are a multidimensional feature similar to the pitch chroma (Sec. 2.1), often representing each spectrum with 1313 or more values.

    The algorithm for the MFCC computation is an interesting mixture of ideas from perception, signal processing (cepstrum), and data compression (DCT) davis_comparison_1980 . While the interpretability of the MFCCs is limited and their perceptual meaning is questionable, they have proven surprisingly useful in a wide variety of tasks logan_mel_2000 ; jensen_evaluation_2006 ; heittola_musical_2009 and are nowadays considered a standard baseline feature.

Intensity features

Intensity features model the dynamics or loudness of a recording. The two feature examples given below represent two simple and common features.

  • Envelope: The simplest way to describe the envelope of a signal per block is to find the absolute maximum per block, resulting in the overall shape of the waveform. This is somewhat related to the slightly more complicated Peak Programme Meter (PPM) that is used in recording studios.

  • Root Mean Square: The Root Mean Square (RMS) is a standard way of computing the intensity of a signal. It is the so-called effective value of the signal computed as the square root of the mean of the squared values per block. For long block sizes, it can be an efficient measure of long-term level changes. A common pre-processing step is to filter the signal before computing the RMS to take into account the sensitivity of the human ear at different frequencies for different level ranges (compare A-weighted or C-weighted sound pressure level measures).

Additional features

Musical genre is so broadly defined that features representing characteristics from many categories can be meaningful. Therefore, numerous and diverse custom-designed features have been investigated for this task. These features include, for example, stereo features representing the width and form of the stereo image tzanetakis_stereo_2007 , pitch content features representing the variety and ranges of pitches tzanetakis_pitch_2002 , and tempo and rhythm features describing tempo variation, beat strength, and rhythmic complexity dixon_classification_2003 ; burred_hierarchical_2004 . Many of these features are, unlike the features introduced above, extracted from longer windows of data. For instance, attempts of describing the rhythmic content need a context window of approx. 5 s or more to be meaningful. As pointed out above, the era of feature design is nowadays considered a thing of the past (except for problems with only small amounts of available data) and modern features are often learned directly from the spectrogram. Common feature learning approaches are based, for example, on neural networks lee_unsupervised_2009 ; hamel_learning_2010 or dictionary learning mairal_online_2009 ; wu_assessment_2018 .

Classification

The simplest, most intuitive classifier is just a threshold: if a feature value is higher than a threshold ϵ\epsilon choose class 1, otherwise class 2. A modern data-driven system derives this threshold from the data itself and generalizes to a multi-dimensional space with possibly non-linear thresholds. In other words, it learns what combination of feature values are common for each class and how to differentiate between classes given these feature values. Figure 12 shows a so-called scatter plot for two classes represented by two features (two-dimensional feature space). It can be seen that for this data, the RMS feature seems to work slightly better in separating the two classes speech and music than the spectral centroid. This visualization also emphasizes the importance of the so-called training data set; if the estimated thresholds are based on data that are not representative, the classifier will not perform well.

Refer to caption
Figure 12: A music/speech dataset visualized in a two-dimensional feature space (x-axis: average spectral centroid, y-axis: standard deviation of rms)

Another example for a basic classifier is the Nearest-Neighbor classifier fix_discriminatory_1951 . While training, it stores the location of each data point in the feature space with its corresponding class label. When queried with a new and unseen feature vector, the distance to every single training vector is computed; the final result is the class label of the closest vector, the nearest neighbor. Other classifiers model the distribution of data, avoiding to store each individual data point and allowing for simple generalization of the data. One popular classifier that models the data with Gaussian distributions is the Gaussian Mixture Model duda_pattern_2000 . More modern approaches maximize the separation between classes by mapping the data into high-dimensional spaces (Support Vector Machine boser_training_1992 ) or form a so-called ensemble of many simple classifiers to yield a more robust majority vote (Random Forest breiman_random_2001 ). Most state-of-the-art classifiers are, if the amount of training data allows it, based on (deep) neural networks goodfellow_deep_2016 .

4.3 Music emotion recognition

An analysis task for which many consumers seem to see an intuitive need is the extraction of the emotional content of a recording, as a majority of users search for and categorize music by describing the emotional content of the music kim_categories_2002 . The task definition of Music Emotion Recognition (MER) suffers from similar, possibly more severe, restrictions in terms of subjectivity and noisiness of human-labeled data. The main issues are (i) the question of whether to estimate the conveyed emotion or the elicited emotion, i.e., a subject might perceive the music transporting a specific emotion, but might feel a completely different emotion themselves meyer_emotion_1956 ; zentner_emotions_2008 , and (ii) the unclear definition of what emotions or moods are actually inherent to music listening, e.g., can music trigger basic emotions such as fear and anger scherer_why_2003 ; scherer_which_2004 ; zentner_emotions_2008 ? Due to the (commercially) appealing applications of MER, however, these inherent problems have not stopped researchers from targeting this task yang_machine_2012 . Despite many studies investigating emotional content and impact of music, however, the link between musical parameters to emotions remains largely unknown, and the audio features which could directly describe emotional content are unknown as well. Therefore, the features and approaches commonly chosen for MER are similar to genre classification as described in Sect. 4.2. In addition to such classification approaches, some methods aim at estimating emotions not by sorting them into distinct classes but locating them as coordinates in a two-dimensional valence-arousal plane as proposed by Russel russel_circumplex_1980 . In this case, the machine learning system is not choosing one of multiple output categories but estimates two values (valence and arousal). This class of machine learning algorithms is referred to as regression algorithms.

5 Current challenges

Looking at the historic development in audio content analysis as well as the currently pressing research issues, three main overall challenges can be identified, (i) the amount of data available for training complex machine learning systems, (ii) the predictability of modern systems and interpretability of the results, and (iii) the inherently abstract musical language and the largely unclear relations of musical concepts and perceptual meaning.

5.1 Training data

Machine learning systems are data-driven, meaning that they learn the most likely mapping between input (features) and output (classes) from a set of data. Thus, in order to train a system well, there exist important requirements on data. First, the training data have to be representative. That means that, on the one hand, the possible variability of the input data should be covered completely to enable the system to learn. A music genre can, for instance, not be properly represented by only one band as the system might learn to distinguish the band but not the genre. On the other hand, the distribution of songs per class should reflect the expected distribution so as to not bias the system towards majority classes. Second, the training data should not be noisy, meaning the labels should be consistent and unambiguous. This is, as mentioned above, a problem especially for MGC because of the issues with the definition of music genre. Third, the training data have to be sufficient. The more complex a task is, the more complex a system needs to be, and the more training data is required. Without a sufficient amount of data, a complex system will not be able to generalize a model from the training data and thus will “overfit,” meaning that the system works very well on the data it has been trained on but poorly on unseen data duda_pattern_2000 . The amount of training data becomes a crucial issue for machine learning approaches and systems based on Deep Neural Networks which have shown superior performance at nearly all tasks in audio content analysis but require large amounts of data for training.

Although there is a vast amount of music data easily accessible, not all of these data can be directly used for training a machine learning system. A system for transcribing drum events from popular music, for instance, needs expert annotations precisely marking each drum hit in terms of instrument and timing (and possibly the playing technique). Marking these individual hits is, however, a very time-consuming and tedious task so that the increasing requirement for data due to the increasing system complexity usually outpaces the annotation of new data by human annotators. Given the multitude of annotations needed for various content analysis tasks, it is likely that the gap between available annotated data and required amount of training data will widen with increasing system complexity. This will result in a growing need of systems and approaches addressing this challenge, as is reflected by an increasing research interest in this problem from various angles. Current approaches include:

  • data augmentation and synthetic data: without sufficient annotated data, the machine learning engineer can “cheat” by virtually increasing the amount of training data either with synthesized data (e.g., through synthesis of MIDI data) or by processing existing audio data with irrelevant transformations, for instance, pitch shifting for music genre classification or segmenting the longer file into shorter segments mignot_analysis_2019 ;

  • transfer learning: although data might be scarce for one task, it might be available for related tasks; therefore, the idea of transfer learning is to take an internal representation of a system trained for one task with abundant data and use this (hopefully powerful) representation as a feature input to a more simple classifier that can be trained with significantly less data choi_transfer_2017 ;

  • weakly labeled data: annotating audio data with high accuracy is a tedious and time-consuming task, however, it becomes significantly less demanding if high time accuracy is not required by, e.g., labeling the presence of a musical instrument in a snippet of audio without pinpointing its exact occurrence time(s); this requires, however, to modify existing machine learning approaches to deal with the weak time accuracy gururani_attention_2019 ;

  • self-supervised systems and unsupervised systems which tackle the data challenge in MIR in a different way by exploring possibilities of training systems with unlabeled data; it is, for example, possible to train a complex system with high data requirements with unlabeled data by utilizing the outputs of pre-existing systems for these unlabeled data as training targets for the new system wu_labeled_2018 or by utilizing synthesis iteratively to learn from unlabeled data choi_deep_2019 .

As the field of machine learning evolves rapidly, it is difficult to predict if any of the methods above will become standard approaches. It is clear, however, that the lack of large-scale training data will continue to impact the progress and methods applied to audio content analysis.

5.2 Interpretability and Understandability

The increasing complexity of machine learning systems not only affects the required amount of training data, but also leads to highly complex systems which do not allow easy interpretation of internal states, easy analysis of results, or understanding of influencing factors lipton_mythos_2018 . In traditional feature-driven designs, each feature had at the minimum a technical connotation allowing an expert to connect system outputs to somewhat interpretable inputs. As modern systems learn features from the data, it becomes increasingly difficult to derive this link, thus limiting the interpretability and control over system behavior. This restricts possibilities to tweak system outputs for specific inputs and could increase the likelihood of the system producing unexpected results especially for unseen data. Recent years have seen this problem partly being addressed in the field of audio content analysis through (i) visualization that gives insights into the networks’ internal states or intermediate representations, diagnoses the embedded space, or disentangles the complex internal representations into interpretable graphs zhang_visual_2018 , (ii) enforcing interpretable latent spaces or intermediate representations that are humanly understandable through regularization hadjeres_glsr-vae_2017 ; brunner_midi-vae_2018 , and (iii) analysis and transformation of latent spaces to interpretable spaces adel_discovering_2018 .

5.3 Perceptual meaning

A challenge unrelated to the technical progress but somewhat emphasized by the engineering-driven methodologies in the field is the question of the perceptual relevance of the results of ACA-systems. The problems of data annotation in the context of musical genre and emotion have already been discussed in Sects. 4.2 and 4.3, however, there are other fundamental issues when it comes to “musical meaning.” While, for example, the task of extracting drum onset times from audio is clearly defined, it is less clear what higher musical meaning can be derived from it. It is not easy to find answers to questions such as is there a generalized way of describing rhythm and rhythmic properties, and can specific rhythmic properties and/or timing variations be mapped to specific affectual responses in humans? A major obstacle impeding MIR research is the inability to successfully isolate (and therefore understand) the various score-based and performance characteristics that contribute to the music listening experience. The listener, however, has to be the ultimate judge of the usefulness of any audio content analysis system lerch_music_2019 .

6 Outlook

Audio Content Analysis is an emerging research field facing interesting challenges and enabling a wide range of future applications. We are already seeing new applications and emerging companies building on the advances made in the field.

6.1 Music education

The idea of utilizing technology to assist music (performance) education is not new. Seashore pointed out the value of scientific observation of music performances for improving learning as early as the 1930s seashore_psychology_1938 . One of the earliest studies exploring the potential of computer-assisted techniques in the music classroom was carried out by Allvin allvin_computer-assisted_1971 . Although his work was focused on using technology for providing individualized instruction and developing aural and visual aids to facilitate learning, it also highlighted the potential of using ACA techniques such as pitch detection to perform error analysis in a musical performance and provide constructive feedback to the learners through (semi-)automated music tutoring software. Such software aims at supplementing teachers by providing students with insights and interactive feedback by analyzing and assessing the audio of practice sessions. The ultimate goals of an interactive music tutor are to highlight problematic parts of the students’ performance, provide a concise yet easily understandable analysis, give specific and understandable feedback on how to improve, and individualize the curriculum depending on the students’ mistakes and general progress.

Tools for performance assessment typically assess one or more performance parameters which are usually related to the accuracy of the performance in terms of pitch and timing wu_towards_2016 ; vidwans_objective_2017 ; pati_assessment_2018 ; luo_detection_2015 or quality of sound (timbre) knight_potential_2011 ; romani_picas_real-time_2015 . Various (commercial) solutions are already available, exhibiting a similar set of goals. These systems adopt different approaches, ranging from traditional music classroom settings to games targeting a playful learning experience. Examples for tutoring applications are SmartMusic333MakeMusic, Inc., SmartMusic, https://www.smartmusic.com, last accessed 04/11/2019, Yousician444Yousician Oy, Yousician, https://www.yousician.com, last accessed 04/11/2019, Music Prodigy555The Way of H, Inc. (dba Music Prodigy), Music Prodigy, http://www.musicprodigy.com, last accessed 04/11/2019, and SingStar666Sony Interactive Entertainment, SingStar, http://www.singstar.com, last accessed 04/11/2019.

6.2 Music production

Knowledge of the audio content enables the improvement of music production tools in various dimensions. The most obvious enhancement can be found in terms of productivity and efficiency: the better a software understands the details of incoming audio streams or files, the better it can adapt, for instance, by applying default gain and equalization parameters reiss_applications_2018 or suggest compatible recordings from a library. Systems might support editors by automatic artifact-free splicing of multiple recordings from one session or selecting error-free recordings from a set of recordings. Modern processing methods allow for subtle or dramatic timing and pitch variations in high quality777Antares, Autotune, https://autotune.com, last accessed 01/14/2020888zplane elastique, https://licensing.zplane.de/technology#elastique, last accessed 01/14/2020 — controlling them with musically relevant content-adaptive intelligence could streamline music production in unprecedented ways.

Modern tools also enhance the creative possibilities in the production process. For example, creating harmonically meaningful background choirs by analyzing the lead vocals and the harmony track is already technically feasible nowadays.999zplane vielklang, https://vielklang.zplane.de, last accessed 01/14/2020 Knowing and possibly separating sound sources in a recording could enable new ways of modifying or morphing different sounds to create new soundscapes and auditory scenes. Many more scenarios are conceivable where audio analysis will impact the production process, although the multifaceted character of the field makes it difficult to predict specific use cases.

6.3 Music distribution and consumption

Audio analysis has already started to transform consumer-facing industries such as streaming services with audio-based music recommendation and playlist generation systems using an in-depth understanding of the musical content knees_intelligent_2019 . This is not only the case for the end-consumer themselves: there is also a industry need for automatically identifying music and creating playlists that conform to the company’s brand image herzog_predicting_2017 .

In the near future, we can expect the rise of creative music discovery and listening applications that enable the listener to interact not only by choosing content but interact with the content itself. This could include, for example, the gain adjustment for individual voices, replacing instruments or vocalists, or interactively changing the musical arrangement.

6.4 Generative music

An important outcome of being able to extract machine interpretable content information from audio data is the possibility for these data to feed generative algorithms. The automatic composition and rendition of music is emerging as a challenging yet popular research direction briot_deep_2020 , gaining interest from both research institutions and industry. While bigger questions concerning capabilities and restrictions of computational creativity as well as aesthetic evaluation of algorithmically generated music remain largely unanswered, practical applications such as generating background music for user videos and commercial advertisements are currently in the focus of many researchers. The interactive and adaptive generation of sound tracks for video games as well as individualized generation of license-free music content for streaming are additional long-term goals of considerable commercial interest.

References

  • (1) Abeßer, J., Hasselhorn, J., Grollmisch, S., Dittmar, C., Lehmann, A.: Automatic Competency Assessment of Rhythm Performances of Ninth-grade and Tenth-grade Pupils. In: Proceedings of the International Computer Music Conference (ICMC). Athens (2014)
  • (2) Adel, T., Ghahramani, Z., Weller, A.: Discovering Interpretable Representations for Both Deep Generative and Discriminative Models. In: Proceedings of the International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, pp. 50–59. Stockholm (2018). URL http://proceedings.mlr.press/v80/adel18a.html
  • (3) Allvin, R.L.: Computer-Assisted Music Instruction: A Look at the Potential. Journal of Research in Music Education 19(2) (1971). URL http://www.jstor.org/stable/3343819
  • (4) Böck, S., Krebs, F., Widmer, G.: Accurate Tempo Estimation Based on Recurrent Neural Networks and Resonating Comb Filters. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 625–631. Malaga, Spain (2015). URL http://ismir2015.uma.es/articles/196_Paper.pdf
  • (5) Bello, J.P., Daudet, L., Abdallah, S., Duxbury, C., Davies, M., Sandler, M.B.: A tutorial on onset detection in music signals. Transactions on Speech and Audio Processing 13(5), 1035–1047 (2005). DOI 10.1109/TSA.2005.851998. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1495485
  • (6) Benetos, E., Dixon, S., Duan, Z., Ewert, S.: Automatic Music Transcription: An Overview. IEEE Signal Processing Magazine 36(1), 20–30 (2019). DOI 10.1109/MSP.2018.2869928
  • (7) Benetos, E., Dixon, S., Giannoulis, D., Kirchhoff, H., Klapuri, A.: Automatic music transcription: challenges and future directions. Journal of Intelligent Information Systems 41(3), 407–434 (2013). DOI 10.1007/s10844-013-0258-3. URL http://link.springer.com/10.1007/s10844-013-0258-3. 00001
  • (8) Bertin, N., Badeau, R., Richard, G.: Blind Signal Decompositions for Automatic Transcription of Polyphonic Music: NMF and K-SVD on the Benchmark. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–65–I–68. IEEE, Honolulu (2007). DOI 10.1109/ICASSP.2007.366617. ISSN: 2379-190X
  • (9) Boser, B.E., Guyon, I.M., Vapnik, V.N.: A Training Algorithm for Optimal Margin Classifiers. In: Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, pp. 144–152. Association for Computing Machinery, New York, NY, USA (1992). DOI 10.1145/130385.130401. URL https://doi.org/10.1145/130385.130401. Event-place: Pittsburgh, Pennsylvania, USA
  • (10) Bozkurt, B., Baysal, O., Yüret, D.: A Dataset and Baseline System for Singing Voice Assessment. In: Proceedings of the International Symposium on CMMR. Matosinhos (2017)
  • (11) Breiman, L.: Random Forests. Machine Learning 45(1), 5–32 (2001). DOI 10.1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324
  • (12) Briot, J.P., Hadjeres, G., Pachet, F.D.: Deep Learning Techniques for Music Generation. Computational Synthesis and Creative Systems. Springer International Publishing, Cham (2020). DOI 10.1007/978-3-319-70163-9. URL http://link.springer.com/10.1007/978-3-319-70163-9
  • (13) Brunner, G., Konrad, A., Wang, Y., Wattenhofer, R.: MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 747–754. Paris, France (2018). URL http://ismir2018.ircam.fr/doc/pdfs/204_Paper.pdf
  • (14) Burgoyne, J.A., Fujinaga, I., Downie, J.S.: Music Information Retrieval. In: S. Schreibman, R. Siemens, J. Unsworth (eds.) A New Companion to Digital Humanities, pp. 213–228. John Wiley & Sons, Ltd (2015). DOI 10.1002/9781118680605.ch15. URL http://onlinelibrary.wiley.com/doi/10.1002/9781118680605.ch15/summary
  • (15) Burred, J.J., Lerch, A.: Hierarchical Automatic Audio Signal Classification. Journal of the Audio Engineering Society (JAES) 52(7/8), 724–739 (2004). URL http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2016/10/Burred-and-Lerch-2004-Hierarchical-Automatic-Audio-Signal-Classification.pdf
  • (16) Caclin, A., McAdams, S., Smith, B.K., Winsberg, S.: Acoustic correlates of timbre space dimensions: A confirmatory study using synthetic tones. Journal of the Acoustical Society of America (JASA) 118(1), 471–482 (2005). DOI 10.1121/1.1929229. URL http://link.aip.org/link/JASMAN/v118/i1/p471/s1&Agg=doi
  • (17) Cano, P., Batlle, E., Gomez, E., Gomes, L.D.C.T., Bonnet, M.: Audio Fingerprinting: Concepts And Applications. In: S.K. Halgamuge, L. Wang (eds.) Computational Intelligence for Modelling and Prediction, pp. 233–245. Springer, Berlin (2005)
  • (18) Cano, P., Batlle, E., Kalker, T., Haitsma, J.: A Review of Audio Fingerprinting. The Journal of VLSI Signal Processing-Systems for Signal, Image, and Video Technology 41(3), 271–284 (2005). DOI 10.1007/s11265-005-4151-3. URL http://www.springerlink.com/index/10.1007/s11265-005-4151-3
  • (19) Cheveigné, A.D., Kawahara, H.: Multiple period estimation and pitch perception model. Speech Communication 27, 175–185 (1999)
  • (20) Choi, K., Cho, K.: Deep Unsupervised Drum Transcription. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Delft, Netherlands (2019)
  • (21) Choi, K., Fazekas, G., Sandler, M.B., Cho, K.: Transfer Learning for Music Classification and Regression Tasks. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 141–149. Suzhou, China (2017). URL https://ismir2017.smcnus.org/wp-content/uploads/2017/10/12_Paper.pdf
  • (22) Chuan, C.H., Chew, E.: Polyphonic Audio Key Finding Using the Spiral Array CEG Algorithm. In: International Conference on Multimedia and Expo, pp. 21–24. IEEE (2005). DOI 10.1109/ICME.2005.1521350. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1521350
  • (23) Dannenberg, R.B., Goto, M.: Music Structure Analysis from Acoustic Signals. In: D. Havelock, S. Kuwano, M. Vorländer (eds.) Handbook of Signal Processing in Acoustics, pp. 305–331. Springer, New York, NY (2008). DOI 10.1007/978-0-387-30441-0˙21. URL https://doi.org/10.1007/978-0-387-30441-0_21
  • (24) Davis, S.B., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. Transactions on Acoustics, Speech, and Signal Processing 28(4), 357–366 (1980). DOI 10.1109/TASSP.1980.1163420. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1163420
  • (25) Devaney, J., Mandel, M.I., Ellis, D.P., Fujinaga, I.: Automatically extracting performance data from recordings of trained singers. Psychomusicology: Music, Mind and Brain 21(1-2), 108–136 (2011). DOI 10.1037/h0094008
  • (26) Dixon, S.: A Beat Tracking System for Audio Signals. In: Proceedings of the Conference on Mathematical and Computational Methods in Music. Vienna (1999)
  • (27) Dixon, S., Pampalk, E., Widmer, G.: Classification of Dance Music by Periodicity Patterns. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 1–7 (2003)
  • (28) Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd editio edn. John Wiley & Sons, Inc, New York (2000). 25694
  • (29) Fix, E., Hodges, J.L.: Discriminatory Analysis - Nonparametric Discrimination: Consistency Properties. Tech. rep., University of California Berkeley (1951). URL https://apps.dtic.mil/docs/citations/ADA800276
  • (30) Foote, J.T.: Automatic audio segmentation using a measure of audio novelty. In: Proceedings of the International Conference on Multimedia and Expo (ICME), pp. 452–455. New York (2000)
  • (31) Fraisse, P.: Time and Rhythm Perception. In: E.C. Carterette, M.P. Friedman (eds.) Perceptual Coding, pp. 203–254. Academic Press (1978). DOI 10.1016/B978-0-12-161908-4.50012-7. URL http://www.sciencedirect.com/science/article/pii/B9780121619084500127
  • (32) Fujishima, T.: Realtime Chord Recognition of Musical Sound: a System Using Common Lisp Music. In: Proceedings of the International Computer Music Conference (ICMC) (1999). 00289
  • (33) Gómez, E.: Tonal Description of Polyphonic Audio for Music Content Processing. INFORMS Journal on Computing 18(3), 294–304 (2006). DOI 10.1287/ijoc.1040.0126. URL http://pubsonline.informs.org/doi/abs/10.1287/ijoc.1040.0126. 00104
  • (34) Goodfellow, I., Bengio, Y., Courville, A.: Deep learning. MIT press (2016)
  • (35) Gouyon, F., Herrera, P.: A Beat Induction Method for Musical Audio Signals. In: Proceedings of the 4th European Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS). London (2003)
  • (36) Gururani, S., Sharma, M., Lerch, A.: An Attention Mechanism for Music Instrument Recognition. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). International Society for Music Information Retrieval ({ISMIR}), Delft (2019). URL http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2019/07/Gururani-et-al.-2019-An-Attention-Mechanism-for-Music-Instrument-Recogn.pdf
  • (37) Hadjeres, G., Nielsen, F., Pachet, F.: GLSR-VAE: Geodesic latent space regularization for variational autoencoder architectures. In: Proceedings of the IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–7. IEEE, Honolulu (2017). DOI 10.1109/SSCI.2017.8280895. ISSN: null
  • (38) Haitsma, J., Kalker, T.: A Highly Robust Audio Fingerprinting System. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Paris (2002)
  • (39) Hamel, P., Eck, D.: Learning Features from Music Audio with Deep Belief Networks. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Utrecht, Netherlands (2010). URL https://zenodo.org/record/1414970
  • (40) Hamming, R.W.: Error detecting and error correcting codes. The Bell System Technical Journal 29(2), 147–160 (1950). DOI 10.1002/j.1538-7305.1950.tb00463.x
  • (41) Heittola, T., Klapuri, A., Virtanen, T.: Musical Instrument Recognition in Polyphonic Audio Using Source-Filter Model for Sound Separation. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 327–332 (2009). URL http://www.researchgate.net/profile/Tuomas_Virtanen2/publication/220723588_Musical_Instrument_Recognition_in_Polyphonic_Audio_Using_Source-Filter_Model_for_Sound_Separation/links/0fcfd508e3cf10b707000000.pdf
  • (42) Helmholtz, H.V.: Die Lehre von den Tonempfindungen als physiologische Grundlage für die Theorie der Musik, 3rd editio edn. Vieweg, Braunschweig (1870)
  • (43) Herzog, M., Lepa, S., Steffens, J., Schoenrock, A., Egermann, H.W.: Predicting Musical Meaning in Audio Branding Scenarios. In: Proceedings of the Conference of the European Society for Cognitive Science of Music (ESCOM). Ghent, Belgium (2017). URL http://eprints.whiterose.ac.uk/116600/
  • (44) Humphrey, E.J., Bello, J.P., LeCun, Y.: Feature learning and deep architectures: new directions for music informatics. Journal of Intelligent Information Systems 41(3), 461–481 (2013). DOI 10.1007/s10844-013-0248-5. URL https://doi.org/10.1007/s10844-013-0248-5
  • (45) Izmirli, z.: Template based key finding from audio. In: Proceedings of the International Computer Music Conference (ICMC). Barcelona (2005)
  • (46) Jensen, J.H., Christensen, M.G., Murthi, M.N., Jensen, S.r.H.: Evaluation of MFCC estimation techniques for music similarity. In: Proceedings of the XIV. European Signal Processing Conference (EUSIPCO). Florence (2006)
  • (47) Juslin, P.N.: Cue Utilization in Communication of Emotion in Music Performance: Relating Performance to Perception. Journal of Experimental Psychology 26(6), 1797–1813 (2000)
  • (48) Kim, J.Y., Belkin, N.J.: Categories of Music Description and Search Terms and Phrases Used by Non-Music Experts. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Paris (2002). URL http://ismir2002.ismir.net/proceedings/02-FP07-2.pdf
  • (49) Klapuri, A.P.: Multiple Fundamental Frequency Estimation Based on Harmonicity and Spectral Smoothness. Transactions on Speech and Audio Processing 11(6), 804–816 (2003). DOI 10.1109/TSA.2003.815516. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1255467
  • (50) Knees, P., Schedl, M., Goto, M.: Intelligent User Interfaces for Music Discovery: The Past 20 Years and What’s to Come. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 44–53. Delft, Netherlands (2019). URL http://archives.ismir.net/ismir2019/paper/000003.pdf
  • (51) Knight, T., Upham, F., Fujinaga, I.: The potential for automatic assessment of trumpet tone quality. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 573–578. Miami (2011)
  • (52) Krumhansl, C.L.: Cognitive Foundations of Musical Pitch. Oxford University Press, New York (1990)
  • (53) Large, E.W.: Beat Tracking with a Nonlinear Oscillator. In: Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI). Montreal (1995)
  • (54) Lee, H., Pham, P., Largman, Y., Ng, A.Y.: Unsupervised feature learning for audio classification using convolutional deep belief networks. In: Y. Bengio, D. Schuurmans, J.D. Lafferty, C.K.I. Williams, A. Culotta (eds.) Advances in Neural Information Processing Systems 22, pp. 1096–1104. Curran Associates, Inc. (2009). URL http://papers.nips.cc/paper/3674-unsupervised-feature-learning-for-audio-classification-using-convolutional-deep-belief-networks.pdf
  • (55) Lerch, A.: Software-Based Extraction of Objective Parameters from Music Performances. GRIN Verlag, München (2009). URL 10.14279/depositonce-2025
  • (56) Lerch, A.: An Introduction to Audio Content Analysis: Applications in Signal Processing and Music Informatics. Wiley-IEEE Press, Hoboken (2012). URL http://ieeexplore.ieee.org/xpl/bkabstractplus.jsp?bkn=6266785
  • (57) Lerch, A., Arthur, C., Pati, A., Gururani, S.: Music Performance Analysis: A Survey. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). International Society for Music Information Retrieval ({ISMIR}), Delft (2019). URL http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2019/06/Lerch-et-al.-2019-Music-Performance-Analysis-A-Survey.pdf
  • (58) Lerdahl, F., Jackendorf, R.: A Generative Theory of Tonal Music. MIT Press, Cambridge (1983)
  • (59) Lipton, Z.C.: The Mythos of Model Interpretability. Queue 16(3), 31–57 (2018). DOI 10.1145/3236386.3241340. URL https://doi.org/10.1145/3236386.3241340
  • (60) Logan, B.: Mel Frequency Cepstral Coefficients for Music Modeling. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Plymouth (2000)
  • (61) Lourens, J.G.: Detection and Logging Advertisements using its Sound. Transactions on Broadcasting 36(3), 231–233 (1990). DOI 10.1109/11.59850. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=59850
  • (62) Luo, Y.J.: Detection of Common Mistakes in Novice Violin Playing. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 316–322. Malaga (2015). URL http://ismir2015.uma.es/articles/197_Paper.pdf
  • (63) Lykartsis, A., Lerch, A.: Beat Histogram Features for Rhythm-based Musical Genre Classification Using Multiple Novelty Functions. In: Proceedings of the International Conference on Digital Audio Effects (DAFX). Trondheim, Norway (2015). URL http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2015/12/DAFx-15_submission_42-1.pdf
  • (64) Maempel, H.J.: Musikaufnahmen als Datenquellen der Interpretationsanalyse. In: H. von Lösch, S. Weinzierl (eds.) Gemessene Interpretation - Computergestützte Aufführungsanalyse im Kreuzverhör der Disziplinen, Klang und Begriff, pp. 157–171. Schott, Mainz (2011)
  • (65) Mairal, J., Bach, F., Ponce, J., Sapiro, G.: Online Dictionary Learning for Sparse Coding. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML ’09, pp. 689–696. Association for Computing Machinery, New York, NY, USA (2009). DOI 10.1145/1553374.1553463. URL https://doi.org/10.1145/1553374.1553463. Event-place: Montreal, Quebec, Canada
  • (66) Mauch, M., Dixon, S.: PYIN: A fundamental frequency estimator using probabilistic threshold distributions. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 659–663. Florence, Italy (2014). DOI 10.1109/ICASSP.2014.6853678. ISSN: 2379-190X
  • (67) McAdams, S., Winsberg, S., Donnadieu, S., Soete, G.D., Krimphoff, J.: Perceptual scaling of synthesized musical timbres: Common dimensions, specificities, and latent subject classes. Psychological Research 58(3), 177–192 (1995). DOI 10.1007/BF00419633. URL http://link.springer.com/article/10.1007/BF00419633
  • (68) Meyer, L.B.: Emotion and Meaning in Music. University of Chicago Press, Chicago (1956)
  • (69) Mignot, R., Peeters, G.: An Analysis of the Effect of Data Augmentation Methods: Experiments for a Musical Genre Classification Task. Transactions of the International Society for Music Information Retrieval 2(1), 97–110 (2019). DOI 10.5334/tismir.26. URL http://transactions.ismir.net/article/10.5334/tismir.26/
  • (70) Müller, M.: Information Retrieval for Music and Motion. Springer, Berlin (2007)
  • (71) Nakamura, T.: The communication of dynamics between musicians and listeners through musical performance. Perception & Psychophysics 41(6), 525–533 (1987)
  • (72) Nakano, T., Goto, M., Hiraga, Y.: An automatic singing skill evaluation method for unknown melodies using pitch interval accuracy and vibrato features. Rn 12, 1 (2006). URL https://staff.aist.go.jp/t.nakano/PAPER/INTERSPEECH2006nakano.pdf
  • (73) Noll, A.M.: Pitch Determination of Human Speech by the Harmonic Product Spectrum, the Harmonic Sum Spectrum, and a Maximum Likelihood Estimate. In: Proceedings of the Symposium on Computer Processing in Communications, vol. 19, pp. 779–797. Polytechnic Press of the University of Brooklyn, Brooklyn (1969)
  • (74) Pachet, F., Cazaly, D.: A Taxonomy of Musical Genres. In: Proceedings of the Conference on Content-Based Multimedia Information Access. Paris (2000). 00219
  • (75) Palmer, C.: Mapping Musical Thought to Musical Performance. Journal of Experimental Psychology: Human Perception and Performance 15(2), 331–346 (1989)
  • (76) Palmer, C.: Music Performance. Annual Review of Psychology 48, 115–138 (1997)
  • (77) Pati, K.A., Gururani, S., Lerch, A.: Assessment of Student Music Performances Using Deep Neural Networks. Applied Sciences 8(4), 507 (2018). DOI 10.3390/app8040507. URL http://www.mdpi.com/2076-3417/8/4/507/pdf
  • (78) Paulus, J., Muller, M., Klapuri, A.P.: State of the Art Report: Audio-Based Music Structure Analysis. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 625–636. Utrecht, Netherlands (2010). URL http://ismir2010.ismir.net/proceedings/ismir2010-107.pdf
  • (79) Pauws, S.: Musical key extraction from audio. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Barcelona (2004)
  • (80) Rabiner, L.R.: On the use of autocorrelation analysis for pitch detection. IEEE Transactions on Acoustics, Speech, and Signal Processing 25(1), 24–33 (1977). DOI 10.1109/TASSP.1977.1162905
  • (81) Reiss, J.D., Brandtsegg, y.: Applications of Cross-Adaptive Audio Effects: Automatic Mixing, Live Performance and Everything in Between. Frontiers in Digital Humanities 5 (2018). DOI 10.3389/fdigh.2018.00017. URL https://www.frontiersin.org/articles/10.3389/fdigh.2018.00017/full
  • (82) Romani Picas, O., Rodriguez, H.P., Dabiri, D., Tokuda, H., Hariya, W., Oishi, K., Serra, X.: A Real-Time System for Measuring Sound Goodness in Instrumental Sounds. In: Proceedings of the Audio Engineering Society Convention, vol. 138. Warsaw (2015)
  • (83) Russel, J.A.: A Circumplex Model of Affect. Journal of Personality and Social Psychology 39(6), 1161–1178 (1980). DOI 10.1037/h0077714
  • (84) Schedl, M., Gómez, E., Urbano, J.: Music Information Retrieval: Recent Developments and Applications. Foundations and Trends® in Information Retrieval 8(2-3), 127–261 (2014). DOI 10.1561/1500000042. URL http://www.nowpublishers.com/article/Details/INR-042
  • (85) Scheirer, E.D.: Tempo and beat analysis of acoustic musical signals. Journal of the Acoustical Society of America (JASA) 103(1), 588–601 (1998)
  • (86) Scherer, K.R.: Why Music does not Produce Basic Emotions: Pleading for a new Approach to Measuring the Emotional Effects of Music. In: Proceedings of the Stockholm Music Acoustics Conference (SMAC). Stockholm (2003)
  • (87) Scherer, K.R.: Which Emotions Can be Induced by Music? What Are the Underlying Mechanisms? And How Can We Measure Them? Journal of New Music Research 33(3), 239–251 (2004). DOI 10.1080/0929821042000317822. URL http://www.informaworld.com/openurl?genre=article&doi=10.1080/0929821042000317822&magic=crossref||D404A21C5BB053405B1A640AFFD44AE3
  • (88) Schloss, W.A.: On the Automatic Transcription of Percussive Music – From Acoustic Signal to High-Level Analysis. Dissertation, Stanford University, Center for Computer Research in Music and Acoustics (CCRMA), Stanford (1985)
  • (89) Seashore, C.E.: Psychology of Music. McGraw-Hill, New York (1938)
  • (90) Smaragdis, P., Brown, J.C.: Non-Negative Matrix Factorization for Polyphonic Music Transcription. In: Proceedings of the Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)). IEEE, New Paltz (2003). DOI 10.1109/ASPAA.2003.1285860. URL http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1285860
  • (91) Stevens, S.S., Volkmann, J., Newman, E.: A Scale for the Measurement of the Psychological Magnitude Pitch. Journal of the Acoustical Society of America (JASA) 8(3), 185–190 (1937). DOI 10.1121/1.1915893. URL http://link.aip.org/link/?JAS/8/185/1&Agg=doi
  • (92) Temperley, D.: The Tonal Properties of Pitch-Class Sets : Tonal Implication, Tonal Ambiguity, and Tonalness. Computing in Musicology 15, 24–38 (2007)
  • (93) Thompson, S., Williamon, A.: Evaluating Evaluation: Musical Performance Assessment as a Research Tool. Music Perception: An Interdisciplinary Journal 21(1), 21–41 (2003). DOI 10.1525/mp.2003.21.1.21. URL https://mp.ucpress.edu/content/21/1/21
  • (94) Tzanetakis, G., Ermolinskyi, A., Cook, P.: Pitch Histograms in Audio and Symbolic Music Information Retrieval. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Paris (2002)
  • (95) Tzanetakis, G., Jones, R., McNally, K.: Stereo panning features for classifying recording production style. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR), pp. 441–444. Vienna (2007). URL http://webhome.cs.uvic.ca/gtzan/work/pubs/ismir07gtzanE.pdf
  • (96) Vidwans, A., Gururani, S., Wu, C.W., Subramanian, V., Swaminathan, R.V., Lerch, A.: Objective descriptors for the assessment of student music performances. In: Proceedings of the AES Conference on Semantic Audio. Audio Engineering Society (AES), Erlangen (2017). URL http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2017/06/Vidwans-et-al_2017_Objective-descriptors-for-the-assessment-of-student-music-performances.pdf
  • (97) Wang, A.: An Industrial Strength Audio Search Algorithm. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Washington (2003). URL http://www.cs.northwestern.edu/pardo/courses/casa/papers/AnIndustrial-StrengthAudioSearchAlgorithm.pdf
  • (98) Weinzierl, S., Lepa, S., Schultz, F., Detzner, E., von Coler, H., Behler, G.: Sound power and timbre as cues for the dynamic strength of orchestral instruments. The Journal of the Acoustical Society of America 144(3), 1347–1355 (2018). DOI 10.1121/1.5053113. URL http://asa.scitation.org/doi/10.1121/1.5053113
  • (99) Wesolowski, B.C., Wind, S.A., Engelhard, G.: Examining Rater Precision in Music Performance Assessment: An Analysis of Rating Scale Structure Using the Multifaceted Rasch Partial Credit Model. Music Perception: An Interdisciplinary Journal 33(5), 662–678 (2016). DOI 10.1525/mp.2016.33.5.662. URL https://mp.ucpress.edu/content/33/5/662
  • (100) Wu, C.W., Gururani, S., Laguna, C., Pati, A., Vidwans, A., Lerch, A.: Towards the Objective Assessment of Music Performances. In: Proceedings of the International Conference on Music Perception and Cognition (ICMPC), pp. 99–103. San Francisco (2016). URL http://www.icmpc.org/icmpc14/proceedings.html
  • (101) Wu, C.W., Lerch, A.: Assessment of Percussive Music Performances with Feature Learning. International Journal of Semantic Computing 12(3), 315–333 (2018). DOI 10.1142/S1793351X18400147. URL http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2018/09/ws-ijsc_cw_submission.pdf
  • (102) Wu, C.W., Lerch, A.: From Labeled to Unlabeled Data – On the Data Challenge in Automatic Drum Transcription. In: Proceedings of the International Society for Music Information Retrieval Conference (ISMIR). Paris (2018). URL http://www.musicinformatics.gatech.edu/wp-content_nondefault/uploads/2018/06/Wu-and-Lerch-From-Labeled-to-Unlabeled-Data-On-the-Data-Chal.pdf
  • (103) Yang, Y.H., Chen, H.H.: Machine Recognition of Music Emotion: A Review. ACM Transactions on Intelligent Systems and Technology (TIST) 3(3) (2012). DOI 10.1145/2168752.2168754. URL https://doi.org/10.1145/2168752.2168754
  • (104) Zentner, M., Grandjean, D., Scherer, K.R.: Emotions Evoked by the Sound of Music: Characterization, Classification, and Measurement. Emotion 8(4), 494–521 (2008). DOI 10.1037/1528-3542.8.4.494. URL http://www.ncbi.nlm.nih.gov/pubmed/18729581
  • (105) Zhang, Q.s., Zhu, S.c.: Visual interpretability for deep learning: a survey. Frontiers of Information Technology & Electronic Engineering 19(1), 27–39 (2018). DOI 10.1631/FITEE.1700808. URL https://doi.org/10.1631/FITEE.1700808

Index

  • Audio Content Analysis §1
  • Audio Feature §1.2
  • Audio Fingerprinting §4.1
  • Auto Correlation Function §2.2
  • Beat detection §3.2
  • Descriptor §1.2
  • Features §4.2
  • Fundamental frequency detection
  • Harmonic Product Spectrum §2.2
  • Inference §1.2
  • Mel Frequency Cepstral Coefficients 2nd item
  • MIREX §2.3
  • Music Emotion Recognition §4.3
  • Music Genre Classification §4.2
  • Music Information Retrieval §1
  • Music performance analysis §3
  • Music transcription §2
  • Musical Key Detection §2.1
  • Novelty function §3.2
  • Onset detection §3.2
  • Pitch chroma §2.1
  • Root Mean Square 2nd item
  • Self-Similarity Matrix §2.4
  • Short Time Fourier Transform §1.2
  • Spectral Centroid 1st item
  • Spectrogram §1.2
  • Tempo Induction §3.2
  • Waveform §1.2