This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Melody Structure Transfer Network: Generating Music with Separable Self-Attention

Ning Zhang    Junchi Yan \affiliationsDepartment of Computer Science and Engineering, Shanghai Jiao Tong University \emails{ningz, yanjunchi}@sjtu.edu.cn
Abstract

Symbolic music generation has attracted increasing attention, while most methods focus on generating short piece (mostly less than 8 bars, and up to 32 bars). Generating long music calls for effective expression of the coherent music structure. Despite their success on long sequences, self-attention architectures still have challenge in dealing with long-term music as it requires additional care on the subtle music structure. In this paper, we propose to transfer the structure of training samples for new music generation, and develop a novel separable self-attention based model which enable the learning and transferring of the structure embedding. We show that our transfer model can generate music sequences (up to 100 bars) with interpretable structures, which bears similar structures and composition techniques with the template music from training set. Extensive experiments show its ability of generating music with target structure and well diversity. The generated 3,000 sets of music is uploaded as supplemental material.

1 Introduction

Algorithmic composition has a standing history since the early stochastic composition system ‘Atree’ [?]. Handcrafted grammars and rules are used to regularize the generation of music of different styles and structures [??]. Developing these rules is nontrivial as the music theory itself contains lots of rules, and those factors contributing to the music art of different genres are difficult to specify. The rules vary with composers’ style and music forms. [?] builds a four-part J.S. Bach chorale composition system according to over 350 handcrafted rules. [?] extends it with learning from the score corpus of specific composer and creates its own grammar and rules. However, the algorithmic composition still remains open [?].

Recent deep music generation models [??] can generate quality short pieces. While for long sequence, the generated pieces suffer unclear structure. A complete music piece usually consists of more than one track of long note sequences, which has clear and complex horizontal structures (along the time axis) and vertical structures (between multiple instruments or tracks). The horizontal structures are related to the music form, which consists of three-level of structures: sub-phrase, phrase, and section. The sub-phrase contains the basic ideas (motives), which are then developed into phrases and sections according to some composition techniques [?].

The rhythm, theme and emotion of the music piece evolve over time with the extension of sub-phrase and phrases. The organization of music elements into music pieces consists of many rigorous rules. A small violation e.g. one more or less beat, can wreck the structure [?].

Deep music generation approaches can be categorized by the form of music representation. One category is the piano roll based models. The piano roll represents the symbolic music as images of shape P×T×IP\times T\times I, where P,T,IP,T,I denote the number of pitches, time steps and instruments, respectively. Some works [???] build models to generate piano rolls. Limited by the image characteristics, these methods generate music of short length (less than 16 bars). Methods falling into the second category treat music as event sequences, including note based sequences and frame based sequences. Then music generation can be viewed as sequence generation, and various sequence models are used to describe the joint distribution of the music sequence. Many works also utilize RNN to model the music sequences [??], and the length can be up to 512 tokens [?]. Recently, the self-attention architectures have demonstrated their superiority on long-term sequence processing. Methods [???] apply the self-attention architectures (transformers) [?] to music sequence generation. It has been shown in [?] for generating music of 2048 tokens.

There exists structure coherence for long sequence music. Though [?] claims that it could generate music with better structure than [?], it only qualitatively shows a sample with more repeated patterns. Other works [??] stress the problem of music structure in music generation. They try to enhance structure by increasing the probability that the model generates repeated elements. However, more repeat patterns do not mean better structure. Human compositions repeat and vary the motives to express some certain emotions. [?] also proposes to transfer the structure from a template music to generate new music piece. However, it requires to design constraints and train a model for every template music. This limits its applicability.

In this paper, we present a model which can transfer the structures of template music pieces to generate new pieces. Different from [?] in which the model is designed with hand-crafted constraints to learn the structure of one specific music piece less than 32 bars, our model can automatically learn and transfer the structures of all pieces up to 100 bars in the training set. We make two important observations to motivate our approach:

1) Music structure transfer from real music to generated pieces can be label-free and (potentially) an efficient approach, which in fact has not been well-studied in literature;

2) By examining the transformer based music generation models, we find that the learned self-attention matrix closely relates to music structure. In another word, transferring self-attention relations can help transfer music structure. This motivates to explore the better use of self-attention mechanism.

In this paper, we present novel separable self-attention mechanism based models to encode the structure information from the real musics into embeddings, and these embeddings can be transferred to the generation process to achieve realistic long music generation as shown in Fig. 1. Besides, different from other transformer based models [???] which take the note-based event sequence representations, we utilize the frame-based event sequence to represent the music scores, which simplifies the control of the metrical structures of music. Moreover, we introduce the key signature tokens to control the tonality of the generated music. Experiments show that our method can transfer the structure of templates to the generation of new pieces. The trained model can develop any given motives into a new piece by using similar composition techniques with the templates. Also, the model can generate diverse pieces, rather than simply remember the sequence of the training data. The contributions of this work are:

1) Differing from [??] which only try to increase the probability of repeated patterns, we propose to transfer structure of training samples for music generation especially for long sequence. Unlike [?], our method could transfer the structures of all the training pieces and does not rely on hand-crafted constraints.

2) We develop a transferable self-attention mechanism to achieve effective structure transfer even for very long music. It separates the computation of query, key and value, to fulfill flexible structure transfer. Note that [?] expects the self-attention to learn all the structures from the training set. In contrast, our tailored self-attention supports the learning of structure embedding from each of the training data, and transfers it effectively to the generated samples.

3) We propose several quantitative metrics based on music theory to measure the rhythm structure similarity and interval structure similarity, as well as the diversity, for generated music. Extensive quantitative experiments show the effectiveness of our proposed structure transfer network in generating new music pieces (up to 100 bars). Besides, we quantitatively show that the generated pieces are of well diversity, rather than simply remember the training pieces.

Refer to caption
Figure 1: Overview of Melody Structure Transfer Network (MSTN).

2 Preliminaries and Related Works

Music Generation by Sequence Models

Existing music generation models can be categorized into piano roll based models and event sequence based models. We briefly describe the latter ones, which are related to our work. In these models, music scores are represented as an event sequence 𝐬=[s1,s2,,st,,sn]\mathbf{s}=[s_{1},s_{2},...,s_{t},...,s_{n}], where sis_{i} is the event token at time step ii. Its joint distribution can be factorized into: pθ(𝐬)=pθ(s1)pθ(s2|s1)pθ(sn|sn1s1)p_{\theta}(\mathbf{s})=p_{\theta}(s_{1})p_{\theta}(s_{2}|s_{1})...p_{\theta}(s_{n}|s_{n-1}...s_{1}).

Various sequence models have been developed to describe the conditional probabilities. From the pioneering music generation system CONCERT [?], to recently Celtic system [?] and PerformanceRNN [?], which employs RNN to generate symbolic music. Meanwhile, self-attention architecture based models e.g. music transformer [?] and MuseNet [?] are also devised to generate long sequence music.

These models are typically trained with the teacher-forcing strategy [?] to predict the next token. Sampling is performed from the conditional distribution step by step, to generate sequences of arbitrary length. However, the sequence models receive no extra input related to music structure or other information, resulting poor structures of the generated samples. Though the self-attention models often outperform LSTM models for generating long-term music [?], whereby the motives can be repeated along the whole sequences, their generated results still suffer unclear structure. In conclusion, the problem for long-term music generation remains open and far from resolved.

Self-attention Mechanisms

Transformer [?] based models mostly directly employ stacked self-attention blocks (like GPT and GPT-2 [?]) to generate music auto-regressively.

The attention layer first calculates three matrices: queries (QQ), keys (KK) and values (VV) from input embedding sequence Xi=(x0,x1,x2,,xL)X^{i}=(x_{0},x_{1},x_{2},...,x_{L}), where xtDx_{t}\in\mathcal{R}^{D} for time step tt, and ii is the index of transformer blocks and LL is the length of input sequence. Q=XWQQ=XW_{Q}, K=XWKK=XW_{K} and V=XWVV=XW_{V}. Here WQW_{Q}, WKW_{K} and WVW_{V} are D×DD\times D matrices. Then QQ, KK and VV are L×DL\times D matrices, each of which is then split into HL×DhH\cdot L\times D_{h} attention heads, where Dh=DHD_{h}=\frac{D}{H}. The multi-heads mechanism allow the model to focus on different parts of the history. The self attention is computed as:

Z=Attention(Q,K,V)=Softmax(QKDh)VZ=Attention(Q,K,V)=Softmax\left(\frac{QK^{\top}}{\sqrt{D_{h}}}\right)V (1)

The subsequent layers transform ZZ to get each block’s output Xi+1X^{i+1} (input to next block). Initially, X0=Ex+EposX^{0}=E_{x}+E_{pos}, where ExE_{x} is the embedding sequence of input tokens and EposE_{pos} is the positional encoding. An upper triangular mask ensures queries cannot attend to keys later in the sequence.

3 The Proposed Approach

3.1 Connecting Self-attention and Structure

The self-attention can be viewed as a generalization of self-similarity matrix (SSM) [?]. SSM can reflect the self-similarity structure of a music piece, and has been used for music segmentation and motif discovery [??], and music generation [?]. However, SSM can only capture the repetition patterns. According to the composition theory [?], there are more than 10 kinds of techniques e.g. transposition, compression, expansion, mirror for developing motif into melody.

We further identify the connections between self-attention and the sequence structure. The multi-head self-attention mechanism allows modelling of multi-level dependencies from input sequences [?]. Each position in the decoder attends to all positions in the decoder up to (including) that position. This is somewhat similar to the composition process by human, where a prime motif (a short sequence of several notes) is first devised and then developed into a long sequence via repetitions and variations. These repetitions and variations occur at multiple levels (sub-phrases, phrases and sections) and form the multi-level structures. The multi-level dependencies learned by the self-attention mechanism is closely related to the multi-level structures.

Seeing the fundamental connection between self-attention and sequence structure, we argue that the learned dependencies in self-attention mechanism can be used to guide the generation of new music piece, and the realistic structures can be kept during generation by dependency transfer.

3.2 Dependency Transfer via Attention

Now the problem becomes how to transfer the learned dependencies from the real music to generation. Specifically, the generation of token sts_{t} in step tt should attend to similar positions as the real one. To achieve this goal, we hope the calculation of self-attention can be expressed as:

Z=Attention(Q,K,V)=AttMV=𝐟(dep,𝐠(s))V\begin{array}[]{cl}Z=Attention(Q,K,V)&=Att_{M}\cdot V\\ &=\mathbf{f}(dep,\mathbf{g}(s))\cdot V\end{array} (2)

where ZZ is the output hidden state of a self-attention layer, and AttMAtt_{M} is the attention matrix at this position. ss denotes the past tokens in the sequence, depdep is the structure dependency related variables for transfer. 𝐟()\mathbf{f}(\cdot), 𝐠()\mathbf{g}(\cdot) are transformations.

We further expect that depdep shall meet two requirements: 1) containing rich structure-related information; 2) independent from the past input sequences thus it can be transferred to different generated sequences.

For the first requirement, as the structure is closely related to positions, to force variable depdep to learn the structure related information, we make only depdep relate to the positions. Both VV and 𝐠(s)\mathbf{g}(s) should be independent from the positions. To satisfy the second requirement, the computation of depdep should not include the input tokens.

Note that directly transplanting the self-attention matrix AttMAtt_{M} from a training piece to the generation phrase can hardly work well. Because in the original self-attention mechanism, the calculation of attention in each layer depends on the input token at each time step (think back the calculation of query, key and value). Different input sequences will lead to different attention matrices. In this paper, we show how to implement the self-attention mechanism in Eq. 2 by separating the calculation of query, key and value. A structure embedding is introduced to help the learning of variable depdep.

3.3 Design of Transferable Self-attention

In the original self-attention mechanism [?], the computations of query, key, value all depend on the position embedding and input token embedding. Here, we separate the calculation of query, key, value, and introduce a structure hidden state hdh_{d} and note hidden state hxh_{x}. We design two architectures to implement the separable self-attention mechanism in Eq. 2. The designed architectures are shown in Fig. 2. The calculation of the separable self-attention are:

q=(hd+λhx)Wqk=(hd+λhx)Wkvd=hdWdvx=hxWxad=Att(q,k)vdax=Att(q,k)vx\begin{array}[]{ll}q=(h_{d}+\lambda h_{x})W_{q}&k=(h_{d}+\lambda h_{x})W_{k}\\ v_{d}=h_{d}W_{d}&v_{x}=h_{x}W_{x}\\ a_{d}=Att(q,k)v_{d}&a_{x}=Att(q,k)v_{x}\end{array} (3)
qx=(hd+λhx)Wqxkx=(hd+λhx)Wkxqd=hdWqdkd=hdWkdvd=hdWvdvx=hxWvxad=Att(qd,kd)vdax=Att(qx,kx)vx\begin{array}[]{ll}q_{x}=(h_{d}+\lambda h_{x})W_{q_{x}}&k_{x}=(h_{d}+\lambda h_{x})W_{k_{x}}\\ q_{d}=h_{d}W_{q_{d}}&k_{d}=h_{d}W_{k_{d}}\\ v_{d}=h_{d}W_{v_{d}}&v_{x}=h_{x}W_{v_{x}}\\ a_{d}=Att(q_{d},k_{d})v_{d}&a_{x}=Att(q_{x},k_{x})v_{x}\end{array} (4)

where Att(q,k)=Softmax(qkDh)Att(q,k)=Softmax\left(\frac{qk}{\sqrt{D_{h}}}\right).

Note Eq. 3 and 4 specify the separable self-attention mechanism. They corresponds to the architectures in Fig. 2(b) and (c), respectively (We omit the superscript ll for clarity). Different from the original self-attention block where the only hidden state hlh^{l} encode all the information from input, we introduce separated structure hidden states hdh_{d} and the note hidden states hxh_{x}. The vxv_{x} and vdv_{d} at each block are calculated from the input note hidden states and structure states, respectively. In Fig. 2(b) the attention coefficients for vxv_{x} and vdv_{d} are the same, the qq and kk could be related to both the note states and structure states. In Fig. 2(c) the attention coefficients for vxv_{x} and vdv_{d} are different, qdq_{d} and kdk_{d} are only related to the structure states, while qxq_{x} and kxk_{x} still could be related to both the note states and structure states.

At the first block, the original music transformer architecture [?] takes the addition of note embedding EnoteE_{note} and position embedding EposE_{pos} as input, where h0=Enote+Eposh^{0}=E_{note}+E_{pos}. We introduce a structure embedding EdE_{d}:

hd0=Epos+Ed,hx0=Enote.h^{0}_{d}=E_{pos}+E_{d},\quad h^{0}_{x}=E_{note}. (5)

Fig. 2(b) and (c) are two exemplar implementations of Eq. 2. Others may design more architectures for Eq. 2.

Refer to caption
Figure 2: (a) The original self-attention block and (b, c) the proposed separable self-attention block. hdlh^{l}_{d} denotes the structure hidden state of the lthl^{th} block and hdlh^{l}_{d} denotes the note hidden state of this block. The computation process of (b) and (c) correspond to Eq. 3 and 4 (The superscript ll is omitted for clarity).

3.4 Structure Embedding for Transfer

The structure embedding in Eq 5 is defined as Ed={edt|t=0,1,2,,T}E_{d}=\{e_{d_{t}}|t=0,1,2,...,T\}, where edte_{d_{t}} is a trainable vector which denotes the embedding at ttht^{th} time step for a training piece. It is computed as

edt=wtede_{d_{t}}=w_{t}\cdot e_{d} (6)

where wtw_{t} is the transform vector of size (nstate,)(n_{state},) at ttht^{th} time step, ede_{d} is the structure representation of a training piece. Each music piece in the original training set is assigned a structure representation of size (nstate,)(n_{state},) randomly. nstaten_{state} is the length of hidden state in the transformer architecture.

We augment the music pieces in original training set using the pitch transposition technique [?], which is to transpose the pitches totally by {6,5,,5,6}\{-6,-5,...,5,6\} steps. This augmentation changes the tonality of the original music, but not the structure. Thus the augmented versions share the same structure embedding ede_{d} with the original ones. The ede_{d} for each training piece is learned during training. Then the learned ede_{d} can be transferred to the generation stage.

3.5 Advantages to Conditional Transformer

Someone may argue that a conditional music transformer (CMT) [?] may also work for the structure transfer. In this setting, the structure embedding is used as a control code, which is prepended ahead of the music sequence embeddings. This code provides a control over the generation process. However, the control embedding can only affect the global dependencies via its keys at that single position. Therefore, it has difficulty in controlling the subtle dependencies between all the tokens along the sequence and thus lead to unstable structure transfer. We also implement a conditional transformer, as compared in our experiments.

3.6 Data Representation for Music Generation

We adopt the frame-based event representation scheme in [?]: time is quantized using uneven sub-division. Unlike other works quantizing uniformly with the sixteenth note, the work [?] designs the uneven subdivision scheme, where each beat is divided into 6 uneven ticks. This scheme allows to represent note sequences including triplets efficiently. In addition, real note names are used for generation of readable sheet music.

We convert the sheet music into note token sequences using the above scheme. Here we introduce two additional tokens, the tonic and mode, to represent the key signature of sheet music. These two tokens are put in front of the note sequences. Example of our representation is shown in Fig. 3.

Refer to caption
Figure 3: Representation of a sheet music segment (reproduced from [?]). Left: a one-bar melody. Right: uneven quantization bins. Bottom: representation sequence of this one-bar melody. The two strings in blue are the tonic and mode respectively.

4 Experiments

4.1 Datasets and Metrics

We test our proposed Melody Structure Transfer Net (MSTN) on two public datasets. The Session dataset [?] and the Wikifonia dataset [?]. The Session dataset consists of monophonic folk melodies in the Scottish and Irish style taken from the Session website [?]. The original Session dataset consists of more than 48K melody pieces with ABC notation format [?]. In this paper, we take the same subset as [?] from the Session dataset. Only melodies with 4/4 time signature are considered, and the pieces which consist of notes less than sixteen note are dropped. This results in approximate 21K melodies. The Wikifonia [?] contains more than 6500 lead sheets in MusicXml format [?]. We also select a subset from these music pieces according to the above standards and finally there are 3500 sheets. The chords track are dropped and only melodies are kept. These two datasets are split to 90%90\% training and 10%10\% validation. The training pieces are augmented with the pitch transposition strategy. All pieces are encoded into the sequence representation as stated in Sec. 3.6.

To demonstrate that MSTNs can transfer the templates’ structure to the generated ones, rather than simply remember the training templates, we adopt several structure related metrics [?]. We also devise several metrics to evaluate the structure similarities and sample diversity. The metrics we used are as follows:

1) Repeat Count (RC), Repeat Duration (RD) and Repeat Onset (RO): Number of repeats corresponding to various lookback values, various durations, and various onset. These three kinds metrics are proposed in [?], where only 16 bars of short music fragments are considered. We improve these measures to capture the repeat pattern across longer term by extending the lookback values (see Fig. 4). These three metrics are calculated for the duration repeats (-D) and duration-interval repeats (-DI) separately, leading to 6 metrics – see computing details in [?].

2) Pitch Distribution (PD) and Duration Distribution (DD): The statistics of various pitches and durations. These two metrics are popular in music generation tasks to compare the pitch and duration distribution similarity between the generated music sets and training sets [?].

3) Rhythm Structure Similarity (RSS) and Interval Structure Similarity (ISS): Here we adopt the self-similarity matrix (SSM), which is commonly used for the music structure analysis [?]. Despite that SSM is often calculated from the music audio, some researches [?] applied it to the sheet music. In this paper we compute the SSM for rhythm and intervals separately at the bar level. For each note in a bar, we can calculate its duration beat length and its interval from the previous note. We represent the rhythm pattern in a bar as list of (start time, duration) tuples, and represent the interval pattern in a bar as list of intervals:

Rhy(j)=[(st1,d1),(st2,d2),,(stm,dm)]Intv(j)=[iv1,iv2,,ivm]\begin{array}[]{ll}Rhy(j)=[(st_{1},d_{1}),(st_{2},d_{2}),...,(st_{m},d_{m})]\\ Intv(j)=[iv_{1},iv_{2},...,iv_{m}]\end{array} (7)

where mm is the number of notes in a bar, stlst_{l} denotes the start beat position of the lthl^{th} note in the bar, and dld_{l} denotes its duration. ivl=interval(ivl1,ivl)iv_{l}=interval(iv_{l-1},iv_{l}) denotes the staff interval from the previous note to current note. If there is no previous note or the interval is inapplicable (eg. one of the note is ’rest’), the interval value is assigned as nullnull.

With the rhythm and interval representation for each bar, we are able to calculate the rhythm and interval similarity between any two bars. And then the SSM for any given music piece can be obtained. We compute the rhythm similarity between two bars as the ratio between the duration of matched tuples and the total duration of a bar, and the interval similarity as the ratio of length of max matching string in two bars to the max length of these two bars:

SSMRhy(i,j)=ΣlmsdlDur,ms={l|Rhy(i)(l)Rhy(j)}SSMIntv(i,j)=maxmatchlength(Intv(i),Intv(j))maxlength(Intv(i),Intv(j)){\begin{array}[]{ll}SSM_{Rhy}(i,j)=\frac{\Sigma_{l\in ms}d_{l}}{Dur},ms=\{l|Rhy(i)(l)\in Rhy(j)\}\\ SSM_{Intv}(i,j)=\frac{max\ match\ length(Intv(i),Intv(j))}{maxlength(Intv(i),Intv(j))}\end{array}} (8)

where i,ji,j are indexes of two bars, DurDur is the total duration of a bar, and for music pieces of 4/4 time signature, DurDur is 4 beat length. Then we can obtain a SSM of size LLL*L for a music piece of LL bars. The SSM is a symmetric matrix as the similarity between bar ii and jj equals to is counterpart.

After computing the SSM for each music piece, we are able to evaluate the structure similarity between any two music pieces by compare their rhythm and interval SSM. The Rhythm and Interval Structure Similarity are calculated as the root mean square error between their corresponding SSMs.

4) Rhythm Duplicate Rate (RDR) and Interval Duplicate Rate (IDR): It is necessary to check to what extent the model could generate new samples, rather than simply remember the training samples. It is also important to evaluate the diversities between the samples generated from a given template structure. Here we propose the rhythm duplicate rate and the interval duplicate rate to evaluate the diversity between the template and its generated samples (RDRTABRDR_{TAB} and IDRTABIDR_{TAB}), as well as the diversity between the samples generated from the same template (RDRABRDR_{AB} and IDRABIDR_{AB}). These duplicate rates are computed as the ratio of the number of bars with same rhythm or intervals between two samples to the number of bars in the samples.

Datasets WikiFonia The Session
Statistics RC-D RD-D RO-D RC-DI RD-DI RO-DI PD DD RC-D RD-D RO-D RC-DI RD-DI RO-DI PD DD
MT [?] 0.004 0.091 0.014 0.029 0.077 0.064 0.790 0.039 0.073 0.040 0.038 0.054 0.076 0.048 0.916 0.006
CMT [?] 0.015 0.017 0.012 0.006 0.009 0.053 0.569 0.009 0.008 0.011 0.009 0.009 0.026 0.028 0.943 0.006
MSTN-C (ours) 0.022 0.015 0.015 0.033 0.013 0.054 0.481 0.003 0.002 0.009 0.015 0.001 0.031 0.044 0.880 0.003
MSTN-U (ours) 0.008 0.016 0.010 0.021 0.014 0.048 0.707 0.009 0.001 0.010 0.034 0.003 0.035 0.112 0.738 0.008
Table 1: KL-divergences between the training data and the generated melodies on the Wikifonia and Session dataset.
Datasets WikiFonia The Session WikiFonia The Session
Models Free Composition Mode Continuation Mode
MSTN-C MSTN-U CMT MSTN-C MSTN-U CMT MSTN-C MSTN-U CMT MSTN-C MSTN-U CMT
RSS 0.395 0.335 0.549 0.422 0.491 0.569 0.441 0.442 0.558 0.551 0.537 0.580
ISS 0.371 0.309 0.482 0.203 0.229 0.313 0.374 0.375 0.472 0.245 0.255 0.322
RDRTABRDR_{TAB} 0.295 0.241 0.050 0.272 0.194 0.141 0.238 0.242 0.070 0.005 0.008 0.013
IDRTABIDR_{TAB} 0.065 0.040 0.005 0.013 0.001 0.001 0.045 0.032 0.005 0.000 0.000 0.000
RDRABRDR_{AB} 0.263 0.223 0.224 0.376 0.240 0.282 0.425 0.444 0.245 0.092 0.075 0.060
IDRABIDR_{AB} 0.040 0.030 0.060 0.004 0.005 0.007 0.109 0.094 0.065 0.039 0.039 0.041
Table 2: Structure Similarities and Duplicate Rates.

4.2 Evaluation Protocol

We evaluate the two versions of MSTN (i.e. MSTN-C and MSTN-U, corresponding to Fig. 2(b) and (c), respectively. We also evaluate the implemented conditional music transformer (CMT). A baseline music transformer (MT) is also implemented and evaluated on several applicable metrics for clarity. All these models consist of 7 layers of self-attention blocks with 8 heads and 256 hidden states. The learning rate is 2e52e-5 with 5 epochs of warm up. The max length is set to 100 bars (2400 time steps). The models are trained for 100 epochs on Wikifonia and 50 epochs on Session dataset. The parameter λ\lambda in Eq. 3 and Eq. 4 is chosen as 0.10.1 after we coarsely browse several values ranging from 0.0010.001 to 11.

For evaluation, we generate two samples for each given template, which enables to compute the duplicate rates between samples generated from the same templates. The structure similarities between samples and their templates are calculated separately on the two samples and then averaged. Each of the music pieces in the training set has been taken as template. We generate samples in two modes: 1) free composition mode: trained models generate samples freely, without any given prime; 2) continuation mode: trained models generate subsequent sequences from a given one-bar motif. The metrics are computed on each set of generated samples, and then averaged on the whole dataset.

Refer to caption
(a) Wikifonia dataset
Refer to caption
(b) Session dataset
Figure 4: Repeat-related statistics, pitch and duration distributions.
Refer to caption
Figure 5: Melodies and their SSMs of (a) template melody, (b) melody generated with baseline music transformer in continuation mode, (c) using MSTN-U in free composition mode, (d) using MSTN-U in continuation mode, (e) with CMT in free composition mode, (f) with CMT in continuation mode.

4.3 Results and Discussion

The statistics of the repeat-related metrics, as well as the pitch and duration distribution are given in Fig. 4. The KL-Divergences on these distributions between the dataset and the generated set from each model are shown in Table 1. The calculation of these KL-divergences is the same as that in [?]. To better understand the performance, we also present the evaluation statistics on the baseline music transformer model (MT). Table 1 shows that the samples generated by CMT and the proposed MSTNs are closer to the dataset than those of the baseline MT. It is reasonable because CMT and MSTNs have more constraints than the baseline MT during the training process. These statistics do not differ much between the samples generated by CMT and MSTNs.

Table 2 shows the statistics (the lower the better) on structure similarities (RSS, ISS) and duplication rates (RDR, IDR). The samples of MSTNs and CMT generated in the free composition mode and the continuation mode are evaluated. In both free composition mode and continuation mode, MSTN-C and MSTN-U could generate samples which have more similar rhythm structures and interval structures to the template melodies than the CMT. This means that the proposed MSTNs can transfer the structure from the template melodies to the generated pieces well, while the CMT fails.

As the CMT samples has weak relationship with templates, their rhythm and interval duplication rates with the templates are also lower than MSTNs samples. But as to the duplication rates between samples, the CMT samples in free composition mode dose not perform better than MSTNs, which means that in free composition mode, the samples’ diversities of proposed models does at least as well as CMTs. In continuation mode, it is also reasonable that MSTNs’ samples are less diverse than CMTs’ because the structure constraints of MSTNs are more tense than CMTs. Given the same motif, MSTNs will try to develop the motif according to the composition techniques of the template melody, which will definitely lead to similar generated samples. As for CMT, the model does not learn the structure embedding well, so it will not develop the motif according to the template composition techniques, and thus lead to diverse samples. So the lower duplication rates of CMT samples in the continuation mode also show that the CMT model can not transfer the structure well. Fig. 5 show the melodies and theirs SSMs. The SSMs in this figure are the addition of the rhythm SSM and the interval SSM. Figure 5(a) shows a template melody. (b) and (d) are melodies generated using MSTN-U and CMT in free composition mode, respectively. (c) and (e) are melodies generated using these two model in continuation mode, where the first bar is the given motif. It is obvious that the SSM of MSTN-U melodies are much similar with the template’s SSM, and SSM of CMT melodies are much different with that of template. We can see from (c) that the given motif is developed in to melody with a composition technique which is much similar to the template, where the bar 9-16 is similar to the bar 1-8, bar 21-24 is similar to bar 17-20, and bar 25-32 is similar to bar 9-16.

These results verify that the proposed MSTNs can transfer the structure of template melodies to generate new samples. The generated samples are of good diversity in the free composition mode, and their diversity will drop if the samples are generated from the same prime. MSTNs perform much better than the CMT. We upload 3000 sets of generated results in music xml format for the Wikifonia dataset as supplement materials. These pieces are generated with MSTN-U model in continuation mode, where the first bar is the given motif. Each folder is named with the corresponding template’s name in the Wikifonia dataset, and for each template two samples are generated.

5 Conclusion

We have proposed to transfer the structure of training samples for new music generation by a tailored self-attention mechanism, for long music generation. We also devise four quantitative metrics according to music theory. These four new metrics combined with eight existing metrics are used to evaluate our melody structure transfer model, with promising results. We will go deep and explore the structure transfer methods for polyphonic music generation in the future.

References

  • [1] Juan P Bello. Measuring structural similarity in music. IEEE Transactions on Audio, Speech, and Language Processing, 19(7):2013–2025, 2011.
  • [2] Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. In Advances in Neural Information Processing Systems, pages 1171–1179, 2015.
  • [3] Jean-Pierre Briot, Gaëtan Hadjeres, and François-David Pachet. Deep learning techniques for music generation–a survey. arXiv preprint arXiv:1709.01620, 2017.
  • [4] Jean-Pierre Briot and François Pachet. Deep learning for music generation: challenges and directions. Neural Computing and Applications, pages 1–13, 2018.
  • [5] David Cope. The algorithmic composer, volume 16. AR Editions, Inc., 2000.
  • [6] Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
  • [7] Hao-Wen Dong and Yi-Hsuan Yang. Convolutional generative adversarial networks with binary neurons for polyphonic music generation. arXiv preprint arXiv:1804.09399, 2018.
  • [8] Kemal Ebcioğlu. An expert system for harmonizing four-part chorales. Computer Music Journal, 12(3):43–51, 1988.
  • [9] Michael Good et al. Musicxml: An internet-friendly format for sheet music. In Xml conference and expo, pages 03–04, 2001.
  • [10] http://www.wikifonia.org/. Wikifonia, 2010.
  • [11] Cheng-Zhi Anna Huang, Tim Cooijmans, Adam Roberts, Aaron Courville, and Douglas Eck. Counterpoint by convolution. In International Society for Music Information Retrieval (ISMIR), 2017.
  • [12] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Ian Simon, Curtis Hawthorne, Noam Shazeer, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. Music transformer: Generating music with long-term structure. 2018.
  • [13] Yu-Siang Huang and Yi-Hsuan Yang. Pop music transformer: Generating music with rhythm and harmony. arXiv preprint arXiv:2002.00212, 2020.
  • [14] Natasha Jaques, Shixiang Gu, Richard E Turner, and Douglas Eck. Tuning recurrent neural networks with reinforcement learning. 2017.
  • [15] Harsh Jhamtani and Taylor Berg-Kirkpatrick. Modeling self-repetition in music generation using generative adversarial networks. In Machine Learning for Music Discovery Workshop, ICML, 2019.
  • [16] Sanghoon Jun, Seungmin Rho, and Eenjun Hwang. Music structure analysis using self-similarity matrix and two-stage categorization. Multimedia Tools and Applications, 74(1):287–302, 2015.
  • [17] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858, 2019.
  • [18] Stefan Lattner, Maarten Grachten, Gerhard Widmer, et al. Imposing higher-level structure in polyphonic music generation using convolutional restricted boltzmann machines and constraints. Journal of Creative Music Systems, 2:1, 2018.
  • [19] Alex McLean and Roger T Dean. The Oxford handbook of algorithmic music. Oxford University Press, 2018.
  • [20] Gabriele Medeot, Srikanth Cherla, Katerina Kosta, Matt McVicar, Samer Abdallah, Marco Selvi, Ed Newton-Rex, and Kevin Webster. Structurenet: Inducing structure in generated melodies. In ISMIR, pages 725–731, 2018.
  • [21] Michael C Mozer. Neural network music composition by prediction: Exploring the benefits of psychoacoustic constraints and multi-scale processing. Connection Science, 6(2-3):247–280, 1994.
  • [22] Gerhard Nierhaus. Algorithmic composition: paradigms of automated music generation. Springer Science & Business Media, 2009.
  • [23] Sageev Oore, Ian Simon, Sander Dieleman, Douglas Eck, and Karen Simonyan. This time with feeling: Learning expressive musical performance. Neural Computing and Applications, pages 1–13, 2018.
  • [24] Ashis Pati, Alexander Lerch, and Gaëtan Hadjeres. Learning to traverse latent spaces for musical score inpaintning. In Proc. of the 20th International Society for Music Information Retrieval Conference (ISMIR), Delft, The Netherlands, 2019.
  • [25] Christine Payne. ”musenet.”, 2019.
  • [26] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8):9, 2019.
  • [27] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck. A hierarchical latent vector model for learning long-term structure in music. In International Conference on Machine Learning, pages 4364–4373, 2018.
  • [28] Ian Simon and Sageev Oore. Performance rnn: Generating music with expressive timing and dynamics. In JMLR: Workshop and Conference Proceedings, volume 80, page 116, 2017.
  • [29] Bob Sturm, João Felipe Santos, Oded Ben-Tal, and Iryna Korshunova. Music transcription modelling and composition using deep learning. In 1st Conference on Computer Simulation of Musical Creativity, 2016.
  • [30] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in neural information processing systems, pages 5998–6008, 2017.
  • [31] Chris Walshaw. The abc music standard 2.1. URL: http://abcnotation. com/wiki/abc: standard: v2, 1, 2011.
  • [32] I Xenakis. Formalized music: Thought and mathematics in composition (sharon kanach, compilation and edition), 1963.