This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MELONS: generating melody WITH LONG-TERM STRUCTURE USING transformers and structure graph

Abstract

The creation of long melody sequences requires effective expression of coherent musical structure. However, there is no clear representation of musical structure. Recent works on music generation have suggested various approaches to deal with the structural information of music, but generating a full-song melody with clear long-term structure remains a challenge. In this paper, we propose MELONS, a melody generation framework based on a graph representation of music structure which consists of eight types of bar-level relations. MELONS adopts a multi-step generation method with transformer-based networks by factoring melody generation into two sub-problems: structure generation and structure conditional melody generation. Experimental results show that MELONS can produce structured melodies with high quality and rich contents.

Index Terms—  Music structure, melody generation, structure graph, transformer

1 Introduction

Music composition requires not only inspiration, but also obeying the rules of music theory. The restriction of musical structure is one of the most important rules for complete melody creation [1]. The structure of a complete melody piece is related to the musical form which affects the melody presentation along the time axis, for example, the repetition of motives, the transition between phrases, and the development between sections [2]. In melody composition, the restriction of the structure makes the melodic line cohesive and comprehensible. While aesthetically appealing short melody pieces can be generated using the recent deep learning based technologies [3, 4], creating long melodies with reasonable structure remains a challenge [5, 6], which forms the motivation of this work.

One of the obstacles in generating structured long music is that there is no clear approach to describe the structure technically. Some works regard the structure of music as simple repetitive patterns [7, 8]. However, the structures are primarily designed to reflect the dependencies among note sequences in different time-scales. Such dependencies include not only repetitions but also flexible and versatile developments. Some researchers consider structure as a potential component of music and try to use various neural networks to discover it from music data automatically [9, 10, 11], but the generated music samples cannot show the reasonable and clear organization of structure that exists in human created music. Many works use conditional information such as chord progressions [12, 13, 2] and section tags [2, 14] to control the structure of the generated music, but this requires parallel annotated data, which is not easy to collect.

There has been a long history of dividing the effort of generating structured music into different steps. Todd [15] proposed to use two cascaded networks to construct new pieces of music with hierarchical structure representations. The networks learn both certain aspects of musical structure as well as obscure structural aspects from musical examples. Later, Hornel [16] proposed MELONET system, in which the first network learns to recognize musical structures while the second network predicts the notes. PopMNet [17] also employs two networks to generate structured pop melodies: a structure generation network used to generate the structural representations, and a melody generation net for generating music sequences based on the structural representation and chord progression. Different from [15] and  [16], PopMNet proposes to represent melody structure with graph, which consist of two typical relations known as repetition and sequence between pairs of bars. It utilizes adjacency matrices of the graph to model the structural variations of a long melody. Although PopMNet integrates melody structure into the generation process, the quality of the generated demos is quite limited. This is mainly due to the incomplete enumeration of the relations between bars. Repetition and sequence are just two categories among all common relations. Further, the generation networks in PopMNet are based on CNN and RNN, which greatly limits the model’s ability to generate time-related long sequences.

In this paper, we propose a framework named MELONS which can generate full-song melody with long-term structure. Our contributions can be summarized as follows. Firstly, instead of considering only simple repetitions and rhythm sequence, we construct a graph representation of melody structure with eight types of bar-level relations that appear frequently in the pop songs. These relations have covered most of the permissive music creation techniques including cadence and breath. Statistical analysis on multiple datasets from different sources has proved the versatility and rationality of our proposed relation types. In addition, we propose a multi-step generation method with transformer-based networks by factoring melody generation into structure generation and structure conditional melody generation. Unlike PopMNet which is limited by a fixed 32-bar length and other works which require annotated structural information, our method can generate full-song melodies with flexible length, and is also annotation free. Subjective evaluations show that our approach can generate harmonious melodies with reasonable structures.

2 Proposed Approach

Refer to caption
Fig. 1: Examples of the relations in music. The piece of melody comes from a Chinese song named “Who are you dating tonight.” Note that it shows only one example for each type of relation.
Table 1: Description of relation types. The priorities are sorted according to the similarity between the two bars in a relation
Priority Relation types Description
1 Repetition The current bar is the same as a previous bar.
2 Transposition The current bar is the tonal transposition of a previous bar.
3 Pitch progression Same rhythm. The similarity of pitch sequences between two bars is not less than 50%
4 Rhythmic sequence Same rhythm. The similarity of pitch sequences between two bars is less than 50%
5 Melody progression Two bars who have at least 3 consecutive notes of the same pitch and rhythm.
6 Surround progression Surrounding notes(repeat more than 3 times in a bar) rise or fall by at least five semitones between two bars.
7 Harmonious cadence A certain note in the current bar belongs to the local minimum point of the note density curve1, and the pitch of this note belongs to the tonic chord of music key.
8 Rest No notes in the current bar.

In this section, We firstly show a brief study on pairwise relations between bars, and then we introduce our proposed framework, MELONS, whose goal is to produce full-song melodies with long-term structures given melody motifs.

2.1 Proposed bar-level relations

We employ the relations between pairwise bars to represent the musical structure. We take four beats as one bar and each beat equals to a quarter note. Based on the knowledge of musical composition theory, we classify the bar-level relations into three categories: repetition, development [18] and cadence [19]. After statistical analysis on more than 2500 pieces of pop music, we further decompose these three categories into eight frequently appeared types of structural relations, which are defined in Table 1. Fig. 1 shows an example of the relations in music score.

These relations have covered most common skills of pop music composition. They not only include complicated development methods, but can also reflect the progress of emotions (surround progression) and express the harmonious ending or breath of phrases (harmonious cadence). Moreover, these relations can be expressed by mathematical formulas111For more details, please refer to https://yiathena.github.io/MELONS/, which facilitates their auto detection.

We use a directed graph with multiple types of edges, which we call a structure graph, to describe the melody structure of a song. Bars and the relations between bars are represented as nodes and edges respectively, while types of the edges stand for types of the relations. As is shown in Table 1, the priority of relations is determined by the similarity between two bars. To avoid redundancies in the structure graph, each bar is matched with its previous bars from near to far to find out the relation of the highest priority. Fig. 2 illustrates the structure graph of the melody provided in Fig. 1.

Refer to caption
Fig. 2: The structure graph of the melody in Fig. 1. The color of the line represents the type of the relation. Note that it remains all of the relations finally kept by the structure graph.

2.2 Framework of MELONS

Refer to caption
(a) Framework of training
Refer to caption
(b) Framework of generation
Fig. 3: Proposed framework of MELONS.

The training and generation procedures of MELONS are shown in Fig. 3. We use the structure graph to bridge the gap between local pairwise relations and global music structure, and propose a multi-step melody generation framework to compose a full-song melody based on an 8-bar motif (which is the common length of a phrase).

Structure generation. We extract structure graphs from the original songs to construct the dataset for structure generation. Inspired by GraphRNN [20], we model the structure graph as a sequence of relations, and generate the relations of a song using an auto-regressive transformer model. We use a token triple like (i,j,t)(i,j,t) to represent the edge(relation) from a previous bar jj to the current bar ii with a relation type tt. By arranging all edges of a structure graph in ascending order of the index of the current bar, we get a list of token triples like {(i1,j1,t1),(i2,j2,t2),,(in,jn.tn)}\{(i_{1},j_{1},t_{1}),(i_{2},j_{2},t_{2}),\cdots,(i_{n},j_{n}.t_{n})\}, which is a sequence form of the structure graph. We adopt the tokenize method proposed in [21] and predict the three tokens of an edge at one time step. As MELONS deals with full-song melody generation, we append an EOS relation at the end of each structure graph to mark the end of the song. At the inference time, edges in the structure graph would be generated autoregressively until the EOS is predicted, so the generated structure is not limited to a fixed length. Moreover, the subsequent relations are predicted conditioned on the music structure generated so far, leading to better cohesiveness.

Melody generation. The melody generation part consists of an unconditional music generation module and a conditional generation module. We adopt the event-based music token representation applied broadly for music generation [10, 22, 21, 2] in both two modules. The unconditional module is an auto-regressive Transformer model [21] trained on the original melodies. The conditional generation module adopts a similar architecture, but training data is reorganized to build a conditional scenario, like CTRL [23] for conditional text generation and MuseNet [24] for music. For each relation (i,j,t)(i,j,t) in a structure graph, we treat the melody of the ithi^{th} bar as the target and train the model to generate the ithi^{th} bar conditional on the related jthj^{th} bar and the relation type tt. The final sequence has the form of {[start of the melody context], (content of the melody context),[start of the related bar], (content of the related bar), [relation type], [start of the target bar], (content of the target bar)}. The melody context here is formed by the 8 bars prior to the ithi^{th} bar. Tokens before the [start of the target bar] token would be given at the inference time. The three modules mentioned above are combined together in the generation procedure. The detailed generation algorithm is illustrated in Algorithm 1.

Algorithm 1 Generation procedure
0:  melody motif M={b1,b2,,b8}M=\{b_{1},b_{2},\cdots,b_{8}\}
0:  full-song melody S={b1,b2,,bn}S=\{b_{1},b_{2},\cdots,b_{n}\}
  SMS\leftarrow M
  Extract the structure of MM into relation list RR
  repeat
     Predict next relation autoregressively and append next relation to RR
  until EOS relation predicted
  ll\leftarrow index of the last bar in RR
  for i9i\leftarrow 9 to ll do
     Check relation r=(i,j,t)r=(i,j,t) of the ithi^{th} bar in RR
     if rr does not exist then
        Generate the ithi^{th} bar bib_{i} autoregressively with the unconditional generation module
     else if relation type tt is repetition then
        bibjb_{i}\leftarrow b_{j}
     else
        Get melody context C={bi8,bi7,,bi1}C=\{b_{i-8},b_{i-7},\cdots,b_{i-1}\} from SS
        Generate bib_{i} with the conditional generation module conditioned on CC, bjb_{j} and the relation type
     end if
     Append bib_{i} to SS
  end for
  return  SS

3 Experimental configurations

3.1 Dataset

The datasets used in this work contain POP909 [25] and Wikifonia222http://www.wikifonia.org. They are public datasets of pop Chinese and Western songs, respectively. We keep melodies in 4/4 time signature as our training data and select 901 musical scores from POP909 and 1600 scores from Wikifonia with different pop music styles. We use a 9:1 method to allocate the training and testing set.

3.2 Implementation details

In MELONS, we choose the linear Transformer [26] as the backbone architecture for sequence modeling, considering its linear complexity in attention calculation. The structure generation module consists of 4 self-attention layers with 4 heads and 256 hidden states. The inner layer size of the feed-forward part is set to 1024. The melody generation modules consist of 6 self-attention layers with 8 heads and 512 hidden states, with the inner layer size set to 2048. We adopt the compound word (CP) method [21] for sequence representation and tokenization. The structure generation module employs a token set including three types of tokens: index of the current bar, index of the related bar and relation type, and the embedding size is 128, 128 and 32 respectively. The unconditional melody generation module uses a token set (type,beat,tempo,pitch,duration)(type,beat,tempo,pitch,duration) to serialize melodies, with embedding size as (16,128,64,256,128)(16,128,64,256,128). We additionally use track(size=16)track(size=16) to mark the different sections in the conditional generation module, and use relationtype(size=32)relationtype(size=32) to impose the control. The embedding sizes are chosen based on the vocabulary sizes of different token types, and the embedded tokens describing the same object would be concatenated together and linearly projected to the same size of the corresponding module’s hidden state. We use the Adam optimizer [27] with a learning rate of 10410^{-4} for all these models. We apply dropout with a rate of 0.1 during training.

4 Experimental Evaluation

4.1 Objective evaluations

Table 2: Percentages of relation types in different datasets.
Relation types POP909 Wikifonia Collected MELONS
Repetition 44% 47% 50% 50%
Transposition 1% 3% 2% 1%
Pitch progression 3% 4% 3% 4%
Rhythmic sequence 6% 8% 6% 11%
Melody progression 11% 3% 6% 6%
Surround progression 5% 2% 3% 2%
Harmonious cadence 6% 5% 5% 5%
Rest 6% 8% 7% 4%
In total 81% 81% 82% 83%

In order to verify the versatility of the relationship we define in Table 1, we additionally collect 1160 pop music in different styles and regions from the Internet, and also collect thousands of melodies generated by the MELONS. We automatically extract all relations of each dataset. Then we perform statistical analysis on relations in each dataset. The proportion is defined as the ratio of the number of bars with a certain relation to the total number of bars. Table 2 shows the statistical results of the proportions. We can see that the distribution of the relations across the four dataset are roughly close (correlation coefficients >0.98>0.98 for all pairs). This demonstrates the universality and stability of our proposed structure representations.

We also want to see whether our melody generation network can produce target bars with correct relations under specified conditions. After comparing the structure graphs extracted from the generated melodies with the graphs used as conditions, we get the ratio of correctly generated bars of each relation. Among all the relations, rhythmic sequence owns the highest accuracy(92.88%)(92.88\%) while transposition shows the lowest performance(77.33%)(77.33\%).

4.2 Subjective evaluations

We compare the performance of MELONS with another two systems including CP-Transformer (SOTA method for unconditional music generation without control on structure) and PopMNet with different configurations. The structure graph used as the condition for the melody generation network can either be extracted from the real music (marked as R) or generated by the structure generation net (no mark). The relations in the structure graph can be either a simple repetition and rhythm sequence as PopMNet suggested (marked as B) or the more sophisticated relation set as described in  2.1 (no mark). We also include human created melodies for reference.

For the listening test, we randomly select 10 motifs from the testing set and generate sets of melodies corresponding to the above experimental systems333Audio samples of generated melodies are available at https://yiathena.github.io/MELONS/. We invite 12 listeners who are professionals in music to evaluate 80 samples. The subjects are asked to rate the melodies on a five-point scale varying from very bad to excellent on the following four metrics: 1) Structure: Does the melody have clear structure? 2) Richness: Does the melody have rich content? 3) Pleasure: Does the melody sound melodious? 4) Overall: What is the overall quality of the melody?

Table 3: The average mean opinion scores on the four evaluation metrics over all experimental groups.
Model Structure Richness Pleasure Overall
CP-Transformer 2.41 2.28 2.37 2.13
PopMNet 2.31 2.46 2.18 2.17
PopMNet-R 2.54 2.30 2.27 2.29
MELONS-B 2.77 2.60 2.39 2.34
MELONS-B-R 2.81 2.71 2.41 2.43
MELONS 3.53 3.29 3.20 3.20
MELONS-R 3.60 3.37 3.26 3.22
Human 4.54 4.34 4.34 4.39
Refer to caption
Fig. 4: Box plots on MOS scores of overall. Black dots represent mean score.

Table 3 shows the average mean opinion scores on the four evaluation metrics. The distributions of the results of the four metrics are relatively similar, which to a certain extent illustrates the significance of musical structure in the overall evaluation of music. Hence, we only show the box plots on the overall metric in Fig. 4. T-tests with Holm-Bonferroni correction (α=0.05\alpha=0.05) were also conducted for all evaluation metrics.

The results demonstrate the effectiveness of our proposed approach. MELONS and MELONS-R outperform the other systems on all metrics by a large margin (p<1e6p<1e-6 for all evaluation metrics). Among all the systems, CP-transformer archives the lowest evaluation. This indicates that, remarkably, melody generation can benefit from structural information. Furthermore, the performance of MELONS systems which use the sophisticated structure graph is obviously better than those using the basic structure graph proposed by PopMNet, which shows the superiority of our proposed structure representation. We also see that MELONS-B related systems are better than the PopMNet related systems, indicating that MELONS could generate better melodies when using the same structural information as prior knowledge. There is still an obvious gap between generated melodies and human created music, leaving large room for improvement. We notice that differences between models using the generated structure graph and those using the real structure graph are not significant. This suggests that music structure generated by MELONS has close quality compared to artificially designed music forms. Thus our future work will focus more on improving the melody generation model.

5 Conclusions

MELONS is a transformer-based framework which generates melodies with long-term structures by using a structure generation net and a melody generation net. One of the key ideas is to represent structure information using graph which consists of eight hierarchical relations among pairwise bars. Evaluations show the effectiveness of such representations. The subjective evaluation results suggest MELONS can generate enjoyable melodies with obvious structure and rich contents.

References

  • [1] Peter Langston, “Six techniques for algorithmic music composition,” in Proceedings of the International Computer Music Conference. Citeseer, 1989, vol. 60, p. 59.
  • [2] Xueyao Zhang, Jinchao Zhang, Yao Qiu, Li Wang, and Jie Zhou, “Structure-enhanced pop music generation via harmony-aware learning,” arXiv preprint arXiv:2109.06441, 2021.
  • [3] Florian Colombo, Alexander Seeholzer, and Wulfram Gerstner, “Deep artificial composer: A creative neural network model for automated melody generation,” in International Conference on Evolutionary and Biologically Inspired Music and Art. Springer, 2017, pp. 81–96.
  • [4] Jian Wu, Changran Hu, Yulong Wang, Xiaolin Hu, and Jun Zhu, “A hierarchical recurrent neural network for symbolic melody generation,” IEEE transactions on cybernetics, vol. 50, no. 6, pp. 2749–2757, 2019.
  • [5] Jean-Pierre Briot, Gaëtan Hadjeres, and François-David Pachet, Deep learning techniques for music generation, vol. 1, Springer, 2020.
  • [6] Shulei Ji, Jing Luo, and Xinyu Yang, “A comprehensive survey on deep music generation: Multi-level representations, algorithms, evaluations, and future directions,” arXiv preprint arXiv:2011.06801, 2020.
  • [7] Elliot Waite et al., “Generating long-term structure in songs and stories,” Web blog post. Magenta, vol. 15, no. 4, 2016.
  • [8] Gabriele Medeot, Srikanth Cherla, Katerina Kosta, Matt McVicar, Samer Abdallah, Marco Selvi, Ed Newton-Rex, and Kevin Webster, “Structurenet: Inducing structure in generated melodies.,” in ISMIR, 2018, pp. 725–731.
  • [9] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck, “A hierarchical latent vector model for learning long-term structure in music,” in International conference on machine learning. PMLR, 2018, pp. 4364–4373.
  • [10] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck, “Music transformer,” arXiv preprint arXiv:1809.04281, 2018.
  • [11] Ning Zhang and Junchi Yan, “Melody structure transfer network: Generating music with separable self-attention,” arXiv preprint arXiv:2107.09877, 2021.
  • [12] Ke Chen, Weilin Zhang, Shlomo Dubnov, Gus Xia, and Wei Li, “The effect of explicit structure encoding of deep neural networks for symbolic music generation,” in 2019 International Workshop on Multilayer Music Representation and Processing (MMRP). IEEE, 2019, pp. 77–84.
  • [13] Zixun Guo, Makris Dimos, and Herremans Dorien, “Hierarchical recurrent neural networks for conditional melody generation with long-term structure,” arXiv preprint arXiv:2102.09794, 2021.
  • [14] Shuqi Dai, Zeyu Jin, Celso Gomes, and Roger B Dannenberg, “Controllable deep melody generation via hierarchical music structure representation,” arXiv preprint arXiv:2109.00663, 2021.
  • [15] Peter M Todd, “A connectionist approach to algorithmic composition,” Computer Music Journal, vol. 13, no. 4, pp. 27–43, 1989.
  • [16] Dominik Hörnel and Wolfram Menzel, “Learning musical structure and style with neural networks,” Computer Music Journal, vol. 22, no. 4, pp. 44–62, 1998.
  • [17] Jian Wu, Xiaoguang Liu, Xiaolin Hu, and Jun Zhu, “Popmnet: Generating structured pop music melodies using neural networks,” Artificial Intelligence, vol. 286, pp. 103303, 2020.
  • [18] Jimmy Kachulis, The Songwriter’s Workshop: Melody, Hal Leonard Corporation, 2003.
  • [19] Jimmy Kachulis, The Songwriter’s Workshop: Harmony, Hal Leonard Corporation, 2004.
  • [20] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec, “Graphrnn: Generating realistic graphs with deep auto-regressive models,” in International conference on machine learning. PMLR, 2018, pp. 5708–5717.
  • [21] Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang, “Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,” arXiv preprint arXiv:2101.02402, 2021.
  • [22] Yu-Siang Huang and Yi-Hsuan Yang, “Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1180–1188.
  • [23] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher, “Ctrl: A conditional transformer language model for controllable generation,” arXiv preprint arXiv:1909.05858, 2019.
  • [24] Christine Payne, “Musenet,” OpenAI Blog, vol. 3, 2019.
  • [25] Ziyu Wang, Ke Chen, Junyan Jiang, Yiyi Zhang, Maoran Xu, Shuqi Dai, Xianbin Gu, and Gus Xia, “Pop909: A pop-song dataset for music arrangement generation,” arXiv preprint arXiv:2008.07142, 2020.
  • [26] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International Conference on Machine Learning. PMLR, 2020, pp. 5156–5165.
  • [27] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
Refer to caption
Fig. 5: Examples of surround notes (marked in blue in the figure). The melodic line progresses and moves around the surround notes. The piece of melody comes from a pop song named “Shape of you”.

6 Appendix

This chapter is a supplement to Table 1. Here we describe the definitions of some concepts that appear in the table.

Since the relation types are determined by the rhythm and pitch sequence within bars, here we firstly explain the definitions of rhythm and pitch sequence used in this paper. We use RiR_{i} and PiP_{i} to represent the rhythm and pitch sequence of the ithi^{th} bar, respectively. There are a set of rhythm and pitch as follows:

Ri={ri,1,ri,2,,ri,k,,ri,n}Pi={pi,1,pi,2,,pi,k,,pi,n}R_{i}=\{r_{i,1},r_{i,2},...,r_{i,k},...,r_{i,n}\}\\ P_{i}=\{p_{i,1},p_{i,2},...,p_{i,k},...,p_{i,n}\} (1)

where kk is the position of notes, and ri,kr_{i,k} and pi,kp_{i,k} represent the rhythm and pitch of the kthk^{th} note within the ithi_{th} bar, respectively. In order to show the rhythm in a mathematical way, we use the relative start time of each note in a bar to describe rhythm. In our experiment, the sixteenth note is used as the basic time unit. pi,kp_{i,k} is the midi index of the kthk^{th} note in bar, for example, the index 60 represents the pitch C4 in midi protocol. nn is the number of notes in a bar. Taking Fig 1 as an example, the rhythm sequence and pitch sequence of the 14th bar and the 23rd bar are: R14={0,2,6,8}R_{14}=\{0,2,6,8\}, P14={77,74,74,72}P_{14}=\{77,74,74,72\}, R23={0,2,6,8}R_{23}=\{0,2,6,8\}, P23={79,76,74,72}P_{23}=\{79,76,74,72\}.

Pitch progression and rhythmic sequence. The differences between pitch progression and rhythmic sequence are mainly existing in the similarity coefficient of the pitch sequences in two bars. If there is a bar ii and another bar jj, where j>ij>i, and the number of notes in jj is nn, the similarity of the bar ii and bar jj is determined by the number of the notes those have the same pitch:

Similarity=k=1n(pi,k==pj,k)n\displaystyle Similarity=\frac{\sum_{k=1}^{n}(p_{i,k}==p_{j,k})}{n} (2)

If the total number of notes in jthj-th bar is less than nn, we set pj,kp_{j,k} as 0. If the similarity of pitch sequences between two bars is not less than 50%, and their rhythm sequences are the same, we call the second bar as a pitch progression of the first bar. For example, for the 14th14th bar and the 23rd23rd bar in figure 1, their rhythms are the same R23=R14R_{23}=R_{14} and their similarity is 0.5 since n=4n=4 and p23,3=p14,3,p23,4=p14,4p_{23,3}=p_{14,3},p_{23,4}=p_{14,4}. Thus the 23rd23rd bar is the pitch progression of the 14th14th bar because the similarity coefficient of the pitch sequence between the two is less than 50%. In contrast, for the 6th bar and the 2nd bar in figure 1, the similarity coefficient of the pitch sequence between the two is less than 50% (Ri=Rj,similarity=0)(R_{i}=R_{j},similarity=0), thus the 6th bar is the rhythmic sequence of the 2nd bar.

Melody progression. To find the bars that are written based on melody progression, we utilize an algorithm which is similar to finding the longest common subsequence between two sequences. A common subsequence of two sequence is an ordered subsequence that is common to both sequences. Assume Ri,jR_{i,j} is the longest common subsequence of rhythm between bar ii and jj, and Pi,jP_{i,j} is the longest common subsequence of pitch sequence between bar ii and jj, then Ri,jR_{i,j} and Pi,jP_{i,j} can be represented as follows:

Ri,j={ri,k,ri,k+1,,ri,m},i>j,mnPi,j={pi,k,pi,k+1,,pi,m},i>j,mn\begin{split}R_{i,j}=\{r_{i,k},r_{i,k+1},...,r_{i,m}\},i>j,m\leq n\\ P_{i,j}=\{p_{i,k},p_{i,k+1},...,p_{i,m}\},i>j,m\leq n\end{split} (3)

if Ri,j=Pi,jR_{i,j}=P_{i,j} and m3m\geq 3 , the ithi^{t}h bar is the melody progression of the jthj^{t}h bar.

Harmonious Cadence. Harmonious cadence is used to represent the harmonious end or breath of phrase. We employ note density curve to detect the position of harmonious cadence. The note density curve is formed by the position and density of the notes. Since the position of the notes is independent, the density of the notes is a function of the independent variable. The density indicates the number of notes in the time interval that the note belongs to. The lower bound of the time interval is the open interval of the notes’ start time. The length of the interval is set as one bar. Local minimum points of the note density curve is the local sparsest position in the melody, indicating the end of the phrase and breathing. Fig 6 shows an example of the music note density of the 19th bar.

Refer to caption
Fig. 6: Examples of the note density in 19th bar, music notes on the interval boundary will not be counted as density. The red note is local minimum points of note density in this score
Refer to caption
Fig. 7: Note density curve for score in Fig 1.

The note density curve extracted from the melody of Fig. 1 is shown in Fig. 7. The fifth local minimum point in Fig. 7 is the red note shown in Fig. 6,

After analyzing the curve, we select the notes that belong to the tonic chord of music key from the local minimum points. This idea comes from a melody writing method called Cadence, which allows the phrase to create a sense of resolution.

Surround progression. Surround progression is used to manifest the progression of emotions in music. In rap, rock and pop music, the melody in a phrase usually moves around a certain note, example is shown in Fig 5.

Raising the pitch sequence of the melody moving around a certain note is the easiest way to promote emotions. In this work, the notes that repeat more than 3 times in bar are called surround notes. The method to find the type of Surround progression for the current bar is as follows: First, eliminate the higher-priority relation types, and then determine whether there are surrounding sounds in the current bar. Finally, look for bars that also have surrounding sounds from near to far, and the surrounding sounds of this bar rise or fall at least five semitones.