MELONS: generating melody WITH LONG-TERM STRUCTURE USING transformers and structure graph
Abstract
The creation of long melody sequences requires effective expression of coherent musical structure. However, there is no clear representation of musical structure. Recent works on music generation have suggested various approaches to deal with the structural information of music, but generating a full-song melody with clear long-term structure remains a challenge. In this paper, we propose MELONS, a melody generation framework based on a graph representation of music structure which consists of eight types of bar-level relations. MELONS adopts a multi-step generation method with transformer-based networks by factoring melody generation into two sub-problems: structure generation and structure conditional melody generation. Experimental results show that MELONS can produce structured melodies with high quality and rich contents.
Index Terms— Music structure, melody generation, structure graph, transformer
1 Introduction
Music composition requires not only inspiration, but also obeying the rules of music theory. The restriction of musical structure is one of the most important rules for complete melody creation [1]. The structure of a complete melody piece is related to the musical form which affects the melody presentation along the time axis, for example, the repetition of motives, the transition between phrases, and the development between sections [2]. In melody composition, the restriction of the structure makes the melodic line cohesive and comprehensible. While aesthetically appealing short melody pieces can be generated using the recent deep learning based technologies [3, 4], creating long melodies with reasonable structure remains a challenge [5, 6], which forms the motivation of this work.
One of the obstacles in generating structured long music is that there is no clear approach to describe the structure technically. Some works regard the structure of music as simple repetitive patterns [7, 8]. However, the structures are primarily designed to reflect the dependencies among note sequences in different time-scales. Such dependencies include not only repetitions but also flexible and versatile developments. Some researchers consider structure as a potential component of music and try to use various neural networks to discover it from music data automatically [9, 10, 11], but the generated music samples cannot show the reasonable and clear organization of structure that exists in human created music. Many works use conditional information such as chord progressions [12, 13, 2] and section tags [2, 14] to control the structure of the generated music, but this requires parallel annotated data, which is not easy to collect.
There has been a long history of dividing the effort of generating structured music into different steps. Todd [15] proposed to use two cascaded networks to construct new pieces of music with hierarchical structure representations. The networks learn both certain aspects of musical structure as well as obscure structural aspects from musical examples. Later, Hornel [16] proposed MELONET system, in which the first network learns to recognize musical structures while the second network predicts the notes. PopMNet [17] also employs two networks to generate structured pop melodies: a structure generation network used to generate the structural representations, and a melody generation net for generating music sequences based on the structural representation and chord progression. Different from [15] and [16], PopMNet proposes to represent melody structure with graph, which consist of two typical relations known as repetition and sequence between pairs of bars. It utilizes adjacency matrices of the graph to model the structural variations of a long melody. Although PopMNet integrates melody structure into the generation process, the quality of the generated demos is quite limited. This is mainly due to the incomplete enumeration of the relations between bars. Repetition and sequence are just two categories among all common relations. Further, the generation networks in PopMNet are based on CNN and RNN, which greatly limits the model’s ability to generate time-related long sequences.
In this paper, we propose a framework named MELONS which can generate full-song melody with long-term structure. Our contributions can be summarized as follows. Firstly, instead of considering only simple repetitions and rhythm sequence, we construct a graph representation of melody structure with eight types of bar-level relations that appear frequently in the pop songs. These relations have covered most of the permissive music creation techniques including cadence and breath. Statistical analysis on multiple datasets from different sources has proved the versatility and rationality of our proposed relation types. In addition, we propose a multi-step generation method with transformer-based networks by factoring melody generation into structure generation and structure conditional melody generation. Unlike PopMNet which is limited by a fixed 32-bar length and other works which require annotated structural information, our method can generate full-song melodies with flexible length, and is also annotation free. Subjective evaluations show that our approach can generate harmonious melodies with reasonable structures.
2 Proposed Approach

Priority | Relation types | Description |
---|---|---|
1 | Repetition | The current bar is the same as a previous bar. |
2 | Transposition | The current bar is the tonal transposition of a previous bar. |
3 | Pitch progression | Same rhythm. The similarity of pitch sequences between two bars is not less than 50% |
4 | Rhythmic sequence | Same rhythm. The similarity of pitch sequences between two bars is less than 50% |
5 | Melody progression | Two bars who have at least 3 consecutive notes of the same pitch and rhythm. |
6 | Surround progression | Surrounding notes(repeat more than 3 times in a bar) rise or fall by at least five semitones between two bars. |
7 | Harmonious cadence | A certain note in the current bar belongs to the local minimum point of the note density curve1, and the pitch of this note belongs to the tonic chord of music key. |
8 | Rest | No notes in the current bar. |
In this section, We firstly show a brief study on pairwise relations between bars, and then we introduce our proposed framework, MELONS, whose goal is to produce full-song melodies with long-term structures given melody motifs.
2.1 Proposed bar-level relations
We employ the relations between pairwise bars to represent the musical structure. We take four beats as one bar and each beat equals to a quarter note. Based on the knowledge of musical composition theory, we classify the bar-level relations into three categories: repetition, development [18] and cadence [19]. After statistical analysis on more than 2500 pieces of pop music, we further decompose these three categories into eight frequently appeared types of structural relations, which are defined in Table 1. Fig. 1 shows an example of the relations in music score.
These relations have covered most common skills of pop music composition. They not only include complicated development methods, but can also reflect the progress of emotions (surround progression) and express the harmonious ending or breath of phrases (harmonious cadence). Moreover, these relations can be expressed by mathematical formulas111For more details, please refer to https://yiathena.github.io/MELONS/, which facilitates their auto detection.
We use a directed graph with multiple types of edges, which we call a structure graph, to describe the melody structure of a song. Bars and the relations between bars are represented as nodes and edges respectively, while types of the edges stand for types of the relations. As is shown in Table 1, the priority of relations is determined by the similarity between two bars. To avoid redundancies in the structure graph, each bar is matched with its previous bars from near to far to find out the relation of the highest priority. Fig. 2 illustrates the structure graph of the melody provided in Fig. 1.

2.2 Framework of MELONS


The training and generation procedures of MELONS are shown in Fig. 3. We use the structure graph to bridge the gap between local pairwise relations and global music structure, and propose a multi-step melody generation framework to compose a full-song melody based on an 8-bar motif (which is the common length of a phrase).
Structure generation. We extract structure graphs from the original songs to construct the dataset for structure generation. Inspired by GraphRNN [20], we model the structure graph as a sequence of relations, and generate the relations of a song using an auto-regressive transformer model. We use a token triple like to represent the edge(relation) from a previous bar to the current bar with a relation type . By arranging all edges of a structure graph in ascending order of the index of the current bar, we get a list of token triples like , which is a sequence form of the structure graph. We adopt the tokenize method proposed in [21] and predict the three tokens of an edge at one time step. As MELONS deals with full-song melody generation, we append an EOS relation at the end of each structure graph to mark the end of the song. At the inference time, edges in the structure graph would be generated autoregressively until the EOS is predicted, so the generated structure is not limited to a fixed length. Moreover, the subsequent relations are predicted conditioned on the music structure generated so far, leading to better cohesiveness.
Melody generation. The melody generation part consists of an unconditional music generation module and a conditional generation module. We adopt the event-based music token representation applied broadly for music generation [10, 22, 21, 2] in both two modules. The unconditional module is an auto-regressive Transformer model [21] trained on the original melodies. The conditional generation module adopts a similar architecture, but training data is reorganized to build a conditional scenario, like CTRL [23] for conditional text generation and MuseNet [24] for music. For each relation in a structure graph, we treat the melody of the bar as the target and train the model to generate the bar conditional on the related bar and the relation type . The final sequence has the form of {[start of the melody context], (content of the melody context),[start of the related bar], (content of the related bar), [relation type], [start of the target bar], (content of the target bar)}. The melody context here is formed by the 8 bars prior to the bar. Tokens before the [start of the target bar] token would be given at the inference time. The three modules mentioned above are combined together in the generation procedure. The detailed generation algorithm is illustrated in Algorithm 1.
3 Experimental configurations
3.1 Dataset
The datasets used in this work contain POP909 [25] and Wikifonia222http://www.wikifonia.org. They are public datasets of pop Chinese and Western songs, respectively. We keep melodies in 4/4 time signature as our training data and select 901 musical scores from POP909 and 1600 scores from Wikifonia with different pop music styles. We use a 9:1 method to allocate the training and testing set.
3.2 Implementation details
In MELONS, we choose the linear Transformer [26] as the backbone architecture for sequence modeling, considering its linear complexity in attention calculation. The structure generation module consists of 4 self-attention layers with 4 heads and 256 hidden states. The inner layer size of the feed-forward part is set to 1024. The melody generation modules consist of 6 self-attention layers with 8 heads and 512 hidden states, with the inner layer size set to 2048. We adopt the compound word (CP) method [21] for sequence representation and tokenization. The structure generation module employs a token set including three types of tokens: index of the current bar, index of the related bar and relation type, and the embedding size is 128, 128 and 32 respectively. The unconditional melody generation module uses a token set to serialize melodies, with embedding size as . We additionally use to mark the different sections in the conditional generation module, and use to impose the control. The embedding sizes are chosen based on the vocabulary sizes of different token types, and the embedded tokens describing the same object would be concatenated together and linearly projected to the same size of the corresponding module’s hidden state. We use the Adam optimizer [27] with a learning rate of for all these models. We apply dropout with a rate of 0.1 during training.
4 Experimental Evaluation
4.1 Objective evaluations
Relation types | POP909 | Wikifonia | Collected | MELONS |
---|---|---|---|---|
Repetition | 44% | 47% | 50% | 50% |
Transposition | 1% | 3% | 2% | 1% |
Pitch progression | 3% | 4% | 3% | 4% |
Rhythmic sequence | 6% | 8% | 6% | 11% |
Melody progression | 11% | 3% | 6% | 6% |
Surround progression | 5% | 2% | 3% | 2% |
Harmonious cadence | 6% | 5% | 5% | 5% |
Rest | 6% | 8% | 7% | 4% |
In total | 81% | 81% | 82% | 83% |
In order to verify the versatility of the relationship we define in Table 1, we additionally collect 1160 pop music in different styles and regions from the Internet, and also collect thousands of melodies generated by the MELONS. We automatically extract all relations of each dataset. Then we perform statistical analysis on relations in each dataset. The proportion is defined as the ratio of the number of bars with a certain relation to the total number of bars. Table 2 shows the statistical results of the proportions. We can see that the distribution of the relations across the four dataset are roughly close (correlation coefficients for all pairs). This demonstrates the universality and stability of our proposed structure representations.
We also want to see whether our melody generation network can produce target bars with correct relations under specified conditions. After comparing the structure graphs extracted from the generated melodies with the graphs used as conditions, we get the ratio of correctly generated bars of each relation. Among all the relations, rhythmic sequence owns the highest accuracy while transposition shows the lowest performance.
4.2 Subjective evaluations
We compare the performance of MELONS with another two systems including CP-Transformer (SOTA method for unconditional music generation without control on structure) and PopMNet with different configurations. The structure graph used as the condition for the melody generation network can either be extracted from the real music (marked as R) or generated by the structure generation net (no mark). The relations in the structure graph can be either a simple repetition and rhythm sequence as PopMNet suggested (marked as B) or the more sophisticated relation set as described in 2.1 (no mark). We also include human created melodies for reference.
For the listening test, we randomly select 10 motifs from the testing set and generate sets of melodies corresponding to the above experimental systems333Audio samples of generated melodies are available at https://yiathena.github.io/MELONS/. We invite 12 listeners who are professionals in music to evaluate 80 samples. The subjects are asked to rate the melodies on a five-point scale varying from very bad to excellent on the following four metrics: 1) Structure: Does the melody have clear structure? 2) Richness: Does the melody have rich content? 3) Pleasure: Does the melody sound melodious? 4) Overall: What is the overall quality of the melody?
Model | Structure | Richness | Pleasure | Overall |
---|---|---|---|---|
CP-Transformer | 2.41 | 2.28 | 2.37 | 2.13 |
PopMNet | 2.31 | 2.46 | 2.18 | 2.17 |
PopMNet-R | 2.54 | 2.30 | 2.27 | 2.29 |
MELONS-B | 2.77 | 2.60 | 2.39 | 2.34 |
MELONS-B-R | 2.81 | 2.71 | 2.41 | 2.43 |
MELONS | 3.53 | 3.29 | 3.20 | 3.20 |
MELONS-R | 3.60 | 3.37 | 3.26 | 3.22 |
Human | 4.54 | 4.34 | 4.34 | 4.39 |

Table 3 shows the average mean opinion scores on the four evaluation metrics. The distributions of the results of the four metrics are relatively similar, which to a certain extent illustrates the significance of musical structure in the overall evaluation of music. Hence, we only show the box plots on the overall metric in Fig. 4. T-tests with Holm-Bonferroni correction () were also conducted for all evaluation metrics.
The results demonstrate the effectiveness of our proposed approach. MELONS and MELONS-R outperform the other systems on all metrics by a large margin ( for all evaluation metrics). Among all the systems, CP-transformer archives the lowest evaluation. This indicates that, remarkably, melody generation can benefit from structural information. Furthermore, the performance of MELONS systems which use the sophisticated structure graph is obviously better than those using the basic structure graph proposed by PopMNet, which shows the superiority of our proposed structure representation. We also see that MELONS-B related systems are better than the PopMNet related systems, indicating that MELONS could generate better melodies when using the same structural information as prior knowledge. There is still an obvious gap between generated melodies and human created music, leaving large room for improvement. We notice that differences between models using the generated structure graph and those using the real structure graph are not significant. This suggests that music structure generated by MELONS has close quality compared to artificially designed music forms. Thus our future work will focus more on improving the melody generation model.
5 Conclusions
MELONS is a transformer-based framework which generates melodies with long-term structures by using a structure generation net and a melody generation net. One of the key ideas is to represent structure information using graph which consists of eight hierarchical relations among pairwise bars. Evaluations show the effectiveness of such representations. The subjective evaluation results suggest MELONS can generate enjoyable melodies with obvious structure and rich contents.
References
- [1] Peter Langston, “Six techniques for algorithmic music composition,” in Proceedings of the International Computer Music Conference. Citeseer, 1989, vol. 60, p. 59.
- [2] Xueyao Zhang, Jinchao Zhang, Yao Qiu, Li Wang, and Jie Zhou, “Structure-enhanced pop music generation via harmony-aware learning,” arXiv preprint arXiv:2109.06441, 2021.
- [3] Florian Colombo, Alexander Seeholzer, and Wulfram Gerstner, “Deep artificial composer: A creative neural network model for automated melody generation,” in International Conference on Evolutionary and Biologically Inspired Music and Art. Springer, 2017, pp. 81–96.
- [4] Jian Wu, Changran Hu, Yulong Wang, Xiaolin Hu, and Jun Zhu, “A hierarchical recurrent neural network for symbolic melody generation,” IEEE transactions on cybernetics, vol. 50, no. 6, pp. 2749–2757, 2019.
- [5] Jean-Pierre Briot, Gaëtan Hadjeres, and François-David Pachet, Deep learning techniques for music generation, vol. 1, Springer, 2020.
- [6] Shulei Ji, Jing Luo, and Xinyu Yang, “A comprehensive survey on deep music generation: Multi-level representations, algorithms, evaluations, and future directions,” arXiv preprint arXiv:2011.06801, 2020.
- [7] Elliot Waite et al., “Generating long-term structure in songs and stories,” Web blog post. Magenta, vol. 15, no. 4, 2016.
- [8] Gabriele Medeot, Srikanth Cherla, Katerina Kosta, Matt McVicar, Samer Abdallah, Marco Selvi, Ed Newton-Rex, and Kevin Webster, “Structurenet: Inducing structure in generated melodies.,” in ISMIR, 2018, pp. 725–731.
- [9] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck, “A hierarchical latent vector model for learning long-term structure in music,” in International conference on machine learning. PMLR, 2018, pp. 4364–4373.
- [10] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck, “Music transformer,” arXiv preprint arXiv:1809.04281, 2018.
- [11] Ning Zhang and Junchi Yan, “Melody structure transfer network: Generating music with separable self-attention,” arXiv preprint arXiv:2107.09877, 2021.
- [12] Ke Chen, Weilin Zhang, Shlomo Dubnov, Gus Xia, and Wei Li, “The effect of explicit structure encoding of deep neural networks for symbolic music generation,” in 2019 International Workshop on Multilayer Music Representation and Processing (MMRP). IEEE, 2019, pp. 77–84.
- [13] Zixun Guo, Makris Dimos, and Herremans Dorien, “Hierarchical recurrent neural networks for conditional melody generation with long-term structure,” arXiv preprint arXiv:2102.09794, 2021.
- [14] Shuqi Dai, Zeyu Jin, Celso Gomes, and Roger B Dannenberg, “Controllable deep melody generation via hierarchical music structure representation,” arXiv preprint arXiv:2109.00663, 2021.
- [15] Peter M Todd, “A connectionist approach to algorithmic composition,” Computer Music Journal, vol. 13, no. 4, pp. 27–43, 1989.
- [16] Dominik Hörnel and Wolfram Menzel, “Learning musical structure and style with neural networks,” Computer Music Journal, vol. 22, no. 4, pp. 44–62, 1998.
- [17] Jian Wu, Xiaoguang Liu, Xiaolin Hu, and Jun Zhu, “Popmnet: Generating structured pop music melodies using neural networks,” Artificial Intelligence, vol. 286, pp. 103303, 2020.
- [18] Jimmy Kachulis, The Songwriter’s Workshop: Melody, Hal Leonard Corporation, 2003.
- [19] Jimmy Kachulis, The Songwriter’s Workshop: Harmony, Hal Leonard Corporation, 2004.
- [20] Jiaxuan You, Rex Ying, Xiang Ren, William Hamilton, and Jure Leskovec, “Graphrnn: Generating realistic graphs with deep auto-regressive models,” in International conference on machine learning. PMLR, 2018, pp. 5708–5717.
- [21] Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang, “Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs,” arXiv preprint arXiv:2101.02402, 2021.
- [22] Yu-Siang Huang and Yi-Hsuan Yang, “Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1180–1188.
- [23] Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and Richard Socher, “Ctrl: A conditional transformer language model for controllable generation,” arXiv preprint arXiv:1909.05858, 2019.
- [24] Christine Payne, “Musenet,” OpenAI Blog, vol. 3, 2019.
- [25] Ziyu Wang, Ke Chen, Junyan Jiang, Yiyi Zhang, Maoran Xu, Shuqi Dai, Xianbin Gu, and Gus Xia, “Pop909: A pop-song dataset for music arrangement generation,” arXiv preprint arXiv:2008.07142, 2020.
- [26] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret, “Transformers are rnns: Fast autoregressive transformers with linear attention,” in International Conference on Machine Learning. PMLR, 2020, pp. 5156–5165.
- [27] Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.

6 Appendix
This chapter is a supplement to Table 1. Here we describe the definitions of some concepts that appear in the table.
Since the relation types are determined by the rhythm and pitch sequence within bars, here we firstly explain the definitions of rhythm and pitch sequence used in this paper. We use and to represent the rhythm and pitch sequence of the bar, respectively. There are a set of rhythm and pitch as follows:
(1) |
where is the position of notes, and and represent the rhythm and pitch of the note within the bar, respectively. In order to show the rhythm in a mathematical way, we use the relative start time of each note in a bar to describe rhythm. In our experiment, the sixteenth note is used as the basic time unit. is the midi index of the note in bar, for example, the index 60 represents the pitch C4 in midi protocol. is the number of notes in a bar. Taking Fig 1 as an example, the rhythm sequence and pitch sequence of the 14th bar and the 23rd bar are: , , , .
Pitch progression and rhythmic sequence. The differences between pitch progression and rhythmic sequence are mainly existing in the similarity coefficient of the pitch sequences in two bars. If there is a bar and another bar , where , and the number of notes in is , the similarity of the bar and bar is determined by the number of the notes those have the same pitch:
(2) |
If the total number of notes in bar is less than , we set as 0. If the similarity of pitch sequences between two bars is not less than 50%, and their rhythm sequences are the same, we call the second bar as a pitch progression of the first bar. For example, for the bar and the bar in figure 1, their rhythms are the same and their similarity is 0.5 since and . Thus the bar is the pitch progression of the bar because the similarity coefficient of the pitch sequence between the two is less than 50%. In contrast, for the 6th bar and the 2nd bar in figure 1, the similarity coefficient of the pitch sequence between the two is less than 50% , thus the 6th bar is the rhythmic sequence of the 2nd bar.
Melody progression. To find the bars that are written based on melody progression, we utilize an algorithm which is similar to finding the longest common subsequence between two sequences. A common subsequence of two sequence is an ordered subsequence that is common to both sequences. Assume is the longest common subsequence of rhythm between bar and , and is the longest common subsequence of pitch sequence between bar and , then and can be represented as follows:
(3) |
if and , the bar is the melody progression of the bar.
Harmonious Cadence. Harmonious cadence is used to represent the harmonious end or breath of phrase. We employ note density curve to detect the position of harmonious cadence. The note density curve is formed by the position and density of the notes. Since the position of the notes is independent, the density of the notes is a function of the independent variable. The density indicates the number of notes in the time interval that the note belongs to. The lower bound of the time interval is the open interval of the notes’ start time. The length of the interval is set as one bar. Local minimum points of the note density curve is the local sparsest position in the melody, indicating the end of the phrase and breathing. Fig 6 shows an example of the music note density of the 19th bar.


The note density curve extracted from the melody of Fig. 1 is shown in Fig. 7. The fifth local minimum point in Fig. 7 is the red note shown in Fig. 6,
After analyzing the curve, we select the notes that belong to the tonic chord of music key from the local minimum points. This idea comes from a melody writing method called Cadence, which allows the phrase to create a sense of resolution.
Surround progression. Surround progression is used to manifest the progression of emotions in music. In rap, rock and pop music, the melody in a phrase usually moves around a certain note, example is shown in Fig 5.
Raising the pitch sequence of the melody moving around a certain note is the easiest way to promote emotions. In this work, the notes that repeat more than 3 times in bar are called surround notes. The method to find the type of Surround progression for the current bar is as follows: First, eliminate the higher-priority relation types, and then determine whether there are surrounding sounds in the current bar. Finally, look for bars that also have surrounding sounds from near to far, and the surrounding sounds of this bar rise or fall at least five semitones.