MusicMamba: A Dual-Feature Modeling Approach for Generating Chinese Traditional Music with Modal Precision

Abstract

In recent years, deep learning has significantly advanced the MIDI domain, solidifying music generation as a key application of artificial intelligence. However, existing research primarily focuses on Western music and encounters challenges in generating melodies for Chinese traditional music, especially in capturing modal characteristics and emotional expression. To address these issues, we propose a new architecture, the Dual-Feature Modeling Module, which integrates the long-range dependency modeling of the Mamba Block with the global structure capturing capabilities of the Transformer Block. Additionally, we introduce the Bidirectional Mamba Fusion Layer, which integrates local details and global structures through bidirectional scanning, enhancing the modeling of complex sequences. Building on this architecture, we propose the REMI-M representation, which more accurately captures and generates modal information in melodies. To support this research, we developed FolkDB, a high-quality Chinese traditional music dataset encompassing various styles and totaling over 11 hours of music. Experimental results demonstrate that the proposed architecture excels in generating melodies with Chinese traditional music characteristics, offering a new and effective solution for music generation.

Index Terms— Music generation, music information retrieval, music generation, neural networks, deep learning, machine learning

1 Introduction

Recent advances in deep learning have significantly impacted the MIDI domain, making music generation a key application of artificial intelligence. Melody generation, a central task in music composition, involves creating musical fragments through computational models and presents more challenges than harmony generation and arrangement. A successful model must capture essential features like pitch and rhythm while producing melodies that align with specific styles and emotions. However, most existing methods, whether based on Recurrent Neural Networks[1, 2, 3, 4] or Transformer architectures [5, 6, 7, 8], struggle with the complexity and structure of melodies. For example, while [9] generated long-term structured melodies, Transformers excel in capturing global dependencies, demonstrating strong performance across various melody tasks [10].

Several studies have integrated music theory into the generation process. For instance, [11] introduced chord progressions for melody generation, [12] controlled polyphonic music features through chords and textures, and [13] improved beat structure representation. Additionally, [6] generated harmonious jazz melodies by adjusting harmonic and rhythmic properties, while other works have explored structured music generation using note-to-bar relationships [14] and melody skeletons [15].

Meanwhile, State Space Models (SSMs) have also advanced in modeling long-sequence dependencies, particularly in capturing global musical structures. Models like S4 [16] and S5 [17] have significantly improved parallel scanning efficiency through new state space layers. Mamba [18], a successful SSM variant, enhances parallel computation and has been applied across various fields, including visual domains with VMamba [19] and large-scale language modeling with Jamba [20]. Recognizing Mamba’s potential in sequence modeling, we applied it to symbolic music generation.

However, these methods primarily focus on Western music and struggle with generating Chinese traditional melodies. While they can produce smooth melodies, they often align with modern styles, failing to capture the unique contours and rhythms of Chinese traditional music. As shown in Figure 1, existing methods underperform in preserving the stylistic elements of Chinese music. Modes play a central role in Chinese melodies, determining note selection and arrangement, while conveying specific emotions and styles [21]. Due to significant differences in scales, pitch relationships, and modal structures between Western and Chinese music, these methods fail to capture these modal characteristics, leading to discrepancies in style and emotional expression [22]. The lack of high-quality Chinese traditional music datasets further limits their effectiveness.

Refer to caption — Fig. 1: This diagram presents the scores of various models in terms of their effectiveness in replicating the style of Chinese folk music.

To address these issues, we propose a new architecture, the Dual-Feature Modeling Module, which combining the long-range dependency modeling of the Mamba Block with the global structure capturing of the Transformer Block. We also designed the Bidirectional Mamba Fusion Layer, which integrates local details and global structures through bidirectional scanning, enhancing complex sequence modeling. This comprehensive architecture enables the generation of Chinese traditional music with complex structures and coherent melodies. Specifically, our contributions are:

•

Mamba architecture to the MIDI domain. We applied the Mamba architecture to MIDI music generation, proposing the Dual-Feature Modeling Module, which combines the strengths of Mamba and Transformer Blocks. Through the Bidirectional Mamba Fusion Layer, we integrated local details with global structures, achieving excellent performance in long-sequence generation tasks.
•

REMI-M Representation. We extended the REMI representation with REMI-M, introducing mode-related events and note type indicators, allowing the model to more accurately capture and generate modal information in melodies.
•

FolkDB. We created a high-quality Chinese traditional music dataset, FolkDB, designed for studying Chinese traditional music. With over 11 hours of music covering various styles, FolkDB fills a gap in existing datasets and provides a foundation for further research.

2 Proposed Method

2.1 Problem Formulation

In melody generation, the condition sequence is typically defined as $x_{1:t}=[x_{1},...,x_{t}]$ and the target sequence as $y_{1:k}=[y_{1},...,y_{k}]$ , where $k>t$ . The prediction of the $j$ -th element in the target sequence can be expressed as $y_{j}|[x_{1},...,x_{t}]\sim{p(y_{j}|x_{1},...,x_{t})}$ where $p(y_{j}|x_{1},...,x_{t})$ represents the conditional probability distribution of $y_{j}$ given the condition sequence. Chinese traditional music often includes various modes. For example, in pentatonic modes, if a note $N_{i}\in\{C,D,E,G,A\}$ serves as the tonic note, then the following notes, if they follow a specific interval relationship, form a mode $M$ . Therefore, to generate Chinese music with mode characteristics, the target sequence can consist of multiple modes and transition notes, represented as $y_{1}=(M_{1},f(M_{1}),M_{2},f(M_{2}),...,M_{l},f(M_{l}))$ , where $M_{i}$ corresponds to a subsequence of notes within a specific mode. where $M_{i}$ corresponds to a subsequence of notes within a specific mode, and $M_{i}$ represents the transition note sequence following $M_{i}$ . The task of generating melodies with Chinese modes can ultimately be formulated as the following autoregressive problem:

p(\mathbf{y}|\mathbf{x},M)=\prod_{i=1}^{l}p(M_{i}|C_{i})\cdot p(f(M_{i})|C^{\prime}_{i})\,,

(1)

where $M$ is the collection of multiple modes, $C_{i}=(\mathbf{x},\mathbf{y}_{<i})$ and $C^{\prime}_{i}=(\mathbf{x},\mathbf{y}_{\leq i})$ . During the step-by-step generation of notes, the corresponding mode sequence $M_{i}$ is generated first, followed by the generation of the transition note sequence $f(M_{i})$ based on the mode sequence.

2.2 REMI-M Representation

In generating traditional Chinese music, mode generation is a crucial and complex component. Chinese music often features intricate modal structures, such as pentatonic and heptatonic scales, where the selection and transition of modes are vital to the style and expression of the music. However, existing music representation methods face significant limitations in capturing and generating these modal structures. Although the REMI representation [13] effectively captures rhythm, pitch, and velocity information through events such as bar, position, tempo, and note, it struggles with complex modal structures, particularly when handling the dynamic modes in Chinese music.

To address this issue, we extended REMI by introducing two new events in the REMI-M representation to explicitly describe modes:

•

Note type event. Distinguishes between mode notes and transition notes, helping the model to more accurately capture modal information.
•

Mode-related events. Include the start, end, and type of mode, enabling REMI-M to explicitly annotate and generate modal changes in the music.

As shown in Figure 2, the original REMI and MIDI-Like encodings result in low mode generation rates, whereas REMI-M demonstrates significant improvements, achieving mode generation rates exceeding $0.8$ across all tested music lengths. These enhancements allow REMI-M to better handle complex modal structures, significantly improving the stylistic consistency and theoretical accuracy in generated music.

2.3 Model

In music generation tasks, it is crucial to capture both local melodic details and global musical structure dependencies within long contexts. To achieve this, as shown in Figure 3, we designed a hierarchical feature extraction and integration architecture named the Dual-Feature Modeling Module, which combines the long-range dependency modeling capability of the Mamba Block with the global structure capturing ability of the Transformer Block.

Dual-Feature Modeling Module. In music sequence generation, both melodic details (like note variations and modal transitions) and overall structure (such as phrases and repetition patterns) are crucial. Traditional architectures often struggle to capture these levels of features simultaneously. Let $\mathbf{H}$ represent the feature matrix. The Mamba Block captures melodic details and modal dependencies by computing a dot product between the mode mask and melody tokens, generating the feature representation $\mathbf{H}_{1}$ . This provides essential long-range and local information for integration. The Transformer Block primarily models global structural information, processing input melody embeddings with positional encoding to obtain the structural representation $\mathbf{H}_{2}$ .

Bidirectional Mamba Fusion Layer. To integrate the outputs of the Mamba Block and Transformer Block, we introduce the Bidirectional Mamba Fusion Layer. This layer simultaneously receives the long-range features $\mathbf{H}_{1}$ generated by the Mamba Block and the global features $\mathbf{H}_{2}$ generated by the Transformer Block. Through a bidirectional scanning mechanism, the forward and backward features are processed separately to obtain $F_{\text{forward}}$ and $F_{\text{backward}}$ . Then, self-attention is applied to the forward and backward features to extract key information:

F_{1}=\text{Attention}(F_{\text{forward}}),\quad F_{2}=\text{Attention}(F_{\text{backward}})\,,

(2)

Next, these two directional features are concatenated to obtain the fused feature $\mathbf{H}_{\text{fusion}}$ , and further processed by a linear layer:

\text{Output}=\text{Linear}(\mathbf{H}_{\text{fusion}})\,,

(3)

The fused feature $\mathbf{H}_{\text{fusion}}$ combines the long-range dependencies and global structure of the melody, providing complete information support for generating complex and coherent music sequences. Finally, the linear layer maps the fused features to the output space, generating the final music sequence.

3 Experiments

3.1 Implementation Details

3.1.1 Dataset

We used two datasets: the POP909 dataset[23] and a self-collected Chinese Traditional Music dataset (referred to as the FolkDB). The POP909 dataset contains rich musical information, particularly in chords and melodies. Pre-training on this dataset allows the model to learn fundamental musical structures and elements, helping it to adapt more quickly to our FolkDB. Additionally, to address the lack of cultural diversity in the POP909 dataset, we have compiled a dataset of approximately 300 Chinese traditional music pieces. This dataset contains about 11 hours of piano MIDI works, featuring traditional modes such as the pentatonic and heptatonic scales, showcasing the diverse styles and modal characteristics of Chinese music.

In terms of data preprocessing, since the original data consists of single-track Chinese traditional music melodies, we performed additional processing on the self-collected Chinese traditional music dataset to ensure that the model can effectively capture and generate music with Chinese cultural characteristics. The specific steps are as follows:

•

Tonic Track Extraction. We employed the tonic extraction framework mentioned in the Wuyun model[15]. This framework uses a layered skeleton-guided approach, first constructing the skeleton of the melody and then extending it.
•

Mode Detection and Annotation. After extracting the tonic track, we conducted mode detection on the melody using interval relationships. By analyzing the intervals between each pair of tonic notes and leveraging the knowledge-enhanced logic within the Wuyun model, we obtained the mode track for each piece of music.

3.1.2 Model Settings

We adopted an architecture based on the MambaBlock2 module[18], with the model’s hidden dimension set to 256 and the feedforward network’s intermediate layer dimension set to 1024. Additionally, we employed GatedMLP to enhance the model’s nonlinear representation capabilities. During training, we used the Adam optimizer with an initial learning rate of $2\times 10^{-4}$ , dynamically adjusted via the LambdaLR scheduler. The training data was processed in batches, with each batch containing 8 samples and a fixed sequence length of 512 tokens. We used the cross-entropy loss function to measure the difference between the model’s predictions and the target labels. This loss function is defined as follows:

\text{Loss}=L_{\text{CE}}-\lambda_{1}L_{\text{NT}}-\lambda_{2}L_{\text{MR}}\,,

(4)

Among these, $L_{\text{CE}}$ is used to focus on the model’s ability to accurately predict musical events, while $L_{\text{NT}}$ and $L_{\text{MR}}$ correspond to note type events and mode-related events, respectively. $\lambda_{1}$ and $\lambda_{2}$ are used to balance the contributions of the two losses, with both values typically ranging between 0 and 1.

Table 1: Performance comparison of our proposed model against the baseline models.

Model	Average Groove	Average Style	Mode	Subjective listening test results
	Consistency (%)	Consistency (%)	Consistency (%)	Coherence	Richness	Style
MusicTransformer[5]	42.3	65.4	37.5	7.48	7.59	6.55
Mamba[18]	44.3	73.0	62.1	7.73	7.43	7.51
MusicTransformer[5]	45.2	69.7	53.6	7.06	7.49	7.67
MelodyT5[10]	51.8	67.9	—	7.21	7.65	6.86
Ours	59.9	85.3	66.1	7.91	7.72	8.26

3.2 Objective Evaluation

3.2.1 Metric

To evaluate our music generation model, we selected the following four objective metrics: Pitch Class Entropy, Groove Consistency[24, 25, 26], Style Consistency, and Mode Consistency.

•

Pitch Class Entropy. This metric reflects the diversity of pitch distribution, with higher entropy indicating a more dispersed distribution of generated notes, while lower entropy indicates a more concentrated distribution.
•

Groove Consistency. Higher groove consistency indicates less variation in rhythm, resulting in a smoother, more stable musical flow.
•

Style Consistency. A higher style consistency score indicates that the generated music aligns more closely with the expected style.
•

Mode Consistency. This metric evaluates whether the notes in the generated music conform to the predefined mode structure. We improved the traditional scale consistency metric [24, 25] to better align with the modal characteristics of Chinese folk music. The specific formula is as follows:

\text{Consistency Score}=\frac{|\mathcal{P}_{\text{melody}}\cap\mathcal{P}_{\text{scale}}|}{|\mathcal{P}_{\text{melody}}|}\times 100\%\,,

(5)

Here, $\mathcal{P}_{\text{melody}}$ represents the set of melody notes, $\mathcal{P}_{\text{scale}}$ represents the set of scale notes, ${|\mathcal{P}_{\text{melody}}\cap\mathcal{P}_{\text{scale}}|}$ denotes the size of the intersection between the melody note set and the scale note set, and $\mathcal{P}_{\text{melody}}$ represents the size of the melody note set. The consistency score is determined by calculating the overlap ratio between the sets of notes in the melody and scale tracks.

Table 2: Comparison of Average Pitch Entropy among models.

Model	Average Pitch Entropy
Ground truth	3.831
MusicTransformer[5]	3.124
Mamba[18]	3.647
MusicTransformer[5]	3.404
MelodyT5[10]	3.530
Ours	4.070

3.2.2 Results

Before introducing the objective indicator test results, we tested the key restoration of each model, and it is clear from the Figure 4 that MusicMamba not only effectively restores the key in the original sequence, but also introduces additional key changes, while MusicTransformer, although it captures some keys, is not as comprehensive and diverse as MusicMamba. Experimental results show that MusicMamba is better at generating melodies with traditional Chinese music styles, and can generate richer and more consistent sequences.

We conducted two sets of comparative experiments using the MusicTransformer [5] and MelodyT5 [10] models as baselines. In each experiment, we randomly generated approximately 50 songs for each model and calculated objective metrics, which are displayed in the table. When evaluating the quality of generated music, we consider values that are closer to real data as better. As shown in the Table 2, our model’s generated music is closer to the real values in terms of pitch entropy, outperforming the other models. Our model also excels in style consistency and rhythm consistency. Notably, in terms of mode consistency, over 70% of the music generated by our model exhibits a detectable modal structure, and more than 60% of the music performs well in mode consistency. The above metrics are shown in the Table 1.

3.3 Subjective Listening Test

To evaluate the quality of the music samples generated by the model, we designed a subjective listening test. We recruited 10 music enthusiasts from social networks, each of whom plays at least one musical instrument. Each participant was asked to listen to 10 generated audio samples. They rated the samples based on three criteria: coherence, richness, and style, with scores ranging from 0 to 10. In the subjective evaluation results, MusicMamba outperformed all baseline models in coherence, richness, and style, showing the best overall performance.

4 Conclusion

This paper proposes a new architecture that combines the long-range dependency modeling capability of the Mamba Block with the global structure capturing ability of the Transformer Block. We also designed the Bidirectional Mamba Fusion Layer to effectively integrate local and global information. By introducing the REMI-M representation, we were able to more accurately capture and generate modal features in Chinese traditional music. Experimental results show that the combination of REMI-M and MusicMamba more accurately reproduces and generates specific modes in Chinese traditional music, with the generated music outperforming traditional baseline models in terms of stylistic consistency and quality. Our research provides a new direction and technical foundation for exploring more complex modes in various types of ethnic music, as well as for generating melodies with distinctive styles through the incorporation of traditional instruments.

References

[1] Shunit Haviv Hakimi, Nadav Bhonker, and Ran El-Yaniv, “Bebopnet: Deep neural models for personalized jazz improvisations,” in Proceedings of the 21st International Society for Music Information Retrieval Conference (ISMIR), 2020.
[2] Adam Roberts, Jesse Engel, Colin Raffel, Curtis Hawthorne, and Douglas Eck, “A hierarchical latent vector model for learning long-term structure in music,” in International conference on machine learning. PMLR, 2018, pp. 4364–4373.
[3] Daniel D Johnson, “Generating polyphonic music using tied parallel networks,” in International conference on evolutionary and biologically inspired music and art. Springer, 2017, pp. 128–143.
[4] Ziyu Wang, Yiyi Zhang, Yixiao Zhang, Junyan Jiang, Ruihan Yang, Junbo Zhao, and Gus Xia, “Pianotree vae: Structured representation learning for polyphonic music,” International Society for Music Information Retrieval, 2020.
[5] Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, and Douglas Eck, “Music transformer,” CoRR, vol. abs/1809.04281, 2018.
[6] Vincenzo Madaghiele, Pasquale Lisena, and Raphaël Troncy, “Mingus: Melodic improvisation neural generator using seq2seq.,” in ISMIR, 2021, pp. 412–419.
[7] Yi Ren, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu, “Popmag: Pop music accompaniment generation,” in Proceedings of the 28th ACM international conference on multimedia, 2020, pp. 1198–1206.
[8] Ning Zhang, “Learning adversarial transformer for symbolic music generation,” IEEE Transactions on Neural Networks and Learning Systems, vol. 34, no. 4, pp. 1754–1763, 2023.
[9] Guo Zixun, Dimos Makris, and Dorien Herremans, “Hierarchical recurrent neural networks for conditional melody generation with long-term structure,” in 2021 International Joint Conference on Neural Networks (IJCNN), 2021, pp. 1–8.
[10] Shangda Wu, Yashan Wang, Xiaobing Li, Feng Yu, and Maosong Sun, “Melodyt5: A unified score-to-score transformer for symbolic music processing,” 2024.
[11] Hongyuan Zhu, Qi Liu, Nicholas Jing Yuan, Chuan Qin, Jiawei Li, Kun Zhang, Guang Zhou, Furu Wei, Yuanchun Xu, and Enhong Chen, “Xiaoice band: A melody and arrangement generation framework for pop music,” in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, New York, NY, USA, 2018, KDD ’18, p. 2837–2846, Association for Computing Machinery.
[12] Ziyu Wang, Dingsu Wang, Yixiao Zhang, and Gus Xia, “Learning interpretable representation for controllable polyphonic music generation,” ISMIR, 2020.
[13] Yu-Siang Huang and Yi-Hsuan Yang, “Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions,” in Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 2020, MM ’20, p. 1180–1188, Association for Computing Machinery.
[14] Dimitri von Rütte, Luca Biggio, Yannic Kilcher, and Thomas Hofmann, “Figaro: Controllable music generation using learned and expert features,” Proc. Int. Conf. Learn. Representations, 2023.
[15] Kejun Zhang, Xinda Wu, Tieyao Zhang, Zhijie Huang, Xu Tan, Qihao Liang, Songruoyao Wu, and Lingyun Sun, “Wuyun: Exploring hierarchical skeleton-guided melody generation using knowledge-enhanced deep learning,” arXiv preprint arXiv:2301.04488, 2023.
[16] Albert Gu, Karan Goel, and Christopher Re, “Efficiently modeling long sequences with structured state spaces,” in International Conference on Learning Representations, 2021.
[17] Jimmy TH Smith, Andrew Warrington, and Scott Linderman, “Simplified state space layers for sequence modeling,” in The Eleventh International Conference on Learning Representations, 2023.
[18] Albert Gu and Tri Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.
[19] Yue Liu, Yunjie Tian, Yuzhong Zhao, Hongtian Yu, Lingxi Xie, Yaowei Wang, Qixiang Ye, and Yunfan Liu, “Vmamba: Visual state space model,” arXiv preprint arXiv:2401.10166, 2024.
[20] Opher Lieber, Barak Lenz, Hofit Bata, Gal Cohen, Jhonathan Osin, Itay Dalmedigos, Erez Safahi, Shaked Meirom, Yonatan Belinkov, Shai Shalev-Shwartz, Omri Abend, Raz Alon, Tomer Asida, Amir Bergman, Roman Glozman, Michael Gokhman, Avashalom Manevich, Nir Ratner, Noam Rozen, Erez Shwartz, Mor Zusman, and Yoav Shoham, “Jamba: A hybrid transformer-mamba language model,” 2024.
[21] Yu Zhang, Ziya Zhou, and Maosong Sun, “Influence of musical elements on the perception of ‘chinese style’in music,” Cognitive Computation and Systems, vol. 4, no. 2, pp. 147–164, 2022.
[22] Wei Hao, “A comparative study of chinese and western music,” Highlights in Art and Design, vol. 3, no. 1, pp. 80–82, 2023.
[23] Ziyu Wang*, Ke Chen*, Junyan Jiang, Yiyi Zhang, Maoran Xu, Shuqi Dai, Guxian Bin, and Gus Xia, “Pop909: A pop-song dataset for music arrangement generation,” in Proceedings of 21st International Conference on Music Information Retrieval, ISMIR, 2020.
[24] Shih-Lun Wu and Yi-Hsuan Yang, “The jazz transformer on the front line: Exploring the shortcomings of ai-composed music through quantitative measures,” in Proc. ISMIR, 2020, vol. 2, p. 3.
[25] Olof Mogren, “C-rnn-gan: A continuous recurrent neural network with adversarial training,” in Constructive Machine Learning Workshop (CML) at NIPS 2016, 2016, p. 1.
[26] Bob L. Sturm, João Felipe Santos, Oded Ben-Tal, and Iryna Korshunova, “Music transcription modelling and composition using deep learning,” ArXiv, vol. abs/1604.08723, 2016.