\SetWatermarkFontSize

12pt \SetWatermarkScale1.1 \SetWatermarkAngle90 \SetWatermarkHorCenter202mm \SetWatermarkVerCenter170mm \SetWatermarkColordarkgray \SetWatermarkTextLate-Breaking / Demo Session Extended Abstract, ISMIR 2024 Conference

MusicGen-Chord: Advancing Music Generation through Chord Progressions and Interactive Web-UI

Abstract

MusicGen is a music generation language model (LM) that can be conditioned on textual descriptions and melodic features. We introduce MusicGen-Chord¹¹1https://replicate.com/sakemin/musicgen-chord, which extends this capability by incorporating chord progression features. This model modifies one-hot encoded melody chroma vectors into multi-hot encoded chord chroma vectors, enabling the generation of music that reflects both chord progressions and textual descriptions. Furthermore, we developed MusicGen-Remixer²²2https://replicate.com/sakemin/musicgen-remixer, an application utilizing MusicGen-Chord to generate remixes of input music conditioned on textual descriptions. Both models are integrated into Replicate’s web-UI using cog, facilitating broad accessibility and user-friendly controllable interaction for creating and experiencing AI-generated music.

Image of melody and chord chromagram matrices — Figure 1: (a) MusicGen’s melodic features in a matrix of one-hot encoded chroma vectors. (b) MusicGen-Chord’s chord progression features in a matrix of multi hot encoded chroma vectors. For example, in the chromagram above right there is an E $\flat$ chord (E $\flat$ , G, B $\flat$ ) followed by a G chord (D, G, B), followed by a C minor chord (C, E $\flat$ , G), etc.

1 Introduction

The trend in generative AI emphasizes the controllability of models, allowing users to direct and refine outputs according to their preferences. Notable examples include Stable Diffusion [1, 2], supported by interfaces like AUTOMATIC1111’s web-UI [3] and ComfyUI [4], which offer extensive user control over image generation processes. In the realm of music generation, models with enhanced controllability are emerging, offering conditions such as chord[5, 6, 7], rhythm[5, 6, 7], melody[8, 6], and style based on reference audio [9]. This paper explores the integration of controllability in music generation through the example of MusicGen-Chord.

MusicGen [8] is an auto-regressive, Transformer-based music generation model that enables user control through textual descriptions and melodic features. It processes multiple streams of compressed discrete audio representations [10] to generate high-quality, coherent, and stylistically diverse music. MusicGen-Chord extends this model by conditioning on chord progressions instead of melodies. This modification uses a matrix of multi-hot encoded chroma vectors to represent chord progression features. MusicGen-Chord was released in October 2023, and since then, several similar but more advanced studies have been introduced, such as MusiConGen[7].

To demonstrate the practical benefits of this approach, we developed MusicGen-Remixer, an application based on MusicGen-Chord. This application allows users to upload a music track, provide a textual description prompt, and generate a new background track that is remixed with the input audio. By leveraging Replicate’s web-UI and the cog [11] package, MusicGen-Remixer and MusicGen-Chord are made widely accessible on the cloud, promoting user-friendly interaction and broad accessibility for creating and experiencing AI-generated music.

2 MusicGen-Chord

MusicGen-Chord extends the original MusicGen model by shifting the conditioning target from melodies to chord progressions. The original MusicGen model uses one-hot encoded chroma vectors as input condition to represent melodies (Figure 1.(a)). In this approach, each vector indicates the presence of a single pitch class at a given time, which is effective for simple melodies but limited in capturing complex harmonic content.

We found that we can tweak this input format to a multi-hot format to represent chord condition (Figure 1.(b)). These multi-hot chroma vectors can encode multiple active pitch classes for each time frame, providing a more comprehensive representation of harmonic structures. This “trick” works surprisingly well, enabling MusicGen-Chord to generate chord progressions that align with the style indicated by the prompt using the pretrained MusicGen model weights, without requiring any fine-tuning.

The interface for MusicGen-Chord accepts chord progression inputs both as audio and text formats, automatically converting them into multi-hot chroma vectors for the model. This flexibility allows users detailed control over the harmonic structure, enabling a more interactive and customizable music creation experience. For text-based inputs, users can specify chords using a simple format that includes the root and type of each chord, defined by ROOT:TYPE [12]. Each chord lasts for a single bar, with the option to add multiple chords in a bar by separating them with commas. For example: "G:maj7 D:min7,G:7 C:maj7 F:7 B:min7,Bb:7 A:min7,D:7". These text inputs are converted into chroma representations based on the input BPM value. For audio-based inputs, a chord extraction model, BTC [13] is employed to predict symbolic chords along with their timestamps, which are then represented in the chord chroma feature, ensuring effective incorporation of the harmonic content into the music generation process.

3 MusicGen-Remixer

MusicGen-Remixer utilizes the features of MusicGen-Chord to enable the creation of remixed music tracks. This application allows users to upload a music track, provide a textual description prompt, and generate a new background track that is remixed with the input audio.

The process involves several sophisticated steps to ensure the generation of coherent and contextually relevant remixes:³³3https://github.com/sakemin/musicgen-remixer

1.

Input Music Structure Analysis: Utilizing the All-in-One[14] framework, the input music’s BPM and downbeats are detected to maintain temporal integrity.
2.

Source Separation: A neural source separation model, Demucs[15] is employed to separate vocal tracks from instrumental components, ensuring the original vocal performance is preserved.
3.

Chord Progression Feature Extraction: BTC is used to extract chord progression features from the input audio, guiding the generation of the new background track.
4.

Dynamic Time Warping: Using Py-TSMod[16], the timing of the generated track is adjusted to match the downbeats of the input audio, ensuring rhythmic consistency.
5.

Mixing: The aligned background track is mixed with the separated vocal track to produce a cohesive remixed output.

4 REPLICATE INTEGRATION

Replicate’s web-UI, combined with the cog package, provides a seamless and convenient platform for deploying AI models. The cog package can encapsulate AI models with all their dependencies, including Python packages, operating system components, and CUDA versions. This integration ensures that models are portable and easily deployable, facilitating a user-friendly interface for model interaction and management .

MusicGen-Chord and MusicGen-Remixer are integrated with Replicate through the cog package, making them widely accessible on the cloud. Users can easily interact with these models via the web interface, API, or directly using the cog-wrapped repository⁴⁴4https://github.com/sakemin/cog-musicgen-chord on local machines. This setup ensures a straightforward and accessible experience for generating and remixing AI-driven music, enhancing both usability and adaptability.

Replicate offers a practical solution for sharing AI model demonstrations within the MIR community. Researchers and developers can utilize Replicate as an effective tool for presenting and disseminating their work, as demonstrated by the Music Technology Group (MTG)⁵⁵5https://replicate.com/mtg and others who have successfully used this platform. For example, we implemented MusiConGen[7], a recent MusicGen variant with controllable chord and rhythm features, into a cog wrapped demo⁶⁶6https://replicate.com/sakemin/musicongen.

References

[1] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. IEEE, 2022, pp. 10 674–10 685.
[2] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “SDXL: improving latent diffusion models for high-resolution image synthesis,” in The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024.
[3] AUTOMATIC1111, “Stable Diffusion Web UI,” https://github.com/AUTOMATIC1111/stable-diffusion-webui, Aug. 2022.
[4] comfyanonymous, “ComfyUI,” https://github.com/comfyanonymous/ComfyUI.
[5] L. Lin, G. Xia, J. Jiang, and Y. Zhang, “Content-based controls for music large language modeling,” 2024. [Online]. Available: https://arxiv.org/abs/2310.17162
[6] O. Tal, A. Ziv, I. Gat, F. Kreuk, and Y. Adi, “Joint audio and symbolic conditioning for temporally controlled text-to-music generation,” 2024. [Online]. Available: https://arxiv.org/abs/2406.10970
[7] Y.-H. Lan, W.-Y. Hsiao, H.-C. Cheng, and Y.-H. Yang, “Musicongen: Rhythm and chord control for transformer-based text-to-music generation,” 2024. [Online]. Available: https://arxiv.org/abs/2407.15060
[8] J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” in Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023.
[9] S. Rouard, Y. Adi, J. Copet, A. Roebel, and A. Défossez, “Audio conditioning for music generation via discrete bottleneck features,” 2024. [Online]. Available: https://arxiv.org/abs/2407.12563
[10] A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” Transactions on Machine Learning Research, 2023, featured Certification, Reproducibility Certification.
[11] Replicate, “cog,” https://github.com/replicate/cog.
[12] C. Harte, “Towards automatic extraction of harmony information from music signals,” Ph.D. dissertation, Queen Mary University of London, London, UK, August 2010.
[13] J. Park, K. Choi, S. Jeon, D. Kim, and J. Park, “A bi-directional transformer for musical chord recognition,” in Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, The Netherlands, November 4-8, 2019, 2019, pp. 620–627.
[14] T. Kim and J. Nam, “All-in-one metrical and functional structure analysis with neighborhood attentions on demixed audio,” 2023 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, WASPAA, pp. 1–5, 2023.
[15] S. Rouard, F. Massa, and A. Défossez, “Hybrid transformers for music source separation,” in IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023. IEEE, 2023, pp. 1–5.
[16] S. Yong, S. Choi, and J. Nam, “PyTSMod: A Python Implementation of Time-Scale Modification Algorithms,” Extended Abstracts for the Late-Breaking Demo Session of the 21st Int. Society for Music Information Retrieval Conf. Montréal, Canada, 2020., 2020.