Mini-Omni2: Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities

Zhifei Xie^♠^♣
xzf24@mails.tsinghua.edu.cn
Changqiao Wu^♠
wuchangqiao@inspirai.com
Corresponding author. Work done during Zhifei Xie’s internship at Inspirai. ^♠ Inspirai ^♣ Tsinghua University
https://github.com/gpt-omni/mini-omni2

Abstract

GPT-4o, an all-encompassing model, represents a milestone in the development of large multi-modal language models. It can understand visual, auditory, and textual modalities, directly output audio, and support flexible duplex interaction. Models from the open-source community often achieve some functionalities of GPT-4o, such as visual understanding and voice chat. Nevertheless, training a unified model that incorporates all modalities is challenging due to the complexities of multi-modal data, intricate model architectures, and training processes. In this paper, we introduce Mini-Omni2, a visual-audio assistant capable of providing real-time, end-to-end voice responses to visoin and audio queries. By integrating pretrained visual and auditory encoders, Mini-Omni2 maintains performance in individual modalities. We propose a three-stage training process to align modalities, allowing the language model to handle multi-modal inputs and outputs after training on a limited dataset. For interaction, we introduce a command-based interruption mechanism, enabling more flexible interaction with users. To the best of our knowledge, Mini-Omni2 is one of the closest reproductions of GPT-4o, which have similar form of functionality, and we hope it can offer valuable insights for subsequent research.

Refer to caption — Figure 1: The Mini-Omni2 model architecture.

1 Introduction

GPT-4o(Openai, 2024a) represents a milestone in the development of multi-modal large language models, particularly evident in three aspects: (1) its powerful capabilities in multi-modal question answering; (2) its ability to transcend traditional text-based input and output, enabling the understanding and generation of multi-modal content; and (3) its flexible interaction mode with interruption mechanisms, which facilitates a more natural and fluid human-computer interaction. However, the GPT-4o model is not open-sourced to the public, and its technical specifications remain undisclosed. To date, mainstream methods predominantly involve employing various pre-trained encoders to obtain textual outputs for specific modalities, such as visual and audio inputs, and utilizing cascading techniques to integrate a text-to-speech (TTS) module that replicates GPT-4o’s speech output capabilities, thereby simulating its multi-modal functionalities. Achieving end-to-end multi-modal understanding and output remains a challenging task.

Recently, as the capabilities of language models such as Llama3.2(meta, 2024) continue to expand, researchers have begun to explore multi-modal approaches to achieve the performance of GPT-4o. However, these research outcomes often focus only on specific functionalities of GPT-4o, such as vision-text understanding (LLava(Liu et al., 2024), Flamingo(Alayrac et al., 2022)), audio comprehension (Qwen2-audio(Chu et al., 2024)), multi-modal understanding (VITA(Fu et al., 2024)), and speech-to-speech dialogue (Mini-Omni(Xie and Wu, 2024), Llama-Omni(Fang et al., 2024), Moshi(Défossez et al., 2024)). However, integrating text, vision, and speech modalities remain challenging.

In our view, the current challenges in achieving interaction across three modalities involve the following aspects: (1) Model capability — GPT-4o requires a unified model that comprehensively understands all modalities while maintaining robust performance across wide range of tasks; (2) direct inference output capabilities in multi-modal contexts — our recent work Mini-Omni(Xie and Wu, 2024) has addressed how to enhance the model’s streaming output abilities in audio, laying the groundwork for Mini-Omni2’s voice interaction capabilities; (3) substantial data requirements — training for GPT-4o necessitates the integration of data across visual, audio, and textual modalities, with quantities increasing exponentially compared to previous efforts; (4) the design of flexible interaction methods — GPT-4o’s full-duplex capability is also a notable feature.

In this paper, we introduce Mini-Omni2 as a continuation of Mini-Omni, employing a single model to end-to-end simulate the visual, speech, and textual capabilities of GPT-4o, enhanced by a unique command-based interruption mechanism. Consistent with Mini-Omni, we retain Qwen2(Yang et al., 2024) as the foundational model, leveraging this compact architecture to achieve comprehensive multi-modal understanding and real-time streaming speech inference across the three modalities. Furthermore, we enable the model to receive external audio inputs in real time, simulating its "auditory" perception and controlling the speech output stream based on content semantics. The model architecture of Mini-Omni2 is illustrated in Figure 1. As an end-to-end model, we enhance data utilization efficiency and demonstrate the generalizability of the Mini-Omni2 algorithm by directly employing the classic pre-trained visual encoder CLIP(Radford et al., 2021) and the encoder component of the speech recognition model Whisper(Radford et al., 2023) as feature extractors for visual and audio inputs. The features from the pre-trained encoders and the text embedding are concatenated to form the model’s input. Due to challenges related to understanding capabilities, we did not adopt a token-in-token-out paradigm. Moreover, utilizing a delayed parallel output approach for text and audio, the model can response instantly with audio like GPT-4o.

In Mini-Omni2, we propose an efficient training approach based on a limited amount of data, aiming to enable the model’s training methods to assist other multi-modal models in modality expansion. Thus, we avoided blindly expanding the dataset exponentially and instead sought to develop a multi-modal extension method using minimal new data. We employed a three-phase training process for modality expansion, alignment, and joint training. Initially, the Mini-Omni2 model underwent adapter training using speech recognition and image caption datasets, thereby broadening the scope of multi-modal understanding. Next, Mini-Omni2 was trained for text output in question-answering tasks across modalities, allowing the adapter-based output features to align with text embedding for effective question answering. In the third phase, we focused on multi-modal output capability by incorporating audio output and training for auditory capabilities like interruption.

With respect to the model’s capabilities in voice interaction, Mini-Omni2 continues to utilize the SNAC tokenizer(Siuzdak, 2024) to ensure high-quality speech output. However, based on our observations, we believe that the current full-duplex training is still not sufficiently stable. Therefore, we contend that interruptions based on input semantic information are essential for achieving stable and flexible human-machine interaction. We enable the model to perform real-time encoding of its received "auditory" waveforms using SNAC, generating tokens that allow it to control its own output during each generation. As a demonstration, we construct data using the phrase "stop omni," employing frame-level irq and n-irq special tokens to control the generating process.

To evaluate the multi-modal interaction capabilities of Mini-Omni2, we first empirically tested its performance on traditional visual and auditory tasks, verifying that the model maintains consistency with the original model in basic tasks such as image caption and speech recognition. Next, we conducted a series of additional experiments to test the model’s response speed and perform some case studies.

In summary, we make the following contributions:

•

We introduce Mini-Omni2, the first open-source multi-modal language model with capabilities in vision, speech, text and an auditory interruption mechanism. To the best of our knowledge, it is one of the most similar end-to-end models to the GPT-4o’s functionalities. Figure 2 shows the demo of the model as a visual voice assistant.
•

We propose a novel training pipeline based on the modal expansion method from the previous Mini-Omni. This pipeline encompasses three training phases, allowing the text model to first align responses to multi-modal inputs, and ultimately extend outputs to the speech modality in the final phase, employing a delayed parallel generation algorithm for real-time speech output.
•

We explored a command-based interruption method, utilizing streaming tokens as input and constructing training data to enable the model to control its audio output stream based on external semantic cues. And all the synthetic data will be open-sourced.

2 Related Work

Large Vision Language Models Recent vision-language models are developing rapidly and were among the first modalities to combine with large language models. The foundational work began with CLIP(Radford et al., 2021), which is also used as vision encoder in our work. Subsequent works typically employ a vision encoder, an adapter as an intermediate layer, and a large language model as the architecture to enable the LLM to understand and reason about visual inputs. Classic works include BLIP(Li et al., 2022), BLIP2(Li et al., 2022), Llava(Liu et al., 2024), Qwen-VL(Bai et al., 2023), Qwen2-VL(Wang et al., 2024), InstructBLIP(Dai et al., 2023), MiniGPT-4(Zhu et al., 2023), GPT-4V(Openai, 2024b) from OpenAI, Gemini(Google, 2024) from Google, and Llama-3.2(meta, 2024) from Meta. Researchers are also exploring other directions, such as higher-resolution vision encoders like InternLM-XComposer2-4KHD(Dong et al., 2024) and using MOE architectures, as in works like CogVLM(Wang et al., 2023b). The method used in this paper is the most classical, which is similar to Llava(Liu et al., 2024).

Audio Language Modeling With the further development of large multi-modal models, speech signals have also been discretized into tokens, enabling understanding and reasoning in a manner similar to text models. Important works include speech synthesis models like VALL-E(Wang et al., 2023a), music generation models like MusicGen(Copet et al., 2024), as well as voice interaction works like AudioPaLM(Rubenstein et al., 2023) and LauraGPT(Chen et al., 2023). Just recently, researchers have explored methods for speech-to-speech interaction, with works such as Mini-Omni(Xie and Wu, 2024), Llama-Omni(Fang et al., 2024), and Moshi(Défossez et al., 2024). Speech tokenization is also an important direction for generating stable and information-rich tokens, with recent works like Speechtokenizer(Zhang et al., 2023b), Google USM(Zhang et al., 2023c), and EnCodec(Défossez et al., 2022).

Multi-modal Interaction Model With the emergence of GPT-4o, researchers have begun working on end-to-end multi-modal models for voice chat. Early works include Spectron(Nachmani et al., 2023) and SpeechGPT(Zhang et al., 2023a), which use the A-T-T-A method to achieve speech-in and speech-out in an end-to-end manner. Mini-Omni(Xie and Wu, 2024) introduced a method for parallel generation of text and audio, enabling the model to directly start reasoning in audio. Both Moshi(Défossez et al., 2024) and Llama-Omni(Fang et al., 2024) used similar approaches. LSLM(Ma et al., 2024) and Moshi explored the full duplex interaction capability by combining the speaking and listening signals as input. VITA(Fu et al., 2024) can understand all modalities but only outputs text. The AnyGPT(Zhan et al., 2024) project aims to achieve full multi-modal understanding and generation. This work is a continuation of Mini-Omni, aiming to realize multi-modal input and low-latency parallel speech-text output with duplex capability.

3 Mini-Omni2

The model architecture of Mini-Omni2 is illustrated in Figure 1. In addition to the text embedding module, Mini-Omni2 employs the visual component of CLIP and Whisper-small as encoders for visual and auditory modalities, resulting in highly efficient data utilization during training and minimizing extensive pre-training efforts. Additionally, Mini-Omni2 features real-time duplex capability, providing greater flexibility in model interactions. This section includes 3.1, which discusses the model architecture; 3.2, which presents the modeling methods for input and output streams; and sections 3.3 and 3.4, which detail training methods and interuption manner, respectively.

3.1 Architecture

Visual Encoder - We utilize the visual component of CLIP, specifically the ViT-B/32 model, as the visual encoder, which converts incoming images to a feature sequence of length 49 for the image patches and a global semantic feature. Mini-Omni2 concatenates these to form a raw feature sequence of length 50, employing a single-layer LlamaMLP(Touvron et al., 2023) as the vision adapter.

Audio Encoder - In the encoder section, we continue our previous work by using the Whisper-small model as the audio encoder. We opted not to adopt a token-in-token-out modeling approach for audio input and output for two reasons. (i) Strong semantic alignment in speech recognition. The Whisper model, proposed by OpenAI, is trained on thousands of hours of datasets, demonstrating exceptional robustness. Furthermore, we unexpectedly found that Mini-Omni exhibits an understanding of Chinese data, despite not being trained on any Chinese datasets. We believe this is due to the Whisper model’s capability to automatically align audio from different languages, tones, and noise levels that convey the same meaning, thereby enabling the model to focus on the user’s intention. (ii) Unstable open-source audio tokens. We observed a phenomenon where a) the audio loss of Mini-Omni2 remains high during training, and b) the tokens for a segment of audio can vary significantly based on the content at both ends. We argue that tokens are insufficient for reliably conveying the content of speech input, as evidenced by the poor performance of ASR comparing to semantic features like Whisper.

Language Model - Mini-Omni2 uses the Qwen2-0.5B base version as its foundational language model. We have ported the Llama-based Qwen2 model using the LitGPT(AI, 2023) training framework, employing the configuration of the 0.5B model as the base language model. For the parallel generation of the multi-layer codebook shown in Figure 3, we expanded the vocabulary of the Qwen2 model by adding 7 × 4160 sub-LM-heads, as illustrated in Figure 4, resulting in a vocabulary size of 181,120.

3.2 Multimodal Languague Modeling

Multimodal Modeling - Consider $Y=(y_{i}\in\mathcal{V}_{\text{txt}}\mid i=1,\ldots,t_{\text{txt}})$ as a text utterance from a vocabulary $\mathcal{V}_{\text{txt}}$ with length $t_{\text{txt}}$ . The probability of $Y$ can be expressed as $p(Y)=\prod_{i=1}^{t_{\text{txt}}}p(y_{i}\mid y_{1},\ldots,y_{i-1})$ . Now, when dealing with a continuous speech signal, we can convert it into discrete speech tokens (dst), represented as $D=(d_{i}\in\mathcal{V}_{\text{dst}}|i=1,\cdots,t_{\text{dst}})$ using a audio tokenizer. In this context $\mathcal{V}_{\text{dst}}$ is the vocabulary of discrete speech tokens. These discrete speech tokens can be treated as spoken language within $\mathcal{V}_{\text{dst}}$ and modeled in a manner similar to text. We combine text and speech in a new vocabulary $\mathcal{V}_{\text{voxt}}$ by $\mathcal{V}_{\text{voxt}}=\mathcal{V}_{\text{txt}}\cup\mathcal{V}_{\text{dst}}$ . Additionally, we introduce visual features $V\in\mathcal{F}_{\text{vis}}$ , where $\mathcal{F}_{\text{vis}}$ represents the continuous features extracted from the image. Therefore, we can model the probability of both speech, text, where $Z=(z_{i}\in\mathcal{V}|i=1,\cdots,t)$ . This probability is expressed as $p(Z)=\prod_{i=1}^{t}p(z_{i}\mid z_{1},\cdots,z_{i-1},V)$ , where $Z$ represents discrete speech tokens $D(\mathcal{V}=\mathcal{V}_{\text{dst}})$ , text tokens $Y(\mathcal{V}=\mathcal{V}_{\text{txt}})$ , and continuous video features $V(\mathcal{F}=\mathcal{F}_{\text{vis}})$ , or various combinations of $Y$ , $D$ , and $V$ . For the audio and text tokens generated simultaneously, the negative log-likelihood loss can be formulated as in Equation (1).

\mathcal{L}(T,A,V|C)=\sum_{j=1}^{m}\sum_{i=1}^{n_{j}}\log P(T_{i,j},A_{i,j}|T_{<i,j},A_{<i,j},V_{j};X_{j})

(1)

where $T$ , $A$ are the text-audio output pairs in the training corpus $C$ , and $m$ is the number of training examples. $X_{j}$ and $V_{j}$ is the input condition of the $j$ -th example, $n_{j}$ is the max number of tokens of sample $T_{j}$ and $A_{j}$ , and $T_{i,j}$ and $A_{i,j}$ represent the $i$ -th text token and audio token of the $j$ -th sample.

Multi-modal token-Mixed Input - The modeling of input and output tokens for some of the model’s main tasks is illustrated in Figure 3. In this section, we will discuss the model’s inputs and outputs. Since the model incorporates multiple LM-heads, it generates multiple sequences in an auto-regressive manner. As a result, the model also takes multiple sequences as inputs. The input sequences can include a mixed input from a minimum of one modality to a maximum of three modalities. In this subsection, we will discuss the methods for modality mixing.

•

Visual-[Audio|Text] Input Our experiments indicate that the Transformer architecture is easier to train and generates more natural responses when auto-regressive tasks are connected with semantic information. Therefore, as shown in Figure 3 (a), we first place the visual features processed by the vision adapter, followed by the Whisper features processed by the audio adapter. Finally, at the position where a response needs to be generated auto-regressively, we place a special token for the response. The total length is approximately 50(CLIP feature length) + $L_{a}$ (Whisper feature length).
•

Single Modality Input Single-modal inputs may consist of visual, speech, or text inputs. We place the features of both visual and audio modalities across layers 1 to 7. These features will be replicated to enhance their prominence when averaged across all layer features. Notably, when only a single modality’s features are input without the control of a special token, the default tasks are image caption, speech-to-text question answering, and text-to-text question answering.

Text-Audio Parrallel Decoding In Mini-Omni2, we essentially retain the output strategy of Mini-Omni, employing the Text-Instruct Delay Parallel Decoding algorithm to enhance audio generation. This approach utilizes text-audio parallel decoding to simultaneously generate audio and text tokens, leveraging text-to-speech synthesis for real-time output. We continue the parallel generation method introduced by MusicGen(Copet et al., 2024), utilizing SNAC as the audio encoder, which comprises seven complementary token layers. In a single step, we generate eight tokens, including text, while maintaining a one-step delay between layers. Furthermore, we incorporate a Batch approach that involves two samples: one requiring both text and audio responses and the other necessitating a text-only response. By discarding the text token from the first sample and embedding the output from the second sample into the first, we effectively transfer the model’s text-based capabilities to audio tasks, significantly enhancing reasoning abilities with minimal resource overhead. We have provided detailed explanations of the specific technical details in the Mini-Omni(Xie and Wu, 2024).

Overall, we have introduced our modeling approach for three-modal inputs and two-modal outputs within a single model. Through these methods, the model can accomplish eight reasonable multi-modal tasks, with some of the primary tasks illustrated in Figure 3, showcasing all the multilayer tokens generated during a single inference.

3.3 Training Strategies

In this section, we will introduce the training phase of the Mini-Omni2 model. The overall training process of Mini-Omni2 is illustrated in Figure 5. The training process is divided into three stages, with multitask training employed in each stage. In the figure, except for Stage 1, a foundational text-to-text task is additionally incorporated but not explicitly depicted. We categorize the entire training process into three stages:

•

Multimodal Encoder Adaptation In the first stage, we employ a rapid, small-scale training focused solely on the weights of the linear layer connecting the language model and the encoder. The objective of Stage 1 is to ensure that the multi-modal features received by the model closely resemble the characteristics of text tokens as represented in the model’s embedding layer. We believe this approach offers two primary advantages: 1. It allows the model to concentrate on logical reasoning in modality-specific question answering during subsequent training. 2. It minimizes the parameter changes in the language model’s core that would otherwise result from adapting to other modalities.
•

Modality Alignment In stage 2, the primary task of the model training is to transfer the question-answering ability based on text input to question-answering abilities based on images and audio. In this step, the adapters trained in stage 1 are temporarily frozen, and the weights of the language model are involved in the training. At this stage, all tasks do not involve audio responses. For tasks like image-based and audio-based QA, only text-based responses are generated to establish the model’s foundational logical capabilities. The speech output is simply an extension of this logical ability into different modalities.
•

Post training In Stage 3, the task of the model is to extend the output modality to include audio response generation. As shown in Figure 5, the model will be trained on all tasks from Stage 1 and Stage 2, with audio token outputs for all question-answering tasks. Additionally, the model will learn interruption mechanism, an algorithm introduced in the next section.

3.4 Duplex Interaction

A real-time conversation model needs to have duplex capability in order to enable more flexible interactions. However, this interruption mechanism should not be a simple VAD (Voice Activity Detection)-based one, but rather a system that can determine whether the user intends to interrupt the model. Additionally, the model’s ability should be highly robust, capable of handling various external situations (e.g., noise, other conversations, and unrelated sounds). We explore this functionality with command-based task, where the model will stop talking immediately when user speaks "Stop Omni". Furthermore, this approach can be naturally extended to incorporate more sophisticated semantic interruption mechanisms through the development of more contextually appropriate interruption datasets.

Background Noise Selection: (1) We randomly utilized a variety of speech recognition samples from the Libri-tts dataset as the original human noise data samples. (2) We employed samples from the MUSAN(Snyder et al., 2015) dataset, which includes music, human voices, white noise, and urban noise.

Semantic Interruption Construction: We synthesized "Stop Omni" phrases with random voice timbres, which were subsequently mixed with noise. The specific data construction methods are introduced in the next section.

Combining the aforementioned data, the model will receive long sequences of data containing "Stop Omni" phrases amidst various noises. The model will generate two types of state tokens in real time: irq and n-irq, representing the intention of the user to interrupt and not to interrupt, respectively. During inference, when the model output irq token, it will stop the generating process and start to listen the new question. For this task, we use tokens as input to enhance the model’s real-time processing capabilities.

4 Data and Evaluation

In this section, we introduce the data used for training Mini-Omni2 and present some initial evaluation results. We will provide a more detailed explanation of the composition and construction process of the data for each modality. In the experimental results section, we only showcase a few application cases and basic capability assessments. More comprehensive experiments related text and vision tasks will be updated shortly.

4.1 Datasets

The training data for the Mini-Omni2 model is primarily sourced from five components, as shown in Table 1. (1) Textual Question-Answering Data: Throughout all training stages, whenever the language model weights were unfrozen for training, textual question-answering data was included to maintain the model’s reasoning ability. We used the first 1.5 million question-answer pairs from the Open-Orca dataset. (2) Speech Recognition Data: Speech recognition data was used to continuously maintain the model’s semantic understanding of external spoken input. We primarily utilized the LibriTTS, VCTK, and Multilingual LibriSpeech datasets. (3) Spoken Question-Answering Data: We did not use a standalone spoken dataset; instead, synthetic data was employed for training. The spoken question-answering data was derived from the Moss-002-sft dataset. (4) Image Question-Answering Data: We used 400,000 samples (caption and instruction) from the ALLaVA-4V dataset. (5) Voice Assistant Data: To make the model’s responses more aligned with the style of voice assistants, we continuously used the VoiceAssistant-400K dataset introduced in Mini-Omni.

Task	Stages	Dataset	Modality	items
		Libritts (Zen et al., 2019)	A1\|T1	586 h
ASR	1,2,3	VCTK (datashare, 2024)	A1\|T1	44 h
		Multilingual LibriSpeech (Pratap et al., 2020)	A1\|T1	8000h
Text QA	2,3	Open-Orca (OpenOrca, )	T1\|T2	2000K
Audio QA	2,3	Moss-002-sft-data (Sun et al., 2024)	A1\|T1\|A2\|T2	1500K
Visual QA	2,3	ALLaVA-4V (Sun et al., 2024)	V\|A1\|T1\|A2\|T2	800K
		Alpaca-GPT4 (vicgalle, 2024)	A1\|T1\|A2\|T2	55k
		Identity finetune (sayan1101, 2024)	A1\|T1\|A2\|T2	2k
		QAassistant (Mihaiii, 2024a)	A1\|T1\|A2\|T2	27k
voice QA	final	Rlhf (Anthropic, 2024)	A1\|T1\|A2\|T2	367k
		Trivia-singlechoice (Mihaiii, 2024c)	A1\|T1\|A2\|T2	17k
		Trivia-Multichoice (Mihaiii, 2024b)	A1\|T1\|A2\|T2	20k
		OpenAssistant (OpenAssistan, 2024)	A1\|T1\|A2\|T2	2k

Table 1: The datasets and their usage for training Mini-Omni2.

4.2 Training Parameters

The Mini-Omni2 model completed all training steps on eight A100 GPUs. During the adapter training stage, learning rates ranged from 2e-5 to 1e-3, while training the language model used learning rates between 2e-6 and 2e-4. The final fine-tuning was conducted with learning rates ranging from 2e-6 to 2e-5. A cosine scheduler was employed, with 1,500 warm-up steps and a global batch size of 192. Each stage was trained for one epoch using the full dataset. The scale of the vision and audio encoders was described earlier, and the language model used was the Qwen2-0.5B base model. All model adapters used Llama-MLP with an intermediate size of 4,864.

4.3 Data Construction

Spoken Dialogue Data: We used our speech recognition dataset as a random voice timbre library. To ensure robust training, a random sample from this dataset was selected as a voice prompt for the input of all spoken dialogue data, and CosyVoice(Du et al., 2024) was employed for zero-shot speech synthesis. For the output of all question-answering data, the same voice timbre was used from an internal TTS system.

Interruption Data: First, the noise data is stream-encoded and decoded to simulate real-time streaming input to the model. Then, a random segment of the noise data is extracted. At the end of this segment, a "Stop Omni" phrase is inserted, generated with a random voice timbre in the same manner as the dialogue data. Finally, an additional "tail" of 0-10 seconds is appended to the end of this segment. In terms of labeling, all data before the tail is labeled as "n-irq", while the tail segment is labeled as "irq", indicating that the model should be interrupted.

4.4 Experimental Results

Currently, we provide the accuracy of Mini-Omni2 in speech recognition to evaluate the model’s speech understanding ability, and we present some practical cases. For model experience and more cases, please follow our github repositories.

Method	test-clean	test-other	dev-clean	dev-other
Wav2vec2-base (Baevski et al., 2020)	6.0	13.4	-	-
VITA (Fu et al., 2024)	8.14	18.41	7.57	16.57
Whisper-small*	4.4	10.1	4.6	10.3
Mini-Omni	4.5	9.7	4.6	9.2
Mini-Omni2	4.8	9.8	4.7	9.4

Table 2: Comparison of the model’s ASR with the base model used. (* our reproduced evaluation result.)

According to the speech recognition results in Table 2, it can be observed that the accuracy of Mini-Omni2 shows a slight decline after adding the visual modality compared to Mini-Omni. This phenomenon may be attributed to the relative reduction in the proportion of data. Moreover, in comparison with the decoder of the whisper module employed by the model, the Mini-Omni2 model outperforms Whisper on the librispeech-other dataset. This demonstrates that our training process has enhanced the robustness of the model in speech recognition.

4.5 Case Study

Here we present some use cases from Mini-Omni2.

5 Limitations

We believe the following aspects are worth exploring and improving: 1. Scaling of model and data size. Mini-Omni2 aims to train small models with limited resources, and we believe that more data and compute can greatly enhance its capabilities. 2. Improve style control and diversity of audio output (emotion, naturalness, timbre, accent, and singing). 3. A richer mechanism for semantic interruptions.

6 Conclusion

In this paper, we present Mini-Omni2, a unified multi-modal language model with capabilities in text, speech, vision, end-to-end streaming audio output, and duplex interaction. Our goal is to reproduce an open-source GPT-4o model, and to our best knowledge, our work is also one of the closest in terms of functionality. We use multiple pretrained encoders as the vision and speech encoders and align them with the language model to extend the modalities. Furthermore, we propose a three-phase modality alignment and expansion training process to achieve the desired capabilities of the model. We also explore a robust method for duplex interaction modeling and introduce our data construction and interruption mechanism. All models and datasets will be open-sourced, and we hope Mini-Omni2 can serve as a reference for future research.

References

AI [2023] Lightning AI. Litgpt. https://github.com/Lightning-AI/litgpt, 2023.
Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
Anthropic [2024] Anthropic. https://huggingface.co/datasets/anthropic/hh-rlhf, 2024.
Baevski et al. [2020] Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
Bai et al. [2023] Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
Chen et al. [2023] Qian Chen, Yunfei Chu, Zhifu Gao, Zerui Li, Kai Hu, Xiaohuan Zhou, Jin Xu, Ziyang Ma, Wen Wang, Siqi Zheng, et al. Lauragpt: Listen, attend, understand, and regenerate audio with gpt. arXiv preprint arXiv:2310.04673, 2023.
Chu et al. [2024] Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759, 2024.
Copet et al. [2024] Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 2024.
Dai et al. [2023] Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305.06500.
datashare [2024] datashare. https://datashare.ed.ac.uk/handle/10283/2651, 2024.
Défossez et al. [2022] Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
Défossez et al. [2024] Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037, 2024.
Dong et al. [2024] Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Bin Wang, Linke Ouyang, Songyang Zhang, Haodong Duan, Wenwei Zhang, Yining Li, et al. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
Du et al. [2024] Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407, 2024.
Fang et al. [2024] Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666, 2024.
Fu et al. [2024] Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, et al. Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211, 2024.
Google [2024] Google. https://deepmind.google/technologies/gemini/, 2024.
Li et al. [2022] Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pages 12888–12900. PMLR, 2022.
Liu et al. [2024] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024.
Ma et al. [2024] Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, and Xie Chen. Language model can listen while speaking. arXiv preprint arXiv:2408.02622, 2024.
meta [2024] meta. llama3.1, 2024. URL https://llama.meta.com/.
Mihaiii [2024a] Mihaiii. https://huggingface.co/datasets/mihaiii/qa-assistant-2, 2024a.
Mihaiii [2024b] Mihaiii. https://huggingface.co/datasets/mihaiii/triviamultichoice, 2024b.
Mihaiii [2024c] Mihaiii. https://huggingface.co/datasets/mihaiii/triviasinglechoice, 2024c.
Nachmani et al. [2023] Eliya Nachmani, Alon Levkovitch, Roy Hirsch, Julian Salazar, Chulayuth Asawaroengchai, Soroosh Mariooryad, Ehud Rivlin, RJ Skerry-Ryan, and Michelle Tadmor Ramanovich. Spoken question answering and speech continuation using spectrogram-powered llm. arXiv preprint arXiv:2305.15255, 2023.
Openai [2024a] Openai. https://openai.com/index/hello-gpt-4o/, 2024a.
Openai [2024b] Openai. https://openai.com/index/gpt-4v-system-card/, 2024b.
OpenAssistan [2024] OpenAssistan. https://huggingface.co/datasets/openassistant/oasst1, 2024.
[29] OpenOrca. https://huggingface.co/datasets/open-orca/openorca/.
Pratap et al. [2020] Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. Mls: A large-scale multilingual dataset for speech research. arXiv preprint arXiv:2012.03411, 2020.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Radford et al. [2023] Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492–28518. PMLR, 2023.
Rubenstein et al. [2023] Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
sayan1101 [2024] sayan1101. https://huggingface.co/datasets/sayan1101/identity-finetune-data, 2024.
Siuzdak [2024] Hubert Siuzdak. https://github.com/hubertsiuzdak/snac/, 2024.
Snyder et al. [2015] David Snyder, Guoguo Chen, and Daniel Povey. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484, 2015.
Sun et al. [2024] Tianxiang Sun, Xiaotian Zhang, Zhengfu He, Peng Li, Qinyuan Cheng, Xiangyang Liu, Hang Yan, Yunfan Shao, Qiong Tang, Shiduo Zhang, et al. Moss: An open conversational large language model. Machine Intelligence Research, pages 1–18, 2024.
Touvron et al. [2023] H Touvron, T Lavril, G Izacard, X Martinet, MA Lachaux, T Lacroix, B Rozière, N Goyal, E Hambro, F Azhar, et al. Open and efficient foundation language models. Preprint at arXiv. https://doi. org/10.48550/arXiv, 2302, 2023.
vicgalle [2024] vicgalle. https://huggingface.co/datasets/vicgalle/alpaca-gpt4, 2024.
Wang et al. [2023a] Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023a.
Wang et al. [2024] Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024.
Wang et al. [2023b] Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079, 2023b.
Xie and Wu [2024] Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725, 2024.
Yang et al. [2024] An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671, 2024.
Zen et al. [2019] Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
Zhan et al. [2024] Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al. Anygpt: Unified multimodal llm with discrete sequence modeling. arXiv preprint arXiv:2402.12226, 2024.
Zhang et al. [2023a] Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. arXiv preprint arXiv:2305.11000, 2023a.
Zhang et al. [2023b] Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023b.
Zhang et al. [2023c] Yu Zhang, Wei Han, James Qin, Yongqiang Wang, Ankur Bapna, Zhehuai Chen, Nanxin Chen, Bo Li, Vera Axelrod, Gary Wang, et al. Google usm: Scaling automatic speech recognition beyond 100 languages. arXiv preprint arXiv:2303.01037, 2023c.
Zhu et al. [2023] Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.