Bora: Biomedical Generalist Video Generation Model

Weixiang Sun^1∗, Xiaocao You^2∗, Ruizhe Zheng^3∗, Zhengqing Yuan⁴,
Xiang Li⁵, Lifang He⁶, Quanzheng Li⁵, Lichao Sun⁶
¹Northeastern University, China, ²Shanghai University of Finance and Economics
³Fudan University, ⁴University of Notre Dame
⁵Massachusetts General Hospital and Harvard Medical School, ⁶Lehigh University

Abstract

Generative models hold promise for revolutionizing medical education, robot-assisted surgery, and data augmentation for medical AI development. Diffusion models can now generate realistic images from text prompts, while recent advancements have demonstrated their ability to create diverse, high-quality videos. However, these models often struggle with generating accurate representations of medical procedures and detailed anatomical structures. This paper introduces Bora, the first spatio-temporal diffusion probabilistic model designed for text-guided biomedical video generation. Bora leverages Transformer architecture and is pre-trained on general-purpose video generation tasks. It is fine-tuned through model alignment and instruction tuning using a newly established medical video corpus, which includes paired text-video data from various biomedical fields. To the best of our knowledge, this is the first attempt to establish such a comprehensive annotated biomedical video dataset. Bora is capable of generating high-quality video data across four distinct biomedical domains, adhering to medical expert standards and demonstrating consistency and diversity. This generalist video generative model holds significant potential for enhancing medical consultation and decision-making, particularly in resource-limited settings. Additionally, Bora could pave the way for immersive medical training and procedure planning. Extensive experiments on distinct medical modalities such as endoscopy, ultrasound, MRI, and cell tracking validate the effectiveness of our model in understanding biomedical instructions and its superior performance across subjects compared to state-of-the-art generation models. Our model and codes are available at https://weixiang-sun.github.io/Bora/

Refer to caption — Figure 1: The overall process for generating captions. First, the agent extracts background information from the corresponding dataset, which is then injected into the LLM. Then, combined with the frame sequences, it generates high-quality captions.

1 Introduction

Generative AI technologies have stepped into a new era, fundamentally altering how industries operate and deeply influencing everyday life OpenAI et al. (2023). Text-to-image (T2I) diffusion models are now capable of generating realistic images that adhere to complex text prompts Saharia et al. (2022); Rombach et al. (2022). Recent video generation models like Pika Pika (2023), SVD Blattmann et al. (2023a) and Gen-2 ML (2023) have also demonstrated their ability to create diverse and high-quality videos, primarily in general contexts. Sora Liu et al. (2024), introduced by OpenAI in February 2024 and known for its advanced capabilities in generating detailed videos from textual descriptions. It stands out by its capacity to generate high-quality videos lasting up to one minute, while faithfully following the textual instructions provided by the users OpenAI (2024). Despite their revolutionary capabilities, these models often struggle with generating accurate representations of medical procedures, detailed anatomical structures, and other specific clinical information.

So far, a diversity of models have been proposed for text-guided visual contents. Among them, diffusion models stand out as the most powerful deep generative models for various tasks in content creation, including image-to-image, text-to-image and text-to-video generation Croitoru et al. (2023); Wang et al. (2023b); Mogavi et al. (2024); Ma et al. (2024). Video generation aims to produce realistic videos that exhibit a high-quality visual appearance and consistent motion simultaneously. Diffusion models demonstrate strong performance in synthesizing temporally coherent video frames with flexible conditioning and controls, stronger diversity and more significant details. Finetuning of diffusion models is more cost-effective due to its generalizability and adaptability to user requirements. In particular, recent advance in Transformer-based large-scale video diffusion models has enabled long video generation in adherence to specific human instructions. SoraLiu et al. (2024) demonstrates a remarkable ability to accurately interpret and execute complex human instructions.

However, there have been few attempts to explore biomedical video generation, which demands that the model comprehends complex medical instructions and intricate real-world dynamics. In addition, current approaches for video generation are not capable of generating accurate and realistic anatomical structures. In this work, we investigate the potential of diffusion models to generate biomedical video content with high controllability and quality. We begin by establishing a medical video corpus, which includes paired text-video data from various biomedical fields, encompassing both non-task-oriented and task-oriented content. Due to concerns regarding intellectual property and privacy, the dataset list is not exhaustive. Nevertheless, our text-video corpus is designed to be representative of diverse applications, ranging from macroscopic to microscopic scales. For video clips lacking consistent descriptions, we leverage LLM to generate captions, thereby enhancing the usability of their content. Then, we design Bora, the first spatio-temporal diffusion probabilistic model for text-guided generalist biomedical video generation. Bora is based on Transformer architecture and has been pre-trained on general-purpose video generation tasks. As shown in Figure 1, We fine-tune the model through alignment and instruction tuning on the constructed corpus. We assess whether Bora-generated videos appear plausible with respect to medical expert standards and evaluate their consistency and diversity.

For generative video modeling, it is well-established that pre-training on a large and diverse dataset followed by fine-tuning on a smaller, higher-quality dataset is beneficial for final performance. Therefore, Bora is initialized from a pre-trained weights on large scale data. The model is fine-tuned through alignment and instruction tuning on the well-curated biomedical video corpus. Through extensive text-to-video generation experiments, we demonstrate that Bora is capable of generating high-quality video data with assistance from LLM-based captions across four distinct biomedical domains. More importantly, we assess whether Bora-generated videos appear plausible with respect to medical expert standards and evaluate their consistency and diversity. The results show that Bora achieves significantly better understanding of domain-specific instructions than general-purpose state-of-the-art video diffusion models. It also show promising subject and motion consistency across various modalities such as endoscope, RT-MRI and ultrasound imagings.

The debut of Bora, a generalist biomedical video generative model, underscores its vast potential in enhancing medical consultation, diagnosis, and operations for clinical practitioners, thereby improving patient experience and welfare. Bora can significantly impact medicine by providing patients with visual guides on procedures and treatments and offering doctors real-time assistance. In medical education, Bora could offer resources for students. Additionally, Bora could accelerate the integration and development of AR/VR technologies for immersive medical training and procedure planning. We summarize the contributions as follows:

•

We propose Bora, a generalist biomedical video generation model. Extensive experiments highlight Bora’s superior performance over other models in terms of video quality and consistency and its capability in following expert instructions.
•

Given the limited availability of high-quality data, we construct the first comprehensive biomedical video-text corpus by extracting detailed descriptions and background knowledge from open-source video data using LLM. This is expected to provide valuable resources for future research.
•

We validate Bora’s capability in generating videos across various biomedical modalities, including endoscopy, ultrasound, real-time MRI, and cellular motility. Bora’s proficiency in producing diverse realistic medical videos opens new avenues for medical AI.

2 Related Work

Text-to-Image Diffusion Model So far, most of the SOTA approaches for text-to-image generation are based on diffusion modelsAchiam et al. (2023); Saharia et al. (2022). Diffusion models constitute a class of generative models that utilize diffusion stochastic process to modeling data generation. It can be conditioned by the class-induced and non-class-induced informationHo and Salimans (2022), while the latter has become the predominant approach due to its flexibility. Of these, DALL-E2Ramesh et al. (2022) and Imagen Saharia et al. (2022) achieve photorealistic text-to-image generation using cascaded diffusion models, whereas Stable Diffusion Rombach et al. (2022) performs generation in a low-dimensional latent space.

Text-to-Video Diffusion Model Recent years have witnessed significant discussion on video generative models. Text-to-video extends text-to-image generation to generation of coherent high-fidelity videos given text conditions. At the initial phase, Variational Autoencoders (VAEs) and Generative Adversarial Networks (GANs) Li et al. (2017); Tulyakov et al. (2017); Chu et al. (2020); Wang et al. (2020) are used to model spatio-temporal dynamics of video data, which is vulnerable to mode collapse. Diffusion models, by contrast, can generate dynamic and accurate video content with improved stability Ho et al. (2022); Singer et al. (2022). Specifically, spatial and temporal modules are leveraged to generate time-consistent contents. For example, MagicVideo Zhou et al. (2023a) introduces latent diffusion to the text-to-video generation framework, enhancing model’s capabilities to capture video content dynamics. Ge et al. (2023) proposes a video noise prior to boost performance. Several solutions to the high computational cost of training video diffusion models have been proposed, such as downsampling the videos spatio-temporally Bar-Tal et al. (2024) or fine-tuning only the temporal modules Blattmann et al. (2023b); Guo et al. (2023) by reusing pre-trained weights.

Modality	Data Source	Origin Resolution	Length	Origin Size	Processed Size
Endoscopy	Colonoscopy Mesejo et al. (2016)	$768\times 5761$	10s+	76	210
	Kvasir-Capsule Smedsrud et al. (2021)	$256\times 448$	/	50	1000
	CholecTriplet Nwoye et al. (2022)	$720\times 576$	10s+	374	580
Ultrasound	Echo-Dynamic Ouyang et al. (2020)	$112\times 112$	5s	10,030	10,030
Ultrasound	ULiver De Luca et al. (2013); Petrusca et al. (2013)	$500\times 480$	10s+	7	28
RT-MRI	2drt Lim et al. (2021)	$84\times 84$	10s+	/	1682
Cell	Tryp Anzaku et al. (2023)	$1360\times 1024$	10s+	114	188
	CMTC-v1 Anjum and Gurari (2020)	$320\times 400$	10s+	86	258
	VISEM Haugen et al. (2019)	$640\times 480$	10s+	85	339

Table 1: Sources and detailed information of the data in our text-video pair dataset.

Diffusion Models in Biomedicine In recent years, the application of generative models has expanded significantly, evolving from conventional domains to specialized industries, including the field of biomedicine. So far, several works have applied diffusion models in synthesizing medical visual contents of a variety of modalities for data augmentation and privacy protection in development of AI models for medical image analysis. Xu et al. (2023) uses text-conditioned synthesized low-resolution images as a foundation for 3D CT images.Dorjsembe et al. (2022) proposes the first diffusion-based 3D medical image generation model that achieves promising results in high-resolution MRI image synthesis. However, biomedical video generation is yet to be explored. Li et al. (2024a) proposes Endora, a preliminary attempt to develop a video diffusion model specifically for endoscope data. To the best of our knowledge, there has been no medicine-specific generative model for producing high-quality and accurate videos.

3 Biomedical Text-Video Pair Dataset

Previously, there had been no exploration in the field of text-to-video generation within the biomedical domain. Consequently, there are no readily available biomedical text-video pair datasets. To address this gap, we leveraged the capabilities of LLM to create the first biomedical text-video pair dataset including four major biomedical modalities.

3.1 Included Videos

Our video data encompasses four primary biomedical modalities: endoscopic imaging, ultrasonography, real-time MRI, and cellular motility. Table A details the specific dataset sources along with their fundamental information. For varying resolutions, we standardize each to 256x256 pixels to facilitate model training. Regarding temporal length, if the original dataset’s video duration exceeds ten seconds, it is uniformly recorded as 10s+. For videos that are excessively long, we determine a threshold $K$ based on the degree of frame-to-frame variation. Sampling begins at zero frame intervals and progressively increases until the average inter-frame interval of the resultant video exceeds the predetermined threshold.

3.2 LLMs Instruct Caption Generation

Currently, a vast array of multimodal large language models (LLMs) support various input modes, including functioning as unimodal language models for processing purely textual inputs, accepting single image-text pairs or single images alone, as well as handling interleaved image-text pairs or multiple images, all with good performance. Furthermore, Li et al. (2024b); Peng et al. (2023) has demonstrated that data generated by LLMs can serve as high-quality training data. More importantly, LLMs that exhibit strong performance in general tasks, even without specific fine-tuning for the biomedical domain, still show excellent capabilities, such as the zero-shot performance of GPT-4 in the biomedical field Yan et al. (2023). Therefore, we further explore the powerful capabilities of LLMs in the domain of biomedical videos, which involve temporal information, aiming to efficiently and accurately generate video descriptions.

In summary, we pre-process the video $X_{v}^{i}$ by evenly splitting it into $n$ frames $(\textit{f}_{1},\cdots,\textit{f}_{n})$ and sequentially transmit these images to LLM, obtaining descriptions $X_{desc}^{i}$ . Then the origin video $X_{v}^{i}$ will be combined with its description $X_{desc}^{i}$ to form a text-video pair $X_{i}=(X_{v}^{i},X_{desc}^{i})$ . However, during the process, we discovered that this straightforward approach tends to overlook the dataset’s background, focusing primarily on describing the objects and movements within. To incorporate background information, we use an agent approach to transmit additional information to LLM, such as technical documents, research papers, or home pages related to the dataset. This not only enriches its biomedical background knowledge but also ensures that background information is not neglected, significantly enhancing the quality of the captions. More details about source data and processing can be found in Appendix A.

4 Methods

Following prior work Chen et al. (2023); Ramesh et al. (2022); Feng et al. (2023); Wei et al. (2023); Zhou et al. (2023b), our architecture is divided into three modules: Text Encoder, Video Encoder, and Diffusion Block. Specifically, we initialize the weights using Open-Sora Zheng et al. (2024), a framework capable of generating high-quality general video models, and subsequently conduct two-phase biomedical training on this basis.

4.1 Model Architecture

Text Encoder. We adopt a pre-trained text encoder $T5$ Raffel et al. (2020) to encode the medical prompt. Specifically, we employ only the encoder portion of T5. It consists of multiple identical layers, each comprising a self-attention mechanism and a feedforward neural network. This architecture is capable of transforming the text $X_{\text{desc}}$ into an representation $z^{T}$ of medical prompts via the process $z^{T}=\text{FFN}(\text{SelfAttention}(X_{desc}))$ .

Video Encoder. We compress the training data into a smaller representation within the latent space using a video encoder, which is then utilized for training subsequent blocks. For each video $X_{v}$ , we sample $T$ frames. For each batch $B_{v}$ , we form a combination shaped as $(B,C,T)$ , where $B$ is the batch size, $C$ is the number of channels, and $T$ is the length of the time series (i.e., the number of sampled frames). We rearrange the sequences in the input data into a single frame, reshaping into $B_{v}^{\prime}=(B\times T,C)$ . Using the mean $\mu$ and variance $\log\sigma^{2}$ output from a pre-trained VAE, we obtain a Gaussian distribution in the latent space $q(z\mid x)=\mathcal{N}(z;\mu(x),\sigma^{2}(x))$ . Sampling $z^{V}$ from this distribution, we employ the reparameterization trick $z^{V}=(\mu(x)+\sigma(x)\odot\epsilon)\times 0.18215$ , where $\epsilon\sim\mathcal{N}(0,I)$ , allowing gradients to propagate through the stochastic sampling, and apply a scaling transformation. Finally, we rearrange the data back into the shape $(B,C,T)$ to match subsequent dimensions.

Difusion Transformer. Diffusion models typically include a "forward process" and a "reverse process". The forward process incrementally adds noise to a data point $x_{0}$ , transforming it completely into Gaussian noise $x_{T}$ . This process can be described as:

q\left(x_{t}\mid x_{t-1}\right)=\mathcal{N}\left(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}I\right)

(1)

Conversely, the reverse process attempts to recover the original data point from this noisy state. In conditions where text is present, the core approach involves combining text embeddings with diffusion states to modulate the reverse process as follows:

p_{\theta}\left(x_{t-1}\mid x_{t},c\right)=\mathcal{N}\left(x_{t-1};\mu_{\theta}\left(x_{t},t,c\right),\sigma_{t}^{2}I\right)

(2)

4.2 Adapting to Biomedical Domain

To efficiently and effectively obtain a high-quality model for generating biomedical videos, we divide the training into two stages: Modal Alignment and Full-parameter Training.

Biomedical Modal Alignment. Although Open-Sora has not publicized the text-video pair data they created, our inference based on the source data suggests that it lacks any inclusion of medical concepts or scenarios. Furthermore, their parameters initiate from PixArt- $\alpha$ Chen et al. (2023), which also lacks effective medical knowledge injection. Therefore, modal alignment is particularly important. To enhance efficiency in this step, we simplify the data and freeze a portion of the parameters. Regarding the data, for modalities with fewer total videos, we extract more frames randomly from each video, while for modalities with more videos, fewer frames are extracted from each to ensure performance balance across different modalities. Moreover, we modify the corresponding captions to "This is a modal video." On the model front, we freeze the temporal attention to accelerate training. By employing these strategies, we efficiently provide simple guidance for video generation models that do not contain biomedical knowledge, laying a foundation for further training.

Instruction Tuning. In the second step, we first unfreeze the temporal attention module and then update the weights obtained from the previous step. During this stage, we train using the biomedical text-video pairs introduced in Sec 3, constructing a biomedical video generator. In fact, to ensure balance across different modalities, we did not utilize all the completed data sets. We used the modality with the smallest amount of data as a benchmark, balancing the quantities among the different modalities to prevent the occurrence of inter-modal chaos. For videos of varying lengths, we adopt different sampling intervals to capture richer temporal information. Experiments will demonstrate that our model not only possesses robust command-following capabilities, accurately transforming medical terminologies into corresponding videos but also ensures sufficient video quality.

5 Experiments

5.1 Setup

Baseline Models. At present, there are numerous available open-source or commercial video generation models. We have selected several high-performance models as benchmark models, including Pika Pika (2023), PixVerse PixVerse (2024), Gen-2 ML (2023), ModelScope Wang et al. (2023a), and Lavie Wang et al. (2023b).

Implemention Details. For text-to-video generation, we employ GPT-4 to generate text prompts across different modalities. Specifically, we provide an overview of the desired modalities and some examples to learn from, and then prompt it to generate a certain number of text prompts. For certain models that have length restrictions on text prompts, we utilize GPT-4 to rewrite overly long prompts without altering their meaning. All generated text prompts are subsequently input into text-to-video models to produce videos.

All experiments are conducted on one TESLA A100 GPU, equipped with a substantial $80\mathrm{~{}GB}$ of VRAM. The central processing was handled by $4\times$ Intell(R) Xeon(R) Platinum 8362 56-Core Processors. The software environment was standardized on PyTorch version 2.2.2 and CUDA 12.1 for video generation and PyTorch version 1.13.1 and CUDA 12.1 for video evaluation. More details about training and evaluation can be found in Appendix B.

Biomedical Instruction Following Metrics. For the biomedical field, our primary concern is to evaluate the authenticity of the generated videos and their understanding of biomedical information. Based on this, we have designed the following three metrics for evaluation.

❶Realism Rate Metrics: We use Video-Llama Zhang et al. (2023) to determine whether a video is from the real world. If it determines that the video is from the real world, it scores 1 point; otherwise, it scores 0 points. During the evaluation process, we found that sometimes it returns ambiguous answers such as "I am not sure if this is a real-world video." To ensure the valid participation of all samples, we assign a score of 0.5 in such cases. The final realism rate is calculated as the average score across all samples.

\begin{split}I(x)&=VideoLlama(video)\\ Realism\ Rate&=\begin{cases}1&if\ I(X)=Yes\\ 0&if\ I(X)=Yes\\ 0.5&if\ I(X)=NotSure\end{cases}\end{split}

(3)

❷Biomedical Understanding (BmU): Frankly, even if a video is judged to be completely virtual, it is still an excellent model if it can accurately convey the intent of the biomedical prompt. To this end, we propose Biomedical Understanding, a metric designed to evaluate the degree of adherence to prompts within the latent space. Specifically, we achieve this by having a Large Language Model (LLM) describe the video, then inputting the newly obtained description $\textit{T}_{new}$ text and the original prompt $\textit{T}_{ori}$ into BERT to calculate their pre-defined similarity in the embedding space. We obtain the new video descriptions using two methods: 1) passing image sequences as video (BmU-I), and 2) using Video-Llama Zhang et al. (2023) to describe the video (BmU-V). These two methods provide video descriptions from different perspectives, ensuring the accuracy of the evaluation.

BmU=cos(BERT(T_{new}^{(i)}),BERT(T_{ori}^{(i)}))

(4)

Table 2: The comparison of biomedical instruction following ability between other models. All the results are under biomedical prompts.

Model	Realism Rate	BmU-I	BmU-V	BmU-ave	Default Length(s)
Pika Pika (2023)	0.14	0.60	0.67	0.64	4
PixVerse PixVerse (2024)	0.06	0.71	0.69	0.70	4
Gen-2 ML (2023)	0.11	0.55	0.59	0.57	4
ModelScope Wang et al. (2023a)	0.32	0.39	0.52	0.46	2
Lavie Wang et al. (2023b)	0.20	0.47	0.53	0.50	2
Bora	0.66	0.83	0.89	0.86	5

Table 3: The comprehensive video quality and evaluation scores of four biomedical modalities videos generated by our Bora. Note: The results of Sora are only for comparison, not under biomedical prompts.

Modality	Subject Consistency	Background Consistency	Temporal Flickering	Motion Smothness	Dynamic Dgree	Imaging Quality	Temporal Style
Endoscope	0.89	0.96	0.90	0.91	0.90	0.60	0.23
Ultrasound	0.91	0.94	0.97	0.98	0.29	0.37	0.21
RT-MRI	0.90	0.96	0.99	0.99	0.28	0.19	0.23
Cell	0.92	0.99	0.97	0.98	0.35	0.48	0.19
Ave Score	0.91	0.96	0.96	0.97	0.46	0.41	0.22
Sora	0.95	0.96	-	1.0	0.69	0.58	0.35

Video Quality Metrics. We evaluate the comprehensive video quality of four covered biomedical modalities, following some of the basic metrics proposed in Huang et al. (2023). As our model is specialized for the biomedical domain, we have omitted the aesthetic scoring.

❶ Subject Consistency, computed by the DINO Caron et al. (2021) feature similarity across frames to assess whether the appearance is consistent throughout the video; ❷ Background Consistency, calculated by CLIP Radford et al. (2021) feature similarity across frames to evaluate temporal consistency of the background; ❸ Temporal Flicking, by selecting keyframes and calculating their average absolute deviation to access temporal consistency at details. ❹ Motion Smoothness, which utilizes the motion priors in the video frame interpolation model AMT Li et al. (2023) focuses on "move" to evaluate the smoothness of generated motions; ❺ Dynamic Degree, computed by employing RAFT Teed and Deng (2020) to see whether it contains large motions; ❻ Imaging Quality, calculated by using MUSIQ Ke et al. (2021) image quality predictor; ❼Temporal style is determined by using ViCLIP Wang et al. (2023c) to compute the similarity between video features and temporal style descriptions, reflecting the style’s consistency.

5.2 Results

The evaluation results of biomedical instruction following ability are shown in Table 2. It can be observed that our Bora significantly outperforms other models across all metrics. It should be noted that BERT Devlin et al. (2018) maps text into an embedding space and performs comparisons based on vectors, it does not operate based on actual semantics. The BmU (Biomedical understanding) of other models is already at a considerably low level, resulting in the generated videos being almost entirely different. The comparison of video between Bora and other models is shown in Figure 4.

Due to the significant discrepancies in the evaluation results from the previous phase, it is deemed unnecessary to conduct further comprehensive quality assessments on videos generated by other models. On one hand, comparing models that do not specialize in the biomedical domain is not fair; on the other hand, the biomedical videos they produce do not accurately reflect their true performance. Therefore, we shift our focus to evaluating videos generated across the four modalities we cover. For a concise and intuitive comparison of data, we still incorporate the evaluation results of Sora Yuan et al. (2024). The results of comprehensive video quality are illustrated in Table 3.

The cell is clearly ahead in terms of background consistency because its background is uniformly gray or bright on the slide. Due to the processed videos Li et al. (2024a), endoscopy exhibits significant frame-to-frame variation, resulting in high dynamic degrees. Moreover, other metrics show little variation across the four modalities and generally perform well. The Image Quality of the Endoscope even slightly surpasses that of Sora. Despite some performance discrepancies between different modalities, the average scores still demonstrate its performance approaching that of Sora.

6 Conclusion

In this work, we propose Bora, a model designed for generating high-quality biomedical videos. By combining detailed descriptions and extensive background knowledge from video data, we created the first high-quality biomedical text-video pair datasets, highlighting the importance of open-source data in medical AI. Bora sets a new standard with state-of-the-art accuracy and authenticity, surpassing other video generation models in understanding real-world scenarios. Its flexibility in video synthesis makes Bora valuable for various medical applications. We believe that our work will significantly advance subsequent developments in biomedical generation as well as in industries such as AR, VR, and even education.

7 Limitations

7.1 Highly Data-centric

The collection and legal use of video data are often hindered by copyright protections, with these challenges becoming even more pronounced in the field of biomedical video. Beyond copyright, concerns surrounding privacy and ethics must be considered. High-quality biomedical videos are typically sourced from educational content at universities and institutions, where external access is restricted. This limitation forces reliance on open-source data for generating models across several biomedical modalities. Additionally, the execution of procedures in these videos demands high clarity, but biomedical processes, such as those using endoscopes, often produce videos of lower resolution. This underscores the importance of high-quality data for training biomedical generation models.

7.2 Variable Quality of Captions

Although numerous multimodal large language models (LLMs) can describe visual inputs, their performance in the biomedical domain significantly lags behind their capabilities in general domains. While some models are specifically fine-tuned for certain areas within the biomedical field, they are typically optimized for those particular domains and fail to effectively generate captions for other biomedical modalities. Moreover, a homogenization issue exists among the captions. Specifically, due to weak recognition capabilities regarding genuine medical details, the generated captions often repetitively echo similar content. This leads to confusion between different modalities, as demonstrated in Figure 15, where descriptions intended for cell and RT-MRI scenarios result in endoscopic and ultrasound. The most accurate captions usually come from medical diagnoses or narratives by researchers, which not only raises the cost of generating captions but also poses potential risks to patient privacy. Striking a balance between accurate captions, manageable costs, and privacy regulations is crucial.

7.3 Insufficiencies in Quality and Duration

Despite our Bora model’s ability to generate up to 5-second videos across various biomedical modalities, it underperforms when dealing with complex procedures or longer video durations. When we attempted to produce videos up to 16 seconds in length, there was a noticeable degradation in quality. This issue stems partly from a lack of high-quality, long-duration biomedical video data available for training, and partly from the suboptimal performance of our chosen base model in handling spatiotemporal interactions. In contrast, the best generators for regular videos, such as Sora, can produce high-quality videos exceeding one minute. Currently, we can only guarantee the quality of 5-second videos at a resolution of 256x256. This limitation urges us to further expand on spatiotemporal capabilities in future versions of our model.

References

Achiam et al. (2023) Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
Anjum and Gurari (2020) Samreen Anjum and Danna Gurari. 2020. Ctmc: Cell tracking with mitosis detection dataset challenge. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 982–983.
Anzaku et al. (2023) Esla Timothy Anzaku, Mohammed Aliy Mohammed, Utku Ozbulak, Jongbum Won, Hyesoo Hong, Janarthanan Krishnamoorthy, Sofie Van Hoecke, Stefan Magez, Arnout Van Messem, and Wesley De Neve. 2023. Tryp: a dataset of microscopy images of unstained thick blood smears for trypanosome detection. Scientific Data, 10(1):716.
Bar-Tal et al. (2024) Omer Bar-Tal, Hila Chefer, Omer Tov, Charles Herrmann, Roni Paiss, Shiran Zada, Ariel Ephrat, Junhwa Hur, Yuanzhen Li, Tomer Michaeli, et al. 2024. Lumiere: A space-time diffusion model for video generation. arXiv preprint arXiv:2401.12945.
Blattmann et al. (2023a) Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, Varun Jampani, and Robin Rombach. 2023a. Stable video diffusion: Scaling latent video diffusion models to large datasets.
Blattmann et al. (2023b) Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. 2023b. Stable video diffusion: Scaling latent video diffusion models to large datasets. arXiv preprint arXiv:2311.15127.
Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660.
Chen et al. (2023) Junsong Chen, Jincheng Yu, Chongjian Ge, Lewei Yao, Enze Xie, Yue Wu, Zhongdao Wang, James Kwok, Ping Luo, Huchuan Lu, et al. 2023. Pixart- $\alpha$ : Fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426.
Chu et al. (2020) Mengyu Chu, You Xie, Jonas Mayer, Laura Leal-Taixé, and Nils Thuerey. 2020. Learning temporal coherence via self-supervision for gan-based video generation. ACM Transactions on Graphics, 39(4).
Croitoru et al. (2023) Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah. 2023. Diffusion models in vision: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
De Luca et al. (2013) Valeria De Luca, Michael Tschannen, Gábor Székely, and Christine Tanner. 2013. A learning-based approach for fast and robust vessel tracking in long ultrasound sequences. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2013: 16th International Conference, Nagoya, Japan, September 22-26, 2013, Proceedings, Part I 16, pages 518–525. Springer.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Dorjsembe et al. (2022) Zolnamar Dorjsembe, Sodtavilan Odonchimed, and Furen Xiao. 2022. Three-dimensional medical image synthesis with denoising diffusion probabilistic models. In Medical Imaging with Deep Learning.
Feng et al. (2023) Zhida Feng, Zhenyu Zhang, Xintong Yu, Yewei Fang, Lanxin Li, Xuyi Chen, Yuxiang Lu, Jiaxiang Liu, Weichong Yin, Shikun Feng, Yu Sun, Li Chen, Hao Tian, Hua Wu, and Haifeng Wang. 2023. Ernie-vilg 2.0: Improving text-to-image diffusion model with knowledge-enhanced mixture-of-denoising-experts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10135–10145.
Ge et al. (2023) Songwei Ge, Seungjun Nah, Guilin Liu, Tyler Poon, Andrew Tao, Bryan Catanzaro, David Jacobs, Jia-Bin Huang, Ming-Yu Liu, and Yogesh Balaji. 2023. Preserve your own correlation: A noise prior for video diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 22930–22941.
Guo et al. (2023) Yuwei Guo, Ceyuan Yang, Anyi Rao, Yaohui Wang, Yu Qiao, Dahua Lin, and Bo Dai. 2023. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725.
Haugen et al. (2019) Trine B Haugen, Steven A Hicks, Jorunn M Andersen, Oliwia Witczak, Hugo L Hammer, Rune Borgli, Pål Halvorsen, and Michael Riegler. 2019. Visem: A multimodal video dataset of human spermatozoa. In Proceedings of the 10th ACM Multimedia Systems Conference, pages 261–266.
Ho et al. (2022) Jonathan Ho, William Chan, Chitwan Saharia, Jay Whang, Ruiqi Gao, Alexey Gritsenko, Diederik P Kingma, Ben Poole, Mohammad Norouzi, David J Fleet, et al. 2022. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303.
Ho and Salimans (2022) Jonathan Ho and Tim Salimans. 2022. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
Huang et al. (2023) Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, Yaohui Wang, Xinyuan Chen, Limin Wang, Dahua Lin, Yu Qiao, and Ziwei Liu. 2023. Vbench: Comprehensive benchmark suite for video generative models. ArXiv, abs/2311.17982.
Ke et al. (2021) Junjie Ke, Qifei Wang, Yilin Wang, Peyman Milanfar, and Feng Yang. 2021. Musiq: Multi-scale image quality transformer. In Proceedings of the IEEE/CVF international conference on computer vision, pages 5148–5157.
Li et al. (2024a) Chenxin Li, Hengyu Liu, Yifan Liu, Brandon Y. Feng, Wuyang Li, Xinyu Liu, Zhen Chen, Jing Shao, and Yixuan Yuan. 2024a. Endora: Video generation models as endoscopy simulators.
Li et al. (2024b) Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2024b. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems, 36.
Li et al. (2017) Yitong Li, Martin Renqiang Min, Dinghan Shen, David Carlson, and Lawrence Carin. 2017. Video generation from text.
Li et al. (2023) Zhen Li, Zuo-Liang Zhu, Ling-Hao Han, Qibin Hou, Chun-Le Guo, and Ming-Ming Cheng. 2023. Amt: All-pairs multi-field transforms for efficient frame interpolation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9801–9810.
Lim et al. (2021) Yongwan Lim, Asterios Toutios, Yannick Bliesener, Ye Tian, Sajan Goud Lingala, Colin Vaz, Tanner Sorensen, Miran Oh, Sarah Harper, Weiyi Chen, et al. 2021. A multispeaker dataset of raw and reconstructed speech production real-time mri video and 3d volumetric images. Scientific data, 8(1):187.
Liu et al. (2024) Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, et al. 2024. Sora: A review on background, technology, limitations, and opportunities of large vision models. arXiv preprint arXiv:2402.17177.
Ma et al. (2024) Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. 2024. Latte: Latent diffusion transformer for video generation. arXiv preprint arXiv:2401.03048.
Mesejo et al. (2016) Pablo Mesejo, Daniel Pizarro, Armand Abergel, Olivier Rouquette, Sylvain Beorchia, Laurent Poincloux, and Adrien Bartoli. 2016. Computer-aided classification of gastrointestinal lesions in regular colonoscopy. IEEE transactions on medical imaging, 35(9):2051–2063.
ML (2023) Runway ML. 2023. Text to video. URL https://runwayml.com/ai-tools/gen-2-text-to-video/.
Mogavi et al. (2024) Reza Hadi Mogavi, Derrick Wang, Joseph Tu, Hilda Hadan, Sabrina A Sgandurra, Pan Hui, and Lennart E Nacke. 2024. Sora openai’s prelude: Social media perspectives on sora openai and the future of ai video generation. arXiv preprint arXiv:2403.14665.
Nwoye et al. (2022) Chinedu Innocent Nwoye, Tong Yu, Cristians Gonzalez, Barbara Seeliger, Pietro Mascagni, Didier Mutter, Jacques Marescaux, and Nicolas Padoy. 2022. Rendezvous: Attention mechanisms for the recognition of surgical action triplets in endoscopic videos. Medical Image Analysis, 78:102433.
OpenAI (2024) OpenAI. 2024. Sora: Creating video from text. https://openai.com/sora.
OpenAI et al. (2023) OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, and Diogo Almeida. 2023. Gpt-4 technical report.
Ouyang et al. (2020) David Ouyang, Bryan He, Amirata Ghorbani, Neal Yuan, Joseph Ebinger, Curtis P Langlotz, Paul A Heidenreich, Robert A Harrington, David H Liang, Euan A Ashley, et al. 2020. Video-based ai for beat-to-beat assessment of cardiac function. Nature, 580(7802):252–256.
Peng et al. (2023) Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. 2023. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
Petrusca et al. (2013) Lorena Petrusca, Philippe Cattin, Valeria De Luca, Frank Preiswerk, Zarko Celicanin, Vincent Auboiroux, Magalie Viallon, Patrik Arnold, Francesco Santini, Sylvain Terraz, et al. 2013. Hybrid ultrasound/magnetic resonance simultaneous acquisition and image fusion for motion monitoring in the upper abdomen. Investigative radiology, 48(5):333–340.
Pika (2023) Pika. 2023. Pika art. URL https://pika.art/home.
PixVerse (2024) PixVerse. 2024. Pixverse. URL https://pixverse.ai/.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. In ICML.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
Ramesh et al. (2022) Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. 2022. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3.
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
Saharia et al. (2022) Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. 2022. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494.
Singer et al. (2022) Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, Devi Parikh, Sonal Gupta, and Yaniv Taigman. 2022. Make-a-video: Text-to-video generation without text-video data.
Smedsrud et al. (2021) Pia H Smedsrud, Vajira Thambawita, Steven A Hicks, Henrik Gjestang, Oda Olsen Nedrejord, Espen Næss, Hanna Borgli, Debesh Jha, Tor Jan Derek Berstad, Sigrun L Eskeland, et al. 2021. Kvasir-capsule, a video capsule endoscopy dataset. Scientific Data, 8(1):142.
Teed and Deng (2020) Zachary Teed and Jia Deng. 2020. Raft: Recurrent all-pairs field transforms for optical flow. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 402–419. Springer.
Tulyakov et al. (2017) Sergey Tulyakov, Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. 2017. Mocogan: Decomposing motion and content for video generation.
Wang et al. (2023a) Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. 2023a. Modelscope text-to-video technical report. arXiv preprint arXiv:2308.06571.
Wang et al. (2020) Yaohui Wang, Piotr Bilinski, Francois Bremond, and Antitza Dantcheva. 2020. Imaginator: Conditional spatio-temporal gan for video generation. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1149–1158.
Wang et al. (2023b) Yaohui Wang, Xinyuan Chen, Xin Ma, Shangchen Zhou, Ziqi Huang, Yi Wang, Ceyuan Yang, Yinan He, Jiashuo Yu, Peiqing Yang, et al. 2023b. Lavie: High-quality video generation with cascaded latent diffusion models. arXiv preprint arXiv:2309.15103.
Wang et al. (2023c) Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. 2023c. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942.
Wei et al. (2023) Yuxiang Wei, Yabo Zhang, Zhilong Ji, Jinfeng Bai, Lei Zhang, and Wangmeng Zuo. 2023. Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15943–15953.
Xu et al. (2023) Yanwu Xu, Li Sun, Wei Peng, Shyam Visweswaran, and Kayhan Batmanghelich. 2023. Medsyn: Text-guided anatomy-aware synthesis of high-fidelity 3d ct images.
Yan et al. (2023) Zhiling Yan, Kai Zhang, Rong Zhou, Lifang He, Xiang Li, and Lichao Sun. 2023. Multimodal chatgpt for medical applications: an experimental study of gpt-4v. arXiv preprint arXiv:2310.19061.
Yuan et al. (2024) Zhengqing Yuan, Ruoxi Chen, Zhaoxu Li, Haolong Jia, Lifang He, Chi Wang, and Lichao Sun. 2024. Mora: Enabling generalist video generation via a multi-agent framework. arXiv preprint arXiv:2403.13248.
Zhang et al. (2023) Hang Zhang, Xin Li, and Lidong Bing. 2023. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858.
Zheng et al. (2024) Zangwei Zheng, Xiangyu Peng, and Yang You. 2024. Open-sora: Democratizing efficient video production for all.
Zhou et al. (2023a) Daquan Zhou, Weimin Wang, Hanshu Yan, Weiwei Lv, Yizhe Zhu, and Jiashi Feng. 2023a. Magicvideo: Efficient video generation with latent diffusion models.
Zhou et al. (2023b) Yufan Zhou, Bingchen Liu, Yizhe Zhu, Xiao Yang, Changyou Chen, and Jinhui Xu. 2023b. Shifted diffusion for text-to-image generation.

Appendix A More Dataset Construction Details

Our preprocessing approach for video data primarily consists of two steps: resolution normalization and sampling. For resolution, we standardize all video data to a resolution of 256x256 pixels. The majority of the videos have an aspect ratio of 1:1 or close to it. For videos with unusual aspect ratios, given the characteristics of medical videos where crucial information is typically centered, we first compress or expand the shortest dimension to 256 pixels and then apply center cropping to achieve the desired resolution. Regarding sampling, to train models that can generate videos with pronounced movements, we set a threshold $K$ to increase the frame interval during sampling. This method not only ensures a dynamic quality in the training data but also augments the dataset volume.

In addition, we conducted simple statistics in conjunction with the captions we generated. The results depicted in Figure 5 illustrate the relationship between the number of video frames and the length of the captions. The video duration is largely concentrated around 150 frames, which is due to our model’s objective to generate five-second videos. However, there is a distribution across other durations as well, allowing the videos to learn more temporal information for more accurate output. Regarding caption length, it generally presents an average of around 95 characters, with almost no captions shorter than 60 characters. This also indirectly reflects the consistency of the captions generated by our system.

Appendix B More Implementation Details

B.1 Training Details

Our training process was conducted on four A100-80G GPUs. We accelerated the process by setting the data format to bf16, incorporating gradient checkpointing, and utilizing ZeRO-2 optimization. Specifically, we set the batch size to 4, the learning rate to $1\times 10^{-5}$ , and the gradient clipping threshold to 1.0. Besides these, we will specifically focus on the selection of pre-trained models within our architecture.

Text Encoder Selection. To obtain representations in the embedding space of text, one can either utilize a text-image encoder or a standard language model. Language models are trained solely on textual corpora, which are substantially larger than paired image-text datasets, thereby exposing them to a rich and diverse distribution of text. These models are typically much larger than the text encoders used in current image-text models. In this paper, we opt for the T5 Raffel et al. (2020) model from the language model category. T5 retains most of the original Transformer architecture, featuring sequence-independent self-attention that uses dot products instead of recursion to explore relationships between each word and all other words in a sequence. Positional encodings are added to the word embeddings prior to dot products; unlike the original Transformer, which uses sinusoidal positional encodings, T5 employs relative positional embeddings. In T5, positional encoding relies on the extension of self-attention to compare pairwise relationships, with shared positional encodings that are reevaluated across all layers of the model. As noted in Imagen, T5 demonstrates significant advantages in alignment and fidelity over image-text models such as CLIP. Therefore, we have reason to believe that this large-scale model, even without training in medical terminology, is sufficient as an encoder for encoding medical prompts.

Video Encoder Selection. Sora employs a spatiotemporal VAE to reduce the temporal dimension. However, there are no high-quality spatiotemporal VAEs available in open source. Additionally, Open-Sora has indicated that the 2x4x4 VAE of VideoGPT is of low quality. Consequently, we have opted to use a 2D VAE from Stability-AI.