PolySmart @ TRECVid 2024 Medical Video Question Answering
Abstract
In this paper, we summarize our submitted runs and results for the Medical Video Question Answering task at TRECVid 2024[1].
Video Corpus Visual Answer Localization (VCVAL): This task includes question-related video retrieval and visual answer localization in the videos. Specifically, we use text-to-text retrieval to find relevant videos for a medical question based on the similarity of video transcript and answers generated by GPT4. For the visual answer localization, the start and end timestamps of the answer are predicted by the alignments on both visual content and subtitles with queries. We submit five runs this year and they are briefly summarized as follows:
-
•
Run 1: Achieves MAP using top-10 text-to-text retrieval. This run computes the similarity between the original question and the video transcript using two sentence-transformer models: PubMedBert and MiniLM.
-
•
Run 2: Achieves MAP with top-10 text-to-text retrieval. This run evaluates the similarity between GPT-4 generated question-answer pairs and the video transcript, using the same sentence-transformer models as in Run 1.
-
•
Run 3: Achieves MAP for top-100 text-to-text retrieval. This run combines the results of Run 1 and Run 2.
-
•
Run 4: Achieves MAP with top-10 text-to-text retrieval. This run takes the mean similarity between the original question and the video transcript, using the same sentence-transformer models as in Run 1.
-
•
Run 5: A novel approach using top-100 text-to-vision retrieval with BLIP-2 features, achieving MAP .
Query-Focused Instructional Step Captioning (QFISC): For this task, the step captions are generated by GPT4. Specifically, we provide the video captions generated by the LLaVA-Next-Video model and the video subtitles with timestamps as context, and ask GPT4 to generate step captions for the given medical query. We only submit one run for evaluation and it obtains a F-score of 11.92 and mean IoU of 9.6527.
1 Video Corpus Visual Answer Localization
Medical Video Question Answer Localization (VCVAL) presents unique challenges compared to general text-to-video retrieval [2, 3, 4, 5, 6, 7] due to the specialized nature of medical content. Unlike general videos, medical videos convey critical information through both precise visual details (e.g., procedures, anatomy) and specific medical terminology, including abbreviations that may not match directly with the video content. This can lead to difficulties in retrieval if the model fails to recognize medical abbreviations in the query, reducing retrieval accuracy. To address these challenges, we introduce a two-step approach combining video retrieval and precise segment localization.
Given the multimodal nature of medical videos, accurately locating answers requires both identifying relevant videos and pinpointing specific segments. Our method performs text-to-text retrieval using video transcripts, obtained via the YouTube API, with sentence transformer embeddings for the query and transcript text. Cosine similarity ranks the results, retrieving top-10 or top-100 videos. In the second stage, a dual-predictor [8] system—comprising a Textual Predictor and a Visual Predictor—focuses on complementary aspects of video content to refine segment localization. A cross-modal knowledge transfer mechanism with a lookup table facilitates information exchange between predictors, enabling adaptive knowledge sharing. This system captures both visual and textual nuances, enhancing answer localization accuracy. An Optimized Dynamic Learning (ODL) module adjusts knowledge transfer based on each predictor’s needs, further improving robustness across varying scenarios.
1.1 Related Video Retrieval by Text
In the first step of our approach, we perform text-to-text retrieval to identify videos relevant to the medical query. This step reduces the search space by retrieving a subset of the most relevant videos before moving to the finer task of segment localization.
1.1.1 Video Transcript Extraction and Question Expansion
For each video in the corpus, we use the YouTube API to extract its transcript, denoted as . This transcript represents the spoken content within the video.
To enhance the query’s richness and handle medical terminology, we generate an additional embedding based on a GPT-4-enhanced query answer . The prompt for GPT-4 is:
"You act as a medical or a health helper. Given a list of medical or health-related how-to questions, output the instructions step by step."
1.1.2 Text Feature Extraction and Alignment
We utilize a sentence transformer model [9, 10] to encode both the query and the transcripts into vector representations in a semantic embedding space. Given a query and a video transcript , we compute the embeddings as follows:
We compute the cosine similarity between the embeddings of each video transcript and both the original query embedding and the GPT-4-enhanced query embedding . The cosine similarity for each query embedding with respect to is defined as:
The final similarity score for each video is the maximum of the two similarity values:
Based on the computed similarity scores for each video transcript , we rank the videos and retrieve the top-10 or top-100 videos with the highest scores. These selected videos are then passed to the subsequent localization stage.
1.2 Visual Answer Localization by Multi-Modal Collaboration
In tackling the Video Corpus Visual Answer Localization (VCVAL) task, we introduce Multi-Modal Collaborative Localization (Figure 1), which employs a synergistic approach to cross-modal learning by integrating feature extraction, cross-modal fusion, and adaptive knowledge transfer.

1.2.1 Feature Extraction
We utilize the I3D [11] model, a powerful network for video processing, to encode the visual information from each video segment. I3D operates by inflating 2D convolutional filters into 3D, allowing it to effectively capture both spatial and temporal features. This model generates a feature matrix , where is the number of video frames and represents the feature dimension. By using I3D, we gain high-quality visual representations that capture both static spatial information and dynamic motion patterns crucial for video answer localization.
For the textual modality, we employ the DeBERT [12] to process the concatenated text, including the query and any relevant video captions . This produces a feature vector , where is the length of the concatenated text tokens, representing the textual information in alignment with the video content.
1.2.2 Cross-Modal Fusion
The MCL approach applies Context Query Attention (CQA) [13] to merge visual and textual features effectively. CQA leverages two attention mechanisms: query-to-context and context-to-query, facilitating deeper semantic interaction between the modalities. This fusion step results in enhanced feature representations, enabling better alignment between the video content and the query.
1.2.3 Dual Predictors for Localization
To determine the start and end of the visual answer, MCL utilizes two predictors: a Visual Predictor and a Textual Predictor. The Visual Predictor employs LSTMs followed by feedforward networks to identify key time points in the video, while the Textual Predictor, based on the structure of question-answering networks, predicts relevant time spans from textual features. This dual predictor setup ensures robust answer localization by leveraging insights from both modalities.
1.2.4 Adaptive Knowledge Transfer
A critical component of MCL is the Adaptive Knowledge Transfer Module. To synchronize the predictions of the Visual and Textual Predictors, we introduce a Lookup Table that facilitates cross-modal knowledge alignment. By mapping predicted time spans from one modality to another, the Lookup Table ensures consistent understanding across the predictors.
MCL employs a One-Way Dynamic Loss Adjustment mechanism from previous work [8], dynamically adjusting the knowledge transfer between modalities based on prediction alignment using an Intersection over Union (IoU) criterion. This process stops gradient flow between predictors, enabling them to independently refine their learning while optimizing for overall consistency. The final loss function combines contributions from each predictor and includes additional loss terms from cross-modal transfer. The total loss is defined as:
2 Query-Focused Instructional Step Captioning
The Query-Focused Instructional Step Captioning (QFISC) task aims to provide step-by-step textual summaries of visual instructional segments within medical videos in response to specific queries. This task extends the visual answer localization approach by requiring the identification of instructional boundaries and the generation of detailed captions for each instructional step, resulting in a comprehensive response tailored to the medical query.
Using LLaVA NEXT 32B [14] with GPT-4 [15] Our approach begins by using LLaVA NEXT 32B to generate initial captions for each relevant instructional segment in the video. These generated captions are then combined with the original captions from the video, creating a rich dataset that encompasses both generated and existing linguistic cues. This combined data is fed into GPT-4, which processes the information to produce the final output. GPT-4 generates the time ranges for each instructional segment and formulates detailed step-by-step instructions, ensuring that the response aligns accurately with both the visual content and the query requirements.
2.1 Instructional Video Question Answering
The InstructVQA model (Figure 2) is designed to tackle the Query-Focused Instructional Step Captioning (QFISC) task by generating structured, step-by-step captions from visual instructional segments in response to a medical query. InstructVQA combines advanced vision-language models with language generation techniques to create temporally-aligned, query-specific instructional summaries.

InstructVQA begins by utilizing LLaVA NEXT 32B [16], a powerful vision-language model, to generate preliminary captions for the relevant instructional segments of a video. Given a medical query and input video , LLaVA NEXT 32B outputs a set of captions that correspond to various instructional steps identified in the video:
where represents the captions generated based on the visual features of aligned with the query .
To enhance the depth and accuracy of the generated captions, InstructVQA combines the generated captions with the original video subtitles or captions . This merged caption set, , incorporates both the generated instructional content and the existing linguistic cues in the video, resulting in a comprehensive set of textual data for the next stage.
The combined captions are fed into GPT-4, which processes this text to identify distinct instructional steps. GPT-4 analyzes to determine the time range and description for each step, providing a temporally structured and semantically rich response to the query . For each step , GPT-4 outputs a tuple containing the start and end times and a descriptive caption , formulated as follows:
where is the number of instructional steps detected. This allows InstructVQA to produce a sequence of time-aligned, step-by-step instructions.
The final output from InstructVQA is a structured sequence of instructional steps, each aligned with a specific time range and detailed caption, forming a coherent and query-focused instructional guide. Each step directly corresponds to the query , enhancing the usability and relevance of the response.
3 Results analysis
3.1 Video Corpus Visual Answer Localization (VCVAL)
To evaluate the effectiveness of our proposed method, we conducted experiments on the VCVAL task (Stage of Video retrieval). The results are summarized in Table 1. The performance metrics include Mean Average Precision (MAP), Recall at top 5 (R@5) and top 10 (R@10), Precision at top 5 (P@5) and top 10 (P@10), and normalized Discounted Cumulative Gain (nDCG).
The results Table 1 demonstrate that our method achieves competitive performance across all metrics. Among the runs, RunID 1 yielded the highest MAP of 0.1401, while RunID 3 also achieved strong results with an MAP of 0.1348. The mean, minimum, and maximum values across the evaluated runs are included for comparison with the reported performance on the VCVAL task.
RunID | Model | MAP | R@5 | R@10 | P@5 | P@10 | nDCG |
1 | 0.1401 | 0.1799 | 0.2094 | 0.1115 | 0.0635 | 0.1955 | |
2 | 0.1305 | 0.1767 | 0.2094 | 0.1154 | 0.0635 | 0.1892 | |
3 | Run1+ Run2 | 0.1348 | 0.1643 | 0.1998 | 0.1077 | 0.0615 | 0.2009 |
4 | mean | 0.1087 | 0.1539 | 0.1810 | 0.0885 | 0.0558 | 0.1569 |
5 | BLIP-2 [9] | 0.0466 | 0.0365 | 0.0756 | 0.0231 | 0.0212 | 0.1167 |
Min | 0.0027 | 0.0027 | 0.0027 | 0.0038 | 0.0019 | 0.0031 | |
Mean | 0.1756 | 0.1972 | 0.2221 | 0.1154 | 0.0649 | 0.2306 | |
Max | 0.4339 | 0.4565 | 0.4857 | 0.2385 | 0.1308 | 0.5443 |
3.1.1 Qualitative result
In our experiments, the model demonstrates strong coverage in retrieving relevant segments within medical videos, though it sometimes lacks precision in identifying exact answer boundaries (Figure 3).

3.2 Query-Focused Instructional Step Captioning (QFISC)

To evaluate our approach for the Query-Focused Instructional Step Captioning (QFISC) task, we measured several metrics, including precision, recall, f-score, overlap IoU (Intersection over Union) at thresholds 3, 5, and 7, as well as the mean IoU (mIoU). The results for the submitted run, as well as the minimum, mean, and maximum values across all runs, are presented in Table 2.
Run ID | Precision | Recall | F-Score | IoU@3 | IoU@5 | IoU@7 | mIoU |
1 | 12.5489 | 12.1781 | 11.9291 | 12.1779 | 11.7582 | 8.0083 | 9.6527 |
Min | 12.5489 | 10.5014 | 11.9291 | 9.7003 | 9.4781 | 7.3453 | 8.0271 |
Mean | 21.7501 | 25.2405 | 22.1168 | 24.2981 | 22.0943 | 14.1882 | 18.1065 |
Max | 25.8113 | 35.9927 | 28.7081 | 34.7259 | 32.0150 | 20.0946 | 26.0907 |
Our method InstructVQA, demonstrates competitive performance with a precision of 12.5489, recall of 12.1781, and f-score of 11.9291. The overlap IoU metrics indicate robust alignment between the generated captions and the video content, with values across different thresholds.
3.2.1 Examples of QFISC with LLaVA NEXT 32B and GPT-4
In our experiments on Query-Focused Instructional Step Captioning (QFISC), we found that using LLaVA NEXT 32B alone did not yield sufficiently structured instructional captions. However, by incorporating GPT-4 to refine and segment the response, we achieved significantly more coherent and detailed step-by-step captions.
4 Conclusion
In this notebook, we presented our approach to the TRECVid 2024 Medical Video Question Answering tasks, specifically focusing on Video Corpus Visual Answer Localization (VCVAL) and Query-Focused Instructional Step Captioning (QFISC). By combining a dual-predictor system with cross-modal knowledge transfer and adaptive learning, our method effectively addresses the complexities of medical video localization. For QFISC, the integration of LLaVA NEXT 32B with GPT-4 yielded well-structured, detailed instructional captions that align closely with query requirements. Experimental results demonstrate that our methods provide strong coverage and accuracy, showcasing the potential of multimodal approaches in medical video question answering.
Despite these strengths, our approach faces some limitations. The dual-predictor system sometimes lacks precision in identifying exact segment boundaries, leading to overly broad answer spans. Additionally, the reliance on text-to-text retrieval may struggle with medical abbreviations or terminology that vary across video content, potentially affecting retrieval accuracy. Future work could explore refined localization techniques and enhanced handling of domain-specific language to improve precision and robustness further.
5 Acknowledgments
This research project is supported by the National Natural Science Foundation of China (Grant No.: 62372314).
References
- [1] G. Awad, K. Curtis, A. A. Butt, J. Fiscus, A. Godil, Y. Lee, A. Delgado, E. Godard, L. Diduch, Y. Graham, , and G. Quénot, “Trecvid 2023 - a series of evaluation tracks in video understanding,” in Proceedings of TRECVID 2023. NIST, USA, 2023.
- [2] J. Wu, Z. Ma, C.-W. Ngo, and S.-H. Zhong, “VIREO@TRECVid 2023: Ad-hoc video search,” in In NIST TRECVID Workshop, 2023.
- [3] J. Wu, Z. Ma, and C.-W. Ngo, “VIREO@TRECVid 2022: Ad-hoc video search,” in In NIST TRECVID Workshop, 2022.
- [4] J. Wu, Z. Hou, Z. Ma, and C.-W. Ngo, “VIREO@TRECVid 2021: Ad-hoc video search,” in In NIST TRECVID Workshop, 2021.
- [5] C.-W. Ngo, S.-A. Zhu, H.-K. Tan, W.-L. Zhao, and X.-Y. Wei, “VIREO at TRECvID 2010: Semantic indexing, known-item search, and content-based copy detection,” in TRECVID, 2010.
- [6] C.-W. Ngo, Y.-G. Jiang, X.-Y. Wei, W. Zhao, F. Wang, X. Wu, and H.-K. Tan, “Beyond semantic search: What you observe may not be what you think,” in IEEE Computer Society, 2008.
- [7] C.-W. Ngo, Z. Pan, X. Wei, X. Wu, H.-K. Tan, and W. Zhao, “Motion driven approaches to shot boundary detection, low-level feature extraction and bbc rushes characterization at TRECVid 2005,” in TRECVID, 2005.
- [8] Y. Weng and B. Li, “Visual answer localization with cross-modal mutual knowledge transfer,” 2022. [Online]. Available: https://arxiv.org/abs/2210.14823
- [9] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2301.12597
- [10] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” 2019, available at: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. [Online]. Available: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
- [11] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4724–4733.
- [12] P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing,” 2023. [Online]. Available: https://arxiv.org/abs/2111.09543
- [13] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Span-based localizing network for natural language video localization,” 2020. [Online]. Available: https://arxiv.org/abs/2004.13931
- [14] Y. Zhang, B. Li, h. Liu, Y. j. Lee, L. Gui, D. Fu, J. Feng, Z. Liu, and C. Li, “Llava-next: A strong zero-shot video understanding model,” April 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-04-30-llava-next-video/
- [15] OpenAI, “Gpt-4 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2303.08774
- [16] F. Li, R. Zhang, H. Zhang, Y. Zhang, B. Li, W. Li, Z. Ma, and C. Li, “Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.07895