This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PolySmart @ TRECVid 2024 Medical Video Question Answering

Jiaxin Wu, Yiyang Jiang, Xiao-Yong Wei⋆†, Qing Li
Department of Computing, The Hong Kong Polytechnic University
Department of Computer Science, Sichuan University
[email protected], [email protected],
[email protected], [email protected]

Abstract

In this paper, we summarize our submitted runs and results for the Medical Video Question Answering task at TRECVid 2024[1].

Video Corpus Visual Answer Localization (VCVAL): This task includes question-related video retrieval and visual answer localization in the videos. Specifically, we use text-to-text retrieval to find relevant videos for a medical question based on the similarity of video transcript and answers generated by GPT4. For the visual answer localization, the start and end timestamps of the answer are predicted by the alignments on both visual content and subtitles with queries. We submit five runs this year and they are briefly summarized as follows:

  • Run 1: Achieves MAP =0.1401=0.1401 using top-10 text-to-text retrieval. This run computes the similarity between the original question and the video transcript using two sentence-transformer models: PubMedBert and MiniLM.

  • Run 2: Achieves MAP =0.1305=0.1305 with top-10 text-to-text retrieval. This run evaluates the similarity between GPT-4 generated question-answer pairs and the video transcript, using the same sentence-transformer models as in Run 1.

  • Run 3: Achieves MAP =0.1348=0.1348 for top-100 text-to-text retrieval. This run combines the results of Run 1 and Run 2.

  • Run 4: Achieves MAP =0.1087=0.1087 with top-10 text-to-text retrieval. This run takes the mean similarity between the original question and the video transcript, using the same sentence-transformer models as in Run 1.

  • Run 5: A novel approach using top-100 text-to-vision retrieval with BLIP-2 features, achieving MAP =0.0466=0.0466.

Query-Focused Instructional Step Captioning (QFISC): For this task, the step captions are generated by GPT4. Specifically, we provide the video captions generated by the LLaVA-Next-Video model and the video subtitles with timestamps as context, and ask GPT4 to generate step captions for the given medical query. We only submit one run for evaluation and it obtains a F-score of 11.92 and mean IoU of 9.6527.

1 Video Corpus Visual Answer Localization

Medical Video Question Answer Localization (VCVAL) presents unique challenges compared to general text-to-video retrieval [2, 3, 4, 5, 6, 7] due to the specialized nature of medical content. Unlike general videos, medical videos convey critical information through both precise visual details (e.g., procedures, anatomy) and specific medical terminology, including abbreviations that may not match directly with the video content. This can lead to difficulties in retrieval if the model fails to recognize medical abbreviations in the query, reducing retrieval accuracy. To address these challenges, we introduce a two-step approach combining video retrieval and precise segment localization.

Given the multimodal nature of medical videos, accurately locating answers requires both identifying relevant videos and pinpointing specific segments. Our method performs text-to-text retrieval using video transcripts, obtained via the YouTube API, with sentence transformer embeddings for the query and transcript text. Cosine similarity ranks the results, retrieving top-10 or top-100 videos. In the second stage, a dual-predictor [8] system—comprising a Textual Predictor and a Visual Predictor—focuses on complementary aspects of video content to refine segment localization. A cross-modal knowledge transfer mechanism with a lookup table facilitates information exchange between predictors, enabling adaptive knowledge sharing. This system captures both visual and textual nuances, enhancing answer localization accuracy. An Optimized Dynamic Learning (ODL) module adjusts knowledge transfer based on each predictor’s needs, further improving robustness across varying scenarios.

1.1 Related Video Retrieval by Text

In the first step of our approach, we perform text-to-text retrieval to identify videos relevant to the medical query. This step reduces the search space by retrieving a subset of the most relevant videos before moving to the finer task of segment localization.

1.1.1 Video Transcript Extraction and Question Expansion

For each video ViV_{i} in the corpus, we use the YouTube API to extract its transcript, denoted as TiT_{i}. This transcript represents the spoken content within the video.

To enhance the query’s richness and handle medical terminology, we generate an additional embedding based on a GPT-4-enhanced query answer qGPT-4q_{\text{GPT-4}}. The prompt for GPT-4 is:

"You act as a medical or a health helper. Given a list of medical or health-related how-to questions, output the instructions step by step."

1.1.2 Text Feature Extraction and Alignment

We utilize a sentence transformer model [9, 10] to encode both the query and the transcripts into vector representations in a semantic embedding space. Given a query qq and a video transcript TiT_{i}, we compute the embeddings as follows:

𝐪orig=SentenceTransformer(q)\mathbf{q}_{\text{orig}}=\text{SentenceTransformer}(q)
𝐭i=SentenceTransformer(Ti)\mathbf{t}_{i}=\text{SentenceTransformer}(T_{i})
𝐪GPT-4=SentenceTransformer(qGPT-4)\mathbf{q}_{\text{GPT-4}}=\text{SentenceTransformer}(q_{\text{GPT-4}})

We compute the cosine similarity between the embeddings of each video transcript 𝐭i\mathbf{t}_{i} and both the original query embedding 𝐪orig\mathbf{q}_{\text{orig}} and the GPT-4-enhanced query embedding 𝐪GPT-4\mathbf{q}_{\text{GPT-4}}. The cosine similarity for each query embedding with respect to TiT_{i} is defined as:

Sim(𝐪orig,𝐭i)=𝐪orig𝐭i𝐪orig𝐭i\text{Sim}(\mathbf{q}_{\text{orig}},\mathbf{t}_{i})=\frac{\mathbf{q}_{\text{orig}}\cdot\mathbf{t}_{i}}{\|\mathbf{q}_{\text{orig}}\|\|\mathbf{t}_{i}\|}
Sim(𝐪GPT-4,𝐭i)=𝐪GPT-4𝐭i𝐪GPT-4𝐭i\text{Sim}(\mathbf{q}_{\text{GPT-4}},\mathbf{t}_{i})=\frac{\mathbf{q}_{\text{GPT-4}}\cdot\mathbf{t}_{i}}{\|\mathbf{q}_{\text{GPT-4}}\|\|\mathbf{t}_{i}\|}

The final similarity score for each video ViV_{i} is the maximum of the two similarity values:

Simfinal(q,Ti)=max(Sim(𝐪orig,𝐭i),Sim(𝐪GPT-4,𝐭i))\text{Sim}_{\text{final}}(q,T_{i})=\max\left(\text{Sim}(\mathbf{q}_{\text{orig}},\mathbf{t}_{i}),\text{Sim}(\mathbf{q}_{\text{GPT-4}},\mathbf{t}_{i})\right)

Based on the computed similarity scores Simfinal(q,Ti)\text{Sim}_{\text{final}}(q,T_{i}) for each video transcript TiT_{i}, we rank the videos and retrieve the top-10 or top-100 videos with the highest scores. These selected videos are then passed to the subsequent localization stage.

1.2 Visual Answer Localization by Multi-Modal Collaboration

In tackling the Video Corpus Visual Answer Localization (VCVAL) task, we introduce Multi-Modal Collaborative Localization (Figure 1), which employs a synergistic approach to cross-modal learning by integrating feature extraction, cross-modal fusion, and adaptive knowledge transfer.

Refer to caption
Figure 1: Multi-Modal Collaborative Localization

1.2.1 Feature Extraction

We utilize the I3D [11] model, a powerful network for video processing, to encode the visual information from each video segment. I3D operates by inflating 2D convolutional filters into 3D, allowing it to effectively capture both spatial and temporal features. This model generates a feature matrix Vk×dV\in\mathbb{R}^{k\times d}, where kk is the number of video frames and dd represents the feature dimension. By using I3D, we gain high-quality visual representations that capture both static spatial information and dynamic motion patterns crucial for video answer localization.

For the textual modality, we employ the DeBERT [12] to process the concatenated text, including the query QQ and any relevant video captions T=[Q,T1,,Tr]T=[Q,T_{1},...,T_{r}]. This produces a feature vector Tn×dT\in\mathbb{R}^{n\times d}, where nn is the length of the concatenated text tokens, representing the textual information in alignment with the video content.

1.2.2 Cross-Modal Fusion

The MCL approach applies Context Query Attention (CQA) [13] to merge visual and textual features effectively. CQA leverages two attention mechanisms: query-to-context and context-to-query, facilitating deeper semantic interaction between the modalities. This fusion step results in enhanced feature representations, enabling better alignment between the video content and the query.

1.2.3 Dual Predictors for Localization

To determine the start and end of the visual answer, MCL utilizes two predictors: a Visual Predictor and a Textual Predictor. The Visual Predictor employs LSTMs followed by feedforward networks to identify key time points in the video, while the Textual Predictor, based on the structure of question-answering networks, predicts relevant time spans from textual features. This dual predictor setup ensures robust answer localization by leveraging insights from both modalities.

1.2.4 Adaptive Knowledge Transfer

A critical component of MCL is the Adaptive Knowledge Transfer Module. To synchronize the predictions of the Visual and Textual Predictors, we introduce a Lookup Table that facilitates cross-modal knowledge alignment. By mapping predicted time spans from one modality to another, the Lookup Table ensures consistent understanding across the predictors.

MCL employs a One-Way Dynamic Loss Adjustment mechanism from previous work [8], dynamically adjusting the knowledge transfer between modalities based on prediction alignment using an Intersection over Union (IoU) criterion. This process stops gradient flow between predictors, enabling them to independently refine their learning while optimizing for overall consistency. The final loss function combines contributions from each predictor and includes additional loss terms from cross-modal transfer. The total loss is defined as:

Loss=LossVisual+LossTextual+LossTransferVisual+LossTransferTextual\text{Loss}=\text{Loss}_{\text{Visual}}+\text{Loss}_{\text{Textual}}+\text{Loss}_{\text{Transfer}}^{\text{Visual}}+\text{Loss}_{\text{Transfer}}^{\text{Textual}}

2 Query-Focused Instructional Step Captioning

The Query-Focused Instructional Step Captioning (QFISC) task aims to provide step-by-step textual summaries of visual instructional segments within medical videos in response to specific queries. This task extends the visual answer localization approach by requiring the identification of instructional boundaries and the generation of detailed captions for each instructional step, resulting in a comprehensive response tailored to the medical query.

Using LLaVA NEXT 32B [14] with GPT-4 [15] Our approach begins by using LLaVA NEXT 32B to generate initial captions for each relevant instructional segment in the video. These generated captions are then combined with the original captions from the video, creating a rich dataset that encompasses both generated and existing linguistic cues. This combined data is fed into GPT-4, which processes the information to produce the final output. GPT-4 generates the time ranges for each instructional segment and formulates detailed step-by-step instructions, ensuring that the response aligns accurately with both the visual content and the query requirements.

2.1 Instructional Video Question Answering

The InstructVQA model (Figure 2) is designed to tackle the Query-Focused Instructional Step Captioning (QFISC) task by generating structured, step-by-step captions from visual instructional segments in response to a medical query. InstructVQA combines advanced vision-language models with language generation techniques to create temporally-aligned, query-specific instructional summaries.

Refer to caption
Figure 2: InstructVQA

InstructVQA begins by utilizing LLaVA NEXT 32B [16], a powerful vision-language model, to generate preliminary captions for the relevant instructional segments of a video. Given a medical query QQ and input video VV, LLaVA NEXT 32B outputs a set of captions C=[c1,c2,,cm]C=[c_{1},c_{2},\dots,c_{m}] that correspond to various instructional steps identified in the video:

C=LLaVA-NEXT-32B(V,Q)C=\text{LLaVA-NEXT-32B}(V,Q)

where CC represents the captions generated based on the visual features of VV aligned with the query QQ.

To enhance the depth and accuracy of the generated captions, InstructVQA combines the generated captions CC with the original video subtitles or captions S=[s1,s2,,sn]S=[s_{1},s_{2},\dots,s_{n}]. This merged caption set, Ccombined=CSC_{\text{combined}}=C\cup S, incorporates both the generated instructional content and the existing linguistic cues in the video, resulting in a comprehensive set of textual data for the next stage.

The combined captions CcombinedC_{\text{combined}} are fed into GPT-4, which processes this text to identify distinct instructional steps. GPT-4 analyzes CcombinedC_{\text{combined}} to determine the time range and description for each step, providing a temporally structured and semantically rich response to the query QQ. For each step ii, GPT-4 outputs a tuple containing the start and end times [ts,i,te,i][t_{s,i},t_{e,i}] and a descriptive caption did_{i}, formulated as follows:

{(ts,i,te,i,di)}i=1p=GPT-4(Ccombined)\{(t_{s,i},t_{e,i},d_{i})\}_{i=1}^{p}=\text{GPT-4}(C_{\text{combined}})

where pp is the number of instructional steps detected. This allows InstructVQA to produce a sequence of time-aligned, step-by-step instructions.

The final output from InstructVQA is a structured sequence of instructional steps, each aligned with a specific time range and detailed caption, forming a coherent and query-focused instructional guide. Each step (ts,i,te,i,di)(t_{s,i},t_{e,i},d_{i}) directly corresponds to the query QQ, enhancing the usability and relevance of the response.

3 Results analysis

3.1 Video Corpus Visual Answer Localization (VCVAL)

To evaluate the effectiveness of our proposed method, we conducted experiments on the VCVAL task (Stage of Video retrieval). The results are summarized in Table 1. The performance metrics include Mean Average Precision (MAP), Recall at top 5 (R@5) and top 10 (R@10), Precision at top 5 (P@5) and top 10 (P@10), and normalized Discounted Cumulative Gain (nDCG).

The results Table 1 demonstrate that our method achieves competitive performance across all metrics. Among the runs, RunID 1 yielded the highest MAP of 0.1401, while RunID 3 also achieved strong results with an MAP of 0.1348. The mean, minimum, and maximum values across the evaluated runs are included for comparison with the reported performance on the VCVAL task.

RunID Model MAP R@5 R@10 P@5 P@10 nDCG
1 Sim(𝐪orig,𝐭i)\text{Sim}(\mathbf{q}_{\text{orig}},\mathbf{t}_{i}) 0.1401 0.1799 0.2094 0.1115 0.0635 0.1955
2 Sim(𝐪GPT-4,𝐭i)\text{Sim}(\mathbf{q}_{\text{GPT-4}},\mathbf{t}_{i}) 0.1305 0.1767 0.2094 0.1154 0.0635 0.1892
3 Run1+ Run2 0.1348 0.1643 0.1998 0.1077 0.0615 0.2009
4 mean Sim(𝐪orig,𝐭i)\text{Sim}(\mathbf{q}_{\text{orig}},\mathbf{t}_{i}) 0.1087 0.1539 0.1810 0.0885 0.0558 0.1569
5 BLIP-2 [9] 0.0466 0.0365 0.0756 0.0231 0.0212 0.1167
Min 0.0027 0.0027 0.0027 0.0038 0.0019 0.0031
Mean 0.1756 0.1972 0.2221 0.1154 0.0649 0.2306
Max 0.4339 0.4565 0.4857 0.2385 0.1308 0.5443
Table 1: Video retrieval results for VCVAL Task

3.1.1 Qualitative result

In our experiments, the model demonstrates strong coverage in retrieving relevant segments within medical videos, though it sometimes lacks precision in identifying exact answer boundaries (Figure 3).

Refer to caption
Figure 3: Qualitative result for moment retrieval of VCVAL Task

3.2 Query-Focused Instructional Step Captioning (QFISC)

Refer to caption
Figure 4: Qualitative result for QFISC

To evaluate our approach for the Query-Focused Instructional Step Captioning (QFISC) task, we measured several metrics, including precision, recall, f-score, overlap IoU (Intersection over Union) at thresholds 3, 5, and 7, as well as the mean IoU (mIoU). The results for the submitted run, as well as the minimum, mean, and maximum values across all runs, are presented in Table 2.

Run ID Precision Recall F-Score IoU@3 IoU@5 IoU@7 mIoU
1 12.5489 12.1781 11.9291 12.1779 11.7582 8.0083 9.6527
Min 12.5489 10.5014 11.9291 9.7003 9.4781 7.3453 8.0271
Mean 21.7501 25.2405 22.1168 24.2981 22.0943 14.1882 18.1065
Max 25.8113 35.9927 28.7081 34.7259 32.0150 20.0946 26.0907
Table 2: Experimental Results for Query-Focused Instructional Step Captioning (QFISC) Task

Our method InstructVQA, demonstrates competitive performance with a precision of 12.5489, recall of 12.1781, and f-score of 11.9291. The overlap IoU metrics indicate robust alignment between the generated captions and the video content, with values across different thresholds.

3.2.1 Examples of QFISC with LLaVA NEXT 32B and GPT-4

In our experiments on Query-Focused Instructional Step Captioning (QFISC), we found that using LLaVA NEXT 32B alone did not yield sufficiently structured instructional captions. However, by incorporating GPT-4 to refine and segment the response, we achieved significantly more coherent and detailed step-by-step captions.

4 Conclusion

In this notebook, we presented our approach to the TRECVid 2024 Medical Video Question Answering tasks, specifically focusing on Video Corpus Visual Answer Localization (VCVAL) and Query-Focused Instructional Step Captioning (QFISC). By combining a dual-predictor system with cross-modal knowledge transfer and adaptive learning, our method effectively addresses the complexities of medical video localization. For QFISC, the integration of LLaVA NEXT 32B with GPT-4 yielded well-structured, detailed instructional captions that align closely with query requirements. Experimental results demonstrate that our methods provide strong coverage and accuracy, showcasing the potential of multimodal approaches in medical video question answering.

Despite these strengths, our approach faces some limitations. The dual-predictor system sometimes lacks precision in identifying exact segment boundaries, leading to overly broad answer spans. Additionally, the reliance on text-to-text retrieval may struggle with medical abbreviations or terminology that vary across video content, potentially affecting retrieval accuracy. Future work could explore refined localization techniques and enhanced handling of domain-specific language to improve precision and robustness further.

5 Acknowledgments

This research project is supported by the National Natural Science Foundation of China (Grant No.: 62372314).

References

  • [1] G. Awad, K. Curtis, A. A. Butt, J. Fiscus, A. Godil, Y. Lee, A. Delgado, E. Godard, L. Diduch, Y. Graham, , and G. Quénot, “Trecvid 2023 - a series of evaluation tracks in video understanding,” in Proceedings of TRECVID 2023.   NIST, USA, 2023.
  • [2] J. Wu, Z. Ma, C.-W. Ngo, and S.-H. Zhong, “VIREO@TRECVid 2023: Ad-hoc video search,” in In NIST TRECVID Workshop, 2023.
  • [3] J. Wu, Z. Ma, and C.-W. Ngo, “VIREO@TRECVid 2022: Ad-hoc video search,” in In NIST TRECVID Workshop, 2022.
  • [4] J. Wu, Z. Hou, Z. Ma, and C.-W. Ngo, “VIREO@TRECVid 2021: Ad-hoc video search,” in In NIST TRECVID Workshop, 2021.
  • [5] C.-W. Ngo, S.-A. Zhu, H.-K. Tan, W.-L. Zhao, and X.-Y. Wei, “VIREO at TRECvID 2010: Semantic indexing, known-item search, and content-based copy detection,” in TRECVID, 2010.
  • [6] C.-W. Ngo, Y.-G. Jiang, X.-Y. Wei, W. Zhao, F. Wang, X. Wu, and H.-K. Tan, “Beyond semantic search: What you observe may not be what you think,” in IEEE Computer Society, 2008.
  • [7] C.-W. Ngo, Z. Pan, X. Wei, X. Wu, H.-K. Tan, and W. Zhao, “Motion driven approaches to shot boundary detection, low-level feature extraction and bbc rushes characterization at TRECVid 2005,” in TRECVID, 2005.
  • [8] Y. Weng and B. Li, “Visual answer localization with cross-modal mutual knowledge transfer,” 2022. [Online]. Available: https://arxiv.org/abs/2210.14823
  • [9] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” 2023. [Online]. Available: https://arxiv.org/abs/2301.12597
  • [10] N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” 2019, available at: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2. [Online]. Available: https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2
  • [11] J. Carreira and A. Zisserman, “Quo vadis, action recognition? a new model and the kinetics dataset,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4724–4733.
  • [12] P. He, J. Gao, and W. Chen, “Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing,” 2023. [Online]. Available: https://arxiv.org/abs/2111.09543
  • [13] H. Zhang, A. Sun, W. Jing, and J. T. Zhou, “Span-based localizing network for natural language video localization,” 2020. [Online]. Available: https://arxiv.org/abs/2004.13931
  • [14] Y. Zhang, B. Li, h. Liu, Y. j. Lee, L. Gui, D. Fu, J. Feng, Z. Liu, and C. Li, “Llava-next: A strong zero-shot video understanding model,” April 2024. [Online]. Available: https://llava-vl.github.io/blog/2024-04-30-llava-next-video/
  • [15] OpenAI, “Gpt-4 technical report,” 2024. [Online]. Available: https://arxiv.org/abs/2303.08774
  • [16] F. Li, R. Zhang, H. Zhang, Y. Zhang, B. Li, W. Li, Z. Ma, and C. Li, “Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models,” 2024. [Online]. Available: https://arxiv.org/abs/2407.07895