¹¹institutetext: Department of ¹Image and ²Artificial Intelligence
Chung-Ang University, Seoul, Korea
¹¹email: {dgkang, dasolj, dl218218, swpark, hspark, kwonsk, yjkim, paikj}@ipis.cau.ac.kr

VLM-HOI: Vision Language Models for Interpretable Human-Object Interaction Analysis

Donggoo Kang¹ Dasol Jeong¹ Hyunmin Lee² Sangwoo Park¹ Hasil Park¹ Sunkyu Kwon² Yeongjoon Kim² Joonki Paik^1,2,∗

Abstract

The Large Vision Language Model (VLM) has recently addressed remarkable progress in bridging two fundamental modalities. VLM, trained by a sufficiently large dataset, exhibits a comprehensive understanding of both visual and linguistic to perform diverse tasks. To distill this knowledge accurately, in this paper, we introduce a novel approach that explicitly utilizes VLM as an objective function form for the Human-Object Interaction (HOI) detection task (VLM-HOI). Specifically, we propose a method that quantifies the similarity of the predicted HOI triplet using the Image-Text matching technique. We represent HOI triplets linguistically to fully utilize the language comprehension of VLMs, which are more suitable than CLIP models due to their localization and object-centric nature. This matching score is used as an objective for contrastive optimization. To our knowledge, this is the first utilization of VLM language abilities for HOI detection. Experiments demonstrate the effectiveness of our method, achieving state-of-the-art HOI detection accuracy on benchmarks. We believe integrating VLMs into HOI detection represents important progress towards more advanced and interpretable analysis of human-object interactions.

Keywords:

Vision Language Model Human-Object Interaction Knowledge Distillation Contrastive Learning

1 Introduction

Refer to caption — Figure 1: Comparison of Image-Text Matching scores and CLIP similarity for various HOI triplets. While CLIP struggles to capture semantic relationships due to its reliance on simple prompts, VLM models effectively distinguish between positive and negative HOI triplets despite not receiving complete sentences as input.

Human-object interaction (HOI) detection plays a pivotal role in understanding the interactions between humans and objects within visual scenes. This task involves identifying and locating both the objects in an image or video and the specific interactions or actions that humans perform with those objects. The ability to comprehend these intricate relationships is crucial for a wide range of applications, including image captioning[45], robotics[42], action recognition[44], scene understanding[41].

HOI Detection involves two subtasks: 1) localizing the subject (human) and the target (object) of the interaction, and 2) classifying the interaction label that describes the relationship between them. For example, in an image of a person riding a bike, the subject is the person, the target is the bike, and the interaction label is “ride”. By recognizing the actions and interactions, such as “riding a bicycle,” “cooking,” or “playing a musical instrument,” HOI detection enables machines to have a more nuanced understanding of human behavior in visual data. In this era of deep learning and the expanding availability of visual data, HOI detection has made remarkable progress[49, 31, 26, 16, 18, 25, 6, 37, 47, 50, 53, 51, 57]. Deep neural networks have revolutionized the field, enabling the development of more accurate and robust models for recognizing human-object interactions. Additionally, the increasing availability of large-scale datasets and improved computational resources have further accelerated advances in this area.

In recent years, the field of artificial intelligence has witnessed remarkable progress in the integration of two fundamental modalities: vision and language[21, 20, 35, 8, 23, 22, 36, 40, 46, 56]. One of the most notable advancements in this domain has been the development of Large Vision Language Models (VLM), also known as the vision foundation model. The early success of the foundation model is mainly due to the pre-training strategy that trains massive text datasets using self-supervised learning objectives[11, 4, 9]. This means that the model is trained to learn general patterns and features from the data without the need for explicit labels. This is a much more efficient approach than supervised learning, which requires large amounts of labeled data. Through the success of the text-driven foundation model, the vision foundation models also have achieved state-of-the-art results on a wide range of vision tasks, including image classification, object detection, and image captioning. These results show the vision foundation models that pre-trained on vast and diverse datasets, exhibit a deep comprehension of both visual and linguistic information, enabling them to excel in a wide range of tasks that require the fusion of these modalities.

In light of this, we propose a novel approach that leverages the capabilities of large VLM as a distillation form for the HOI detection task (VLM-HOI). Our proposed method measures similarity in predicted HOI triplets, utilizing the concept of the Image-Text matching method. To achieve this, we represent HOI triplets in a linguistic form, capitalizing on the language understanding capabilities inherent in VLMs. This linguistic representation allows us to harness the power of the VLM to comprehend and interpret the interactions between humans and objects in a more nuanced and context-aware manner. Despite not being a complete sentence, VLM can compute the similarity between image and language, as shown in Figure. 1. To explicitly distill the knowledge of VLM to our approach, we utilize the ITM score computed from the Image-Text matching as our objective function, employing a contrastive learning framework. This framework enables the network to understand HOI in arbitrary text form. In our pursuit of advancing HOI detection, we conduct extensive experiments to evaluate the effectiveness of our proposed method.

Experimental results reveal that our approach outperforms existing methods in terms of accuracy and robustness. Through this work, we aim to contribute to the growing body of research at the intersection of vision and language, demonstrating the potential of VLMs as powerful tools for enhancing the understanding of complex visual scenarios involving human-object interactions.

The main contributions of this paper are summarized as follows:

•

We propose a novel approach that leverages the capabilities of large VLMs as a distillation form for the HOI detection task.
•

We present a method that measures similarity in predicted HOI triplets, utilizing the concept of the Image-Text matching method.
•

We develop a contrastive learning framework that enables the network to understand HOI in text form.
•

We evaluate our proposed approach on two challenging HOI detection benchmarks and achieve state-of-the-art results.

2 Related Work

2.1 Human-Object Interaction Detection

Human-object interaction (HOI) detection is an active area of research in computer vision. As noted in the introduction, HOI detection is a high-level task built on top of object detection. Existing HOI detection methods can be categorized into two main approaches: one-stage models and two-stage models.

One-stage models[31, 37] aim to detect human-object interactions in a single feedforward pass through a neural network. These models take an image as input and directly output bounding boxes for humans, objects, and their interactions. While computationally efficient, one-stage models struggle to model complex interactions and scale to large numbers of interaction categories. In contrast, two-stage models[6, 25, 49, 50, 51, 12] separate the tasks of detecting humans and objects from modeling their interactions. In the first stage, off-the-shelf object detectors are used to localize candidate humans and objects in the image. The second stage then classifies their relationship based on appearance and spatial cues. By decomposing the problem, two-stage models are able to achieve higher accuracy but at the cost of reduced speed.

Another content [47, 26, 16, 18, 34, 52] recent work has focused on improving both branches of HOI detection. For one-stage models, newer architectures incorporate attention mechanisms and graphical networks to better model interactions[16, 18, 34]. For two-stage models, progress has been made in embedding space designs and developing robust spatial models[52].

From another perspective, GEN-VLKT[26] proposes a Visual-Linguistic Knowledge Transfer (VLKT) strategy that leverages CLIP to enhance interaction understanding. It uses CLIP text embeddings to initialize the HOI classifiers and mimics CLIP image features. RLIP[47, 48] proposes a pre-training strategy to train a robust backbone network by aligning the image representations of entities and relations with their corresponding text descriptions. Both works are impressive approaches in that leverage VLM and pre-training strategy.

2.2 Vision-Lauguage Model

Integrating vision and language has been a long-standing goal in artificial intelligence. Early work focused on image captioning[39], generating textual descriptions of image contents. Recent years have seen rapid progress in visual question answering[2, 1], enabled by large-scale datasets[7] and deep neural encoder-decoder models[43].

More advanced vision-language tasks require a tighter integration between the visual and linguistic modalities. This has led to a surge of interest in unified multimodal representation models that can process both images and text within a single framework[3].

One line of work explores joint embedding models to learn aligned vector representations for image regions and language fragments[19, 15]. However, these approaches do not explicitly model interactions between modalities.

Recently, pretrained language models, such as BERT[11] and GPT[4], have significantly influenced the vision-language domain. Researchers have extended these models to handle multimodal inputs, leveraging their contextual understanding of text. More recent methods like VisualBERT[23] and LXMERT[38] utilize pretrained BERT weights for joint image-text comprehension.

Unified architectures like ViLBERT[30], VL-BERT[36], UNITER[8], BLIP[21] integrate masked language modeling objectives alongside paired image-text prediction tasks within a single Transformer model. These models set new state-of-the-art results on downstream tasks like visual question answering (VQA), visual reasoning, and image retrieval.

Specifically, we utilize BLIP as our vision-language model backbone. BLIP uses a flexible multimodal encoder-decoder model that can handle both understanding and generation tasks. It also improves training data quality through generating new image captions and filtering noise. These advantages lead to strong performance.

Instruction-based VLMs like GPT-4V[33], LLaVA[28], and InstructBLIP[10] are a new type of large vision-language model that are trained to follow natural language instructions and prompts to perform various tasks. A key advantage these models have over traditional fine-tuning is they can adapt to new tasks without needing gradient updates or lots of specific training data. The instruction format allows rapid adaptation. However, still challenges exist around instruction following, ambiguity, high resource usage, and potential biases.

3 Proposed Method

In this section, we introduce our proposed VLM-HOI method, which is illustrated in Figure 3. As shown, we adopt a detection transformer (DETR) as the feature encoder and object detector, following recent works [16, 18, 57, 37, 51, 49, 53]. A query-based transformer decoder is also utilized to predict HOI triplets, as done in recent studies[16, 18]. The goal of this work is to effectively transfer the comprehensive language understanding capabilities of vision-language models (VLMs) to HOI detection.

In the following section, we first briefly define the baseline VLM model. We then introduce a brief problem definition and present HOI triplet association technique that converts predicted HOI triplets into positive and negative text forms as inputs to the VLM. These text sets are used to compute image-text matching scores with the VLM. Subsequently, we introduce a contrastive learning loss to learn representations of VLM using these matching scores.

3.1 Baseline VLM Model

While large models like CLIP[35] excel at overall image-text similarity, their limitations in pinpointing specific objects become apparent. This paper takes a step forward, focusing on BLIP[21, 20], a VLM specifically designed for object-level understanding, and its potential as a powerful teacher model for knowledge distillation in localization tasks. Prior works[26, 32] have utilized CLIP as a teacher model, achieving notable results. However, we argue that BLIP offers distinct advantages for localization due to its inherent strengths.

•

Object-centric Design: BLIP learns representations for both images and object-specific descriptions, enabling pinpoint localization beyond CLIP’s image-level approach.
•

Fine-grained Matching: Unlike CLIP’s simple cosine distance, BLIP employs sub-word level matching, connecting individual textual words to image regions, and facilitating precise object identification.
•

Specialized Training Data: BLIP is pre-trained on object-centric datasets with detailed descriptions, equipping it with nuanced understanding of object features and spatial relationships, crucial for localization.

Figure 1, suggests that the VLM model is able to better capture the semantic relationships between images and text for natural interactions. In contrast, the CLIP similarity scores are not as discriminative between positive and negative triplets. This suggests that CLIP may not be as effective at capturing the more nuanced relationships between images and text.

3.2 Problem Definition

Human-object interaction (HOI) detection aims to locate human and object instance in an image and classify their interaction relationship. Formally, given an image $I$ , the objective is to detect a set of $N$ interacted human-object pairs $\{(h_{i},o_{i},v_{i})\}$ where $h_{i}\text{ and }o_{i}$ represent detected human and object. $v_{i}$ denotes the interaction between $h_{i}\text{ and }o_{i}$ .

Each $h_{i}$ and $o_{i}$ is represented by a bounding box $b_{i}^{h}$ and $b_{i}^{o}$ respectively. The overall HOI prediction for the image can be denoted as:

H=\{(b_{i}^{h},b_{i}^{o},o_{i},v_{i})|i\in 1,2,...,N\}

(1)

where $H$ is the final output containing $N$ detected HOI triplets. The HOI task can be decomposed into two sub-problems: Human and object detection $H_{det}=\{(b_{i}^{h},b_{i}^{o},o_{i})\}$ and interaction classification given information of human and object, $H_{hoi}=\{(v_{i}|b_{i}^{h},b_{i}^{o},o_{i})\}$ .

3.3 HOI Triplet Association

The objective of the HOI triplet association is to convert these triplets into grounded natural language representations to leverage the language comprehension capabilities of vision-language models (VLMs). Our model first predicts a set of $N$ predicted HOI triplets $T=\{(\hat{h}_{1},\hat{o}_{1},\hat{v}_{1}),...,(\hat{h}_{N},\hat{o}_{N},\hat{v}_{N})\}$ from the input image $I$ , where $\hat{h},\hat{o},\hat{v}$ are the detected human, object and predicted interaction verb, respectively.

Given these predictions, we extract the detected object and interaction class names $o_{i}$ and $v_{i}$ from the classifier outputs for each triplet. We exclude any triplets predicted as “no interaction” or “no object”, denoted by $\oslash$ , as determined by the Hungarian matching. For simplicity, we set all predicted human classes $\hat{h}$ as “A person”. We then construct a positive sentence $S^{pos}_{i}$ for each triplet using the template:

\begin{split}s^{+}_{i}=[\text{``A person''}+\hat{v}_{i}+\text{`` a ''}+\hat{o_{i}}],\quad\text{if }T\in H\\ s^{-}_{i}=[\text{``A person''}+\hat{v}_{i}+\text{`` a ''}+\hat{o_{i}}],\quad\text{if }T\notin H\end{split}

(2)

The examples of these sentences are described in Figure 1. The generated sentences may not be grammatically correct, since the class names in most HOI datasets are not designed to produce fluent phrases. Simply concatenating the human, verb, and object classes can result in unnatural or awkward language. However, these grounded sentences still provide useful context and supervision for knowledge distillation from the pre-trained VLM. The key elements of human, action, and object capture the core semantics of the visual HOI detections in a textual representation. While not linguistically flawless, this allows transferring relevant knowledge about humans, objects, and their interactions from the VLM to the HOI model.

The paired positive and negative sentences provide contrasting language context for the VLM to comprehend the visual HOI concept deeply. This gives us a set of positive sentences $S^{pos}=\{s^{+}_{1},...,s^{+}_{n}\}$ and a set of negative sentences $S^{neg}=\{s^{-}_{1},...,s^{-}_{m}\}$ that are semantically grounded in the visual HOI predictions. By optimizing the VLM representations on these contrasting text sets, we enable the model to learn fine-grained HOI concepts based on its language priors. This facilitates transferring the VLM’s comprehensive language understanding to improve HOI prediction.

3.4 Image-Text Matching based Knowledge Distillation

To harness the image-text matching capabilities of VLMs, we compute similarity scores between a given image $I$ and paired sentence sets: $S^{\text{pos}}$ for positive (correct) associations and $S^{\text{neg}}$ for negative (incorrect) ones. The calculation is formalized as:

\text{sim}(I,S)=\text{VLM}_{\text{itm}}(I,S),

(3)

where $\text{sim}(I,S)$ denotes the image-text similarity score obtained from the VLM for the image $I$ and sentence set $S$ .

Ideally, the similarity scores of positive sets should be as close to zero as possible, while the similarity scores of negative sets should be as high as possible. However, Figure 1 illustrates that similarity scores cannot be zero for positive triplets. This ensures that the loss is never zero, even when all predictions are correct. We will discuss later how this problem inhibits the optimization of other losses.

To explicitly distill the rich knowledge encoded in the VLM’s parameters, we employ these scores in a contrastive learning loss:

\mathcal{L}_{\text{ITM}}=\sum^{n}_{i=0}(\max(0,\alpha-\text{sim}(I,s^{+}_{i})))+\sum^{m}_{j=0}(\text{sim}(I,s^{-}_{j}))

(4)

Here, $\alpha$ serves as positive margins that anchor the contrastive loss. This loss function is structured to optimize the pairing of images with their corresponding positive sentences while disassociating them from negative ones.

The key insight is that the pre-trained vision-language model has learned a robust joint distribution over both visual and textual modalities through extensive pre-training. By optimizing our HOI network to align with the VLM’s image-text similarities via contrastive loss, we essentially distill this strong prior knowledge into our model. This guides the HOI network to find an optimal parameter configuration that encodes visual-linguistic concepts effectively. This distillation leads to more efficient optimization and substantial gains for rare HOI categories that lack sufficient training examples. In essence, the VLM’s contextual knowledge helps regularize the HOI model, reducing overfitting and improving generalization.

3.5 Training and Inference

Following prior query-based HOI detection methods[16, 18, 49], we utilize the Hungarian matching algorithm to associate predicted triplets with ground truth triplets. As illustrated in Figure 3, our network consists of a DETR-based encoder and three parallel decoder branches with task-specific queries $Q^{p}$ , $Q^{a}$ , and $Q^{o}$ for human, object, and interaction prediction respectively. We select BLIP[21] as the VLM to compute image-text matching scores because it has higher performance relative to other methods in terms of computational efficiency. In principle, any sufficiently large VLM can be utilized and continue to improve with more compute.

In addition to the contrastive image-text matching loss $\mathcal{L}_{\text{ITM}}$ , we use a standard HOI detection loss $\mathcal{L}_{\text{HOI}}$ defined as:

\mathcal{L_{\text{HOI}}}=\lambda_{1}\mathcal{L_{\text{L1}}}+\lambda_{2}\mathcal{L_{\text{GIoU}}}+\lambda_{3}\mathcal{L_{\text{oc}}}+\lambda_{4}\mathcal{L_{\text{ic}}}.

(5)

Where $\mathcal{L_{\text{L1}}}$ and $\mathcal{L_{\text{GIoU}}}$ are regression losses for predicting human and object bounding boxes, $\mathcal{L_{\text{oc}}}$ is a classification loss for detecting object categories, $\mathcal{L_{\text{ic}}}$ is a verb classification loss for interaction predictions, and $\lambda_{i}$ is weighting hyper-parameter.

We train our model end-to-end by jointly optimizing the HOI detection loss $\mathcal{L}_{\text{{HOI}}}$ and the image-text matching knowledge distillation loss $\mathcal{L}_{\text{ITM}}$ :

\mathcal{L}_{\text{total}}=\mathcal{L}_{\text{HOI}}+\mathcal{L}_{\text{ITM}}.

(6)

The VLM is kept frozen during training to provide fixed similarity computations. At inference time, we simply feed an image into our trained model to generate the detected HOI triplets. The VLM and image-text matching loss are only used during training for knowledge transfer and are not needed at test time. Therefore, the number of learnable parameters are same as the baseline method.

4 Experimental Results

4.1 Implementation Details

Our model uses a ResNet-50 CNN backbone followed by a 6-layer transformer encoder, with $L=6$ parallel prediction branches. We set $N=64$ queries for HICO-DET and $N=100$ for V-COCO following prior work[49, 18]. The loss weights $\lambda_{i}$ are set to 2.5, 1, 1, 1 for $\mathcal{L}_{\text{L1}}$ , $\mathcal{L}_{\text{GIoU}}$ , $\mathcal{L}_{\text{oc}}$ , and $\mathcal{L}_{\text{ic}}$ respectively. We initialize from a DETR model[5] pretrained on COCO[27] and optimize using AdamW[29] with weight decay 1e-4. The CNN backbone has a learning rate of 1e-5 while all other components use 1e-4, trained for 100 epochs. For V-COCO, the CNN weights are frozen to prevent overfitting and the learning rate is reduced to 4e-5. All experiments use a batch size of 4 on 4 RTX 3090ti GPUs. These optimized hyperparameters enable efficient end-to-end training of our method.

4.2 Dataset

We conduct experiments on two standard HOI detection benchmarks: HICO-DET[14] and V-COCO[13].

HICO-DET consists of 47,776 images, with 38,118 for training and 9,658 for testing. It contains 600 HOI categories composed of 80 object classes and 117 action verbs.

V-COCO is a subset of 10,396 COCO images, split into 5,400 train and 4,964 test images. It has 29 action classes including 4 body motions without object interactions, and the same 80 objects as HICO-DET. In total, V-COCO has 263 unique HOI triplets.

4.3 Evaluation Metric

Following the evaluation protocol from[49, 18], we use mean Average Precision (mAP) as the metric for measuring HOI detection performance. A predicted triplet is considered a true positive if: 1) the detected human and object boxes have over 50% IOU with ground truth, and 2) the predicted interaction categories match the labels.

On HICO-DET, we report mAP on the full 600 classes, 138 rare classes, and 462 non-rare classes. For V-COCO, we evaluate on S1 with 29 actions including body motions, and S2 with 25 actions excluding no-object categories. By benchmarking on these diverse splits, we comprehensively analyze our method’s HOI detection capabilities.

4.4 Comparison with State-of-the-Art

Method	Default			Known Object
Method	Full	Rare	Non-Rare	Full	Rare	Non-Rare
IDN[24]	24.58	20.33	25.86	27.89	23.64	29.16
HOTR[16]	25.10	17.34	27.42	-	-	-
HOI-Trans[57]	26.61	19.15	28.84	29.13	20.98	31.57
QPIC[37]	29.07	21.85	31.23	31.68	24.14	33.93
MSTR[17]	31.17	25.31	32.92	34.02	28.83	35.57
CDN[49]	32.07	27.19	33.53	34.79	29.48	36.38
STIP[53]	32.22	28.15	33.43	35.29	31.43	36.45
UPT[51]	32.62	28.62	33.81	36.08	31.41	37.47
MUREN[18]	32.87	28.67	34.12	35.52	30.88	36.91
GEN-VLKT[26]	33.75	29.25	35.10	36.78	32.75	37.99
VLM-HOI	34.25	30.22	35.20	36.88	33.30	37.75

Table 1: Comparison evaluation of our approach against state-of-the-art methods on the HICO-DET benchmark, excluding comparisons to other Swin Transformer-based models for fair analysis with the same backbone.

We extensively evaluate our VLM-HOI method against state-of-the-art approaches on two standard HOI detection benchmarks - HICO-DET[14] and V-COCO[13].

On HICO-DET (Table 1), our proposed knowledge distillation method improves upon GEN-VLKT[26], which also utilizes external knowledge but only achieves 33.75% and 36.78% mAP on default and known object settings. In contrast, our VLM-HOI obtains 33.64% and 36.88% mAP, highlighting the benefits of our image-text matching objective.

Analyzing the long-tail distribution, the proposed VLM-HOI demonstrates consistent gains over GEN-VLKT, with improvements of 0.62% and 0.55% mAP on rare interactions. This validates our approach of transforming visual detections into grounded text for optimized knowledge transfer from VLMs.

Similarly, on V-COCO (Table 2), the proposed method obtains 69.5% and 72.1% mAP on Scenario 1 and 2, surpassing all previous methods. Here, VLM-HOI outperforms GEN-VLKT by large margins of 7.1% AP on Scenario 1 and 7.7% AP on Scenario 2. This highlights the benefits of our image-text matching objective for richer knowledge transfer.

Notably, we improve over MUREN[18], which uses the same transformer backbone as our method. This shows that the performance gains come from our proposed VLM knowledge distillation, rather than just model architecture.

Method	$\text{AP}^{\#1}_{\text{role}}$	$\text{AP}^{\#2}_{\text{role}}$
GG-Net[54]	54.7	-
HOTR[16]	55.2	64.4
QPIC-R50[37]	58.8	61.0
GEN-VLKT[26]	62.4	64.4
CDN[49]	62.4	64.4
UPT[51]	59.0	64.5
STIP[53]	66.0	70.7
DisTR[55]	66.2	68.5
MUREN[18]	66.5	68.7
VLM-HOI	67.7	70.9

Table 2: Comparison evaluation of our approach against state-of-the-art methods on the V-COCO benchmark

4.5 Ablation Study

Effects of the Margin: A critical hyperparameter in our image-text contrastive loss is the positive sample margin $\alpha$ . This margin controls the lower bound on the similarity scores between positive text prompts and the image. Intuitively, it determines how close we want to pull the positive grounded sentence representations to the image representation.

$\alpha$	$\text{AP}^{\#1}_{\text{role}}$	$\text{AP}^{\#2}_{\text{role}}$
0	66.80	70.53
1	67.73	70.91
2	67.13	70.69

Table 3: Analysis of the positive margin hyperparameter

\alpha

in the image-text contrastive loss on V-COCO.

Prompt	$\text{AP}^{\#1}_{\text{role}}$	$\text{AP}^{\#2}_{\text{role}}$
Verb	67.51	70.27
Object	67.29	70.52
Full	67.73	70.91

Table 4: Performance comparison of different grounded prompt structures for converting predicted HOI triplets to text.

If $\alpha$ is too small or zero, the loss will not effectively separate positive and negative samples, as the lower bound on positive scores is too low. On the other hand, if $\alpha$ is too large, it can overly restrict and dominate the loss landscape.

Table 4 shows the impact of $\alpha$ by evaluating models trained with different values. As shown in Figure 1, the typical similarity scores between positive triplets and the image range from 1 to 2. Setting $\alpha=0$ gives a mAP drop of 1% compared to $\alpha=1$ , indicating that a zero margin fails to sufficiently separate distributions.

Increasing $\alpha=2$ also decreases performance by 0.6% mAP, suggesting that too large of a margin overly restricts the optimization. The best balance is achieved with $\alpha=1$ , which aligns well with the expected positive sample score distribution.

The positive margin hyperparameter should be tuned to match the anticipated similarity score range, in order to enable effective optimization. This provides a principle for setting this important variable when applying our image-text contrastive loss to new domains beyond HOI detection.

Effects of Prompt Variation: In our main experiments, we utilize full grounded sentences to represent the predicted HOI triplets, e.g. “A person hold a tennis racket”. Here we analyze the impact of using only partial prompts without the human/object classes.

As shown in Table 4, using just the interaction verb as the prompt (“is holding”) decreases performance by 1.2% mAP compared to the full sentence. This demonstrates that grounding the textual prompt with the detected human/object classes provides useful context and regularization for the VLM.

We also experiment with a prompt containing just the human and object (“A person tennis racket”). This achieves a 0.8% lower mAP than the full prompt, showing that including the interaction verb is important for capturing the HOI semantics.

Overall, the full grounded sentence prompt leads to the best knowledge transfer from the VLM to our HOI detection model. The human, object, and verb provide complementary contextual cues that, when combined, enable optimized distillation of the VLM’s visual-linguistic knowledge.

4.6 Computational Resource

Method	Parameters	training time
Method	Parameters	inference time
Baseline[18]	41M(DETR)	21.2 min / epoch
Baseline[18]	28M(Decoder)	0.03 sec / frame
VLM-HOI	430M(DETR+BLIP)	50.3 min / epoch
VLM-HOI	28M(Learnable)	0.03 sec / frame

Table 5: Comparison of the model size, training time per epoch, and inference time per frame between our proposed method and relevant baseline approaches. Our proposed VLM-HOI incorporates a 361M parameter BLIP, of which only the 28M decoder is learned.

A core component of our approach is the large pre-trained VLM which increases the computational requirements. As shown in Table 5, we use BLIP with 361M parameters due to hardware constraints, compared to 69M parameters for the baseline HOI network.

The additional VLM computations result in longer training times compared to the baseline - around 2.4 $\times$ in our experiments. However, we found that our method requires fewer training epochs to converge, likely due to the beneficial regularization and optimized initialization from the VLM distillation.

At inference time, our model has the same efficiency as the baseline since the VLM is only used during training for knowledge transfer. We only need to perform a single forward pass through the learned HOI network.

5 Conclusion

In this paper, we present a novel approach that leverages the capabilities of the Large Vision Language Model (VLM) to enhance Human-Object Interaction (HOI) detection. By utilizing VLM as an objective function in the context of HOI detection, we have successfully quantified the similarity of predicted HOI triplets through Image-Text matching, harnessing VLM’s comprehensive understanding of both visual and linguistic modalities. Our experiments have yielded state-of-the-art results in HOI detection accuracy on benchmark datasets, marking a significant advancement in the field. This novel integration of VLM into HOI detection not only showcases the potential of language comprehension capabilities in bridging modalities but also takes a promising stride towards more advanced and interpretable human-object interaction analysis. We hope the findings presented in this paper offer valuable insights and open new avenues for research at the intersection of vision and language.

Acknowledgments

This work was supported partly by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by Korea government (MSIT) [2021-0-01341, Artificial Intelligent Graduate School Program (Chung-Ang University)], and partly by Field-oriented Technology Development Project for Customs Administration through National Research Foundation of Korea (NRF) funded by the Ministry of Science & ICT and Korea Customs Service (2021M3I1A1097911).

References

[1] Anderson, P., He, X., Buehler, C., Teney, D., Johnson, M., Gould, S., Zhang, L.: Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 6077–6086 (2018)
[2] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE international conference on computer vision. pp. 2425–2433 (2015)
[3] Baltrušaitis, T., Ahuja, C., Morency, L.P.: Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41(2), 423–443 (2019)
[4] Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems 33, 1877–1901 (2020)
[5] Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European conference on computer vision. pp. 213–229. Springer (2020)
[6] Chen, J., Yanai, K.: Qahoi: Query-based anchors for human-object interaction detection. In: 2023 18th International Conference on Machine Vision and Applications (MVA). pp. 1–5. IEEE (2023)
[7] Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollar, P., Zitnick, C.L.: Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015)
[8] Chen, Y.C., Li, L., Yu, L., El Kholy, A., Ahmed, F., Gan, Z., Cheng, Y., Liu, J.: Uniter: Universal image-text representation learning. In: European conference on computer vision. pp. 104–120. Springer (2020)
[9] Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H.W., Sutton, C., Gehrmann, S., et al.: Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311 (2022)
[10] Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning (2023)
[11] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
[12] Gao, C., Zou, Y., Huang, J.B.: ican: Instance-centric attention network for human-object interaction detection. In: British Machine Vision Conference (2018)
[13] Gupta, S., Malik, J.: Visual semantic role labeling. arXiv preprint arXiv:1505.04474 (2015)
[14] Hou, Z., Peng, X., Qiao, Y., Tao, D.: Visual compositional learning for human-object interaction detection. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XV 16. pp. 584–600. Springer (2020)
[15] Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3128–3137 (2015)
[16] Kim, B., Lee, J., Kang, J., Kim, E.S., Kim, H.J.: Hotr: End-to-end human-object interaction detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 74–83 (2021)
[17] Kim, B., Mun, J., On, K.W., Shin, M., Lee, J., Kim, E.S.: Mstr: Multi-scale transformer for end-to-end human-object interaction detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19578–19587 (2022)
[18] Kim, S., Jung, D., Cho, M.: Relational context learning for human-object interaction detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2925–2934 (2023)
[19] Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. In: arXiv preprint arXiv:1411.2539 (2014)
[20] Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597 (2023)
[21] Li, J., Li, D., Xiong, C., Hoi, S.: Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In: International Conference on Machine Learning. pp. 12888–12900. PMLR (2022)
[22] Li, J., Selvaraju, R., Gotmare, A., Joty, S., Xiong, C., Hoi, S.C.H.: Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, 9694–9705 (2021)
[23] Li, L.H., Yatskar, M., Yin, D., Hsieh, C.J., Chang, K.W.: Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557 (2019)
[24] Li, Y.L., Liu, X., Wu, X., Li, Y., Lu, C.: Hoi analysis: Integrating and decomposing human-object interaction. Advances in Neural Information Processing Systems 33, 5011–5022 (2020)
[25] Liao, Y., Liu, S., Wang, F., Chen, Y., Qian, C., Feng, J.: Ppdm: Parallel point detection and matching for real-time human-object interaction detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 482–490 (2020)
[26] Liao, Y., Zhang, A., Lu, M., Wang, Y., Li, X., Liu, S.: Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20123–20132 (2022)
[27] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13. pp. 740–755. Springer (2014)
[28] Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. arXiv preprint arXiv:2304.08485 (2023)
[29] Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
[30] Lu, J., Goswami, V., Rohrbach, M., Parikh, D., Lee, S.: 12-in-1: Multi-task vision and language representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10437–10446 (2019)
[31] Ma, S., Wang, Y., Wang, S., Wei, Y.: Fgahoi: Fine-grained anchors for human-object interaction detection. arXiv preprint arXiv:2301.04019 (2023)
[32] Ning, S., Qiu, L., Liu, Y., He, X.: Hoiclip: Efficient knowledge transfer for hoi detection with vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 23507–23517 (2023)
[33] OpenAI: Gpt-4 technical report (2023)
[34] Park, J., Park, J.W., Lee, J.S.: Viplo: Vision transformer based pose-conditioned self-loop graph for human-object interaction detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 17152–17162 (2023)
[35] Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
[36] Su, W., Zhu, X., Cao, Y., Li, B., Lu, L., Wei, F., Dai, J.: Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530 (2019)
[37] Tamura, M., Ohashi, H., Yoshinaga, T.: Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10410–10419 (2021)
[38] Tan, H., Bansal, M.: Lxmert: Learning cross-modality encoder representations from transformers. In: arXiv preprint arXiv:1908.07490 (2019)
[39] Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3156–3164 (2015)
[40] Wang, T., Yang, T., Danelljan, M., Khan, F.S., Zhang, X., Sun, J.: Ipnet. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4116–4125 (2020)
[41] Xiao, T., Liu, Y., Zhou, B., Jiang, Y., Sun, J.: Unified perceptual parsing for scene understanding. In: Proceedings of the European conference on computer vision (ECCV). pp. 418–434 (2018)
[42] Xie, X., Bhatnagar, B.L., Pons-Moll, G.: Visibility aware human-object interaction tracking from single rgb camera. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4757–4768 (2023)
[43] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning. pp. 2048–2057 (2015)
[44] Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the AAAI conference on artificial intelligence. vol. 32 (2018)
[45] Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of the European conference on computer vision (ECCV). pp. 684–699 (2018)
[46] Yu, J., Wang, Z., Vasudevan, V., Yeung, L., Seyedhosseini, M., Wu, Y.: Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917 (2022)
[47] Yuan, H., Jiang, J., Albanie, S., Feng, T., Huang, Z., Ni, D., Tang, M.: Rlip: Relational language-image pre-training for human-object interaction detection. Advances in Neural Information Processing Systems 35, 37416–37431 (2022)
[48] Yuan, H., Zhang, S., Wang, X., Albanie, S., Pan, Y., Feng, T., Jiang, J., Ni, D., Zhang, Y., Zhao, D.: Rlipv2: Fast scaling of relational language-image pre-training. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21649–21661 (2023)
[49] Zhang, A., Liao, Y., Liu, S., Lu, M., Wang, Y., Gao, C., Li, X.: Mining the benefits of two-stage and one-stage hoi detection. Advances in Neural Information Processing Systems 34, 17209–17220 (2021)
[50] Zhang, F.Z., Campbell, D., Gould, S.: Spatially conditioned graphs for detecting human-object interactions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 13319–13327 (2021)
[51] Zhang, F.Z., Campbell, D., Gould, S.: Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 20104–20112 (2022)
[52] Zhang, F.Z., Yuan, Y., Campbell, D., Zhong, Z., Gould, S.: Exploring predicate visual context in detecting of human-object interactions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10411–10421 (2023)
[53] Zhang, Y., Pan, Y., Yao, T., Huang, R., Mei, T., Chen, C.W.: Exploring structure-aware transformer over interaction proposals for human-object interaction detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19548–19557 (2022)
[54] Zhong, X., Qu, X., Ding, C., Tao, D.: Glance and gaze: Inferring action-aware points for one-stage human-object interaction detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 13234–13243 (2021)
[55] Zhou, D., Liu, Z., Wang, J., Wang, L., Hu, T., Ding, E., Wang, J.: Human-object interaction detection via disentangled transformer. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 19568–19577 (2022)
[56] Zhou, L., Palangi, H., Zhang, L., Hu, H., Corso, J., Gao, J.: Unified vision-language pre-training for image captioning and vqa. In: Proceedings of the AAAI conference on artificial intelligence. vol. 34, pp. 13041–13049 (2020)
[57] Zou, C., Wang, B., Hu, Y., Liu, J., Wu, Q., Zhao, Y., Li, B., Zhang, C., Zhang, C., Wei, Y., et al.: End-to-end human object interaction detection with hoi transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11825–11834 (2021)