Compositional Learning in Transformer-Based Human-Object Interaction Detection
Abstract
Human-object interaction (HOI) detection is an important part of understanding human activities and visual scenes. The long-tailed distribution of labeled instances is a primary challenge in HOI detection, promoting research in few-shot and zero-shot learning. Inspired by the combinatorial nature of HOI triplets, some existing approaches adopt the idea of compositional learning, in which object and action features are learned individually and re-composed as new training samples. However, these methods follow the CNN-based two-stage paradigm with limited feature extraction ability, and often rely on auxiliary information for better performance. Without introducing any additional information, we creatively propose a transformer-based framework for compositional HOI learning. Human-object pair representations and interaction representations are re-composed across different HOI instances, which involves richer contextual information and promotes the generalization of knowledge. Experiments show our simple but effective method achieves state-of-the-art performance, especially on rare HOI classes.
Index Terms:
human-object interaction detection, long-tailed distribution problem, compositional learningI Introduction
Human-object interaction (HOI) detection aims to localize human and object instances in a given image, and recognize interactions of human-object pairs. The task is also formulated as detection of HOI triplets ⟨human, verb, object⟩. Existing HOI detection methods can be divided into two-stage and one-stage methods. Two-stage methods [1, 2, 3] sequentially perform two sub-tasks: object detection and interaction classification. Humans and objects are detected and paired as human-object proposals, and the interaction classifier predicts interaction classes of proposals based on visual features of human-object pairs. Diverse auxiliary information, including human-object spatial configuration [1], human pose [2] and language priors [3], is introduced to provide cues. One-stage methods eliminate the process of enumerating human-object proposals for higher efficiency. Early one-stage approaches [4, 5] detect points or regions of interaction, and perform object detection and interaction classification in parallel. In recent one-stage methods [6, 7, 8], various network architectures based on transformer [9] are proposed to perform end-to-end HOI detection.
A primary challenge in HOI detection lies in the long-tailed distribution of labels of different categories. In HICO-Det [1], 138 of the total 600 HOI classes have only less than 10 samples, leading to insufficient training and poor detection accuracy on rare classes. To address this problem, few-shot and zero-shot learning methods [3, 10, 11, 12] have been proposed and improved detection accuracy of rare and unseen classes. The combinatorial characteristic of HOI triplets naturally inspires the idea of compositional learning [12], in which verbs and objects are learned individually and re-composed to form samples of different HOI classes, as Fig. 1 illustrates.
However, most of existing compositional learning methods [11, 12, 13, 14] follow the traditional two-stage paradigm. Limited by the feature extraction ability of CNN, the original and re-composed samples contain inadequate semantics to infer HOI classes. Moreover, most of these methods only involve object and action features of their local regions in feature re-composition, causing further loss of global contextual information in re-composed feature samples. As a result, these methods often rely on additional information, such as human-object spatial configuration and word embeddings, for better performance. The simple idea of compositional learning remains rarely seen in transformer-based methods. Exploiting the excellent feature extraction capacity of transformer, we believe re-composition between samples with richer visual semantics can promote more comprehensive understanding of HOIs, without introducing any auxiliary information.
We propose a novel transformer-based framework for compositional HOI learning. Given a pair of input images, our model produces human-object pair representations and interaction representations via two cascade decoders of CDN [8]. Human-object pair representations predict human and object bounding boxes and object categories, while interaction representations are concatenated with human-object pair representations to predict action classes. We select the representations corresponding to the best predictions matched with ground truths, and concatenate them across different HOI instances as new interaction samples. Labels of re-composed samples are also re-composed with ground truth labels of original samples. On one hand, visual features extracted by transformer contain richer global context, which is involved in our sample re-composition. This enables our model to better understand human-object interactions without the help of additional information. On the other hand, sample re-composition can not only explicitly generalize knowledge to rare classes, but also implicitly encourage the model to learn more generalized knowledge insensitive to changes of object and action classes, and therefore helps eliminate the long-tailed distribution problem. Our main contributions can be summarized as follows:
-
•
To the best of our knowledge, we are the first to apply compositional learning on a transformer-based HOI detection framework, without introducing any supplementary information.
-
•
We re-compose human-object pair representations and interaction representations between different HOI instances as new training samples, which involves more global contextual information and promotes knowledge generalization across HOI classes.
-
•
On two benchmark datasets, our method achieves excellent overall performance and state-of-the-art performance on rare HOI classes.
II Related Work
II-A Few-Shot and Zero-Shot HOI Detection
Previous few-shot and zero-shot HOI detection methods can be divided into two groups. One group of methods [3, 15, 10] adopt vision-language joint learning to augment visual features with linguistic knowledge. Peyre et al. [3] learn vision-language embeddings of visual phrases and generate embeddings for unseen triplets via analogies between similar relations. Liu et al. [15] construct an knowledge graph with visual-semantic embeddings to encode multi-level relations among objects, actions and HOIs. Liao et al. [10] transfers visual-linguistic knowledge from CLIP [16] to enhance HOI understanding. The other group of methods re-compose samples to generalize knowledge to rare and unseen HOI classes. Bansal et al. [11] propose a functional generalization module where object word embeddings are replaced with those of functionally similar objects during training. Hou et al. [12, 13, 14] propose several methods to generate new interaction samples. In VCL [12], they concatenate object and verb features across HOI instances. In FCL [13], they propose to generate fabricated object features containing noise and verb features. In ATL [14], verb features re-defined as affordance features are concatenated with object features from HOI datasets and object detection datasets.
II-B Transformer-Based HOI Detection
Transformer [9] has been successfully applied in a diversity of computer vision tasks, e.g. DETR [17] in object detection, which inspires early transformer-based HOI detection methods [6, 7]. In QPIC [6], a transformer encoder aggregates image-wide context, and a transformer decoder transforms a set of queries into embeddings, each directly capturing one human-object pair. In HOTR [7], two decoders respectively generate instance and interaction representations, and each interaction representation associates itself with corresponding humans and objects with HO Pointers. Some works [18] exploit deformable attention [19] to process multi-scale feature maps for fine-grained HOI detection. Recently, some works [8, 20] seek to combine advantages of two-stage and one-stage frameworks. Zhang et al. [8] propose two cascade disentangling decoders, each focusing on respective subtasks, i.e. object detection and interaction classification. Zhang et al. [20] propose a two-stage framework composed of a DETR as object detector and a transformer-based interaction head to pair humans and objects and predict action classes.
III Proposed Method
III-A Overview
Fig. 2 illustrates the overall pipeline of our method which is mainly based on CDN [8] and VCL [12]. Given a pair of input images, the backbone and a shared encoder first extract their global features. Then the human-object pair decoder and the interaction decoder transform a set of learnable queries into corresponding representations. A set of feed-forward networks (FFNs) process human-object pair representations to predict human and object bounding boxes and object categories, and human-object pair representations are concatenated with interaction representations to predict action classes. For compositional learning, we select representations that generate the best predictions matched with ground truths, and concatenate them across different HOI instances to produce new training samples. Labels of re-composed samples are also constructed using labels of original samples. Verb-object compositions that doesn’t belong to HOI categories defined by datasets are considered infeasible and removed from re-composed labels.
III-B Network Architecture
We choose Cascade Disentangling Network (CDN) [8] as the baseline model, based on which we implement our compositional learning method because of its effectiveness and code availability. We briefly review its architecture and introduce our modification to it for sample re-composition.
Baseline. CDN consists of a CNN backbone, a shared encoder and two cascade disentangled decoders, named the Human-Object Pair Decoder (HO-PD) and the interaction decoder, respectively. Given an input image , the backbone generates a visual feature map , which is flattened into the shape of and fed into the encoder along with the positional encoding . The encoder further aggregates image-wide context and produces sequenced visual feature vectors , denoted as global memory.
Two decoders take , and a set of queries as input, apply self-attention on , conduct multi-head co-attention between and , and output updated queries denoted as representations . stands for the number of queries. HO-PD decodes features of human-object pairs, while the interaction decoder captures interaction context. and are cascaded by initializing the interaction queries with HO pair representations , so that are guided by prior knowledge in to learn the corresponding action classes for each HO pair query. The generating process of representations can be described as:
(1) |
(2) |
A group of FFNs process the representations to predict HOI triplets. produce human-object pairs predictions denoted as . , and stand for the human bounding box, the object bounding box and the object class probability distribution, respectively. is the number of object classes. predict action classes of human-object pairs, denoted as . is the action class probability distribution, and is the number of action classes. Composed of HO pair predictions and interaction predictions, HOI predictions are denoted as . The prediction process is formulated as:
(3) |
Modified Baseline. To re-compose original samples and implement our compositional learning method, we make a simple modification to the original baseline. We concatenate HO pair representations with corresponding interaction representations to predict action classes, which can be described as:
(4) |
The shape of concatenated representations is . We denote the model that only concatenates and of the same original samples as the modified baseline.
III-C Compositional Learning
Inspired by Visual Compositional Learning (VCL) [12], we propose our compositional learning method consisting of re-composition of features and re-composition of labels. In our work, sample re-composition is conducted within one pair of randomly selected input images due to hardware limitation, but it’s also easier for us to explain our method clearly. If one image contains multiple HOI instances, we also re-compose samples across different instances within the image following the same process.
Feature Re-Composition. Given a pair of input images and , our model first executes the general procedure of HOI detection, producing a set of HOI predictions, each corresponding to an HO pair representation and an interaction representation. If all representations are involved in re-composition, the huge number of re-composed samples will largely increase the computation overhead. Besides, each ground truth is matched with only one prediction for loss calculation during training. It indicates that the majority of predictions are inaccurate, and the corresponding representations contain inadequate semantic information, which barely benefits the understanding of HOIs. Therefore, we only re-compose representations that produce the best-matching predictions, denoted as . symbolizes the number of ground truths, i.e. annotated HOI instances in input images.
HO pair representations and interaction representations from different images are concatenated as new feature samples, and fed into the interaction classifier to predict action classes, which can be described as:
(5) |
Taking from and from as example, every representation in is concatenated with every representation in . So the shape of re-composed representations is , and contains verb predictions. Combining human-object predictions of original samples, the HOI predictions of re-composed samples are denoted as .
In VCL [12], object features from object bounding box regions are re-concatenated with verb features from union regions of human-object pairs. We think this way of re-composition fails to involve enough human features and global context, which are included in HO pair representations. So we believe our feature re-composition method can enhance the generalization of higher-dimensional semantic knowledge.
Label Re-Composition. Similar to the HOI prediction, an HOI ground truth label consists of the human bounding box, the object bounding box, the object class label and the action class label. We denote the human and object labels as HO pair labels. The ground truth labels of an image are denoted as . is an one-hot vector and is a multi-hot vector.
Taking HO pair labels from and action labels from as example, we pair each HO pair label in with every action label in , and check the feasibility of compositions. An object-verb composition that doesn’t belong to HOI categories defined by the dataset is considered infeasible, even if it may be rational (e.g. “couch” and “wear”, “suitcase” and “sit on” in Fig. 2). By setting the value of infeasible actions in the action label vector to zero, infeasible compositions are removed. If all compositions between the object label and all action labels are infeasible, an all-zero action label vector is kept for the object label instead of removing the whole HOI label, because the HO pair label is still useful for instance detection. The re-composed HOI labels are denoted as , where stands for the number of kept re-composed labels.
III-D Training and Inference
Training. For loss calculation, each ground truth finds its best-matching prediction with the Hungarian algorithm used in [6]. Following [6], the target loss is the weighted sum of four parts: box regression loss , generalized intersection-over-union loss [21], object classification loss and action classification loss , described as:
(6) |
where , , , are hyper-parameters for adjusting weights of each loss. We use the above loss function for both original and re-composed samples. In a mini-batch, the average loss is the weighted sum of the loss of original samples and the loss of re-composed samples :
(7) |
where is the hyper-parameter that adjusts weights of original and re-composed samples. To avoid re-composed samples from dominating the HOI learning, we suggest to give a larger weight for original samples, i.e. set .
Inference. We perform standard HOI detection without sample re-composition during inference. Following [8], we generate the -th prediction as . The HOI class score , where and are scores of corresponding object and action class, respectively.
IV Experiments
IV-A Datasets and Metrics
Datasets. For performance evaluation, we adopt two popular benchmark datasets, V-COCO [22] and HICO-Det [1]. Both datasets contain 80 object classes defined by MS-COCO [23]. V-COCO includes 2,533 images for training, 2,867 images for validation and 4,946 images for testing. Each human instance is annotated with binary labels for 29 action categories. HICO-Det consists of 38,118 images for training and 9,658 images for testing, providing more than 150,000 annotated HOI instances. It contains 117 action classes and explicitly defines 600 HOI categories.
Metrics. Following [1], we use mean average precision (mAP) as the evaluation metric. An HOI prediction is considered as a true positive when both bounding boxes of human and object have intersection over union (IoU) with a ground truth greater than 0.5, and the predicted HOI label is correct. On HICO-Det we report mAPs under the default setting on three category sets: all 600 classes (Full), 138 classes with less than 10 training samples (Rare) and the other 462 classes (Non-Rare). On V-COCO we report role mAPs under Scenario 1 following its official evaluation setup.
IV-B Implementation Details
CDN [8] provides PyTorch implementations of three variant architectures: CDN-S (small), CDN-B (base) and CDN-L (large). For quick validation, we implement our method on CDN-S, which consists of a ResNet-50 as backbone, a 6-layer encoder and two 3-layer decoders. Box regression FFNs have 3 linear layers with ReLU, while object and action classfiers are single-layer FFNs. The number of queries is set to 64 for HICO-Det and 100 for V-COCO. The number of channels of each query is set to 256. Following [6], weight coefficients , , , are set to 2.5, 1, 1 and 1, respectively. Same as [8], we initialize the network with pre-trained parameters of DETR [17], and use AdamW with a weight decay of for optimization. We first train the whole model for 90 epochs with a learning rate of , decreasing by 10 times at 60th epoch. Then we fine-tune the cascade decoders together with FFNs for 10 epochs with a learning rate of . We also use the same dynamic re-weighting strategy and post-processing of pairwise non-maximal suppression as [8] proposes. All experiments are conducted on 2 RTX 2080 Ti GPUs with the batch size of 2.
Method | Full | Rare | Non-Rare |
---|---|---|---|
Baseline | 29.45 | 23.35 | 31.27 |
Baseline∗ | 29.58 | 23.82 | 31.30 |
Compo () | 28.96 | 23.88 | 30.48 |
Compo () | 30.03 | 24.62 | 31.65 |
Compo () | 29.43 | 23.34 | 31.25 |
Method | |
---|---|
Baseline | 55.98 |
Baseline∗ | 55.84 |
Compo () | 56.74 |
Compo () | 56.48 |
Compo () | 57.24 |
IV-C Ablation Study
We first validate the effectiveness of our method compared with baseline models. Experimental results are demonstrated in Table I and Table II, where Baseline, Baseline∗ and Compo denote the original CDN-S, the modified baseline mentioned in Section III-B and our sample re-composition method, respectively.
As Table I shows, Baseline∗ which concatenates representations of HO pairs and interactions to predict action classes outperforms the original baseline on HICO-Det, especially improving the mAP on rare classes by a large margin. Applying sample re-composition on the basis of Baseline∗, our method achieves the best performance with on three category sets of HICO-Det. Note that our best model significantly increases the mAP on rare classes from 23.35 to 24.62 compared with Baseline. This proves the effectiveness of our method in eliminating the long-tailed distribution problem, which is also shown by qualitative results in Fig. 3.
According to Table II, our method also performs better compared with baselines on V-COCO, achieving peak performance with . We assume that the difference of best values on two datasets may result from the number of feasible labels. On HICO-Det with a larger range of object and action categories, the re-composition goes beyond defined HOI classes far more easily than on V-COCO, leading to less feasible HOI labels in re-composed samples. Therefore, re-composed samples of HICO-Det contains less knowledge than those of V-COCO, and a slightly larger weight for re-composed samples becomes necessary.
Method | Backbone | Full | Rare | Non-Rare |
Analogy [3] | ResNet-50+FPN | 19.40 | 14.60 | 20.90 |
Functional [11] | ResNet-101 | 21.96 | 16.43 | 23.62 |
VCL [12] | ResNet-50 | 19.43 | 16.55 | 20.29 |
VCL [12] | ResNet-101 | 23.63 | 17.21 | 25.55 |
ATL [14] | ResNet-101 | 24.50 | 18.53 | 26.28 |
FCL [13] | ResNet-101 | 24.68 | 20.03 | 26.07 |
ConsNet [15] | ResNet-50+FPN | 25.94 | 19.35 | 27.91 |
HOTR [7] | ResNet-50 | 25.10 | 17.34 | 27.42 |
QPIC [6] | ResNet-50 | 29.07 | 21.85 | 31.23 |
Compo () | ResNet-50 | 28.96 | 23.88 | 30.48 |
Compo () | ResNet-50 | 30.03 | 24.62 | 31.65 |
Compo () | ResNet-50 | 29.43 | 23.34 | 31.25 |
IV-D Comparison with State-of-the-Art
We compare our method with existing state-of-the-art few-shot learning methods and transformer-based methods of HOI detection. For fair comparison, we give priority to implementations that adopt Resnet-50 as the backbone. From Table III we can see our best model outperforms all listed previous methods on HICO-Det, especially exceeding the state-of-the-art mAP on rare classes by a large margin. As Table IV shows, our best model also achieves competitive performance on V-COCO. We believe our method can outperform state-of-the-art methods if implemented on a larger network architecture, i.e. CDN-B or CDN-L.
V Conclusion
We propose a novel transformer-based compositional learning framework for few-shot HOI detection. Human-object pair representations and interaction representations from different HOI instances are re-composed as new training samples. This promotes the transfer of knowledge from non-rare classes to rare classes, encourages the learning of generalized semantics, and helps eliminate the long-tailed distribution problem. Experiments on two benchmark datasets prove our method enhances understanding of HOIs without introducing additional information, and achieves state-of-the-art performance especially on rare categories.
Acknowledgment
This work was supported in part by the National Natural Science Foundation of China under Grant 62076183, 61936014 and 61976159, in part by the Natural Science Foundation of Shanghai under Grant 20ZR1473500, in part by the Shanghai Science and Technology Innovation Action Project of under Grant 20511100700 and 22511105300, in part by the Shanghai Municipal Science and Technology Major Project under Grant 2021SHZDZX0100, and in part by the Fundamental Research Funds for the Central Universities. The authors would also like to thank the anonymous reviewers for their careful work and valuable suggestions.
References
- [1] Y. W. Chao, Y. Liu, X. Liu, H. Zeng, and J. Deng, “Learning to detect human-object interactions,” in WACV, 2018, pp. 381–389.
- [2] B. Wan, D. Zhou, Y. Liu, R. Li, and X. He, “Pose-aware multi-level feature network for human object interaction detection,” in ICCV, 2019, pp. 9469–9478.
- [3] J. Peyre, I. Laptev, C. Schmid, and J. Sivic, “Detecting unseen visual relations using analogies,” in ICCV, 2019, pp. 1981–1990.
- [4] B. Kim, T. Choi, J. Kang, and H. J. Kim, “Uniondet: Union-level detector towards real-time human-object interaction detection,” in ECCV, 2020, pp. 498–514.
- [5] Y. Liao, S. Liu, F. Wang, Y. Chen, C. Qian, and J. Feng, “Ppdm: Parallel point detection and matching for real-time human-object interaction detection,” in CVPR, 2020, pp. 482–490.
- [6] M. Tamura, H. Ohashi, and T. Yoshinaga, “Qpic: Query-based pairwise human-object interaction detection with image-wide contextual information,” in CVPR, 2021, pp. 10410–10419.
- [7] B. Kim, J. Lee, J. Kang, E. S. Kim, and H. J. Kim, “Hotr: End-to-end human-object interaction detection with transformers,” in CVPR, 2021, pp. 74–83.
- [8] A. Zhang, Y. Liao, S. Liu, M. Lu, Y. Wang, C. Gao, and X. Li, “Mining the benefits of two-stage and one-stage hoi detection,” Advances in Neural Information Processing Systems, vol. 34, pp. 17209–17220, 2021.
- [9] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- [10] Y. Liao, A. Zhang, M. Lu, Y. Wang, X. Li, and S. Liu, “Gen-vlkt: Simplify association and enhance interaction understanding for hoi detection,” in CVPR, 2022, pp. 20123–20132.
- [11] A. Bansal, S. S. Rambhatla, A. Shrivastava, and R. Chellappa, “Detecting human-object interactions via functional generalization,” in AAAI, 2020, vol. 34, pp. 10460–10469.
- [12] Z. Hou, X. Peng, Y. Qiao, and D. Tao, “Visual compositional learning for human-object interaction detection,” in ECCV, 2020, pp. 584–600.
- [13] Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Detecting human-object interaction via fabricated compositional learning,” in CVPR, 2021, pp. 14646–14655.
- [14] Z. Hou, B. Yu, Y. Qiao, X. Peng, and D. Tao, “Affordance transfer learning for humanobject interaction detection,” in CVPR, 2021, pp. 495–504.
- [15] Y. Liu, J. Yuan, and C. Chen, “Consnet: Learning consistency graph for zero-shot human-object interaction detection,” in ACM MM, 2020, pp. 4235–4243.
- [16] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021, pp. 8748–8763.
- [17] N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020, pp. 213–229.
- [18] J. Chen and K. Yanai, “Qahoi: Query-based anchors for human-object interaction detection,” arXiv preprint arXiv:2112.08647, 2021.
- [19] X. Zhu, W. Su, L. Lu, B. Li, X. Wang, and J. Dai, “Deformable detr: Deformable transformers for end-to-end object detection,” arXiv preprint arXiv:2010.04159, 2020.
- [20] F. Z. Zhang, D. Campbell, and S. Gould, “Efficient two-stage detection of human-object interactions with a novel unary-pairwise transformer,” in CVPR, 2022, pp. 20104–20112.
- [21] H. Rezatofighi, N. Tsoi, J. Y. Gwak, A. Sadeghian, Ian Reid, and Silvio Savarese, “Generalized intersection over union: A metric and a loss for bounding box regression,” in CVPR, 2019, pp. 658–666.
- [22] S. Gupta and J. Malik, “Visual semantic role labeling,” arXiv preprint arXiv:1505.04474, 2015.
- [23] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollar, and L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014, pp. 740–755.