Chairs Can be Stood on: Overcoming Object Bias in
Human-Object Interaction Detection
Supplementary Material
1 Additional Experiments
1.1 Known Object Setting
Following previous work [HORCNN_Chao2018WACV, li2020hoi_IDN, zhang2021spatially_SCG_ICCV21], we also report results on the HICO-DET dataset [HORCNN_Chao2018WACV] with Known Object (KO) setting in Tab. LABEL:table:hicodet_ko_pre and Tab. LABEL:table:hicodet_ko, respectively. It can be observed that our method surpasses the baselines under this setting.
1.2 Hyper-parameter Analysis
The detailed analysis of the coordination of these two classifiers with respect to is shown in Tab. LABEL:table:lambda. It can be seen both classifiers are essential for the performance improvement.
1.3 Efficiency and Memory Comparison
train/img | test/img | #param (train) | #param (test) | |
---|---|---|---|---|
SCG [zhang2021spatially_SCG_ICCV21] | 428.16ms | 248.50ms | 16.04M | 16.04M |
+Ours | 440.72ms | 251.32ms | 17.12M | 16.50M |
We compare the memory and computational cost with SCG [zhang2021spatially_SCG_ICCV21] in Tab. 1. Note that the adopted detector (i.e., Faster R-CNN [ren2015fasterrcnn]) is not counted as the detection results can be obtained via one-pass inference for all images before training. It can be observed the overhead brought by our method is negligible for both training and test.
2 Implementations
2.1 Overall Implementations
We conducted all experiments on 4 Nvidia 2080Ti GPUs. Due to resource limitation, we reduced the batch size of SCG and QPIC to 8 and linearly scaled their learning rate.111This may slightly influence the performance and result in inconsistency between reported and reproduced ones. For QPIC, we started from a trained model and finetuned it with the proposed method for a total of 15 epochs. The learning rate is decayed by 0.1 at the 10-th epoch. For the other two baselines, we followed the default scheduling and started the training of from the epoch for stability. is empirically set to 0.4 in all experiments.
For the proposed method, we parameterize as a three layer Multi-layer Perceptron with ReLU activation function. For each memory cell, we set the size to 16 for each object and to 4. About the write operation, is set to the third smallest (for objects with more than 5 associated verbs) or 0 (other objects). Regarding to other aspects with respect to base models (feature extractor, sampling strategy, and loss function), we adopted their default settings. More details are as follows.
2.2 Implementation of Baselines
The implementation details of the baselines are listed in Tab. LABEL:table:imp_detail. In this table, the batch size is represented as number of images per GPU number of GPUs. BCE stands for binary cross-entropy loss. Kindly find the codes with the corresponding model names in the zip file.
2.3 More Hyper-parameter Settings
Index | Setting | Full↑ | Rare↑ | Non-rare↑ |
---|---|---|---|---|
1 | Baseline (4 GPUs * 2 image, unscaled lr) | 19.94 | 14.70 | 21.50 |
2 | Baseline (4 GPUs * 1 image, scaled lr) | 20.75 | 15.96 | 22.18 |
3 | + Ours | 21.16 | 17.41 | 22.28 |
4 | Baseline (4 GPUs * 2 image, scaled lr) | 20.99 | 16.30 | 22.40 |
5 | + Ours | 21.50 | 17.59 | 22.67 |
Due to resource limitation, we used a smaller batch size and scaled learning rate for both the baseline (SCG) and our method in all previous experiments. We also studied the performance of baseline and our method under other hyper-parameter settings in Tab. 2. It can be observed that a) smaller batch size results in worse performance (2&4, 3&5). b) linearly scaling learning rate with respect to batch size can prohibit performance degradation to some degree (1&4). c) Under different training settings, our method outperforms the baseline by a considerable margin (2&3, 4&5).
3 Discussion on Debiasing Baselines
Re-weighting Methods For re-weighting methods (i.e., inverse frequency weighting and CB-Loss [cui2019class]), we followed their conventions and computed the number of HOI instances (i.e. interactive human-object pairs) in the training set to facilitate the weight calculation. However, this leads to severe performance degradation. We conjecture that there are mainly two reasons. Firstly, these loss functions are all designed for reducing the general bias, instead of the object bias studied in this paper. Secondly, these re-weighting strategies interfere a lot the original training process, which requires complex interaction recognition and reasoning. In contrast, our proposed method allows dynamic adjustment with respect to each HOI instance in the training process, thereby improving the performance.
General Debiasing Methods For Adversarial Training (AT) [wang2020benchmarkbias], we trained the model with another classifier, whose output dimension equals to the number of object classes (i.e., 80 in HICO-DET). Given each human-object feature, a cross-entropy loss with a flat label, i.e., is introduced to the original training process, so that the representation is expected to be object-agnostic. For Domain Independent Training (DIT) [wang2020benchmarkbias], we trained the model with another classifier. The output dimension of this classifier equals to the number of total interactions (i.e., 600 in HICO-DET). During inference, the interaction prediction is taken as the maximum probability over all interactions involving this verb. We observe significant performance degradation with these methods. The key reason to this is that these methods ignore the object factor in their representations, which is essential for interaction recognition.
SGG Debiasing Methods The original TDE [Tang_2020_CVPR_unbiased_SGG] aims to alleviate the contextual bias in Scene Graph Generation (SGG). Besides the original forward pass, it conducts a second forward pass in the same model by masking (e.g., set to zero) both the subjects and the objects. The final prediction is taken as the subtraction between the original logits and the logits in the second pass. In this way, the biasing effects caused by factors other than the subject and object are expected to be eliminated. In this work, to alleviate the object bias, we conduct a second forward pass by masking everything other than the object. Then, similarly, the final output logits is obtained by subtracting this logits from the original ones. By doing the subtraction, the output is expected to be less affected by the object bias problem, following the intuition of [Tang_2020_CVPR_unbiased_SGG]. In PCPL [yan2020pcpl], we take the representation of an HOI class as the average of all features that involve this interaction class. We argue that the failure of these methods may result from the ignorance of multi-label setting, which results in different logits in TDE (since single-label classification is conducted for SGG.) and imprecise class embedding estimation in PCPL (because an embedding for one instance may be counted into multiple classes, confusing the representations).
4 More Visualizations
4.1 More Memory Evolutions
We show the evolution of label distribution under another four randomly picked objects in Fig. LABEL:fig:memvis. It can be observed that the model prefers to sample some frequent class instances at early iterations due to their dominance. When it comes to later training steps, rare class instances gain more attention with the help of the proposed ODM. By the end of the first epoch (i.e., 4.5k iterations), the tail classes under each object is more frequently sampled.
4.2 More Qualitative Results
We provide additional qualitative results in Fig. LABEL:fig:quali_sup. It can be seen that our method can effectively alleviate the object bias problem by reducing false negative errors.