RelTransformer: A Transformer-Based Long-Tail Visual Relationship Recognition

We thank the reviewers for their thoughtful feedback. We are encouraged that they found our paper targeting an important problem (R1, R2), well-written (R2, R3), with our proposed method being intuitive (R3) and the experimental results satisfying and convincing (R2, R3), and in line with the previous few-shot methods (R3).

R1, The role of the Memory Module. The main goal of designing this memory module is to augment the long-tail relation prediction with “out-of-context” information. Typical self-attention mechanism is limited to attending to the features only from the tokens in an input sequence; hence, each relation only learns a representation from the immediate context. Our memory module could help leverage information beyond the scene context to improve the performance on long-tail relations.

To investigate the effectiveness of memory design, we averaged the attention scores per relation-class using one of our well-pretrained models and demonstrate them in Supplementary (Figure 1). We can observe a clear tendency that relations with fewer training examples have higher attention values over the memory than more frequent ones from the head, which means that the memory module has more benefits to accurately predicting the long-tail relations. This can also explain why the memory can gain more improvement over the “medium” and “few” classes than the “head” in Table 4 in the paper.

R1, R2, Novelty. While Transformers have been used in previous works for VRR tasks (e.g., RVL-BERT & RTN in Table 2 of our main paper), we are the first to show its effectiveness by leveraging our global context and memory modules in solving long-tail tasks. We believe this is an important finding that will further motivate future works on the usage of transformers in long-tail VRR tasks, contrary to using GNN and LSTM-based models, as the de-facto choice.

R1, R2, Mean Recall@K performance. We evaluated the mean recall@k performance on the VG200 dataset and compared RelTransformer with strong baselines such as VCTREE and Motif with both cross-entropy, and EBM losses [suhail2021energy]. The results are summarized in Table 1. We observe that RelTransformer can outperform all the baselines on mR@(20, 50, 100) while only being combined with cross-entropy loss, which suggests that our model has good generalization ability on many long-tail visual relationship recognition datasets.

Models	Method	PRDCLS
Models	Method	mR@20	mR@50	mR@100
IMP	CE	8.85	10.97	11.77
IMP	EBM	9.43	11.83	12.77
Motif	CE	12.45	15.71	16.8
Motif	EBM	14.2	18.2	19.7
VCTREE	CE	13.07	16.53	17.77
VCTREE	EBM	14.17	18.02	19.53
Ours	CE	18.51	19.58	20.19

Table 1: Mean Recall@K Performance on VG200 Dataset.

R2, The Effectiveness of Global Context. The global context can aggregate “in-context” information from all the objects and relations in the same scene (see Fig 1 in the main paper). This “in-context” information can be deemed a global view of the whole image and extends the perception scope of the long-tail relation beyond only seeing its local surrounding objects. Hence, it is expected to contribute more to the accurate prediction of the long-tail relation.

R2, Generalization of Memory Module. Essentially, our memory module is an independent weight matrix $M$ , and it is connected to the backbone network via an attention operation. It can be integrated into popular deep learning networks like RNN, GCN, or CNN-based models. We agree that it is exciting to explore this broader applicability in future work, despite that it is outside the scope of this work.

Models	Losses	many	med	few	all
Models	Losses	16	46	248	310
Full model	WCE	63.6	59.1	43.1	46.5
✗global	WCE	60.6	56.1	37.1	41.4
✗mem	WCE	60.8	57.3	38.3	42.3

Table 2: Ablation study of RelTransformer on GQA-LT dataset. Our default setting is marked in gray

R3, The benefits of memory module with WCE. Per the R3’s request, we ran the ablations on the GQA-LT dataset, including 310 relation types; see 2. We can observe that the memory module improves the performance by 4.8 acc, underlying its usefulness under WCE. On the other hand, VG8k-LT is an extremely challenging dataset with 1,600 tail relation types and 2,000 relations in total. On this dataset, our memory module can also benefit the RelTransformer under many other loss functions as we can see in Table 4 (in our main paper). Under WCE loss, removing the memory module drops the performance by 0.3 accuracies (acc) on few and 0.2 acc on medium categories in VG8k-LT. Given the difficulty and the scale of VG8k-LT, we consider this a noticable improvement.

R3, Triplet Creation based on Spatial or Semantics Relationship. Thanks for pointing this out. Actually, we do not construct the training relationships or create the triplet by ourselves, instead, this triplet information is already provided in the VG8K-LT and GQA-LT datasets. We directly construct the triplet based on the one already known to have a spatial or semantic relation during training. At test time, we construct each triplet by any two objects that are prepared to be predicted in these benchmarks.