¹¹institutetext: Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China ²²institutetext: University of Chinese Academy of Sciences, Beijing, 100049, China ²²email: [email protected], ²²email: {wangruiping, sgshan, xlchen}@ict.ac.cn

Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation

Wenbin Wang 1122 0000-0002-4394-0145 Ruiping Wang 1122 0000-0003-1830-2595 Shiguang Shan 1122 0000-0002-8348-392X Xilin Chen 1122 0000-0003-3024-4404

Abstract

Scene graph aims to faithfully reveal humans’ perception of image content. When humans analyze a scene, they usually prefer to describe image gist first, namely major objects and key relations in a scene graph. This humans’ inherent perceptive habit implies that there exists a hierarchical structure about humans’ preference during the scene parsing procedure. Therefore, we argue that a desirable scene graph should be also hierarchically constructed, and introduce a new scheme for modeling scene graph. Concretely, a scene is represented by a human-mimetic Hierarchical Entity Tree (HET) consisting of a series of image regions. To generate a scene graph based on HET, we parse HET with a Hybrid Long Short-Term Memory (Hybrid-LSTM) which specifically encodes hierarchy and siblings context to capture the structured information embedded in HET. To further prioritize key relations in the scene graph, we devise a Relation Ranking Module (RRM) to dynamically adjust their rankings by learning to capture humans’ subjective perceptive habits from objective entity saliency and size. Experiments indicate that our method not only achieves state-of-the-art performances for scene graph generation, but also is expert in mining image-specific relations which play a great role in serving downstream tasks.

Keywords:

Image Gist, Key Relation, Hierarchical Entity Tree, Hybrid-LSTM, Relation Ranking Module

1 Introduction

Refer to caption — Figure 1: Scene graphs from existing methods shown in (a) and (b) fail in sketching the image gist. The hierarchical structure about humans’ perception preference is shown in (f), where the bottom left highlighted branch stands for the hierarchy in (e). The scene graphs in (c) and (d) based on hierarchical structure better capture the gist. Relations in (a) and (b), and purple arrows in (c) and (d), are top-5 relations, while gray ones in (c) and (d) are secondary.

In an effort to thoroughly understand a scene, scene graph generation (SGG) [10, 44] in which objects and pairwise relations should be detected, has been on the way to bridge the gap between low-level recognition and high-level cognition, and contributes to tasks like image captioning [42, 25, 46], VQA [1, 38], and visual reasoning [33]. While previous works [44, 17, 45, 16, 52, 29, 38, 41, 51, 55] have pushed this area forward, the generated scene graph may be still far from perfect, e.g., they seldom consider whether the detected relations are what humans want to convey from the image or not. As a symbolic representation of an image, the scene graph is expected to record the image content as complete as possible. More importantly, a scene graph is not just for being admired, but for supporting downstream tasks, such as image captioning, where a description is supposed to depict the major event in the image, or the namely image gist. This characteristic is also one of the humans’ inherent habits when they parse a scene. Therefore, an urgently needed feature of SGG is to assess the relation importance and prioritize the relations which form the major events that humans intend to preferentially convey, i.e., key relations. This is seldom considered by existing methods. What’s worse, the universal phenomenon of unbalanced distribution of relationship triplets in mainstream datasets exacerbates the problem that the major event cannot be found out. Let’s study the quality of top relations predicted by existing state-of-the-art methods (e.g., [51]) and check whether they are “key” or not. In Figure 1(a)(b), two scene graphs shown with top-5 relations for image A and B are mostly the same although major events in A and B are quite different. In other words, existing methods are deficient in mining image-specific relations, but biased towards trivial or self-evident ones (e.g., $\langle$ woman, has, head $\rangle$ can be obtained from commonsense without observing the image), which fail in conveying image gist (colored parts in ground truth captions in Figure 1), and barely contribute to downstream tasks.

Any pair of objects in a scene can be considered relevant, at least in terms of their spatial configurations. Faced with such a massive amount of relations, how do humans choose relations to describe the images? Given picture (ii) in Figure 1(e), a zoom-in sub-region of picture (i), humans will describe it with $\langle$ woman, riding, bike $\rangle$ , since woman and bike belong to the same perceptive level and their interaction forms the major event in (ii). When it comes to picture (iii), the answers would be $\langle$ woman, wearing, helmet $\rangle$ and $\langle$ bag, on, woman $\rangle$ , where helmet and bag are finer details of woman and belong to an inferior perceptive level. It suggests that there naturally exists a hierarchical structure about humans’ perception preference, as shown in Figure 1(f).

Inspired by observations above, we argue that a desirable scene graph should be hierarchically constructed. Specifically, we represent the image with a human-mimetic Hierarchical Entity Tree (HET) where each node is a detected object and each one can be decomposed into a set of finer objects attached to it. To generate the scene graph based on HET, we devise Hybrid Long Short-Term Memory (Hybrid-LSTM) to encode both hierarchy and siblings context [51, 38] and capture the structured information embedded in HET, considering that important related pairs are more likely to be seen either inside a certain perceptive level or between two adjacent perceptive levels. We further intend to evaluate the performances of different models on key relation prediction but the annotations of key relations are not directly available from existing datasets. Therefore, we extend Visual Genome (VG) [13] to VG-KR dataset which contains indicative annotations of key relations by drawing support from caption annotations in MSCOCO [21]. We devise a Relation Ranking Module to adjust the rankings of relations. It captures humans’ subjective perceptive habits from objective entity saliency and size, and achieves ultimate performances on mining key relations.¹¹1Source code and dataset are available at http://vipl.ict.ac.cn/resources/codes or https://github.com/Kenneth $-$ Wong/het $-$ eccv20.git.

2 Related Works

Scene graph generation (SGG) and Visual Relationship Detection (VRD), are the two most common tasks aiming at extracting interaction between two objects. In the field of VRD, various studies [24, 3, 15, 52, 49, 28, 53, 48, 54] mainly focus on detecting each relation triplet independently rather than describe the structure of the scene. The concept of scene graph is firstly proposed in [10] for image retrieval. Xu et al. [44] define SGG task and creatively devise message passing mechanism for scene graph inference. A series of succeeding works struggle to design various approaches to improve the graph representation. Li et al. [17] induce image captions and object information to jointly address multitasks. [51, 38, 41, 22] draw support from useful context construction. Yang et al. [45] propose Graph-RCNN to embed the structured information. Qi et al. [29] employ a self-attention module to embed a weighted graph representation. Zhang et al. [55] propose contrastive losses to resolve the related pair configuration ambiguity. Zareian et al. [50] creatvely treat the SGG as an edge role assignment problem. Recently, some methods try to borrow advantages from using knowledge [2, 5] or causal effect [37] to diversify the predicted relations. Liang et al. [19] prune the dominant and easy-to-predict relations in VG to alleviate the annihilation problem of rare but meaningful relations.

Structured Scene Parsing, has been paid much attention in pursuit of higher-level scene understanding. [35, 32, 20, 6, 57, 47] construct various hierarchical structures for their specific tasks. Unlike existing SGG studies that indiscriminately detect relations no matter whether they are concerned by humans or not, our work introduces the idea of hierarchical structure into SGG task, and try to give priority to detect key relations, then the trivial ones for completeness.

Saliency vs. Image Gist. An extremely rich set of studies [14, 39, 23, 40, 8, 56] focus on analyzing where humans gaze and find visually salient objects (high contrast of luminance, hue, and saturation, center position [9, 12, 43], etc.). It’s notable that the visually salient objects are related but not equal to objects involved in image gist. He et al. [7] explore gaze data and find that only 48% of fixated objects are referred in humans’ descriptions about the image, while 95% of objects referred in descriptions are fixated. It suggests that objects referred in a description (i.e., objects that humans think important and should form the major events / image gist) are almost visually salient and reveal where humans gaze, but what humans fixate (i.e., visually salient objects) are not always what they want to convey. We provide some examples in the Appendix to help to understand this finding. Naturally, we need to emphasize that the levels in our HET reflect the perception priority level rather than the object saliency. Besides, this finding supports us to obtain the indicative annotations of key relations with the help of image caption annotations.

3 Proposed Approach

3.1 Overview

The scene graph $\mathcal{G}=\{\mathcal{O},\mathcal{R}\}$ of an image $\mathcal{I}$ contains a set of entities $\mathcal{O}=\{o_{i}\}_{i=1}^{N}$ and their pairwise relations $\mathcal{R}=\{r_{k}\}_{k=1}^{M}$ . Each $r_{k}$ is a triplet $\langle o_{i},p_{ij},o_{j}\rangle$ where $p_{ij}\in\mathcal{P}$ and $\mathcal{P}$ is the set of all predicates. As illustrated in Figure 2, our approach can be summarized into four steps. (i) We apply Faster R-CNN [30] with VGG16 [34] backbone to detect all the entity proposals and each of them possesses its bounding box $\bm{b}_{i}\in\mathbb{R}^{4}$ , 4,096-dimensional visual feature $\bm{v}_{i}$ , and the class probability vector $\bm{q}_{i}$ from the softmax output. (ii) In Section 3.2, HET is constructed by organizing the detected entities according to their perceptive levels. (iii) In Section 3.3, we design the Hybrid-LSTM network to parse HET, which firstly encodes the structured context then decodes it for graph inference. (iv) In Section 3.4, we improve the scene graph generated in (iii) with our devised RRM which further adjusts the rankings of relations and shifts the graph focus to the relations between entities that are close to top perceptive levels of HET.

3.2 HET Construction

We aim to construct a hierarchical structure whose top-down levels are accord with the perceptive levels of humans’ inherent scene parsing hierarchy. From a massive number of observations, it can be found that entities with larger sizes are relatively more likely to form the major events in a scene (this will be proved effective through experiments). Therefore, we arrange larger entities as close to the root of HET as possible. Each entity can be decomposed into finer entities that make up the inferior level.

Concretely, HET is a multi-branch tree $\mathcal{T}$ with a virtual root $o_{0}$ standing for the whole image. All the entities are sorted in descending order according to their sizes and we get an orderly sequence $\{o_{i_{1}},o_{i_{2}},\ldots,o_{i_{N}}\}$ . For each entity $o_{i_{n}}$ , we consider entities with larger size, $\{o_{i_{m}}\},1\leq m<n$ , and calculate the ratio

P_{nm}=\frac{I\left(o_{i_{n}},o_{i_{m}}\right)}{A(o_{i_{n}})},

(1)

where $A(\cdot)$ denotes the size of the entity and $I(\cdot,\cdot)$ is the intersection area of two entities. If $P_{nm}$ is larger than threshold $T$ , $o_{i_{m}}$ will be a candidate parent node of $o_{i_{n}}$ since $o_{i_{m}}$ contains most part of $o_{i_{n}}$ . If there is no candidate, the parent node of $o_{i_{n}}$ is set as $o_{0}$ . If there are more than one, we further determine the parent with two alternative strategies:

Area-first Strategy (AFS). Considering that entity with a larger size has a higher probability to contain more details or components, the candidate with the largest size is selected to be a parent node.

Intersection-first Strategy (IFS). We compute ratio

Q_{nm}=\frac{I\left(o_{i_{n}},o_{i_{m}}\right)}{A(o_{i_{m}})}.

(2)

A larger $Q_{nm}$ means that $o_{i_{n}}$ is relatively more important to $o_{i_{m}}$ than to other candidates. Therefore, $o_{i_{m}}$ where $m=\mathop{\arg\max}_{k}Q_{nk}$ is chosen as parent of $o_{i_{n}}$ .

3.3 Structured Context Encoding and Scene Graph Generation

The interpretability of HET implies that important relations are more likely to be seen between entities either inside a certain level or from two adjacent levels. Therefore, both hierarchical connection [38] and sibling association [51] are useful for context modeling. Our Hybrid-LSTM encoder is proposed, which consists of a bidirectional multi-branch TreeLSTM [36] (Bi-TreeLSTM) for encoding the hierarchy context, and a bidirectional chain LSTM [4] (Bi-LSTM) for encoding the siblings context. We use two identical Hybrid-LSTM encoders to encode two types of context for each entity, one is entity context which helps predict the information of entity itself, and the other is relation context which plays a role in inferring the relation when interacting with other potential relevant entities. For brevity we only provide a detailed introduction of entity context encoding (Figure 2(a)). Specifically, the input feature $\bm{x}_{i}$ of each node $o_{i}$ is concatenation of visual feature $\bm{v}_{i}$ and weighted sum of semantic embedding vectors, $\bm{z}_{i}=\bm{W}_{e}^{(1)}\bm{q}_{i}$ , where $\bm{W}_{e}^{(1)}$ is word embedding matrix initialized from GloVe [27]. For the root node $o_{0}$ , $\bm{v}_{0}$ is obtained with the whole-image bounding box, while $\bm{z}_{0}$ is initialized randomly.

The hierarchy context (blue arrows in Figure 2(a)) is encoded as:

\bm{C}=\mathrm{BiTreeLSTM}(\{\bm{x}_{i}\}_{i=0}^{N}),

(3)

where $\bm{C}=\{\bm{c}_{i}\}_{i=0}^{N}$ and each $\bm{c}_{i}=\left[\overrightarrow{\bm{h}_{i}^{\mathcal{T}}};\overleftarrow{\bm{h}_{i}^{\mathcal{T}}}\right]$ is the concatenation of the top-down and bottom-up hidden states of Bi-TreeLSTM:


$\displaystyle\overrightarrow{\bm{h}_{i}^{\mathcal{T}}}$	$\displaystyle=\mathrm{TreeLSTM}\left(\bm{x}_{i},\overrightarrow{\bm{h}_{p}^{\mathcal{T}}}\right),$	(4a)
$\displaystyle\overleftarrow{\bm{h}_{i}^{\mathcal{T}}}$	$\displaystyle=\mathrm{TreeLSTM}\left(\bm{x}_{i},\left\{\overleftarrow{\bm{h}_{j}^{\mathcal{T}}}\Big{\|}{j\in C(i)}\right\}\right),$	(4b)

where $C(\cdot)$ denotes the set of children nodes while subscript $p$ denotes the parent of node $i$ .

The siblings context (red arrows in Figure 2(a)) is encoded within each set of children nodes which share the same parent:

\bm{S}=\mathrm{BiLSTM}(\{\bm{x}_{i}\}_{i=0}^{N}),

(5)

where $\bm{S}=\{\bm{s}_{i}\}_{i=0}^{N}$ and each $\bm{s}_{i}=\left[\overrightarrow{\bm{h}_{i}^{\mathcal{L}}};\overleftarrow{\bm{h}_{i}^{\mathcal{L}}}\right]$ is concatenation of forward and backward hidden states of Bi-LSTM:

\overrightarrow{\bm{h}_{i}^{\mathcal{L}}}=\mathrm{LSTM}\left(\bm{x}_{i},\overrightarrow{\bm{h}_{l}^{\mathcal{L}}}\right),\,\,\overleftarrow{\bm{h}_{i}^{\mathcal{L}}}=\mathrm{LSTM}\left(\bm{x}_{i},\overleftarrow{\bm{h}_{r}^{\mathcal{L}}}\right),

(6)

where $l$ and $r$ stand for left and right sibling which share the same parent with $i$ . We further concatenate hierarchy and siblings context to obtain the entity context, $\bm{f}^{\mathcal{O}}_{i}=[\bm{c}_{i};\bm{s}_{i}]$ . Missing branches or siblings are padded with zero vectors.

The relation context is encoded (Figure 2(b)) in the same way as entity context except that the input of each node is replaced by $\{\bm{f}^{\mathcal{O}}_{i}\}_{i=0}^{N}$ . Another Hybrid-LSTM encoder is applied to get the relation context $\{\bm{f}^{\mathcal{R}}_{i}\}_{i=0}^{N}$ .

To generate a scene graph, we should decode the context to obtain entity and relation information. In HET, a child node strongly depends on its parent, i.e., information of parent node is helpful for prediction of child node. Therefore, to predict entity information, we decode entity context in a top-down manner following Eq. (4a) as shown in Figure 2(c). For node $o_{i}$ , the input $\bm{x}_{i}$ in Eq. (4a) is replaced with $[\bm{f}_{i}^{\mathcal{O}};\bm{W}_{e}^{(2)}\bm{q}_{p}]$ , where $\bm{W}_{e}^{(2)}$ is word embedding matrix and $\bm{q}_{p}$ is the predicted class probability vector of the parent of $o_{i}$ . The output hidden state is fed into a softmax classifier and bounding box regressor to predict entity information of $o_{i}$ . To predict the predicate $p_{ij}$ between $o_{i}$ and $o_{j}$ , we feed $\bm{f}_{ij}^{\mathcal{R}}=[\bm{f}_{i}^{\mathcal{R}};\bm{f}_{j}^{\mathcal{R}}]$ to an MLP classifier (Figure 2(d)). As a result, a scene graph is generated, and for each triplet containing subject $o_{i}$ , object $o_{j}$ and predicate $p_{ij}$ , we obtain their scalar scores $s_{i}$ , $s_{j}$ , and $s_{ij}$ .

3.4 Relation Ranking Module

So far, we obtain the hierarchical scene graph based on HET. As we collect the key relation annotations (Section 4.1), we intend to further maximize the performance on mining key relations with supervised information, and explore the advantages brought by HET. Consequently, we design a Relation Ranking Module (RRM) to prioritize key relations. As analyzed in Related Works, regions of humans’ interest can be tracked under the guidance of visual saliency although they do not always form the major events that humans want to convey. Besides, the size, which guides HET construction, not only is an important reference for estimating the perceptive level of entities, but also is found helpful to rectify some misleadings in humans’ subjective assessment on the importance of relations (see the Appendix). Therefore, we propose to learn to capture humans’ subjective assessment on the importance of relations under the guidance of visual saliency and entity size information.

We firstly employ DSS [8] to predict the pixel-wise saliency map (SM) $\mathcal{S}$ for each image. To effectively collect entity size information, we propose a pixel-wise area map (AM) $\mathcal{A}$ . Given the image $\mathcal{I}$ and its detected $N$ entities $\{o_{i}\}_{i=1}^{N}$ with bounding boxes $\{\bm{b}_{i}\}_{i=1}^{N}$ (specially $o_{0}$ and $\bm{b}_{0}$ for the whole image), the value $a_{xy}$ of each position $(x,y)$ on $\mathcal{A}$ is defined as the minimum normalized size of entities which cover $(x,y)$ :

a_{xy}=\left\{\begin{aligned} &\min\left\{\frac{A(o_{i})}{A(o_{0})}\Bigg{|}i\in\mathcal{X}\right\},\mathrm{if}\,\,\mathcal{X}\neq\emptyset\\ &0,\mathrm{otherwise},\end{aligned}\right.

(7)

where $\mathcal{X}=\{i|(x,y)\in\bm{b}_{i},0<i\leq N\}$ . The sizes of both $\mathcal{S}$ and $\mathcal{A}$ are the same as that of input image $\mathcal{I}$ . We apply adaptive average pooling ( $\mathrm{AAP}(\cdot)$ ) to smooth and down-sample these two maps to align with the shape of conv5 feature map $\mathcal{F}$ from Faster-RCNN, and obtain the attention embedded feature map $\mathcal{F}_{S}$ :

\mathcal{F}_{S}=\mathcal{F}\odot(\mathrm{AAP}(\mathcal{S})+\mathrm{AAP}(\mathcal{A})),

(8)

where $\odot$ is the Hadamard product.

We predict a score for each triplet to adjust their rankings. The input contains visual representation for a triplet, $\bm{v}_{ij}\in\mathbb{R}^{4096}$ , which is obtained by RoI Pooling on $\mathcal{F}_{S}$ . Besides, the geometric information is also an auxiliary cue for estimating the importance. For a triplet containing subject box $\bm{b}_{i}$ and object box $\bm{b}_{j}$ , the geometric feature $\bm{g}_{ij}$ is defined as a 6-dimensional vector following [11]:

\bm{g}_{ij}=\!\!\left[\frac{x_{j}-x_{i}}{\sqrt{w_{i}h_{i}}},\!\frac{y_{j}-y_{i}}{\sqrt{w_{i}h_{i}}},\!\sqrt{\frac{w_{j}h_{j}}{w_{i}h_{i}}},\!\frac{w_{i}}{h_{i}},\!\frac{w_{j}}{h_{j}},\!\frac{\bm{b}_{i}\cap\bm{b}_{j}}{\bm{b}_{i}\cup\bm{b}_{j}}\right]\!\!,

(9)

which is projected to a 256-dimensional vector and concatenated with $\bm{v}_{ij}$ , resulting in the final representation for a relation $\bm{r}_{ij}=[\bm{v}_{ij};\bm{W}^{(g)}\bm{g}_{ij}]$ where $\bm{W}^{(g)}\in\mathbb{R}^{256\times 6}$ is projection matrix. Then we use a bi-directional LSTM to encode global context among all the triplets so that ranking score of each triplet can be reasonably adjusted considering scores of other triplets. Concretely, the ranking score $t_{ij}$ for a pair $(o_{i}$ , $o_{j})$ is achieved as:

	$\displaystyle\{\bm{h}^{\mathcal{R}}_{ij}\}$	$\displaystyle=\mathrm{BiLSTM}\left(\{\bm{r}_{ij}\}\right),$		(10)
	$\displaystyle t_{ij}$	$\displaystyle=\bm{W}^{(r)}_{2}\mathrm{ReLU}(\bm{W}^{(r)}_{1}\bm{h}^{\mathcal{R}}_{ij}).$		(11)

$\bm{W}^{(r)}_{1}$ and $\bm{W}^{(r)}_{2}$ are weights of two fully connected layers. The ranking score is fused with classification scores so that both the confidences of three components of a triplet and ranking priority are considered, resulting in the final ranking confidence $\phi_{ij}=s_{i}\cdot s_{j}\cdot s_{ij}\cdot t_{ij}$ , which is used for re-ranking the relations.

3.5 Loss Function

We adopt the cross-entropy loss for optimizing Hybrid-LSTM networks. Let $e^{\prime}$ and $l^{\prime}$ denote the predicted label of entity and predicate respectively, $e$ and $l$ denote the ground truth labels. The loss is defined as:

\begin{split}\mathcal{L}_{CE}=\mathcal{L}_{entity}+\mathcal{L}_{relation}=-\frac{1}{Z_{1}}\sum_{i}e^{\prime}_{i}\log(e_{i})-\frac{1}{Z_{2}}\sum_{i}\sum_{j\neq i}l^{\prime}_{ij}\log(l_{ij}).\end{split}

(12)

When the RRM is applied, the final loss function is the sum of $\mathcal{L}_{CE}$ and ranking loss $\mathcal{L}(\mathcal{K},\mathcal{N})$ , which is used to maximize the margin between the ranking confidences of key relations and those of secondary ones:

\begin{split}\mathcal{L}_{Final}=\mathcal{L}_{CE}+\mathcal{L}(\mathcal{K},\mathcal{N})=\mathcal{L}_{CE}+\frac{1}{Z_{3}}\sum_{r\in\mathcal{K},r^{\prime}\in\mathcal{N}}\,\max(0,\gamma-\phi_{r}+\phi_{r^{\prime}}),\end{split}

(13)

where $\gamma$ denotes margin parameter, $\mathcal{K}$ and $\mathcal{N}$ stand for the set of key and secondary relations, $r$ and $r^{\prime}$ are relations sampled from $\mathcal{K}$ and $\mathcal{N}$ with ranking confidences $\phi_{r}$ and $\phi_{r^{\prime}}$ . $Z_{1}$ , $Z_{2}$ , and $Z_{3}$ are normalization factors.

4 Experimental Evaluation

4.1 Dataset, Evaluation and Settings

VRD [24], is the benchmarking dataset for visual relationship detection task, which contains 4,000/1,000 training/test images and covers 100 object categories and 70 predicate categories.

Visual Genome (VG), is a large-scale dataset with rich annotations of objects, attributes, dense captions and pairwise relationships, containing 75,651/32,422 training/test images. We adopt the most widely used version of VG, namely VG150 [44], which covers 150 object categories and 50 predicate categories.

VG200 and VG-KR. We intend to collect the indicative annotations of key relations based on VG. Inspired by the finding illustrated in Related Works, we associate the relation triplets referred in caption annotations in MSCOCO [21] with those from VG. We give several examples in Figure 3. The details of our processing and more statistics are provided in the Appendix.

Evaluation, Settings, and Implementation Details. For conventional SGG following triplet-match rule (only if three components of a triplet match the ground truth will it be a correct one), we adopt three universal protocols [44]: PREDCLS, SGCLS, and SGGEN. All protocols use Recall@K (R@K=20, 50, 100) as a metric. When evaluating key relation prediction, there are some variations. First, we only evaluate with PREDCLS and SGCLS protocols to eliminate the interference of errors from object detector, and add a tuple-match rule (only the subject and object are required to match the ground truth) to investigate the ability to find proper pairs. Second, we introduce a new metric, Key Relation Recall (kR@K), which computes recall rate on key relations. As the number of key relations is usually less than 5 (see the Appendix), the K in kR@K is set to 1 and 5. When evaluating on VRD, we use RELDET and PHRDET [49], and report R@50 and R@100 at 1, 10, and 70 predicates per related pair. The details about the hyperparameters settings and implementation are provided in the Appendix.

Table 1: Results table (%) on VG150 and VG200. The results of the full version of our method are highlighted.

		SGGEN			SGCLS			PREDCLS
	R@	20	50	100	20	50	100	20	50	100
VG150	VRD [24]	-	0.3	0.5	-	11.8	14.1	-	27.9	35.0
	IMP [44]	-	3.4	4.2	-	21.7	24.4	-	44.8	53.0
	IMP ${\dagger}$ [44, 51]	14.6	20.7	24.5	31.7	34.6	35.4	52.7	59.3	61.3
	Graph-RCNN [45]	-	11.4	13.7	-	29.6	31.6	-	54.2	59.1
	MemNet [41]	7.7	11.4	13.9	23.3	27.8	29.5	42.1	53.2	57.9
	MOTIFS [51]	21.4	27.2	30.3	32.9	35.8	36.5	58.5	65.2	67.1
	KERN [2]	-	27.1	29.8	-	36.7	37.4	-	65.8	67.6
	VCTree-SL [38]	21.7	27.7	31.1	35.0	37.9	38.6	59.8	66.2	67.9
	HetH-AFS	21.2	27.1	30.5	33.7	36.6	37.3	58.1	64.7	66.6
	HetH w/o chain	21.5	27.4	30.7	32.9	35.9	36.7	57.5	64.5	66.5
	HetH	21.6	27.5	30.9	33.8	36.6	37.3	59.8	66.3	68.1
VG200	MOTIFS [51]	15.2	19.9	22.8	24.5	26.7	27.4	52.5	59.0	61.0
	VCTree-SL [38]	14.7	19.5	22.5	24.2	26.5	27.1	51.9	58.4	60.3
	HetH	15.7	20.4	23.4	25.0	27.2	27.8	53.6	60.1	61.8

Table 2: Results table (%) of key relation prediction on VG-KR.

	Triplet Match				Tuple Match
	SGCLS		PREDCLS		SGCLS		PREDCLS
kR@	1	5	1	5	1	5	1	5
VCTree-SL	5.7	14.2	11.4	30.2	8.4	22.2	16.1	46.4
MOTIFS	5.9	14.5	11.3	30.0	8.5	21.8	16.0	46.2
HetH	6.1	15.1	11.6	30.4	8.6	22.7	16.4	47.1
MOTIFS-RRM	8.6	16.4	16.7	33.8	13.8	26.3	27.9	57.1
HetH-RRM	9.2	17.1	17.5	35.0	14.6	27.3	28.9	59.1
RRM-Base	8.4	16.8	16.2	33.7	13.4	26.8	26.6	57.2
RRM-SM	9.0	16.9	17.2	34.5	14.3	27.1	28.6	58.7
RRM-AM	8.9	16.9	16.9	34.4	14.1	27.0	28.1	58.2

Table 3: Results table (%) on VRD.

	RELDET						PHRDET
	k=1		k=10		k=70		k=1		k=10		k=70
R@	50	100	50	100	50	100	50	100	50	100	50	100
ViP [15]	17.32	20.01	-	-	-	-	22.78	27.91	-	-	-	-
VRL [18]	18.19	20.79	-	-	-	-	21.37	22.60	-	-	-	-
KL-Dist [49]	19.17	21.34	22.56	29.89	22.68	31.89	23.14	24.03	26.47	29.76	26.32	29.43
Zoom-Net [48]	18.92	21.41	-	-	21.37	27.30	24.82	28.09	-	-	29.05	37.34
RelDN- $L_{0}$ [55]	24.30	27.91	26.67	32.55	26.67	32.55	31.09	36.42	33.29	41.25	33.29	41.25
RelDN [55]	25.29	28.62	28.15	33.91	28.15	33.91	31.34	36.42	34.45	42.12	34.45	42.12
HetH	22.42	24.88	26.88	31.69	26.88	31.81	30.69	35.59	35.47	42.94	35.47	43.05

4.2 Ablation Studies

Ablation studies are separated into two sections. The first part is to explore some variants of HET construction. We conduct these experiments on VG150. The complete version of our model is HetH, which is configured with IFS and Hybrid-LSTM. The second part is an investigation into the usage of SM and AM in RRM. Experiments are carried out on VG-KR. The complete version is HetH-RRM, whose implementation follows Eq. (8).

Ablation study on HET construction. We firstly compare AFS and IFS for determining the parent node. Then we investigate the effectiveness of the chain LSTM encoder in Hybrid-LSTM. The ablative models mentioned above are shown in Table 1 as HetH-AFS (i.e.replace IFS by AFS), and HetH w/o chain. We observe that using IFS together with Hybrid-LSTM encoder has the best performances, which indicates that HET would be more reasonable using IFS. It’s noteworthy that if the Bi-TreeLSTM encoder is abandoned, the Hybrid-LSTM encoder would almost degenerate to MOTIFS. Therefore, through comparisons between HetH and MOTIFS, HetH and HetH w/o chain, it implies that both hierarchy and siblings context should be encoded in HET.

Ablation study on RRM. In order to explore the effectiveness of saliency and size, we ablate HetH-RRM with the following baselines: (1) RRM-Base: $\bm{v}_{ij}$ is extracted from $\mathcal{F}$ rather than $\mathcal{F}_{S}$ , (2) RRM-SM: only $\mathcal{S}$ is used, and (3) RRM-AM: only $\mathcal{A}$ is used. Results in Table 2 suggest that both saliency and size information indeed contributes to discovering key relations, and the effect of saliency is slightly better than that of the size. The hybrid version achieves the highest performances. From the following qualitative analysis, we can see that with the guidance of saliency and rectification effect of size, RRM further shifts the model’s attention to key relations significantly.

4.3 Comparisons with State-of-the-Arts

For scene graph generation, we compare our HetH with the following state-of-the-art methods: VRD [24] and KERN [2] use knowledge from language or statistical correlations. IMP [44], Graph-RCNN [45], MemNet [41], MOTIFS [51] and VCTree-SL [38] mainly devise various message passing methods for improving graph representations. For key relation prediction, we mainly evaluate two latest works, MOTIFS and VCTree-SL on VG-KR. Besides, we further incorporate RRM to MOTIFS, namely MOTIFS-RRM, to explore the transferability of RRM. Results are shown in Table 1 and 2. We give statistical significance of the results in the Appendix.

Quantitative Analysis. In Table 1, when evaluated on VG150, HetH dominantly surpasses most methods. Compared to MOTIFS and VCTree-SL, HetH using multi-branch tree structure outperforms MOTIFS and yields comparable recall rate with VCTree-SL which uses a binary tree structure. It indicates that hierarchical structure is superior to plain one in terms of modeling context. We observe that HetH achieves better performances compared to VCTree-SL under PREDCLS protocol, while there exists a slight gap under SGCLS and SGGEN protocols. This is mainly because our tree structure is generated with artificial rules and some incorrect subtrees inevitably emerge due to occlusion in 2D images, while VCTree-SL dynamically adjusts its structure in pursuit of higher performances. Under SGCLS and SGGEN protocols in which object information is fragmentary, it is difficult for HetH to rectify the context encoded from wrong structures. However, we argue that our interpretable and natural multi-branch tree structure is also adaptive to the situation when there is an increment of object and relation categories but fewer data. It can be seen from evaluation results on VG200 that the HetH outperforms MOTIFS by 0.67 mean points and VCTree-SL by 1.1 mean points. On the contrary, in this case, the data are insufficient for dynamic structure optimization.

As SGG task is highly related to VRD task, we apply HetH on VRD and the comparison results are shown in Table 3. Both the HetH and RelDN [55] use pre-trained weights on MSCOCO, while only [48] states that they use ImageNet pre-trained weights and others remain unknown. It’s shown that our method yields competitive results and even surpasses state-of-the-arts under some metrics.

When it comes to key relation prediction, we directly evaluate HetH, MOTIFS, and VCTree-SL on VG-KR. As shown in Table 2, HetH substantially performs better than other two competitors, suggesting that the structure of HET provides hints for judging the importance of relations, and parsing the structured information in HET indeed capture humans’ perceptive habits.

In pursuit of ultimate performances on mining key relations, we jointly optimize the HetH with RRM under the supervision of key relation annotations in VG-KR. From Table 2, both HetH-RRM and MOTIFS-RRM achieve significant gains, and HetH-RRM is better than MOTIFS-RRM, which proves the superiority of HetH again, and shows excellent transferability of RRM.

Qualitative Analysis. We visualize intermediate results in Figure 4(a-d). HET is well constructed and close to human’s analyzing process. In the area map, regions of arm, hand, and foot get small weights because of their small sizes. Actually, relations like $\langle$ lady, has, arm $\rangle$ are indeed trivial. As a result, RRM suppresses these relations. More cases are provided in the Appendix.

4.4 Analyses about HET

We conduct additional experiments to validate whether HET has a potential to reveal humans’ perceptive habits. As shown in Figure 6(a), we compare the depth distribution of top-5 predicted relations (represented by tuple $(d_{o_{i}},d_{o_{j}})$ consisting of the depths of two entities, and the depth of root is defined as 1.) of HetH, RRM-base and HetH-RRM. After applying RRM, there is a significant increment on the ratio of depth tuples (2, 2) and (2, 3), and a drop on (3, 3). This phenomenon is also observed in Figure 4(e). Previous experiments have proved that RRM obviously upgrades the rankings of key relations. In other words, relations which are closer to the root of HET are regarded as key ones by RRM. We also analyze the ranking confidence ( $\phi$ ) of relations from different depths with the RRM-base model (to eliminate the confounding effect caused by AAP information). We sample 10,000 predicted relation triplets from each depth five times. In Figure 6(b), the ranking confidence decreases as the depth increases. Therefore, different levels of HET indeed indicate different perceptive importance of relations. This characteristic makes it possible to reasonably adjust the scale of a scene graph. As shown in the first row in Figure 7, hierarchical scene graph from our HetH-RRM enlarges in a top-down manner as the quota of top relations increases, while the ordinary scene graph in the second row enlarges itself aimlessly. If we want to limit the scale of a scene graph but keep its ability to sketch image gist as far as possible, it is feasible for our hierarchical scene graph since we just need to cut off some secondary branches of HET, but is difficult to realize in an ordinary scene graph.

Besides, different from traditional Exhausted Prediction (EP, predict relation for every pair of entities) during inference stage, we adopt a novel Structured Prediction (SP) strategy, in which we only predict relations between parent and children nodes, and any two sibling nodes that share the same parent. In Figure 6, we compare the performances and inference speed between EP and SP for HetH-RRM. Despite the slight gap in terms of performances, the interpretability of connections in HET makes SP feasible to take a further step towards efficient inference, getting rid of the $O(N^{2})$ complexity [16] of EP. Further researches need to be conducted to balance performance and efficiency.

Table 4: Results of image captioning on VG-KR.

Num.	Model	B@1	B@2	B@3	B@4	ROUGE-L	CIDEr	SPICE	Avg. Growth
all	GCN-LSTM	72.0	54.7	40.5	30.0	52.9	91.1	18.1
20	HetH-Freq	73.1	55.7	41.0	30.1	53.5	94.0	18.8
	HetH	74.9	58.4	43.9	32.8	54.9	101.7	19.8	0.06
	HetH-RRM	75.0	58.2	43.7	32.7	55.1	102.2	19.9	0.06
5	HetH-Freq	70.7	53.2	38.6	28.0	51.7	84.4	17.2
	HetH	72.5	55.4	41.2	30.5	53.1	92.6	18.5	1.57
	HetH-RRM	73.7	56.7	42.3	31.5	54.0	97.5	19.1	1.57
2	HetH-Freq	68.1	50.8	36.8	26.5	50.2	76.5	15.5
	HetH	70.8	53.4	39.2	28.7	51.8	86.4	17.6	2.10
	HetH-RRM	72.3	55.2	41.0	30.4	53.1	92.2	18.4	2.10

5 Experiments on Image Captioning

Do key relations really make sense? We conduct experiments on one of the downstream tasks of SGG, i.e., image captioning, to verify it. ²²2We briefly introduce here and details are provided in Appendix.

Experiments are conducted on VG-KR since it has caption annotations from MSCOCO. To generate captions, we select different numbers of predicted top relations and feed them into the LSTM backend following [46]. We reimplement the complete GCN-LSTM [46] model and evaluate it on VG-KR since it’s one of the state-of-the-art methods and is most related to us. As shown in Table 4, our simple frequency baseline, HetH-Freq (the rankings of relations are accord with their frequency in training data), with 20 top relations input, outperforms GCN-LSTM because GCN-LSTM conducts graph convolution using relations as edges, which is not as effective as our method in terms of making full use of relation information. After applying RRM, there is consistent performance improvement on overall metrics. This improvement is more and more significant as the number of input top relations reduces. It’s reasonable since the impact of RRM centers at top relations. It suggests that our model provides more essential content with as few relations as possible, which contributes to efficiency improvement. The captions presented in Figure 4(e) shows that key relations are more helpful for generating a description that highly fits the major events in an image.

6 Conclusion

We propose a new scene graph modeling formulation and make an attempt to push the study of SGG towards the target of practicability and rationalization. We generate a human-mimetic hierarchical scene graph inspired by humans’ scene parsing procedure, and further prioritize the key relations as far as possible. Based on HET, a hierarchcal scene graph is generated with the assistance of our Hybrid-LSTM. Moreover, RRM is devised to recall more key relations. Experiments show outstanding performances of our method on traditional scene graph generation and key relation prediction tasks. Besides, experiments on image captioning prove that key relations are not just for appreciating, but indeed play a crucial role in higher-level downstream tasks.

Appendix A 0.A Detailed Explanation about Motivation

As illustrated in the main paper, it’s notable that the visually salient objects are related but not completely equal to objects involved in image gist. According to findings in [7], objects referred in a description (i.e., objects that humans think important and should form the major events/image gist) are almost visually salient and reveal where humans gaze, but what humans fixate (i.e., visually salient objects) are not always what they want to convey at first. In Figure 8, we provide some examples to show that this is a common phenomenon. E.g., the red clothes, the Spring Festival couplets, and the black doors of the washing machines (mentioned from left to right), are visually salient due to their high contrast to the context or center position. However, some of them do not form the major events. For example, in the 2^nd image, the first glance description would be “There stands a house on the side of the road”. Then humans may be interested in the eyecatching Spring Festival couplets.

Besides, we are inspired by these observations. There naturally exists a hierarchical structure about humans’ perception preference. Objects with relatively large size which fulfill the scene generally form the major events. It supports us to construct HET with the method introduced in the main paper. We aim at constructing HET whose levels reflect the perception priority level rather than the object saliency. The experiments show that our method for constructing HET has achieved this goal.

Appendix B 0.B Implemention Details for Image Captioning

As the source codes of GCN-LSTM [46] have not been released by the submission deadline, we re-implement it. In its original version, a simple two-layer MLP classifier is applied to predict the pairwise relationship, which acts as the frontend scene graph detector. For a fair comparison, we replace this detector with our HetH. To transform our HetH/HetH-RRM for image captioning task, we add a sentence decoder which is modified from LSTM backend of GCN-LSTM. The GCN-LSTM model conducts graph convolution on the scene graph and injects all relation-aware region-level features into a two-layer LSTM with attention mechanism. Different from GCN-LSTM, we intend to inject the relation features rather than region-level features, considering that the relationships which convey the events in the image are more helpful for description generation. In Figure 9, we show a brief diagram to illustrate our implementation of GCN-LSTM, and demonstrate the implementation scheme of our sentence decoder for image captioning.

Specifically, we obtain a set of visual relationship representations $\{\bm{f}_{m}^{\mathcal{R}}\}_{m=1}^{M}$ ( $\bm{f}_{m}^{\mathcal{R}}\in\mathbb{R}^{D}_{f}$ , $D_{f}=4,096$ ) after relation context decoding (see Figure 2(d) in the main paper). We concatenate them with the word embeddings of their subjects, objects, and predicates, denoted by $\bm{w}_{m}^{s}\in\mathbb{R}^{D_{w}}$ , $\bm{w}_{m}^{o}\in\mathbb{R}^{D_{w}}$ , and $\bm{w}_{m}^{p}\in\mathbb{R}^{D_{w}}$ ( $D_{w}=300$ ), and obtain $\{\bm{r}_{m}\}_{m=1}^{M}$ ( $\bm{r}_{m}\in\mathbb{R}^{D_{r}}$ , $D_{r}=4,996$ ):

\bm{r}_{m}=\left[\bm{f}_{m}^{\mathcal{R}};\bm{w}_{m}^{s};\bm{w}_{m}^{o};\bm{w}_{m}^{p}\right].

(14)

The sentence decoder is a two-layer LSTM. It’s noted that two layers in this decoder share one hidden state $\bm{h}\in\mathbb{R}^{D_{h}}$ and cell state $\bm{c}\in\mathbb{R}^{D_{h}}$ . At each time step $t$ , the first layer collects the maximum contextual information by concatenating the input word embedding $\bm{w}_{t}\in\mathbb{R}^{D_{w}}$ and the mean-pooled visual relationship feature $\bar{\bm{r}}=\frac{1}{M}\sum\limits_{m=1}^{M}\bm{r}_{m}$ . The updating procedure is as

\bm{h}_{t}^{1},\bm{c}_{t}^{1}={f_{1}\left(\left[\bm{w}_{t};\bar{\bm{r}}\right]\right)}_{|\bm{h}_{t-1}^{2},\bm{c}_{t-1}^{2}},

(15)

where $f_{1}$ is the updating function within the first-layer unit, $|\bm{h}_{t-1}^{2},\bm{c}_{t-1}^{2}$ denotes that the internal hidden state and cell state is the ones that updated by the second-layer unit from the previous timestep. Then we compute a normalized attention distribution over all the relationship features

a_{t,m}=\bm{W}_{a}\left[\tanh\left(\bm{W}_{f}\bm{r}_{m}+\bm{W}_{h}\bm{h}_{t}^{1}\right)\right],\,\,\lambda_{t}=\mathrm{softmax}(\bm{a}_{t}),

(16)

where $a_{t,m}$ is the $m$ -th element of $\bm{a}_{t}$ , $\bm{W}_{a}\in\mathbb{R}^{1\times D_{a}}$ , $\bm{W}_{f}\in\mathbb{R}^{D_{a}\times D_{r}}$ , $\bm{W}_{h}\in\mathbb{R}^{D_{a}\times D_{h}}$ are transformation matrices. Specifically, both the dimension of the hidden layer $D_{a}$ for measuring the attention distribution and the dimension of the hidden layer $D_{h}$ in LSTM are set as 512. $\lambda_{t}\in\mathbb{R}^{M}$ denotes the normalized attention distribution whose $m$ -th element $\lambda_{t,m}$ is the attention weight of $\bm{r}_{m}$ . The attended relationship feature is computed as $\bm{r}_{t}=\sum\limits_{m=1}^{M}\lambda_{t,m}\bm{r}_{m}$ . Then the updating procedure of the second-layer unit is

\bm{h}_{t}^{2},\bm{c}_{t}^{2}={f_{2}\left(\bm{r}_{t}\right)}_{|\bm{h}_{t}^{1},\bm{c}_{t}^{1}},

(17)

where $f_{2}$ is the updating function within the second-layer unit. $\bm{h}_{t}^{2}$ is used to predict the next word through a softmax layer.

Table 5: Statistics of VG200, VG-KR, and VG150.

Dataset	Images	Images with Relations	Object Categories	Predicate Categories	Object Instances	Relation Instances	Key Relation Instances	Images with Key Relation Instances
VG200	51,498	46,562	200	80	619,119	442,425	101,312	26,992
VG-KR	26,992	26,992	200	80	360,306	250,755	101,312	26,992
VG150	108,073	89,169	150	50	1,145,398	622,705	-	-

Appendix C 0.C VG-KR Dataset Construction, Statistics, and Experimental Implementation Details

0.C.1 VG-KR Dataset

We demonstrate the procedure of constructing VG-KR dataset and different image sets involved in this procedure in Figure 10. Concretely, 51,498 images in VG come from MSCOCO, which form the image set VGC. We conduct three-stage processing on VGC. (1) Stanford Scene Graph Parser [31] is used to extract relation triplets from captions. They make up the set of key relations, denoted by $\mathcal{R^{K}}$ . (2) We next cleanse the raw annotations of VGC similar to [44], keep the most frequent 150 object categories and 50 predicates, and add another most frequent 50 object categories and 30 predicates in $\mathcal{R^{K}}$ , in order to keep as many key relations as possible for the following third step. After dropping images without relations in VGC, we get a new subset VG200 (i.e. 200 object categories) which contains 46,562 images. (3) Finally, we associate $\mathcal{R^{K}}$ with relation triplets in VG200 by associating their subject and object WordNet synsets [26] respectively. After filtering out the images without key relations in VG200, here comes VG-KR which contains 26,992 images. For both VG200 and VG-KR, we split the training and test set by 7:3 ratio, leading to 32,510/14,052 training/test images in VG200, and 18,720/8,272 training/test images in VG-KR.

We show more detailed statistics and compare with VG150 in Table 5. We can see that VG200 and VG-KR have more categories, as well as object and relation instances per image compared to VG150. Moreover, VG-KR contains indicative annotations of key relations.

In Figure 11, we show the distribution of images that contain different numbers of key relations. More than 90% of images contain less than 5 key relations. It’s reasonable because the key relations are obtained by matching the annotated relations with those extracted from captions. The number of relation triplets in captions generally is not very large. After all, a good caption is only requested to describe the major contents instead of the less important details.

Given each predicate, we explore the distribution of its role, i.e., whether it belongs to a key relation or not. The result is shown in Figure 12. The predicates with large probability to be key ones, such as throwing, brushing, and sniffing, are usually verbs containing rich semantics. They are image-specific and when we see these predicates, a scene can be roughly imagined. While predicates like belonging to, of, and behind, which carry little information, are less likely to make up the key relations.

0.C.2 Settings and Implementation Details

The dimension of hidden states and cells in both Hybrid-LSTM and RRM is 512. The sizes of $\bm{W}^{(r)}_{1}$ and $\bm{W}^{(r)}_{2}$ in Eq.(11) in the main paper are $256\times 512$ and $1\times 256$ respectively. The GloVe embedding vectors we use are of 200 dimensions.

When training on the VG dataset, we follow previous works [51, 38] to extract the first 5,000 images of the training split and treat them as the validation split. The results reported on VG150, VG200, and VG-KR are obtained by firstly selecting the best model on validation split and then evaluating it on test split. As for the experiments on VRD, we report the results of the last epoch evaluated on test split without model selection (The hyperparameters settings are the same as those of experiments on VG).

We pre-train object detectors on VRD, VG150, and VG200 respectively and freeze the learned parameters. To train the whole model end-to-end, we use an SGD optimizer with a learning rate of 0.001 and the batch size is 10. When computing the ranking loss for RRM, we randomly sample 512 pairs of key triplets and secondary triplets. The margin $\gamma$ is empirically set to 0.5. All the existing methods evaluated on our VG200 or VG-KR datasets are retrained.

The threshold $T$ for determining a parent node actually has direct influence on the shape of HET. We investigate the performance curve together with the tree depth and width variation trend. As shown in Figure 13, as $T$ varies from 0.1 to 0.9, the “tall thin” tree becomes a “short fat” tree, and the performance is improved. Thus we set $T$ to 0.9.

As $T$ becomes larger, the condition that a node can be a parent node, i.e., $P_{nm}>T$ (Eq.(1) in the main paper), is more and more difficult to be satisfied. Thus, our algorithm for constructing HET tends to set the root as the parent of a node, which results in a “shorter” and “fatter” tree.

A small $T$ would lead to considerable wrong hierarchical connections. It’s noted that the hierarchical connections in our HET have much stronger semantics than the associations of siblings. Therefore, a large $T$ eliminates wrong hierarchical connections as far as possible. Although it means that more entities are set as the child of the root and inappropriate siblings associations increase, proper hierarchical connections still plays a positive role in context encoding.

Table 6: The results of multiple runs of HetH and the statistical significance. These results are obtained under the PREDCLS protocol.

#RUN	R@20	R@50	R@100
1	33.46	36.59	37.00
2	33.53	36.64	37.04
3	33.93	36.65	37.07
$\mu\pm\sigma$	33.64 $\pm$ 0.21	36.63 $\pm$ 0.03	37.04 $\pm$ 0.03

Appendix D 0.D Robustness Analyses

We make multiple runs on the HetH under the PREDCLS protocol. The results and statistical significance are shown in Table 6.

Appendix E 0.E Exploration on VG-KR

We develop the Relation Ranking Module (RRM) to prioritize key relations. We intend to capture humans’ subjective assessment on the importance of relations with some objective indicators. As analyzed in Section 0.A, visually salient objects engage humans’ gaze and have the potential to form major events. Therefore, visual saliency can be one of the useful indicators. However, it’s easy to lead to misunderstandings when only visual saliency is considered.

To better describe the importance of relation, we borrow the traditional “saliency” concept, and put forward a brand-new concept, cognitive saliency, which tries to estimate the importance of a relation from humans’ perspective as the sensation of importance of relation is very subjective. Considering the measurement of cognitive saliency of a relation triplet, we employ its times being referred within the five captions of each image, which can be directly obtained during the construction of our VG-KR dataset. However, this measurement of cognitive saliency is not computable. (i.e., it is grading from humans, but cannot be directly used in computational models.) If we want to make use of the cognitive saliency, we need to find a computable indicator for it. The indicator should be proportional to cognitive saliency, which means that as the cognitive saliency goes up, the same trend should be observed on the indicator, and vice versa.

Intuitively, the first possible indicator is the visual saliency of subject and object in a relation triplet. Specifically, we set the indicator $\Phi$ as the sum of saliency values of subject $o^{sub}$ and object $o^{obj}$ :

$\displaystyle\mathcal{S}^{sub}$	$\displaystyle=\frac{\|\{p\|p\in\bm{b}^{sub}\wedge\mathcal{S}^{p}>T_{s}\}\|}{\|\{p\|p\in\bm{b}^{sub}\}\|},$	(18)
$\displaystyle\mathcal{S}^{obj}$	$\displaystyle=\frac{\|\{p\|p\in\bm{b}^{obj}\wedge\mathcal{S}^{p}>T_{s}\}\|}{\|\{p\|p\in\bm{b}^{obj}\}\|},$	(19)
$\displaystyle\Phi$	$\displaystyle=\mathcal{S}^{sub}+\mathcal{S}^{obj},$	(20)

where $p$ denotes pixels, $\bm{b}^{sub}$ and $\bm{b}^{obj}$ are bounding boxes of subject and object, $\mathcal{S}^{p}$ is the saliency value of pixel $p$ , $\mathcal{S}^{sub}$ and $\mathcal{S}^{obj}$ are saliency values of subject and object, $T_{s}$ is a given threshold. $|\cdot|$ computes the number of elements in a set. The pixel-wise saliency is computed by one of the state-of-the-art saliency detectors [8].

To draw the Cognitive Saliency (CS, Y-axis) - Indicator (IDC, X-axis) curve chart, we randomly sampled 50,000 key relations from VG-KR with grading from 1 to 5 as their CS values. As IDC values (i.e., $\Phi$ in Eq. (20)) are continuous, we sort all the sampled IDC values in ascending order, and divide them into 50 intervals $[\delta_{k},\delta_{k+1}],0\leq k\leq 50$ , where $\delta_{0}=\rm{IDC}_{min}$ and $\delta_{50}=\rm{IDC}_{max}$ . In each interval, we draw a point with the mean of the sampled IDC values as X-axis coordinate and the mean of sampled CS values as Y-axis coordinate. When it comes to IDC (Y-axis) - CS (X-axis) curve chart, the sampled relations are grouped by CS values. We compute the mean of IDC values for each group as Y-axis coordinates. These two charts are shown in Figure 14. In each chart, we draw curves under different settings of $T_{s}$ , denoted by $\mathrm{sal@}T_{s}$ . From the IDC - CS chart at the top right of Figure 14, IDC is proportional to CS. However, the CS - IDC chart at the top left of Figure 14 shows that CS is not strictly proportional to IDC, which means that although the computed visual saliency of an object is large, the relations involved in this object are not so important. What results in this phenomenon? We further extract the relations with relatively small IDC values and large IDC values respectively and analyze the ratio of each type of triplet. Concretely, we find the quartering points $\lambda_{1}<\lambda_{2}<\lambda_{3}$ of IDC values, and all the triplets whose IDC values are smaller than $\lambda_{1}$ or larger than $\lambda_{3}$ are picked out, namely the set $\Psi$ and $\Omega$ . The component analysis results of the set $\Psi$ and $\Omega$ are shown at the bottom of Figure 14, where the most frequent 18 types of triplets are demonstrated. From the bottom left pie chart, lots of triplets with low IDC and low CS values generally are relations between relatively small objects and the large background entities. However, there are some exceptions, e.g., $\langle$ man, on, surfboard $\rangle$ , and $\langle$ train, on, tracks $\rangle$ . It’s reasonable and we should explore the detailed image contents if we want to further analyze the association between their IDC and CS values. What we should pay attention to is the bottom right pie chart, where we observe that most triplets in $\Omega$ are relations between an independent object and its components, such as $\langle$ hand, of, man $\rangle$ and $\langle$ edge, of, bus $\rangle$ . Actually, these relations are indeed not so image-specific and carry little information. Humans generally overlook them. However, if the saliency of an object is large, saliency of its component will be large, too. It explains the phenomenon when IDC keeps increasing, the CS decreases instead.

In order to further rectify the indicator above, we consider the size of subject and object out of the thinking that the sizes of components or details of a certain entity is relatively small, which can balance the large saliency value. Therefore, we add the normalized size of subject and object into the indicator:

\Phi^{\prime}=\mathcal{S}^{sub}+\mathcal{S}^{obj}+\frac{A(o^{sub})}{A(o_{\mathcal{I}})}+\frac{A(o^{obj})}{A(o_{\mathcal{I}})},

(21)

where $A(\cdot)$ denotes the size function, and $o_{\mathcal{I}}$ denotes the whole image. Similarly, we draw the IDC - CS and CS - IDC charts in Figure 15. It is shown that this improved indicator is a feasible one, as the CS is strictly proportional to IDC, and vice versa.

The exploration above inspires us that an indicator which contains both the visual saliency and size of an object may be useful for finding key relations. Therefore, our devised RRM learns to capture humans’ subjective assessment on the importance of relations under the guidance of visual saliency and entity size information.

Appendix F 0.F Additional Qualitative Results

We demonstrate more qualitative results in Figure 16. From all of these examples, it can be seen that our RRM tends to describe relations between entities which are close to the root of HET. These relations describe the global contents and usually are what humans pay the most attention to. As a result, the captions generated from top relations better cover the essential contents. For example, in Figure 16(a), as the top-2 relations from HetH model contain $\langle$ woman, wearing, boot_1/boot_2 $\rangle$ , the generated caption cannot capture the essential content that the woman is holding an umbrella. On the contrary, top-2 relations from HetH-RRM successfully capture this information. In some cases, we observe that although top-2 relations do not contain the essential content, the generated caption can still capture it, e.g., the caption from HetH in Figure 16(b). It is mainly because the region of man_1 contains part of the region of motorcycle_1, which provides visual cues for inferring the content that a man is riding a motorcycle.

References

[1] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 2425–2433 (2015)
[2] Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6163–6171 (2019)
[3] Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3298–3308 (2017)
[4] Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18(5-6), 602–610 (2005)
[5] Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1969–1978 (2019)
[6] Han, F., Zhu, S.C.: Bottom-up/top-down image parsing with attribute grammar. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(1), 59–73 (2008)
[7] He, S., Tavakoli, H.R., Borji, A., Pugeault, N.: Human attention in image captioning: Dataset and analysis. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 8529–8538 (2019)
[8] Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3203–3212 (2017)
[9] Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (11), 1254–1259 (1998)
[10] Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). pp. 3668–3678 (2015)
[11] Kim, D.J., Choi, J., Oh, T.H., Kweon, I.S.: Dense relational captioning: Triple-stream networks for relationship-based captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6271–6280 (2019)
[12] Klein, D.A., Frintrop, S.: Center-surround divergence of feature statistics for salient object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 2214–2219 (2011)
[13] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV) 123(1), 32–73 (2017)
[14] Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5455–5463 (2015)
[15] Li, Y., Ouyang, W., Wang, X., Tang, X.: Vip-cnn: Visual phrase guided convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7244–7253 (2017)
[16] Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 11205, pp. 346–363. Springer (2018)
[17] Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 1261–1270 (2017)
[18] Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4408–4417 (2017)
[19] Liang, Y., Bai, Y., Zhang, W., Qian, X., Zhu, L., Mei, T.: Vrr-vg: Refocusing visually-relevant relationships. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 10403–10412 (2019)
[20] Lin, L., Wang, G., Zhang, R., Zhang, R., Liang, X., Zuo, W.: Deep structured scene parsing by learning with image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2276–2284 (2016)
[21] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 8693, pp. 740–755. Springer (2014)
[22] Lin, X., Ding, C., Zeng, J., Tao, D.: Gps-net: Graph property sensing network for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3746–3755 (2020)
[23] Liu, N., Han, J., Yang, M.H.: Picanet: Learning pixel-wise contextual attention for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3089–3098 (2018)
[24] Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 9905, pp. 852–869. Springer (2016)
[25] Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7219–7228 (2018)
[26] Miller, G.A.: Wordnet: A lexical database for english. Communication of the ACM 38(11), 39–41 (1992)
[27] Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543 (2014)
[28] Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Weakly-supervised learning of visual relations. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 5179–5188 (2017)
[29] Qi, M., Li, W., Yang, Z., Wang, Y., Luo, J.: Attentive relational networks for mapping images to scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3957–3966 (2019)
[30] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS). pp. 91–99 (2015)
[31] Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., Manning, C.D.: Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Proceedings of the Fourth Workshop on Vision and Language. pp. 70–80 (2015)
[32] Sharma, A., Tuzel, O., Jacobs, D.W.: Deep hierarchical parsing for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 530–538 (2015)
[33] Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8376–8384 (2019)
[34] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
[35] Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 129–136 (2011)
[36] Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 1556–1566 (2015)
[37] Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3716–3725 (2020)
[38] Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6619–6628 (2019)
[39] Wang, L., Lu, H., Ruan, X., Yang, M.H.: Deep networks for saliency detection via local estimation and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3183–3192 (2015)
[40] Wang, T., Borji, A., Zhang, L., Zhang, P., Lu, H.: A stagewise refinement model for detecting salient objects in images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 4019–4028 (2017)
[41] Wang, W., Wang, R., Shan, S., Chen, X.: Exploring context and visual pattern of relationship for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8188–8197 (2019)
[42] Wu, Q., Shen, C., Wang, P., Dick, A., van den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40(6), 1367–1381 (2018)
[43] Xie, Y., Lu, H., Yang, M.H.: Bayesian saliency via low and mid level cues. IEEE Transactions on Image Processing (TIP) 22(5), 1689–1698 (2012)
[44] Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5410–5419 (2017)
[45] Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph r-cnn for scene graph generation. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 11205, pp. 690–706. Springer (2018)
[46] Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 11218, pp. 711–727. Springer (2018)
[47] Yao, T., Pan, Y., Li, Y., Mei, T.: Hierarchy parsing for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 2621–2629 (2019)
[48] Yin, G., Sheng, L., Liu, B., Yu, N., Wang, X., Shao, J., Loy, C.C.: Zoom-net: Mining deep feature interactions for visual relationship recognition. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 11207, pp. 330–347. Springer (2018)
[49] Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 1974–1982 (2017)
[50] Zareian, A., Karaman, S., Chang, S.F.: Weakly supervised visual semantic parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3736–3745 (2020)
[51] Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5831–5840 (2018)
[52] Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5532–5540 (2017)
[53] Zhang, H., Kyaw, Z., Yu, J., Chang, S.F.: Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 4233–4241 (2017)
[54] Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). vol. 33, pp. 9185–9194 (2019)
[55] Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11535–11543 (2019)
[56] Zhang, L., Zhang, J., Lin, Z., Lu, H., He, Y.: Capsal: Leveraging captioning to boost semantics for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6024–6033 (2019)
[57] Zhu, L., Chen, Y., Lin, Y., Lin, C., Yuille, A.: Recursive segmentation and recognition templates for image parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34(2), 359–371 (2011)

Strtg.	Metric	HetH-RRM
EP	kR@1	17.5
	kR@5	35.0
	speed	0.22
SP	kR@1	15.8
	kR@5	31.2
	speed	0.18