This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

11institutetext: Key Laboratory of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing, 100190, China 22institutetext: University of Chinese Academy of Sciences, Beijing, 100049, China 22email: [email protected], 22email: {wangruiping, sgshan, xlchen}@ict.ac.cn

Sketching Image Gist: Human-Mimetic Hierarchical Scene Graph Generation

Wenbin Wang 1122 0000-0002-4394-0145    Ruiping Wang 1122 0000-0003-1830-2595    Shiguang Shan 1122 0000-0002-8348-392X    Xilin Chen 1122 0000-0003-3024-4404
Abstract

Scene graph aims to faithfully reveal humans’ perception of image content. When humans analyze a scene, they usually prefer to describe image gist first, namely major objects and key relations in a scene graph. This humans’ inherent perceptive habit implies that there exists a hierarchical structure about humans’ preference during the scene parsing procedure. Therefore, we argue that a desirable scene graph should be also hierarchically constructed, and introduce a new scheme for modeling scene graph. Concretely, a scene is represented by a human-mimetic Hierarchical Entity Tree (HET) consisting of a series of image regions. To generate a scene graph based on HET, we parse HET with a Hybrid Long Short-Term Memory (Hybrid-LSTM) which specifically encodes hierarchy and siblings context to capture the structured information embedded in HET. To further prioritize key relations in the scene graph, we devise a Relation Ranking Module (RRM) to dynamically adjust their rankings by learning to capture humans’ subjective perceptive habits from objective entity saliency and size. Experiments indicate that our method not only achieves state-of-the-art performances for scene graph generation, but also is expert in mining image-specific relations which play a great role in serving downstream tasks.

Keywords:
Image Gist, Key Relation, Hierarchical Entity Tree, Hybrid-LSTM, Relation Ranking Module

1 Introduction

Refer to caption
Figure 1: Scene graphs from existing methods shown in (a) and (b) fail in sketching the image gist. The hierarchical structure about humans’ perception preference is shown in (f), where the bottom left highlighted branch stands for the hierarchy in (e). The scene graphs in (c) and (d) based on hierarchical structure better capture the gist. Relations in (a) and (b), and purple arrows in (c) and (d), are top-5 relations, while gray ones in (c) and (d) are secondary.

In an effort to thoroughly understand a scene, scene graph generation (SGG) [10, 44] in which objects and pairwise relations should be detected, has been on the way to bridge the gap between low-level recognition and high-level cognition, and contributes to tasks like image captioning [42, 25, 46], VQA [1, 38], and visual reasoning [33]. While previous works [44, 17, 45, 16, 52, 29, 38, 41, 51, 55] have pushed this area forward, the generated scene graph may be still far from perfect, e.g., they seldom consider whether the detected relations are what humans want to convey from the image or not. As a symbolic representation of an image, the scene graph is expected to record the image content as complete as possible. More importantly, a scene graph is not just for being admired, but for supporting downstream tasks, such as image captioning, where a description is supposed to depict the major event in the image, or the namely image gist. This characteristic is also one of the humans’ inherent habits when they parse a scene. Therefore, an urgently needed feature of SGG is to assess the relation importance and prioritize the relations which form the major events that humans intend to preferentially convey, i.e., key relations. This is seldom considered by existing methods. What’s worse, the universal phenomenon of unbalanced distribution of relationship triplets in mainstream datasets exacerbates the problem that the major event cannot be found out. Let’s study the quality of top relations predicted by existing state-of-the-art methods (e.g., [51]) and check whether they are “key”  or not. In Figure 1(a)(b), two scene graphs shown with top-5 relations for image A and B are mostly the same although major events in A and B are quite different. In other words, existing methods are deficient in mining image-specific relations, but biased towards trivial or self-evident ones (e.g., \langlewoman, has, head\rangle can be obtained from commonsense without observing the image), which fail in conveying image gist (colored parts in ground truth captions in Figure 1), and barely contribute to downstream tasks.

Any pair of objects in a scene can be considered relevant, at least in terms of their spatial configurations. Faced with such a massive amount of relations, how do humans choose relations to describe the images? Given picture (ii) in Figure 1(e), a zoom-in sub-region of picture (i), humans will describe it with \langlewoman, riding, bike\rangle, since woman and bike belong to the same perceptive level and their interaction forms the major event in (ii). When it comes to picture (iii), the answers would be \langlewoman, wearing, helmet\rangle and \langlebag, on, woman\rangle, where helmet and bag are finer details of woman and belong to an inferior perceptive level. It suggests that there naturally exists a hierarchical structure about humans’ perception preference, as shown in Figure 1(f).

Inspired by observations above, we argue that a desirable scene graph should be hierarchically constructed. Specifically, we represent the image with a human-mimetic Hierarchical Entity Tree (HET) where each node is a detected object and each one can be decomposed into a set of finer objects attached to it. To generate the scene graph based on HET, we devise Hybrid Long Short-Term Memory (Hybrid-LSTM) to encode both hierarchy and siblings context [51, 38] and capture the structured information embedded in HET, considering that important related pairs are more likely to be seen either inside a certain perceptive level or between two adjacent perceptive levels. We further intend to evaluate the performances of different models on key relation prediction but the annotations of key relations are not directly available from existing datasets. Therefore, we extend Visual Genome (VG) [13] to VG-KR dataset which contains indicative annotations of key relations by drawing support from caption annotations in MSCOCO [21]. We devise a Relation Ranking Module to adjust the rankings of relations. It captures humans’ subjective perceptive habits from objective entity saliency and size, and achieves ultimate performances on mining key relations.111Source code and dataset are available at http://vipl.ict.ac.cn/resources/codes or https://github.com/Kenneth-Wong/het-eccv20.git.

2 Related Works

Scene graph generation (SGG) and Visual Relationship Detection (VRD), are the two most common tasks aiming at extracting interaction between two objects. In the field of VRD, various studies [24, 3, 15, 52, 49, 28, 53, 48, 54] mainly focus on detecting each relation triplet independently rather than describe the structure of the scene. The concept of scene graph is firstly proposed in [10] for image retrieval. Xu et al.  [44] define SGG task and creatively devise message passing mechanism for scene graph inference. A series of succeeding works struggle to design various approaches to improve the graph representation. Li et al.  [17] induce image captions and object information to jointly address multitasks. [51, 38, 41, 22] draw support from useful context construction. Yang et al.  [45] propose Graph-RCNN to embed the structured information. Qi et al.  [29] employ a self-attention module to embed a weighted graph representation. Zhang et al.  [55] propose contrastive losses to resolve the related pair configuration ambiguity. Zareian et al.  [50] creatvely treat the SGG as an edge role assignment problem. Recently, some methods try to borrow advantages from using knowledge [2, 5] or causal effect [37] to diversify the predicted relations. Liang et al.  [19] prune the dominant and easy-to-predict relations in VG to alleviate the annihilation problem of rare but meaningful relations.

Structured Scene Parsing, has been paid much attention in pursuit of higher-level scene understanding. [35, 32, 20, 6, 57, 47] construct various hierarchical structures for their specific tasks. Unlike existing SGG studies that indiscriminately detect relations no matter whether they are concerned by humans or not, our work introduces the idea of hierarchical structure into SGG task, and try to give priority to detect key relations, then the trivial ones for completeness.

Saliency vs.  Image Gist. An extremely rich set of studies [14, 39, 23, 40, 8, 56] focus on analyzing where humans gaze and find visually salient objects (high contrast of luminance, hue, and saturation, center position [9, 12, 43], etc.). It’s notable that the visually salient objects are related but not equal to objects involved in image gist. He et al.  [7] explore gaze data and find that only 48% of fixated objects are referred in humans’ descriptions about the image, while 95% of objects referred in descriptions are fixated. It suggests that objects referred in a description (i.e., objects that humans think important and should form the major events / image gist) are almost visually salient and reveal where humans gaze, but what humans fixate (i.e., visually salient objects) are not always what they want to convey. We provide some examples in the Appendix to help to understand this finding. Naturally, we need to emphasize that the levels in our HET reflect the perception priority level rather than the object saliency. Besides, this finding supports us to obtain the indicative annotations of key relations with the help of image caption annotations.

3 Proposed Approach

3.1 Overview

The scene graph 𝒢={𝒪,}\mathcal{G}=\{\mathcal{O},\mathcal{R}\} of an image \mathcal{I} contains a set of entities 𝒪={oi}i=1N\mathcal{O}=\{o_{i}\}_{i=1}^{N} and their pairwise relations ={rk}k=1M\mathcal{R}=\{r_{k}\}_{k=1}^{M}. Each rkr_{k} is a triplet oi,pij,oj\langle o_{i},p_{ij},o_{j}\rangle where pij𝒫p_{ij}\in\mathcal{P} and 𝒫\mathcal{P} is the set of all predicates. As illustrated in Figure 2, our approach can be summarized into four steps. (i) We apply Faster R-CNN [30] with VGG16 [34] backbone to detect all the entity proposals and each of them possesses its bounding box 𝒃i4\bm{b}_{i}\in\mathbb{R}^{4}, 4,096-dimensional visual feature 𝒗i\bm{v}_{i}, and the class probability vector 𝒒i\bm{q}_{i} from the softmax output. (ii) In Section 3.2, HET is constructed by organizing the detected entities according to their perceptive levels. (iii) In Section 3.3, we design the Hybrid-LSTM network to parse HET, which firstly encodes the structured context then decodes it for graph inference. (iv) In Section 3.4, we improve the scene graph generated in (iii) with our devised RRM which further adjusts the rankings of relations and shifts the graph focus to the relations between entities that are close to top perceptive levels of HET.

Refer to caption
Figure 2: An overview of our method. An object detector is firstly applied to give support to HET construction. Then Hybrid-LSTM is leveraged to parse HET, and specifically contains 4 processes, (a) entity context encoding, (b) relation context encoding, (c) entity context decoding, and (d) relation context decoding. Finally, RRM predicts a ranking score for each triplet which further prioritizes the key relations in the scene graph.

3.2 HET Construction

We aim to construct a hierarchical structure whose top-down levels are accord with the perceptive levels of humans’ inherent scene parsing hierarchy. From a massive number of observations, it can be found that entities with larger sizes are relatively more likely to form the major events in a scene (this will be proved effective through experiments). Therefore, we arrange larger entities as close to the root of HET as possible. Each entity can be decomposed into finer entities that make up the inferior level.

Concretely, HET is a multi-branch tree 𝒯\mathcal{T} with a virtual root o0o_{0} standing for the whole image. All the entities are sorted in descending order according to their sizes and we get an orderly sequence {oi1,oi2,,oiN}\{o_{i_{1}},o_{i_{2}},\ldots,o_{i_{N}}\}. For each entity oino_{i_{n}}, we consider entities with larger size, {oim},1m<n\{o_{i_{m}}\},1\leq m<n, and calculate the ratio

Pnm=I(oin,oim)A(oin),P_{nm}=\frac{I\left(o_{i_{n}},o_{i_{m}}\right)}{A(o_{i_{n}})}, (1)

where A()A(\cdot) denotes the size of the entity and I(,)I(\cdot,\cdot) is the intersection area of two entities. If PnmP_{nm} is larger than threshold TT, oimo_{i_{m}} will be a candidate parent node of oino_{i_{n}} since oimo_{i_{m}} contains most part of oino_{i_{n}}. If there is no candidate, the parent node of oino_{i_{n}} is set as o0o_{0}. If there are more than one, we further determine the parent with two alternative strategies:

Area-first Strategy (AFS). Considering that entity with a larger size has a higher probability to contain more details or components, the candidate with the largest size is selected to be a parent node.

Intersection-first Strategy (IFS). We compute ratio

Qnm=I(oin,oim)A(oim).Q_{nm}=\frac{I\left(o_{i_{n}},o_{i_{m}}\right)}{A(o_{i_{m}})}. (2)

A larger QnmQ_{nm} means that oino_{i_{n}} is relatively more important to oimo_{i_{m}} than to other candidates. Therefore, oimo_{i_{m}} where m=argmaxkQnkm=\mathop{\arg\max}_{k}Q_{nk} is chosen as parent of oino_{i_{n}}.

3.3 Structured Context Encoding and Scene Graph Generation

The interpretability of HET implies that important relations are more likely to be seen between entities either inside a certain level or from two adjacent levels. Therefore, both hierarchical connection [38] and sibling association [51] are useful for context modeling. Our Hybrid-LSTM encoder is proposed, which consists of a bidirectional multi-branch TreeLSTM [36] (Bi-TreeLSTM) for encoding the hierarchy context, and a bidirectional chain LSTM [4] (Bi-LSTM) for encoding the siblings context. We use two identical Hybrid-LSTM encoders to encode two types of context for each entity, one is entity context which helps predict the information of entity itself, and the other is relation context which plays a role in inferring the relation when interacting with other potential relevant entities. For brevity we only provide a detailed introduction of entity context encoding (Figure 2(a)). Specifically, the input feature 𝒙i\bm{x}_{i} of each node oio_{i} is concatenation of visual feature 𝒗i\bm{v}_{i} and weighted sum of semantic embedding vectors, 𝒛i=𝑾e(1)𝒒i\bm{z}_{i}=\bm{W}_{e}^{(1)}\bm{q}_{i}, where 𝑾e(1)\bm{W}_{e}^{(1)} is word embedding matrix initialized from GloVe [27]. For the root node o0o_{0}, 𝒗0\bm{v}_{0} is obtained with the whole-image bounding box, while 𝒛0\bm{z}_{0} is initialized randomly.

The hierarchy context (blue arrows in Figure 2(a)) is encoded as:

𝑪=BiTreeLSTM({𝒙i}i=0N),\bm{C}=\mathrm{BiTreeLSTM}(\{\bm{x}_{i}\}_{i=0}^{N}), (3)

where 𝑪={𝒄i}i=0N\bm{C}=\{\bm{c}_{i}\}_{i=0}^{N} and each 𝒄i=[𝒉i𝒯;𝒉i𝒯]\bm{c}_{i}=\left[\overrightarrow{\bm{h}_{i}^{\mathcal{T}}};\overleftarrow{\bm{h}_{i}^{\mathcal{T}}}\right] is the concatenation of the top-down and bottom-up hidden states of Bi-TreeLSTM:

𝒉i𝒯\displaystyle\overrightarrow{\bm{h}_{i}^{\mathcal{T}}} =TreeLSTM(𝒙i,𝒉p𝒯),\displaystyle=\mathrm{TreeLSTM}\left(\bm{x}_{i},\overrightarrow{\bm{h}_{p}^{\mathcal{T}}}\right), (4a)
𝒉i𝒯\displaystyle\overleftarrow{\bm{h}_{i}^{\mathcal{T}}} =TreeLSTM(𝒙i,{𝒉j𝒯|jC(i)}),\displaystyle=\mathrm{TreeLSTM}\left(\bm{x}_{i},\left\{\overleftarrow{\bm{h}_{j}^{\mathcal{T}}}\Big{|}{j\in C(i)}\right\}\right), (4b)

where C()C(\cdot) denotes the set of children nodes while subscript pp denotes the parent of node ii.

The siblings context (red arrows in Figure 2(a)) is encoded within each set of children nodes which share the same parent:

𝑺=BiLSTM({𝒙i}i=0N),\bm{S}=\mathrm{BiLSTM}(\{\bm{x}_{i}\}_{i=0}^{N}), (5)

where 𝑺={𝒔i}i=0N\bm{S}=\{\bm{s}_{i}\}_{i=0}^{N} and each 𝒔i=[𝒉i;𝒉i]\bm{s}_{i}=\left[\overrightarrow{\bm{h}_{i}^{\mathcal{L}}};\overleftarrow{\bm{h}_{i}^{\mathcal{L}}}\right] is concatenation of forward and backward hidden states of Bi-LSTM:

𝒉i=LSTM(𝒙i,𝒉l),𝒉i=LSTM(𝒙i,𝒉r),\overrightarrow{\bm{h}_{i}^{\mathcal{L}}}=\mathrm{LSTM}\left(\bm{x}_{i},\overrightarrow{\bm{h}_{l}^{\mathcal{L}}}\right),\,\,\overleftarrow{\bm{h}_{i}^{\mathcal{L}}}=\mathrm{LSTM}\left(\bm{x}_{i},\overleftarrow{\bm{h}_{r}^{\mathcal{L}}}\right), (6)

where ll and rr stand for left and right sibling which share the same parent with ii. We further concatenate hierarchy and siblings context to obtain the entity context, 𝒇i𝒪=[𝒄i;𝒔i]\bm{f}^{\mathcal{O}}_{i}=[\bm{c}_{i};\bm{s}_{i}]. Missing branches or siblings are padded with zero vectors.

The relation context is encoded (Figure 2(b)) in the same way as entity context except that the input of each node is replaced by {𝒇i𝒪}i=0N\{\bm{f}^{\mathcal{O}}_{i}\}_{i=0}^{N} . Another Hybrid-LSTM encoder is applied to get the relation context {𝒇i}i=0N\{\bm{f}^{\mathcal{R}}_{i}\}_{i=0}^{N}.

To generate a scene graph, we should decode the context to obtain entity and relation information. In HET, a child node strongly depends on its parent, i.e., information of parent node is helpful for prediction of child node. Therefore, to predict entity information, we decode entity context in a top-down manner following Eq. (4a) as shown in Figure 2(c). For node oio_{i}, the input 𝒙i\bm{x}_{i} in Eq. (4a) is replaced with [𝒇i𝒪;𝑾e(2)𝒒p][\bm{f}_{i}^{\mathcal{O}};\bm{W}_{e}^{(2)}\bm{q}_{p}], where 𝑾e(2)\bm{W}_{e}^{(2)} is word embedding matrix and 𝒒p\bm{q}_{p} is the predicted class probability vector of the parent of oio_{i}. The output hidden state is fed into a softmax classifier and bounding box regressor to predict entity information of oio_{i}. To predict the predicate pijp_{ij} between oio_{i} and ojo_{j}, we feed 𝒇ij=[𝒇i;𝒇j]\bm{f}_{ij}^{\mathcal{R}}=[\bm{f}_{i}^{\mathcal{R}};\bm{f}_{j}^{\mathcal{R}}] to an MLP classifier (Figure 2(d)). As a result, a scene graph is generated, and for each triplet containing subject oio_{i}, object ojo_{j} and predicate pijp_{ij}, we obtain their scalar scores sis_{i}, sjs_{j}, and sijs_{ij}.

3.4 Relation Ranking Module

So far, we obtain the hierarchical scene graph based on HET. As we collect the key relation annotations (Section 4.1), we intend to further maximize the performance on mining key relations with supervised information, and explore the advantages brought by HET. Consequently, we design a Relation Ranking Module (RRM) to prioritize key relations. As analyzed in Related Works, regions of humans’ interest can be tracked under the guidance of visual saliency although they do not always form the major events that humans want to convey. Besides, the size, which guides HET construction, not only is an important reference for estimating the perceptive level of entities, but also is found helpful to rectify some misleadings in humans’ subjective assessment on the importance of relations (see the Appendix). Therefore, we propose to learn to capture humans’ subjective assessment on the importance of relations under the guidance of visual saliency and entity size information.

We firstly employ DSS [8] to predict the pixel-wise saliency map (SM) 𝒮\mathcal{S} for each image. To effectively collect entity size information, we propose a pixel-wise area map (AM) 𝒜\mathcal{A}. Given the image \mathcal{I} and its detected NN entities {oi}i=1N\{o_{i}\}_{i=1}^{N} with bounding boxes {𝒃i}i=1N\{\bm{b}_{i}\}_{i=1}^{N} (specially o0o_{0} and 𝒃0\bm{b}_{0} for the whole image), the value axya_{xy} of each position (x,y)(x,y) on 𝒜\mathcal{A} is defined as the minimum normalized size of entities which cover (x,y)(x,y):

axy={min{A(oi)A(o0)|i𝒳},if𝒳0,otherwise,a_{xy}=\left\{\begin{aligned} &\min\left\{\frac{A(o_{i})}{A(o_{0})}\Bigg{|}i\in\mathcal{X}\right\},\mathrm{if}\,\,\mathcal{X}\neq\emptyset\\ &0,\mathrm{otherwise},\end{aligned}\right. (7)

where 𝒳={i|(x,y)𝒃i,0<iN}\mathcal{X}=\{i|(x,y)\in\bm{b}_{i},0<i\leq N\}. The sizes of both 𝒮\mathcal{S} and 𝒜\mathcal{A} are the same as that of input image \mathcal{I}. We apply adaptive average pooling (AAP()\mathrm{AAP}(\cdot)) to smooth and down-sample these two maps to align with the shape of conv5 feature map \mathcal{F} from Faster-RCNN, and obtain the attention embedded feature map S\mathcal{F}_{S}:

S=(AAP(𝒮)+AAP(𝒜)),\mathcal{F}_{S}=\mathcal{F}\odot(\mathrm{AAP}(\mathcal{S})+\mathrm{AAP}(\mathcal{A})), (8)

where \odot is the Hadamard product.

We predict a score for each triplet to adjust their rankings. The input contains visual representation for a triplet, 𝒗ij4096\bm{v}_{ij}\in\mathbb{R}^{4096}, which is obtained by RoI Pooling on S\mathcal{F}_{S}. Besides, the geometric information is also an auxiliary cue for estimating the importance. For a triplet containing subject box 𝒃i\bm{b}_{i} and object box 𝒃j\bm{b}_{j}, the geometric feature 𝒈ij\bm{g}_{ij} is defined as a 6-dimensional vector following [11]:

𝒈ij=[xjxiwihi,yjyiwihi,wjhjwihi,wihi,wjhj,𝒃i𝒃j𝒃i𝒃j],\bm{g}_{ij}=\!\!\left[\frac{x_{j}-x_{i}}{\sqrt{w_{i}h_{i}}},\!\frac{y_{j}-y_{i}}{\sqrt{w_{i}h_{i}}},\!\sqrt{\frac{w_{j}h_{j}}{w_{i}h_{i}}},\!\frac{w_{i}}{h_{i}},\!\frac{w_{j}}{h_{j}},\!\frac{\bm{b}_{i}\cap\bm{b}_{j}}{\bm{b}_{i}\cup\bm{b}_{j}}\right]\!\!, (9)

which is projected to a 256-dimensional vector and concatenated with 𝒗ij\bm{v}_{ij}, resulting in the final representation for a relation 𝒓ij=[𝒗ij;𝑾(g)𝒈ij]\bm{r}_{ij}=[\bm{v}_{ij};\bm{W}^{(g)}\bm{g}_{ij}] where 𝑾(g)256×6\bm{W}^{(g)}\in\mathbb{R}^{256\times 6} is projection matrix. Then we use a bi-directional LSTM to encode global context among all the triplets so that ranking score of each triplet can be reasonably adjusted considering scores of other triplets. Concretely, the ranking score tijt_{ij} for a pair (oi(o_{i}, oj)o_{j}) is achieved as:

{𝒉ij}\displaystyle\{\bm{h}^{\mathcal{R}}_{ij}\} =BiLSTM({𝒓ij}),\displaystyle=\mathrm{BiLSTM}\left(\{\bm{r}_{ij}\}\right), (10)
tij\displaystyle t_{ij} =𝑾2(r)ReLU(𝑾1(r)𝒉ij).\displaystyle=\bm{W}^{(r)}_{2}\mathrm{ReLU}(\bm{W}^{(r)}_{1}\bm{h}^{\mathcal{R}}_{ij}). (11)

𝑾1(r)\bm{W}^{(r)}_{1} and 𝑾2(r)\bm{W}^{(r)}_{2} are weights of two fully connected layers. The ranking score is fused with classification scores so that both the confidences of three components of a triplet and ranking priority are considered, resulting in the final ranking confidence ϕij=sisjsijtij\phi_{ij}=s_{i}\cdot s_{j}\cdot s_{ij}\cdot t_{ij}, which is used for re-ranking the relations.

3.5 Loss Function

We adopt the cross-entropy loss for optimizing Hybrid-LSTM networks. Let ee^{\prime} and ll^{\prime} denote the predicted label of entity and predicate respectively, ee and ll denote the ground truth labels. The loss is defined as:

CE=entity+relation=1Z1ieilog(ei)1Z2ijilijlog(lij).\begin{split}\mathcal{L}_{CE}=\mathcal{L}_{entity}+\mathcal{L}_{relation}=-\frac{1}{Z_{1}}\sum_{i}e^{\prime}_{i}\log(e_{i})-\frac{1}{Z_{2}}\sum_{i}\sum_{j\neq i}l^{\prime}_{ij}\log(l_{ij}).\end{split} (12)

When the RRM is applied, the final loss function is the sum of CE\mathcal{L}_{CE} and ranking loss (𝒦,𝒩)\mathcal{L}(\mathcal{K},\mathcal{N}), which is used to maximize the margin between the ranking confidences of key relations and those of secondary ones:

Final=CE+(𝒦,𝒩)=CE+1Z3r𝒦,r𝒩max(0,γϕr+ϕr),\begin{split}\mathcal{L}_{Final}=\mathcal{L}_{CE}+\mathcal{L}(\mathcal{K},\mathcal{N})=\mathcal{L}_{CE}+\frac{1}{Z_{3}}\sum_{r\in\mathcal{K},r^{\prime}\in\mathcal{N}}\,\max(0,\gamma-\phi_{r}+\phi_{r^{\prime}}),\end{split} (13)

where γ\gamma denotes margin parameter, 𝒦\mathcal{K} and 𝒩\mathcal{N} stand for the set of key and secondary relations, rr and rr^{\prime} are relations sampled from 𝒦\mathcal{K} and 𝒩\mathcal{N} with ranking confidences ϕr\phi_{r} and ϕr\phi_{r^{\prime}}. Z1Z_{1}, Z2Z_{2}, and Z3Z_{3} are normalization factors.

4 Experimental Evaluation

4.1 Dataset, Evaluation and Settings

VRD [24], is the benchmarking dataset for visual relationship detection task, which contains 4,000/1,000 training/test images and covers 100 object categories and 70 predicate categories.

Visual Genome (VG), is a large-scale dataset with rich annotations of objects, attributes, dense captions and pairwise relationships, containing 75,651/32,422 training/test images. We adopt the most widely used version of VG, namely VG150 [44], which covers 150 object categories and 50 predicate categories.

Refer to caption
Figure 3: Examples in VG-KR dataset. Each image is shown with 3 captions and ground truth relations. Purple triplets are key ones while others are secondary.

VG200 and VG-KR. We intend to collect the indicative annotations of key relations based on VG. Inspired by the finding illustrated in Related Works, we associate the relation triplets referred in caption annotations in MSCOCO [21] with those from VG. We give several examples in Figure 3. The details of our processing and more statistics are provided in the Appendix.

Evaluation, Settings, and Implementation Details. For conventional SGG following triplet-match rule (only if three components of a triplet match the ground truth will it be a correct one), we adopt three universal protocols [44]: PREDCLS, SGCLS, and SGGEN. All protocols use Recall@K (R@K=20, 50, 100) as a metric. When evaluating key relation prediction, there are some variations. First, we only evaluate with PREDCLS and SGCLS protocols to eliminate the interference of errors from object detector, and add a tuple-match rule (only the subject and object are required to match the ground truth) to investigate the ability to find proper pairs. Second, we introduce a new metric, Key Relation Recall (kR@K), which computes recall rate on key relations. As the number of key relations is usually less than 5 (see the Appendix), the K in kR@K is set to 1 and 5. When evaluating on VRD, we use RELDET and PHRDET [49], and report R@50 and R@100 at 1, 10, and 70 predicates per related pair. The details about the hyperparameters settings and implementation are provided in the Appendix.

Table 1: Results table (%) on VG150 and VG200. The results of the full version of our method are highlighted.
SGGEN SGCLS PREDCLS
R@ 20 50 100 20 50 100 20 50 100
VG150 VRD [24] - 0.3 0.5 - 11.8 14.1 - 27.9 35.0
IMP [44] - 3.4 4.2 - 21.7 24.4 - 44.8 53.0
IMP{\dagger} [44, 51] 14.6 20.7 24.5 31.7 34.6 35.4 52.7 59.3 61.3
Graph-RCNN [45] - 11.4 13.7 - 29.6 31.6 - 54.2 59.1
MemNet [41] 7.7 11.4 13.9 23.3 27.8 29.5 42.1 53.2 57.9
MOTIFS [51] 21.4 27.2 30.3 32.9 35.8 36.5 58.5 65.2 67.1
KERN [2] - 27.1 29.8 - 36.7 37.4 - 65.8 67.6
VCTree-SL [38] 21.7 27.7 31.1 35.0 37.9 38.6 59.8 66.2 67.9
HetH-AFS 21.2 27.1 30.5 33.7 36.6 37.3 58.1 64.7 66.6
HetH w/o chain 21.5 27.4 30.7 32.9 35.9 36.7 57.5 64.5 66.5
HetH 21.6 27.5 30.9 33.8 36.6 37.3 59.8 66.3 68.1
VG200 MOTIFS [51] 15.2 19.9 22.8 24.5 26.7 27.4 52.5 59.0 61.0
VCTree-SL [38] 14.7 19.5 22.5 24.2 26.5 27.1 51.9 58.4 60.3
HetH 15.7 20.4 23.4 25.0 27.2 27.8 53.6 60.1 61.8
Table 2: Results table (%) of key relation prediction on VG-KR.
Triplet Match Tuple Match
SGCLS PREDCLS SGCLS PREDCLS
kR@ 1 5 1 5 1 5 1 5
VCTree-SL 5.7 14.2 11.4 30.2 8.4 22.2 16.1 46.4
MOTIFS 5.9 14.5 11.3 30.0 8.5 21.8 16.0 46.2
HetH 6.1 15.1 11.6 30.4 8.6 22.7 16.4 47.1
MOTIFS-RRM 8.6 16.4 16.7 33.8 13.8 26.3 27.9 57.1
HetH-RRM 9.2 17.1 17.5 35.0 14.6 27.3 28.9 59.1
RRM-Base 8.4 16.8 16.2 33.7 13.4 26.8 26.6 57.2
RRM-SM 9.0 16.9 17.2 34.5 14.3 27.1 28.6 58.7
RRM-AM 8.9 16.9 16.9 34.4 14.1 27.0 28.1 58.2
Table 3: Results table (%) on VRD.
RELDET PHRDET
k=1 k=10 k=70 k=1 k=10 k=70
R@ 50 100 50 100 50 100 50 100 50 100 50 100
ViP [15] 17.32 20.01 - - - - 22.78 27.91 - - - -
VRL [18] 18.19 20.79 - - - - 21.37 22.60 - - - -
KL-Dist [49] 19.17 21.34 22.56 29.89 22.68 31.89 23.14 24.03 26.47 29.76 26.32 29.43
Zoom-Net [48] 18.92 21.41 - - 21.37 27.30 24.82 28.09 - - 29.05 37.34
RelDN-L0L_{0} [55] 24.30 27.91 26.67 32.55 26.67 32.55 31.09 36.42 33.29 41.25 33.29 41.25
RelDN [55] 25.29 28.62 28.15 33.91 28.15 33.91 31.34 36.42 34.45 42.12 34.45 42.12
HetH 22.42 24.88 26.88 31.69 26.88 31.81 30.69 35.59 35.47 42.94 35.47 43.05

4.2 Ablation Studies

Ablation studies are separated into two sections. The first part is to explore some variants of HET construction. We conduct these experiments on VG150. The complete version of our model is HetH, which is configured with IFS and Hybrid-LSTM. The second part is an investigation into the usage of SM and AM in RRM. Experiments are carried out on VG-KR. The complete version is HetH-RRM, whose implementation follows Eq. (8).

Ablation study on HET construction. We firstly compare AFS and IFS for determining the parent node. Then we investigate the effectiveness of the chain LSTM encoder in Hybrid-LSTM. The ablative models mentioned above are shown in Table 1 as HetH-AFS (i.e.replace IFS by AFS), and HetH w/o chain. We observe that using IFS together with Hybrid-LSTM encoder has the best performances, which indicates that HET would be more reasonable using IFS. It’s noteworthy that if the Bi-TreeLSTM encoder is abandoned, the Hybrid-LSTM encoder would almost degenerate to MOTIFS. Therefore, through comparisons between HetH and MOTIFS, HetH and HetH w/o chain, it implies that both hierarchy and siblings context should be encoded in HET.

Ablation study on RRM. In order to explore the effectiveness of saliency and size, we ablate HetH-RRM with the following baselines: (1) RRM-Base: 𝒗ij\bm{v}_{ij} is extracted from \mathcal{F} rather than S\mathcal{F}_{S}, (2) RRM-SM: only 𝒮\mathcal{S} is used, and (3) RRM-AM: only 𝒜\mathcal{A} is used. Results in Table 2 suggest that both saliency and size information indeed contributes to discovering key relations, and the effect of saliency is slightly better than that of the size. The hybrid version achieves the highest performances. From the following qualitative analysis, we can see that with the guidance of saliency and rectification effect of size, RRM further shifts the model’s attention to key relations significantly.

4.3 Comparisons with State-of-the-Arts

For scene graph generation, we compare our HetH with the following state-of-the-art methods: VRD [24] and KERN [2] use knowledge from language or statistical correlations. IMP [44], Graph-RCNN [45], MemNet [41], MOTIFS [51] and VCTree-SL [38] mainly devise various message passing methods for improving graph representations. For key relation prediction, we mainly evaluate two latest works, MOTIFS and VCTree-SL on VG-KR. Besides, we further incorporate RRM to MOTIFS, namely MOTIFS-RRM, to explore the transferability of RRM. Results are shown in Table 1 and 2. We give statistical significance of the results in the Appendix.

Refer to caption
Figure 4: Qualitative Results of HetH and HetH-RRM. In (e), the pink entities are involved in top-5 relations, and the purple arrows are key relations matched with ground truth. The purple numeric tags next to the relations are the rankings, and “1”   means that the relation gets the highest score.

Quantitative Analysis. In Table 1, when evaluated on VG150, HetH dominantly surpasses most methods. Compared to MOTIFS and VCTree-SL, HetH using multi-branch tree structure outperforms MOTIFS and yields comparable recall rate with VCTree-SL which uses a binary tree structure. It indicates that hierarchical structure is superior to plain one in terms of modeling context. We observe that HetH achieves better performances compared to VCTree-SL under PREDCLS protocol, while there exists a slight gap under SGCLS and SGGEN protocols. This is mainly because our tree structure is generated with artificial rules and some incorrect subtrees inevitably emerge due to occlusion in 2D images, while VCTree-SL dynamically adjusts its structure in pursuit of higher performances. Under SGCLS and SGGEN protocols in which object information is fragmentary, it is difficult for HetH to rectify the context encoded from wrong structures. However, we argue that our interpretable and natural multi-branch tree structure is also adaptive to the situation when there is an increment of object and relation categories but fewer data. It can be seen from evaluation results on VG200 that the HetH outperforms MOTIFS by 0.67 mean points and VCTree-SL by 1.1 mean points. On the contrary, in this case, the data are insufficient for dynamic structure optimization.

As SGG task is highly related to VRD task, we apply HetH on VRD and the comparison results are shown in Table 3. Both the HetH and RelDN [55] use pre-trained weights on MSCOCO, while only [48] states that they use ImageNet pre-trained weights and others remain unknown. It’s shown that our method yields competitive results and even surpasses state-of-the-arts under some metrics.

When it comes to key relation prediction, we directly evaluate HetH, MOTIFS, and VCTree-SL on VG-KR. As shown in Table 2, HetH substantially performs better than other two competitors, suggesting that the structure of HET provides hints for judging the importance of relations, and parsing the structured information in HET indeed capture humans’ perceptive habits.

In pursuit of ultimate performances on mining key relations, we jointly optimize the HetH with RRM under the supervision of key relation annotations in VG-KR. From Table 2, both HetH-RRM and MOTIFS-RRM achieve significant gains, and HetH-RRM is better than MOTIFS-RRM, which proves the superiority of HetH again, and shows excellent transferability of RRM.

Qualitative Analysis. We visualize intermediate results in Figure 4(a-d). HET is well constructed and close to human’s analyzing process. In the area map, regions of arm, hand, and foot get small weights because of their small sizes. Actually, relations like \langlelady, has, arm\rangle are indeed trivial. As a result, RRM suppresses these relations. More cases are provided in the Appendix.

Refer to caption
(a)
Refer to caption
(b)
Figure 5: (a) Depth distribution of top-5 predicted relations. (b) The ranking confidence of relations from different depths obtained from RRM-base. Sampling is repeated five times.
Strtg. Metric HetH-RRM
EP kR@1 17.5
kR@5 35.0
speed 0.22
SP kR@1 15.8
kR@5 31.2
speed 0.18
Figure 6: Comparison between EP and SP. The inference speed (seconds/image) is evaluated with a single TITAN Xp GPU).
Refer to caption
Figure 7: As the quota of top relations (NR) increases, scene graphs dynamically enlarge. The newly involved entities and relations are shown in a new color. Results in first row and second row are from HetH-RRM and MOTIFS respectively.

4.4 Analyses about HET

We conduct additional experiments to validate whether HET has a potential to reveal humans’ perceptive habits. As shown in Figure 6(a), we compare the depth distribution of top-5 predicted relations (represented by tuple (doi,doj)(d_{o_{i}},d_{o_{j}}) consisting of the depths of two entities, and the depth of root is defined as 1.) of HetH, RRM-base and HetH-RRM. After applying RRM, there is a significant increment on the ratio of depth tuples (2, 2) and (2, 3), and a drop on (3, 3). This phenomenon is also observed in Figure 4(e). Previous experiments have proved that RRM obviously upgrades the rankings of key relations. In other words, relations which are closer to the root of HET are regarded as key ones by RRM. We also analyze the ranking confidence (ϕ\phi) of relations from different depths with the RRM-base model (to eliminate the confounding effect caused by AAP information). We sample 10,000 predicted relation triplets from each depth five times. In Figure 6(b), the ranking confidence decreases as the depth increases. Therefore, different levels of HET indeed indicate different perceptive importance of relations. This characteristic makes it possible to reasonably adjust the scale of a scene graph. As shown in the first row in Figure 7, hierarchical scene graph from our HetH-RRM enlarges in a top-down manner as the quota of top relations increases, while the ordinary scene graph in the second row enlarges itself aimlessly. If we want to limit the scale of a scene graph but keep its ability to sketch image gist as far as possible, it is feasible for our hierarchical scene graph since we just need to cut off some secondary branches of HET, but is difficult to realize in an ordinary scene graph.

Besides, different from traditional Exhausted Prediction (EP, predict relation for every pair of entities) during inference stage, we adopt a novel Structured Prediction (SP) strategy, in which we only predict relations between parent and children nodes, and any two sibling nodes that share the same parent. In Figure 6, we compare the performances and inference speed between EP and SP for HetH-RRM. Despite the slight gap in terms of performances, the interpretability of connections in HET makes SP feasible to take a further step towards efficient inference, getting rid of the O(N2)O(N^{2}) complexity [16] of EP. Further researches need to be conducted to balance performance and efficiency.

Table 4: Results of image captioning on VG-KR.
Num. Model B@1 B@2 B@3 B@4 ROUGE-L CIDEr SPICE Avg. Growth
all GCN-LSTM 72.0 54.7 40.5 30.0 52.9 91.1 18.1
20 HetH-Freq 73.1 55.7 41.0 30.1 53.5 94.0 18.8
HetH 74.9 58.4 43.9 32.8 54.9 101.7 19.8 0.06
HetH-RRM 75.0 58.2 43.7 32.7 55.1 102.2 19.9
5 HetH-Freq 70.7 53.2 38.6 28.0 51.7 84.4 17.2
HetH 72.5 55.4 41.2 30.5 53.1 92.6 18.5 1.57
HetH-RRM 73.7 56.7 42.3 31.5 54.0 97.5 19.1
2 HetH-Freq 68.1 50.8 36.8 26.5 50.2 76.5 15.5
HetH 70.8 53.4 39.2 28.7 51.8 86.4 17.6 2.10
HetH-RRM 72.3 55.2 41.0 30.4 53.1 92.2 18.4

5 Experiments on Image Captioning

Do key relations really make sense? We conduct experiments on one of the downstream tasks of SGG, i.e., image captioning, to verify it. 222We briefly introduce here and details are provided in Appendix.

Experiments are conducted on VG-KR since it has caption annotations from MSCOCO. To generate captions, we select different numbers of predicted top relations and feed them into the LSTM backend following [46]. We reimplement the complete GCN-LSTM [46] model and evaluate it on VG-KR since it’s one of the state-of-the-art methods and is most related to us. As shown in Table 4, our simple frequency baseline, HetH-Freq (the rankings of relations are accord with their frequency in training data), with 20 top relations input, outperforms GCN-LSTM because GCN-LSTM conducts graph convolution using relations as edges, which is not as effective as our method in terms of making full use of relation information. After applying RRM, there is consistent performance improvement on overall metrics. This improvement is more and more significant as the number of input top relations reduces. It’s reasonable since the impact of RRM centers at top relations. It suggests that our model provides more essential content with as few relations as possible, which contributes to efficiency improvement. The captions presented in Figure 4(e) shows that key relations are more helpful for generating a description that highly fits the major events in an image.

6 Conclusion

We propose a new scene graph modeling formulation and make an attempt to push the study of SGG towards the target of practicability and rationalization. We generate a human-mimetic hierarchical scene graph inspired by humans’ scene parsing procedure, and further prioritize the key relations as far as possible. Based on HET, a hierarchcal scene graph is generated with the assistance of our Hybrid-LSTM. Moreover, RRM is devised to recall more key relations. Experiments show outstanding performances of our method on traditional scene graph generation and key relation prediction tasks. Besides, experiments on image captioning prove that key relations are not just for appreciating, but indeed play a crucial role in higher-level downstream tasks.

Appendix A 0.A Detailed Explanation about Motivation

As illustrated in the main paper, it’s notable that the visually salient objects are related but not completely equal to objects involved in image gist. According to findings in [7], objects referred in a description (i.e., objects that humans think important and should form the major events/image gist) are almost visually salient and reveal where humans gaze, but what humans fixate (i.e., visually salient objects) are not always what they want to convey at first. In Figure 8, we provide some examples to show that this is a common phenomenon. E.g., the red clothes, the Spring Festival couplets, and the black doors of the washing machines (mentioned from left to right), are visually salient due to their high contrast to the context or center position. However, some of them do not form the major events. For example, in the 2nd image, the first glance description would be “There stands a house on the side of the road”. Then humans may be interested in the eyecatching Spring Festival couplets.

Besides, we are inspired by these observations. There naturally exists a hierarchical structure about humans’ perception preference. Objects with relatively large size which fulfill the scene generally form the major events. It supports us to construct HET with the method introduced in the main paper. We aim at constructing HET whose levels reflect the perception priority level rather than the object saliency. The experiments show that our method for constructing HET has achieved this goal.

Refer to caption
Figure 8: Visually salient objects do not always form the major events in the images and are not always what humans want to convey at first from the images. The yellow points in each image denote some visually salient objects. The saliency maps in the second row are obtained from [8].
Refer to caption
Figure 9: Our implementation of GCN-LSTM, and the the implementation scheme of our sentence decoder.

Appendix B 0.B Implemention Details for Image Captioning

As the source codes of GCN-LSTM [46] have not been released by the submission deadline, we re-implement it. In its original version, a simple two-layer MLP classifier is applied to predict the pairwise relationship, which acts as the frontend scene graph detector. For a fair comparison, we replace this detector with our HetH. To transform our HetH/HetH-RRM for image captioning task, we add a sentence decoder which is modified from LSTM backend of GCN-LSTM. The GCN-LSTM model conducts graph convolution on the scene graph and injects all relation-aware region-level features into a two-layer LSTM with attention mechanism. Different from GCN-LSTM, we intend to inject the relation features rather than region-level features, considering that the relationships which convey the events in the image are more helpful for description generation. In Figure 9, we show a brief diagram to illustrate our implementation of GCN-LSTM, and demonstrate the implementation scheme of our sentence decoder for image captioning.

Specifically, we obtain a set of visual relationship representations {𝒇m}m=1M\{\bm{f}_{m}^{\mathcal{R}}\}_{m=1}^{M} (𝒇mfD\bm{f}_{m}^{\mathcal{R}}\in\mathbb{R}^{D}_{f}, Df=4,096D_{f}=4,096) after relation context decoding (see Figure 2(d) in the main paper). We concatenate them with the word embeddings of their subjects, objects, and predicates, denoted by 𝒘msDw\bm{w}_{m}^{s}\in\mathbb{R}^{D_{w}}, 𝒘moDw\bm{w}_{m}^{o}\in\mathbb{R}^{D_{w}}, and 𝒘mpDw\bm{w}_{m}^{p}\in\mathbb{R}^{D_{w}} (Dw=300D_{w}=300), and obtain {𝒓m}m=1M\{\bm{r}_{m}\}_{m=1}^{M} (𝒓mDr\bm{r}_{m}\in\mathbb{R}^{D_{r}}, Dr=4,996D_{r}=4,996):

𝒓m=[𝒇m;𝒘ms;𝒘mo;𝒘mp].\bm{r}_{m}=\left[\bm{f}_{m}^{\mathcal{R}};\bm{w}_{m}^{s};\bm{w}_{m}^{o};\bm{w}_{m}^{p}\right]. (14)

The sentence decoder is a two-layer LSTM. It’s noted that two layers in this decoder share one hidden state 𝒉Dh\bm{h}\in\mathbb{R}^{D_{h}} and cell state 𝒄Dh\bm{c}\in\mathbb{R}^{D_{h}}. At each time step tt, the first layer collects the maximum contextual information by concatenating the input word embedding 𝒘tDw\bm{w}_{t}\in\mathbb{R}^{D_{w}} and the mean-pooled visual relationship feature 𝒓¯=1Mm=1M𝒓m\bar{\bm{r}}=\frac{1}{M}\sum\limits_{m=1}^{M}\bm{r}_{m}. The updating procedure is as

𝒉t1,𝒄t1=f1([𝒘t;𝒓¯])|𝒉t12,𝒄t12,\bm{h}_{t}^{1},\bm{c}_{t}^{1}={f_{1}\left(\left[\bm{w}_{t};\bar{\bm{r}}\right]\right)}_{|\bm{h}_{t-1}^{2},\bm{c}_{t-1}^{2}}, (15)

where f1f_{1} is the updating function within the first-layer unit, |𝒉t12,𝒄t12|\bm{h}_{t-1}^{2},\bm{c}_{t-1}^{2} denotes that the internal hidden state and cell state is the ones that updated by the second-layer unit from the previous timestep. Then we compute a normalized attention distribution over all the relationship features

at,m=𝑾a[tanh(𝑾f𝒓m+𝑾h𝒉t1)],λt=softmax(𝒂t),a_{t,m}=\bm{W}_{a}\left[\tanh\left(\bm{W}_{f}\bm{r}_{m}+\bm{W}_{h}\bm{h}_{t}^{1}\right)\right],\,\,\lambda_{t}=\mathrm{softmax}(\bm{a}_{t}), (16)

where at,ma_{t,m} is the mm-th element of 𝒂t\bm{a}_{t}, 𝑾a1×Da\bm{W}_{a}\in\mathbb{R}^{1\times D_{a}}, 𝑾fDa×Dr\bm{W}_{f}\in\mathbb{R}^{D_{a}\times D_{r}}, 𝑾hDa×Dh\bm{W}_{h}\in\mathbb{R}^{D_{a}\times D_{h}} are transformation matrices. Specifically, both the dimension of the hidden layer DaD_{a} for measuring the attention distribution and the dimension of the hidden layer DhD_{h} in LSTM are set as 512. λtM\lambda_{t}\in\mathbb{R}^{M} denotes the normalized attention distribution whose mm-th element λt,m\lambda_{t,m} is the attention weight of 𝒓m\bm{r}_{m}. The attended relationship feature is computed as 𝒓t=m=1Mλt,m𝒓m\bm{r}_{t}=\sum\limits_{m=1}^{M}\lambda_{t,m}\bm{r}_{m}. Then the updating procedure of the second-layer unit is

𝒉t2,𝒄t2=f2(𝒓t)|𝒉t1,𝒄t1,\bm{h}_{t}^{2},\bm{c}_{t}^{2}={f_{2}\left(\bm{r}_{t}\right)}_{|\bm{h}_{t}^{1},\bm{c}_{t}^{1}}, (17)

where f2f_{2} is the updating function within the second-layer unit. 𝒉t2\bm{h}_{t}^{2} is used to predict the next word through a softmax layer.

Table 5: Statistics of VG200, VG-KR, and VG150.
Dataset Images Images with Relations Object Categories Predicate Categories Object Instances Relation Instances Key Relation Instances Images with Key Relation Instances
VG200 51,498 46,562 200 80 619,119 442,425 101,312 26,992
VG-KR 26,992 26,992 200 80 360,306 250,755
VG150 108,073 89,169 150 50 1,145,398 622,705 - -
Refer to caption
Figure 10: The procedure of VG-KR dataset construction. The color block shown in the top right of each dataset in (a) demonstrates its components. (b) gives a global perspective of these color blocks. Images in MSCOCO consist of four parts, A, C, D, and E. Images in VG consist of B, C, D, and E. Images in VGC consist of C, D, and E. E denotes images filtered in step (2) which do not contain any relation. D denotes images filtered in step (3) which do not contain any key relation.
Refer to caption
Figure 11: Distribution of images that contain different numbers of key relations.
Refer to caption
Figure 12: Distribution of the roles of a given predicate. The red bars stand for the probability of being key relations while the blue bars denote the probability of being the secondary ones.

Appendix C 0.C VG-KR Dataset Construction, Statistics, and Experimental Implementation Details

0.C.1 VG-KR Dataset

We demonstrate the procedure of constructing VG-KR dataset and different image sets involved in this procedure in Figure 10. Concretely, 51,498 images in VG come from MSCOCO, which form the image set VGC. We conduct three-stage processing on VGC. (1) Stanford Scene Graph Parser [31] is used to extract relation triplets from captions. They make up the set of key relations, denoted by 𝒦\mathcal{R^{K}}. (2) We next cleanse the raw annotations of VGC similar to [44], keep the most frequent 150 object categories and 50 predicates, and add another most frequent 50 object categories and 30 predicates in 𝒦\mathcal{R^{K}}, in order to keep as many key relations as possible for the following third step. After dropping images without relations in VGC, we get a new subset VG200 (i.e.  200 object categories) which contains 46,562 images. (3) Finally, we associate 𝒦\mathcal{R^{K}} with relation triplets in VG200 by associating their subject and object WordNet synsets [26] respectively. After filtering out the images without key relations in VG200, here comes VG-KR which contains 26,992 images. For both VG200 and VG-KR, we split the training and test set by 7:3 ratio, leading to 32,510/14,052 training/test images in VG200, and 18,720/8,272 training/test images in VG-KR.

We show more detailed statistics and compare with VG150 in Table 5. We can see that VG200 and VG-KR have more categories, as well as object and relation instances per image compared to VG150. Moreover, VG-KR contains indicative annotations of key relations.

In Figure 11, we show the distribution of images that contain different numbers of key relations. More than 90% of images contain less than 5 key relations. It’s reasonable because the key relations are obtained by matching the annotated relations with those extracted from captions. The number of relation triplets in captions generally is not very large. After all, a good caption is only requested to describe the major contents instead of the less important details.

Given each predicate, we explore the distribution of its role, i.e., whether it belongs to a key relation or not. The result is shown in Figure 12. The predicates with large probability to be key ones, such as throwing, brushing, and sniffing, are usually verbs containing rich semantics. They are image-specific and when we see these predicates, a scene can be roughly imagined. While predicates like belonging to, of, and behind, which carry little information, are less likely to make up the key relations.

0.C.2 Settings and Implementation Details

The dimension of hidden states and cells in both Hybrid-LSTM and RRM is 512. The sizes of 𝑾1(r)\bm{W}^{(r)}_{1} and 𝑾2(r)\bm{W}^{(r)}_{2} in Eq.(11) in the main paper are 256×512256\times 512 and 1×2561\times 256 respectively. The GloVe embedding vectors we use are of 200 dimensions.

When training on the VG dataset, we follow previous works [51, 38] to extract the first 5,000 images of the training split and treat them as the validation split. The results reported on VG150, VG200, and VG-KR are obtained by firstly selecting the best model on validation split and then evaluating it on test split. As for the experiments on VRD, we report the results of the last epoch evaluated on test split without model selection (The hyperparameters settings are the same as those of experiments on VG).

We pre-train object detectors on VRD, VG150, and VG200 respectively and freeze the learned parameters. To train the whole model end-to-end, we use an SGD optimizer with a learning rate of 0.001 and the batch size is 10. When computing the ranking loss for RRM, we randomly sample 512 pairs of key triplets and secondary triplets. The margin γ\gamma is empirically set to 0.5. All the existing methods evaluated on our VG200 or VG-KR datasets are retrained.

Refer to caption
Figure 13: Effect of threshold TT when constructing HET. AD and AW denote average depth and average width respectively. The red curve stands for the kR@1 performance of HET-RRM, evaluated under PREDCLS protocol with triplet-match rule.

The threshold TT for determining a parent node actually has direct influence on the shape of HET. We investigate the performance curve together with the tree depth and width variation trend. As shown in Figure 13, as TT varies from 0.1 to 0.9, the “tall thin”  tree becomes a “short fat”  tree, and the performance is improved. Thus we set TT to 0.9.

As TT becomes larger, the condition that a node can be a parent node, i.e., Pnm>TP_{nm}>T (Eq.(1) in the main paper), is more and more difficult to be satisfied. Thus, our algorithm for constructing HET tends to set the root as the parent of a node, which results in a “shorter”   and “fatter”   tree.

A small TT would lead to considerable wrong hierarchical connections. It’s noted that the hierarchical connections in our HET have much stronger semantics than the associations of siblings. Therefore, a large TT eliminates wrong hierarchical connections as far as possible. Although it means that more entities are set as the child of the root and inappropriate siblings associations increase, proper hierarchical connections still plays a positive role in context encoding.

Table 6: The results of multiple runs of HetH and the statistical significance. These results are obtained under the PREDCLS protocol.
       #RUN        R@20        R@50        R@100
       1        33.46        36.59        37.00
       2        33.53        36.64        37.04
       3        33.93        36.65        37.07
       μ±σ\mu\pm\sigma        33.64±\pm0.21        36.63±\pm0.03        37.04±\pm0.03

Appendix D 0.D Robustness Analyses

We make multiple runs on the HetH under the PREDCLS protocol. The results and statistical significance are shown in Table 6.

Refer to caption
Figure 14: Curve charts and pie charts for the indicator which is the sum of subject and object visual saliency values. In the curve charts, different curves are drawn under different thresholds TsT_{s}. Top left: CS - IDC chart. Top right: IDC - CS chart. Bottom left: Component analysis for relations which have small IDC values. Bottom right: Component analysis for relations which have large IDC values.
Refer to caption
Figure 15: Curve charts for the indicator which is the sum of subject and object visual saliency values and normalized areas. Different curves are drawn under different thresholds TsT_{s}. Left: CS - IDC chart. Right: IDC - CS chart.

Appendix E 0.E Exploration on VG-KR

We develop the Relation Ranking Module (RRM) to prioritize key relations. We intend to capture humans’ subjective assessment on the importance of relations with some objective indicators. As analyzed in Section 0.A, visually salient objects engage humans’ gaze and have the potential to form major events. Therefore, visual saliency can be one of the useful indicators. However, it’s easy to lead to misunderstandings when only visual saliency is considered.

To better describe the importance of relation, we borrow the traditional “saliency”  concept, and put forward a brand-new concept, cognitive saliency, which tries to estimate the importance of a relation from humans’ perspective as the sensation of importance of relation is very subjective. Considering the measurement of cognitive saliency of a relation triplet, we employ its times being referred within the five captions of each image, which can be directly obtained during the construction of our VG-KR dataset. However, this measurement of cognitive saliency is not computable. (i.e., it is grading from humans, but cannot be directly used in computational models.) If we want to make use of the cognitive saliency, we need to find a computable indicator for it. The indicator should be proportional to cognitive saliency, which means that as the cognitive saliency goes up, the same trend should be observed on the indicator, and vice versa.

Intuitively, the first possible indicator is the visual saliency of subject and object in a relation triplet. Specifically, we set the indicator Φ\Phi as the sum of saliency values of subject osubo^{sub} and object oobjo^{obj}:

𝒮sub\displaystyle\mathcal{S}^{sub} =|{p|p𝒃sub𝒮p>Ts}||{p|p𝒃sub}|,\displaystyle=\frac{|\{p|p\in\bm{b}^{sub}\wedge\mathcal{S}^{p}>T_{s}\}|}{|\{p|p\in\bm{b}^{sub}\}|}, (18)
𝒮obj\displaystyle\mathcal{S}^{obj} =|{p|p𝒃obj𝒮p>Ts}||{p|p𝒃obj}|,\displaystyle=\frac{|\{p|p\in\bm{b}^{obj}\wedge\mathcal{S}^{p}>T_{s}\}|}{|\{p|p\in\bm{b}^{obj}\}|}, (19)
Φ\displaystyle\Phi =𝒮sub+𝒮obj,\displaystyle=\mathcal{S}^{sub}+\mathcal{S}^{obj}, (20)

where pp denotes pixels, 𝒃sub\bm{b}^{sub} and 𝒃obj\bm{b}^{obj} are bounding boxes of subject and object, 𝒮p\mathcal{S}^{p} is the saliency value of pixel pp, 𝒮sub\mathcal{S}^{sub} and 𝒮obj\mathcal{S}^{obj} are saliency values of subject and object, TsT_{s} is a given threshold. |||\cdot| computes the number of elements in a set. The pixel-wise saliency is computed by one of the state-of-the-art saliency detectors [8].

To draw the Cognitive Saliency (CS, Y-axis) - Indicator (IDC, X-axis) curve chart, we randomly sampled 50,000 key relations from VG-KR with grading from 1 to 5 as their CS values. As IDC values (i.e., Φ\Phi in Eq. (20)) are continuous, we sort all the sampled IDC values in ascending order, and divide them into 50 intervals [δk,δk+1],0k50[\delta_{k},\delta_{k+1}],0\leq k\leq 50, where δ0=IDCmin\delta_{0}=\rm{IDC}_{min} and δ50=IDCmax\delta_{50}=\rm{IDC}_{max}. In each interval, we draw a point with the mean of the sampled IDC values as X-axis coordinate and the mean of sampled CS values as Y-axis coordinate. When it comes to IDC (Y-axis) - CS (X-axis) curve chart, the sampled relations are grouped by CS values. We compute the mean of IDC values for each group as Y-axis coordinates. These two charts are shown in Figure 14. In each chart, we draw curves under different settings of TsT_{s}, denoted by sal@Ts\mathrm{sal@}T_{s}. From the IDC - CS chart at the top right of Figure 14, IDC is proportional to CS. However, the CS - IDC chart at the top left of Figure 14 shows that CS is not strictly proportional to IDC, which means that although the computed visual saliency of an object is large, the relations involved in this object are not so important. What results in this phenomenon? We further extract the relations with relatively small IDC values and large IDC values respectively and analyze the ratio of each type of triplet. Concretely, we find the quartering points λ1<λ2<λ3\lambda_{1}<\lambda_{2}<\lambda_{3} of IDC values, and all the triplets whose IDC values are smaller than λ1\lambda_{1} or larger than λ3\lambda_{3} are picked out, namely the set Ψ\Psi and Ω\Omega. The component analysis results of the set Ψ\Psi and Ω\Omega are shown at the bottom of Figure 14, where the most frequent 18 types of triplets are demonstrated. From the bottom left pie chart, lots of triplets with low IDC and low CS values generally are relations between relatively small objects and the large background entities. However, there are some exceptions, e.g., \langleman, on, surfboard\rangle, and \langletrain, on, tracks\rangle. It’s reasonable and we should explore the detailed image contents if we want to further analyze the association between their IDC and CS values. What we should pay attention to is the bottom right pie chart, where we observe that most triplets in Ω\Omega are relations between an independent object and its components, such as \langlehand, of, man\rangle and \langleedge, of, bus\rangle. Actually, these relations are indeed not so image-specific and carry little information. Humans generally overlook them. However, if the saliency of an object is large, saliency of its component will be large, too. It explains the phenomenon when IDC keeps increasing, the CS decreases instead.

In order to further rectify the indicator above, we consider the size of subject and object out of the thinking that the sizes of components or details of a certain entity is relatively small, which can balance the large saliency value. Therefore, we add the normalized size of subject and object into the indicator:

Φ=𝒮sub+𝒮obj+A(osub)A(o)+A(oobj)A(o),\Phi^{\prime}=\mathcal{S}^{sub}+\mathcal{S}^{obj}+\frac{A(o^{sub})}{A(o_{\mathcal{I}})}+\frac{A(o^{obj})}{A(o_{\mathcal{I}})}, (21)

where A()A(\cdot) denotes the size function, and oo_{\mathcal{I}} denotes the whole image. Similarly, we draw the IDC - CS and CS - IDC charts in Figure 15. It is shown that this improved indicator is a feasible one, as the CS is strictly proportional to IDC, and vice versa.

The exploration above inspires us that an indicator which contains both the visual saliency and size of an object may be useful for finding key relations. Therefore, our devised RRM learns to capture humans’ subjective assessment on the importance of relations under the guidance of visual saliency and entity size information.

Appendix F 0.F Additional Qualitative Results

We demonstrate more qualitative results in Figure 16. From all of these examples, it can be seen that our RRM tends to describe relations between entities which are close to the root of HET. These relations describe the global contents and usually are what humans pay the most attention to. As a result, the captions generated from top relations better cover the essential contents. For example, in Figure 16(a), as the top-2 relations from HetH model contain \langlewoman, wearing, boot_1/boot_2\rangle, the generated caption cannot capture the essential content that the woman is holding an umbrella. On the contrary, top-2 relations from HetH-RRM successfully capture this information. In some cases, we observe that although top-2 relations do not contain the essential content, the generated caption can still capture it, e.g., the caption from HetH in Figure 16(b). It is mainly because the region of man_1 contains part of the region of motorcycle_1, which provides visual cues for inferring the content that a man is riding a motorcycle.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(e)
Refer to caption
(f)
Figure 16: From top left to bottom right are: bounding boxes of all objects, saliency maps, area maps, mixed maps, bounding boxes of objects involved in top-5 relations from HetH, HET structure, bounding boxes of objects involved in top-5 relations from HetH-RRM model, hierarchical scene graphs from HetH and HetH-RRM model, generated captions using top-2 relations from HetH and HetH-RRM respectively. The purple arrows in scene graphs are key relations matched with ground truth. The purple numeric tags next to the relations are the rankings, and “1”   means that the relation gets the highest score.

References

  • [1] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: Vqa: Visual question answering. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 2425–2433 (2015)
  • [2] Chen, T., Yu, W., Chen, R., Lin, L.: Knowledge-embedded routing network for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6163–6171 (2019)
  • [3] Dai, B., Zhang, Y., Lin, D.: Detecting visual relationships with deep relational networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3298–3308 (2017)
  • [4] Graves, A., Schmidhuber, J.: Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18(5-6), 602–610 (2005)
  • [5] Gu, J., Zhao, H., Lin, Z., Li, S., Cai, J., Ling, M.: Scene graph generation with external knowledge and image reconstruction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1969–1978 (2019)
  • [6] Han, F., Zhu, S.C.: Bottom-up/top-down image parsing with attribute grammar. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 31(1), 59–73 (2008)
  • [7] He, S., Tavakoli, H.R., Borji, A., Pugeault, N.: Human attention in image captioning: Dataset and analysis. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 8529–8538 (2019)
  • [8] Hou, Q., Cheng, M.M., Hu, X., Borji, A., Tu, Z., Torr, P.: Deeply supervised salient object detection with short connections. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3203–3212 (2017)
  • [9] Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) (11), 1254–1259 (1998)
  • [10] Johnson, J., Krishna, R., Stark, M., Li, L.J., Shamma, D., Bernstein, M., Fei-Fei, L.: Image retrieval using scene graphs. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR). pp. 3668–3678 (2015)
  • [11] Kim, D.J., Choi, J., Oh, T.H., Kweon, I.S.: Dense relational captioning: Triple-stream networks for relationship-based captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6271–6280 (2019)
  • [12] Klein, D.A., Frintrop, S.: Center-surround divergence of feature statistics for salient object detection. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 2214–2219 (2011)
  • [13] Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K., Kravitz, J., Chen, S., Kalantidis, Y., Li, L.J., Shamma, D.A., Bernstein, M.S., Fei-Fei, L.: Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision (IJCV) 123(1), 32–73 (2017)
  • [14] Li, G., Yu, Y.: Visual saliency based on multiscale deep features. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5455–5463 (2015)
  • [15] Li, Y., Ouyang, W., Wang, X., Tang, X.: Vip-cnn: Visual phrase guided convolutional neural network. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7244–7253 (2017)
  • [16] Li, Y., Ouyang, W., Zhou, B., Shi, J., Zhang, C., Wang, X.: Factorizable net: an efficient subgraph-based framework for scene graph generation. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 11205, pp. 346–363. Springer (2018)
  • [17] Li, Y., Ouyang, W., Zhou, B., Wang, K., Wang, X.: Scene graph generation from objects, phrases and region captions. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 1261–1270 (2017)
  • [18] Liang, X., Lee, L., Xing, E.P.: Deep variation-structured reinforcement learning for visual relationship and attribute detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4408–4417 (2017)
  • [19] Liang, Y., Bai, Y., Zhang, W., Qian, X., Zhu, L., Mei, T.: Vrr-vg: Refocusing visually-relevant relationships. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 10403–10412 (2019)
  • [20] Lin, L., Wang, G., Zhang, R., Zhang, R., Liang, X., Zuo, W.: Deep structured scene parsing by learning with image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2276–2284 (2016)
  • [21] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 8693, pp. 740–755. Springer (2014)
  • [22] Lin, X., Ding, C., Zeng, J., Tao, D.: Gps-net: Graph property sensing network for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3746–3755 (2020)
  • [23] Liu, N., Han, J., Yang, M.H.: Picanet: Learning pixel-wise contextual attention for saliency detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3089–3098 (2018)
  • [24] Lu, C., Krishna, R., Bernstein, M., Fei-Fei, L.: Visual relationship detection with language priors. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 9905, pp. 852–869. Springer (2016)
  • [25] Lu, J., Yang, J., Batra, D., Parikh, D.: Neural baby talk. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 7219–7228 (2018)
  • [26] Miller, G.A.: Wordnet: A lexical database for english. Communication of the ACM 38(11), 39–41 (1992)
  • [27] Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1532–1543 (2014)
  • [28] Peyre, J., Laptev, I., Schmid, C., Sivic, J.: Weakly-supervised learning of visual relations. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 5179–5188 (2017)
  • [29] Qi, M., Li, W., Yang, Z., Wang, Y., Luo, J.: Attentive relational networks for mapping images to scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3957–3966 (2019)
  • [30] Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems (NIPS). pp. 91–99 (2015)
  • [31] Schuster, S., Krishna, R., Chang, A., Fei-Fei, L., Manning, C.D.: Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In: Proceedings of the Fourth Workshop on Vision and Language. pp. 70–80 (2015)
  • [32] Sharma, A., Tuzel, O., Jacobs, D.W.: Deep hierarchical parsing for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 530–538 (2015)
  • [33] Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8376–8384 (2019)
  • [34] Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
  • [35] Socher, R., Lin, C.C., Manning, C., Ng, A.Y.: Parsing natural scenes and natural language with recursive neural networks. In: Proceedings of the 28th International Conference on Machine Learning (ICML-11). pp. 129–136 (2011)
  • [36] Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). pp. 1556–1566 (2015)
  • [37] Tang, K., Niu, Y., Huang, J., Shi, J., Zhang, H.: Unbiased scene graph generation from biased training. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3716–3725 (2020)
  • [38] Tang, K., Zhang, H., Wu, B., Luo, W., Liu, W.: Learning to compose dynamic tree structures for visual contexts. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6619–6628 (2019)
  • [39] Wang, L., Lu, H., Ruan, X., Yang, M.H.: Deep networks for saliency detection via local estimation and global search. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3183–3192 (2015)
  • [40] Wang, T., Borji, A., Zhang, L., Zhang, P., Lu, H.: A stagewise refinement model for detecting salient objects in images. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 4019–4028 (2017)
  • [41] Wang, W., Wang, R., Shan, S., Chen, X.: Exploring context and visual pattern of relationship for scene graph generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 8188–8197 (2019)
  • [42] Wu, Q., Shen, C., Wang, P., Dick, A., van den Hengel, A.: Image captioning and visual question answering based on attributes and external knowledge. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 40(6), 1367–1381 (2018)
  • [43] Xie, Y., Lu, H., Yang, M.H.: Bayesian saliency via low and mid level cues. IEEE Transactions on Image Processing (TIP) 22(5), 1689–1698 (2012)
  • [44] Xu, D., Zhu, Y., Choy, C.B., Fei-Fei, L.: Scene graph generation by iterative message passing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5410–5419 (2017)
  • [45] Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph r-cnn for scene graph generation. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 11205, pp. 690–706. Springer (2018)
  • [46] Yao, T., Pan, Y., Li, Y., Mei, T.: Exploring visual relationship for image captioning. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 11218, pp. 711–727. Springer (2018)
  • [47] Yao, T., Pan, Y., Li, Y., Mei, T.: Hierarchy parsing for image captioning. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 2621–2629 (2019)
  • [48] Yin, G., Sheng, L., Liu, B., Yu, N., Wang, X., Shao, J., Loy, C.C.: Zoom-net: Mining deep feature interactions for visual relationship recognition. In: Proceedings of European Conference on Computer Vision (ECCV). vol. 11207, pp. 330–347. Springer (2018)
  • [49] Yu, R., Li, A., Morariu, V.I., Davis, L.S.: Visual relationship detection with internal and external linguistic knowledge distillation. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 1974–1982 (2017)
  • [50] Zareian, A., Karaman, S., Chang, S.F.: Weakly supervised visual semantic parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 3736–3745 (2020)
  • [51] Zellers, R., Yatskar, M., Thomson, S., Choi, Y.: Neural motifs: Scene graph parsing with global context. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5831–5840 (2018)
  • [52] Zhang, H., Kyaw, Z., Chang, S.F., Chua, T.S.: Visual translation embedding network for visual relation detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 5532–5540 (2017)
  • [53] Zhang, H., Kyaw, Z., Yu, J., Chang, S.F.: Ppr-fcn: Weakly supervised visual relation detection via parallel pairwise r-fcn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV). pp. 4233–4241 (2017)
  • [54] Zhang, J., Kalantidis, Y., Rohrbach, M., Paluri, M., Elgammal, A., Elhoseiny, M.: Large-scale visual relationship understanding. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). vol. 33, pp. 9185–9194 (2019)
  • [55] Zhang, J., Shih, K.J., Elgammal, A., Tao, A., Catanzaro, B.: Graphical contrastive losses for scene graph parsing. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 11535–11543 (2019)
  • [56] Zhang, L., Zhang, J., Lin, Z., Lu, H., He, Y.: Capsal: Leveraging captioning to boost semantics for salient object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6024–6033 (2019)
  • [57] Zhu, L., Chen, Y., Lin, Y., Lin, C., Yuille, A.: Recursive segmentation and recognition templates for image parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 34(2), 359–371 (2011)