R3Net:Relation-embedded Representation Reconstruction Network
for Change Captioning
Abstract
Change captioning is to use a natural language sentence to describe the fine-grained disagreement between two similar images. Viewpoint change is the most typical distractor in this task, because it changes the scale and location of the objects and overwhelms the representation of real change. In this paper, we propose a Relation-embedded Representation Reconstruction Network (R3Net) to explicitly distinguish the real change from the large amount of clutter and irrelevant changes. Specifically, a relation-embedded module is first devised to explore potential changed objects in the large amount of clutter. Then, based on the semantic similarities of corresponding locations in the two images, a representation reconstruction module (RRM) is designed to learn the reconstruction representation and further model the difference representation. Besides, we introduce a syntactic skeleton predictor (SSP) to enhance the semantic interaction between change localization and caption generation. Extensive experiments show that the proposed method achieves the state-of-the-art results on two public datasets 111The code of this paper has been made publicly available at https://github.com/tuyunbin/R3Net..
1 Introduction
Change captioning aims to generate a natural language sentence to detail what has changed in a pair of similar images. It has many practical applications, such as assisted surveillance, medical imaging, and computer assisted tracking of changes in media assets Jhamtani and Berg-Kirkpatrick (2018); Tu et al. (2021).
Different from single-image captioning Kim et al. (2019); Jiang et al. (2019); Fisch et al. (2020), change captioning addresses two-image captioning, which requires not only to understand both image content, but also to describe their disagreement. As the pioneer work, Jhamtani et al. Jhamtani and Berg-Kirkpatrick (2018) described semantic changes between mostly well-aligned image pairs with underlying illumination changes from surveillance cameras. However, they did not consider viewpoint changes that often happen in a dynamic world, and image pairs cannot be mostly well aligned in this case. Hence, feature shift between two unaligned images will adversely affect the learning of difference representation. To make this task more practical, recent works Park et al. (2019); Shi et al. (2020) proposed to address change captioning in the presence of viewpoint changes.

Despite the progress, there are some limitations for the above state-of-the-art methods when modeling the difference representation. First, the object information of each image is only learned at feature-level, and this is difficult to discriminate fine-grained difference when changed object is too tiny and surrounded by the large amount of clutter, as shown in Figure 1. Actually, when an object moved, its semantic relations with surrounding objects would change as well, and this can help explore the fine-grained change. Thus, it is important to model the difference representation at both feature and relation levels. Second, directly applying subtraction between a pair of unaligned images Park et al. (2019) may learn the difference representation with much noise, because viewpoint changes the scale and location of the objects. However, we can observe that those unchanged objects are still in the approximate locations. Hence, it is beneficial to reveal the unchanged representation and further model the difference representation based on the semantic similarities in the corresponding locations of two images.
In this paper, we propose a Relation-embedded Representation Reconstruction Network (R3Net) to handle viewpoint changes and model the fine-grained difference representation between two images in the process of representation reconstruction. Concretely, for “before” and “after” images, the relation-embedded module respectively performs semantic relation reasoning among object features via the self-attention mechanism. This can enhance the fine-grained representation ability of original object features. To model the difference representation, a representation reconstruction module (RRM) is designed, where a “shadow” representation (“after” or “before”) is used to reconstruct a “source” representation (“before” or “after”). The RRM first leverages every location in the “source” to stimulate the corresponding locations in the “shadow” to judge their semantic similarities, i.e., “response signals”. Further, under the guidance of the signals, the RRM picks out the unchanged features from the “shadow” as the “reconstruction” representation. The “difference” representation is computed with the changed features between the “source” and “reconstruction”. Next, a dual change localizer is devised to use the representation of difference as the query to localize the changed object feature on the “before” and “after”, respectively. Finally, the localized features are fed into an attention-based caption decoder for caption generation.
Besides, we introduce a Syntactic Skeleton Predictor (SSP) to enhance the semantic interaction between change localization and caption generation. As observed in Figure 1, a caption mainly consists of a set of nouns, adjectives, and verbs. These words can convey main information of the changed object and its surrounding references, called syntactic skeletons. The skeletons, which are predicted based on a global semantic representation derived from the R3Net, can supervise the modeling of difference representation and provide the decoder with high-level semantic cues about change type. This makes the learned difference representation more relevant to the target words and enhances the quality of generated sentences.
The main contributions of this paper are as follows: (1) We propose the R3Net to learn the fine-grained change from the large amount of clutter and overcome viewpoint changes by embedding semantic relations into object features and performing representation reconstruction with respect to the two images. (2) The SSP is introduced to enhance the semantic interaction between change localization and caption generation via predicting a set of syntactic skeletons based on a global semantic representation derived from the R3Net. (3) Extensive experiments show that the proposed method outperforms the state-of-the-art approaches by a large margin on two public datasets.

2 Related Work
Change Captioning. Captioning the change in the existence of viewpoint changes is a novel task in the vision-language community Zhang et al. (2017); Tu et al. (2017); Deng et al. (2021); Li et al. (2020); Liu et al. (2020). As the first work, DUDA Park et al. (2019) directly applied subtraction between two images to capture their semantic difference. However, due to viewpoint changes, direct subtraction between an unaligned image pair cannot reliably model the correct change Shi et al. (2020). Later, M-VAM Shi et al. (2020) proposed to measure the feature similarity across different regions in an image pair and find the most matched regions as unchanged parts. However, since there are a lot of similar objects, cross-region searching will face the risk of matching the query region with a similar but incorrect region, impacting subsequent change localization. In contrast, in our representation reconstruction network, the prediction of unchanged and changed features are based on the semantic similarities of the corresponding locations in two images. This can avoid the risk of reconstructing “source” with incorrect parts from “shadow”.
Skeleton Prediction in Captioning. Syntactic skeletons can provide the high-level semantic cues (e.g., attribute, class) about objects, so they are widely used in image/video captioning works. These methods either used skeletons as main information to generate captions Fang et al. (2015); Gan et al. (2017); Dai et al. (2018) or leveraged them to bridge the semantic gap between vision and language Gao et al. (2020); Tu et al. (2020). Although the skeletons played different roles in the above methods, the common point was that they only represent basic information of objects in images or videos. Different from them, besides basic information, we try to use skeletons to capture the changed information among objects.
3 Methodology
As shown in Figure 2, the architecture of our method consists of four main parts: (1) a relation-embedded representation reconstruction network (R3Net) to learn the fine-grained change in the presence of viewpoint changes; (2) a dual change localizer to focus on the specific change in a pair of images; (3) a syntactic skeleton predictor (SSP) to learn syntactic skeletons based on a global semantic representation derived from the R3Net; (4) an attention-based caption decoder to describe the change under the guidance of the learned skeletons.
3.1 Relation-embedded Representation Reconstruction Network
3.1.1 Relation-embedded Module
We first exploit a pre-trained CNN model to extract the object-level features and for a pair of “before” and “after” images, where and C, H, W indicate the number of channels, height, and width. However, only utilizing these independent features is difficult to distinguish fine-grained change from the large amount of clutter (similar objects). And related works Wu et al. (2019); Huang et al. (2020); Yin et al. (2020) have shown that capturing semantic relations among objects is useful for a thorough understanding of an image.
Motivated by this, we devise a relation-embedded (Remb) module based on the self-attention mechanism Vaswani et al. (2017) to implicitly learn semantic relations among objects in each image. Specifically, we first reshape to (), where . Then, the semantic relations are embedded into independent object features of each image based on the scaled dot-product attention:
(1) |
where the quires, keys and values are the projections of the object features in and :
(2) |
Thus, and are updated to and , respectively:
(3) |
When the model fully understands each image content, it can better capture the fine-grained difference between the image pair in the subsequent representation reconstruction.
3.1.2 Representation Reconstruction Module
The state-of-the-art method Park et al. (2019) applied direct subtraction between a pair of unaligned images, which is prone to capture the difference with noise in the presence of viewpoint changes.
To distinguish semantic change from viewpoint changes, a representation reconstruction module (RRM) is proposed, where the inputs are a “source” representation and a “shadow” representation . Concretely, first, we exploit each location of to stimulate the corresponding location of . The response degrees of all locations in are regarded as the response signals that measure the semantic similarities between corresponding locations in two images :
(4) |
where , and . Second, we use to reconstruct under the guidance of the response signals :
(5) |
where is the “reconstruction” representation, which represent unchanged features with respect to “source”. Finally, the “difference” representation is captured by subtracting “reconstruction” from “source” :
(6) |
Since the predicted unchanged and changed features in uni-directional reconstruction are only with respect to one kind of “source” representation (e.g., “before”), the model cannot predict the changed feature when it is not in the “source”. For an efficient model, it should capture all underlying changes with respect to both images. To this end, we extend the RRM from uni-direction to bi-direction. Specifically, we first use the “before” as “source” to predict unchanged and changed features, and then use the “after” as “source” to do so. Thus, the “reconstruction” and “difference” w.r.t. the “before” and “after” are formulated as:
(7) |
Finally, we obtain a bi-directional difference representation by a fully-connected layer:
(8) |
3.2 Dual Change localizer
When the bi-directional difference representation is computed, we exploit it as the query to localize the changed feature in and , respectively. Specifically, the dual change localizer first predicts two separate attention maps and :
(9) |
where [;], conv, and denote concatenation, convolutional layer, and sigmoid activation function, respectively. Then, the changed features and are localized via applying and to and :
(10) | ||||
Finally, we compute the local difference feature w.r.t. both and from two directions:
(11) |
3.3 Syntactic Skeleton Predictor
A syntactic skeleton predictor (SSP) is introduced to learn a set of syntactic skeletons based on the outputs derived from the R3Net. The predicted skeletons can provide the caption decoder with high-level semantic cues about changed objects and supervise the modeling of difference representation. This aims to enhance the semantic interaction between change localization and caption generation. Inspired by Gao et al. (2020); Gan et al. (2017), we treat this problem as a multi-label classification task. Suppose there are training image pairs, and is the label vector of the -th image pair, where if the image pair is annotated with the skeleton , and otherwise.
Specifically, first, we apply a mean-pooling layer over the concatenated semantic representations of , , and to obtain a global semantic representation :
(12) |
Then, the probability scores of all syntactic skeletons for -th image pair is computed by:
(13) |
where denotes the probability scores of skeletons in -th image pair. To maximize the probability scores of syntactic skeletons, we use the multi-label loss to optimize the SSP. It can be formulated as:
(14) |
where and indicate the number of all training samples and annotated skeletons of an image pair. The loss can be considered as the supervision signal to regularize the learning of difference representation in the R3Net.
3.4 Skeleton-guided Caption generation
Since the predicted skeletons are the explicit semantic concepts of the changed object and its surrounding references, the captions are generated under the guidance of them. Specifically, first, the predicted probability scores are embedded as a skeleton feature :
(15) |
where is a skeleton embedding matrix and is the dimension of the skeleton feature. and are the parameters to be learned. Then, we exploit a semantic attention module to focus on the key semantic feature from , , and , which is relevant to the target word:
(16) |
where . is computed by an attention LSTMa under the guidance of the predicted skeleton feature :
(17) |
where and are learnable parameters. and are hidden states of the attention module LSTMa and the caption decoder LSTMc, respectively.
Finally, the caption generation process is also guided by the predicted skeleton feature. We feed it, attended visual feature, and the previous word embedding to the caption decoder LSTMc to predict a series of distributions over the next word:
(18) |
where is a word embedding matrix; and are learnable parameters.
3.5 Joint Training
We jointly train the caption decoder and SSP in an end-to-end manner. For the SSP, the multi-label loss is minimized by the Eq. (14). For the decoder, given the target ground-truth words , we minimize its negative log-likelihood loss:
(19) |
where are the parameters of the decoder and is the length of the caption. The final loss function is optimized as follows:
(20) |
where the hyper-parameter is to seek a trade-off between the decoder and SSP.
Total | |||||
---|---|---|---|---|---|
Method | BLEU-4 | METEOR | ROUGE-L | CIDEr | SPICE |
Baseline | 53.1 | 37.6 | 70.8 | 115.6 | 31.3 |
RRM | 53.5 | 39.2 | 72.3 | 119.2 | 32.5 |
R3Net | 54.2 | 39.4 | 72.7 | 122.3 | 32.6 |
R3Net+SSP | 54.7 | 39.8 | 73.1 | 123.0 | 32.6 |
Scene Change | None-scene Change | |||||||||
Method | B-4 | M | R | C | S | B-4 | M | R | C | S |
Baseline | 51.0 | 33.3 | 65.7 | 102.4 | 28.0 | 61.0 | 49.9 | 75.8 | 114.3 | 34.5 |
RRM | 51.8 | 35.7 | 69.0 | 110.1 | 30.4 | 60.0 | 49.6 | 75.6 | 115.0 | 34.5 |
R3Net | 52.5 | 36.0 | 69.5 | 114.8 | 30.5 | 62.0 | 50.0 | 75.9 | 116.3 | 34.8 |
R3Net+SSP | 52.7 | 36.2 | 69.8 | 116.6 | 30.3 | 61.9 | 50.5 | 76.4 | 116.4 | 34.8 |
4 Experiments
4.1 Datasets and Evaluation Metrics
CLEVR-Change dataset Park et al. (2019) is a large-scale dataset with a set of basic geometry objects, which consists of 79,606 image pairs and 493,735 captions. The change types can be categorized into six cases, i.e., “Color”, “Texture”, “Add”, “Drop”, ‘’Move” and “Distractors (e.g., viewpoint change)”. We use the official split with 67,660 for training, 3,976 for validation and 7,970 for testing.
Spot-the-Diff dataset Jhamtani and Berg-Kirkpatrick (2018) contains 13,192 well-aligned image pairs from surveillance cameras. Based on the official split, the dataset is split into training, validation, and testing with a ratio of 8:1:1.
Following the state-of-the-art methods Park et al. (2019); Shi et al. (2020); Tu et al. (2021), we use five standard metrics to evaluate the quality of generated sentences, i.e., BLEU-4 Papineni et al. (2002), METEOR Banerjee and Lavie (2005), ROUGE-L Lin (2004), CIDEr Vedantam et al. (2015) and SPICE Anderson et al. (2016). We get all the results based on the Microsoft COCO evaluation server Chen et al. (2015).
4.2 Implementation Details
We use ResNet-101 He et al. (2016) pre-trained on the Imagenet dataset Russakovsky et al. (2015) to extract object features, with the dimension of 1024 14 14. We project these features into a lower dimension of 256. The hidden size of overall model is set to 512 and the number of attention heads in relation-embedded module is set to 4. The number of skeletons in an image pair is set to 50. The dimension of words is set to 300. For the hyper-parameter , we empirically set it as 0.1. In the training phase, we use Adam optimizer Kingma and Ba (2014) with the learning rate of 1 , and set the mini-batch size as 128 and 64 on CLEVR-Change and Spot-the-Diff. At inference, for fair comparison, we follow the pioneer works Park et al. (2019); Jhamtani and Berg-Kirkpatrick (2018) in the two datasets to use greedy decoding strategy for caption generation. Both training and inference are implemented with PyTorch Paszke et al. (2019) on a Tesla P100 GPU.
Total | ||||||
Method | RL | B-4 | M | R | C | S |
Capt-Dual Park et al. (2019) | 43.5 | 32.7 | - | 108.5 | 23.4 | |
DUDA Park et al. (2019) | 47.3 | 33.9 | - | 112.3 | 24.5 | |
M-VAM Shi et al. (2020) | 50.3 | 37.0 | 69.7 | 114.9 | 30.5 | |
M-VAM+RAF Shi et al. (2020) | ✓ | 51.3 | 37.8 | 70.4 | 115.8 | 30.7 |
R3Net+SSP | 54.7 | 39.8 | 73.1 | 123.0 | 32.6 |
Scene Change | None-scene Change | ||||||||
Method | RL | B-4 | M | C | S | B-4 | M | C | S |
Capt-Dual Park et al. (2019) | 38.5 | 28.5 | 89.8 | 18.2 | 56.3 | 44.0 | 108.9 | 28.7 | |
DUDA Park et al. (2019) | 42.9 | 29.7 | 94.6 | 19.9 | 59.8 | 45.2 | 110.8 | 29.1 | |
M-VAM+RAF Shi et al. (2020) | ✓ | - | - | - | - | - | 66.4 | 122.6 | 33.4 |
R3Net+SSP | 52.7 | 36.2 | 116.6 | 30.3 | 61.9 | 50.5 | 116.4 | 34.8 |
4.3 Ablation Studies
To figure out the contribution of each module of the proposed network, we conduct the following ablation studies on CLEVR-Change: (1) Baseline which is based on DUDA Park et al. (2019); (2) RRM which is the representation reconstruction module; (3) R3Net which augments the RRM with a relation-embedded module; (4) R3Net+SSP which augments the R3Net with a syntactic skeleton predictor.
The evaluation on Total Performance. Total performance is to simultaneously evaluate the model under both scene change and none-scene change. Experimental results are shown in Table 1. We can observe that each module and the full method improve the total performance of Baseline. This indicates that our method not only can correctly judge whether there is an semantic change between a pair of images, but also can describe the change in an accurate natural language sentence.
The evaluation on the settings of Scene Change and None-scene Change. In the setting of scene change, both object and viewpoint changes happen. In the setting of none-scene change, there are only distractors, such as viewpoint change and illumination change. The experimental results are shown in Table 2. Under the setting of scene change, we can observe that 1) the RRM, R3Net, and R3Net+SSP all significantly improve the Baseline; 2) the R3Net is much better than the RRM; 3) the best performance is achieved when augmenting the R3Net with the SSP. The above observations indicate that 1) compared to direct subtraction between a pair of unaligned images, it is effective to capture difference representation via the R3Net, because it can overcome the distraction of viewpoint change; 2) learning semantic relations among object features is important, because these relations can enrich the raw object features, helpful for exploring fine-grained changes; 3) the SSP can enhance the semantic interaction between change localization and caption generation, and thus further improve the quality of generated sentences.
Besides, under the setting of non-scene change, we can observe that the RRM is worse than the Baseline on some metrics. Our conjecture is that, on one hand, due to the large amount of clutter and only representing the image pair at feature-level, the RRM cannot learn the exact semantic similarities of corresponding locations in the two images, performing worse on some metrics. On the other hand, the Baseline learns a coarse difference representation between two unaligned images by a direct subtraction, so it is prone to learn a wrong change type or simply judge nothing has changed. This leads to the results that it performs worse than the RRM with the total performance and scene change, but achieves higher scores than the RRM on some metrics with none-scene change. In fact, when embedding semantic relations among object features, the R3Net outperforms the Baseline in the both settings. This further indicates that it is beneficial to thoroughly understand image content via modeling semantic relations among object features.
Method | RL | Metrics | C | T | A | D | M |
---|---|---|---|---|---|---|---|
Capt-Dual Park et al. (2019) | CIDEr | 115.8 | 82.7 | 85.7 | 103.0 | 52.6 | |
DUDA Park et al. (2019) | CIDEr | 120.4 | 86.7 | 108.3 | 103.4 | 56.4 | |
M-VAM+RAF Shi et al. (2020) | ✓ | CIDEr | 122.1 | 98.7 | 126.3 | 115.8 | 82.0 |
R3Net+SSP | CIDEr | 139.2 | 123.5 | 122.7 | 121.9 | 88.1 | |
Capt-Dual Park et al. (2019) | METEOR | 32.1 | 26.7 | 29.5 | 31.7 | 22.4 | |
DUDA Park et al. (2019) | METEOR | 32.8 | 27.3 | 33.4 | 31.4 | 23.5 | |
M-VAM+RAF Shi et al. (2020) | ✓ | METEOR | 35.8 | 32.3 | 37.8 | 36.2 | 27.9 |
R3Net+SSP | METEOR | 38.9 | 35.5 | 38.0 | 37.5 | 30.9 | |
Capt-Dual Park et al. (2019) | SPICE | 19.8 | 17.6 | 16.9 | 21.9 | 14.7 | |
DUDA Park et al. (2019) | SPICE | 21.2 | 18.3 | 22.4 | 22.2 | 15.4 | |
M-VAM+RAF Shi et al. (2020) | ✓ | SPICE | 28.0 | 26.7 | 30.8 | 32.3 | 22.5 |
R3Net+SSP | SPICE | 31.6 | 30.8 | 32.3 | 31.7 | 25.4 |
Method | RL | M | R | C | S |
---|---|---|---|---|---|
DDLA | 12.0 | 28.6 | 32.8 | - | |
DUDA | 11.8 | 29.1 | 32.5 | - | |
SDCM | 12.7 | 29.7 | 36.3 | - | |
FCC | 12.9 | 29.9 | 36.8 | - | |
static rel-att | 13.0 | 28.3 | 34.0 | - | |
dynamic rel-att | 12.2 | 31.4 | 35.3 | - | |
M-VAM | 12.4 | 31.3 | 38.1 | 14.0 | |
M-VAM+RAF | ✓ | 12.9 | 33.2 | 42.5 | 17.1 |
R3Net+SSP | 13.1 | 32.6 | 36.6 | 18.8 |
4.4 Performance Comparison
4.4.1 Results on CLEVR-Change Dataset
In this dataset, we compare with four state-of-the-art methods, Capt-Dual Park et al. (2019), DUDA Park et al. (2019), M-VAM Shi et al. (2020), and M-VAM+RAF Shi et al. (2020), in four settings: 1) scene change; 2) none-scene change; 3) total (scene and none-scene changes); 4) specific types of scene change.
From Table 3 and Table 4, under two kinds of settings and total performance, we can observe that our method surpasses Capt-Dual and DUDA with a large margin. Compared to M-VAM+RAF, the total performance of our method is much better than it, which indicates our method is more robust. As shown in Table 4, under the setting of none-scene change, it outperforms our method on METEOR and CIDEr. This could be a benefit of the reinforcement learning, while also sharply increasing the training time and computation complexity.
Table 5 is the specific change types. Among five changes, the most challenging types are “ Texture” and “Move”, because they are always confused with irrelevant illumination or viewpoint changes. Compared to the SOTA methods, our method achieves excellent performances under both change types. This shows that our method can better distinguish the attribute change or movement of objects from the illumination or viewpoint change.
Hence, compared to the current SOTA methods from different dimensions, the generalization ability of our method is much better. This benefits from the merits that 1) the R3Net can learn the fine-grained change and overcome viewpoint changes in the process of representation reconstruction; 2) the SSP can enhance the semantic interactions between change localization and caption generation.


4.4.2 Results on Spot-the-Diff Dataset
The image pairs in this dataset are mostly well aligned. We compare with eight SOTA methods and most of them cannot consider handling viewpoint changes: DDLA Jhamtani and Berg-Kirkpatrick (2018), DUDA Park et al. (2019), SDCM Oluwasanmi et al. (2019a), FCC Oluwasanmi et al. (2019b), static rel-att / dyanmic rel-att Tan et al. (2019), and M-VAM / M-VAM+RAF Shi et al. (2020).
The results are shown in Table 6. We can observe that when training without reinforcement learning, our method achieves the best performances on METEOR, ROUGE-L and SPICE. Compared to M-VAM+RAF trained by the reinforcement learning strategy, our method still outperforms them on METEOR and SPICE. Since there is no viewpoint change in this dataset, the superiority mainly results from that the relation-embedded module can enhance the fine-grained representation ability of object features, and the syntactic skeleton predictor can enhance the semantic interaction between change localization and caption generation.
4.5 Qualitative Analysis
Figure 3 shows an example about the case of “Move” from the test set of CLEVR-Change. We can observe that DUDA localizes a wrong region on the “before” and thus misidentifies “Move” as “Add”. By contrast, the R3Net+SSP can accurately locate the moved object on the “before” and “after” images, which benefits from two merits. First, the R3Net is able to localize the fine-grained change in the presence of viewpoint changes. Second, the SSP can predict the key skeletons based on the representations of image pair and their difference learned from the R3Net. For instance, the skeletons of “changed” and “location” has the higher probability scores than “newly” and “placed”. This can provide the decoder with high-level semantic cues to generate the correct sentence.
Figure 4 illustrates two cases about “Move”. In the left example, the R3Net+SSP successfully distinguishes the changed object (i.e., small grey cylinder) and predicts accurate skeletons with high probability scores. The right example is a failure case. In general, we can observe that the grey cylinder is localized and the main skeletons are predicted, which indicates that the R3Net learns a reliable representation of difference. However, the decoder still generates the wrong sentence. The reason behind the failure may be that the movement of this cylinder is very slight and the decoder receives the weak information of change (including skeletons). In our opinion, a possible solution for this challenge is to model position information for object features, which would enhance their position representation ability and help localize the slight movements.
5 Conclusion
In this paper, we propose a relation-embedded representation reconstruction network (R3Net) and a syntactic skeleton predictor (SSP) to address change captioning in the presence of viewpoint changes, where the R3Net can explicitly distinguish semantic changes from viewpoint changes and the SSP is to enhance the semantic interaction between change localization and caption generation. Extensive experiments show that the state-of-the-art results are achieved on two public datasets, CLEVR-Change and Spot-the-Diff.
Acknowledgements
The work was supported by National Natural Science Foundation of China (No. 61761026, 61972186, 61732005, 61762056, 61771457, and 61732007), in part by Youth Innovation Promotion Association of Chinese Academy of Sciences (No. 2020108), and CCF-Baidu Open Fund (No. 2021PP15002000).
References
- Anderson et al. (2016) Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. SPICE: semantic propositional image caption evaluation. In ECCV, pages 382–398.
- Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In ACL, pages 65–72.
- Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
- Dai et al. (2018) Bo Dai, Sanja Fidler, and Dahua Lin. 2018. A neural compositional paradigm for image captioning. NeurIPS, 31:658–668.
- Deng et al. (2021) Jincan Deng, Liang Li, Beichen Zhang, Shuhui Wang, Zhengjun Zha, and Qingming Huang. 2021. Syntax-guided hierarchical attention network for video captioning. IEEE Transactions on Circuits and Systems for Video Technology.
- Fang et al. (2015) Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C Platt, et al. 2015. From captions to visual concepts and back. In CVPR, pages 1473–1482.
- Fisch et al. (2020) Adam Fisch, Kenton Lee, Ming-Wei Chang, Jonathan H Clark, and Regina Barzilay. 2020. Capwap: Captioning with a purpose. In EMNLP, pages 8755–8768.
- Gan et al. (2017) Zhe Gan, Chuang Gan, Xiaodong He, Yunchen Pu, Kenneth Tran, Jianfeng Gao, Lawrence Carin, and Li Deng. 2017. Semantic compositional networks for visual captioning. In CVPR, pages 5630–5639.
- Gao et al. (2020) Lianli Gao, Xuanhan Wang, Jingkuan Song, and Yang Liu. 2020. Fused gru with semantic-temporal attention for video captioning. Neurocomputing, 395:222–228.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR, pages 770–778.
- Huang et al. (2020) Qingbao Huang, Jielong Wei, Yi Cai, Changmeng Zheng, Junying Chen, Ho-fung Leung, and Qing Li. 2020. Aligned dual channel graph convolutional network for visual question answering. In ACL, pages 7166–7176.
- Jhamtani and Berg-Kirkpatrick (2018) Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2018. Learning to describe differences between pairs of similar images. In EMNLP, pages 4024–4034.
- Jiang et al. (2019) Ming Jiang, Junjie Hu, Qiuyuan Huang, Lei Zhang, Jana Diesner, and Jianfeng Gao. 2019. Reo-relevance, extraness, omission: A fine-grained evaluation for image captioning. In EMNLP, pages 1475–1480.
- Kim et al. (2019) Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. 2019. Image captioning with very scarce supervised data: Adversarial semi-supervised learning approach. In EMNLP, pages 2012–2023.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Li et al. (2020) Liang Li, Shijie Yang, Li Su, Shuhui Wang, Chenggang Yan, Zheng-jun Zha, and Qingming Huang. 2020. Diverter-guider recurrent network for diverse poems generation from image. In ACM MM, pages 3875–3883.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
- Liu et al. (2020) Zhenhuan Liu, Jincan Deng, Liang Li, Shaofei Cai, Qianqian Xu, Shuhui Wang, and Qingming Huang. 2020. Ir-gan: Image manipulation with linguistic instruction by increment reasoning. In ACM MM, pages 322–330.
- Oluwasanmi et al. (2019a) Ariyo Oluwasanmi, Muhammad Umar Aftab, Eatedal Alabdulkreem, Bulbula Kumeda, Edward Y Baagyere, and Zhiquang Qin. 2019a. Captionnet: Automatic end-to-end siamese difference captioning model with attention. IEEE Access, 7:106773–106783.
- Oluwasanmi et al. (2019b) Ariyo Oluwasanmi, Enoch Frimpong, Muhammad Umar Aftab, Edward Y Baagyere, Zhiguang Qin, and Kifayat Ullah. 2019b. Fully convolutional captionnet: Siamese difference captioning attention model. IEEE Access, 7:175929–175939.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318.
- Park et al. (2019) Dong Huk Park, Trevor Darrell, and Anna Rohrbach. 2019. Robust change captioning. In ICCV, pages 4623–4632.
- Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS, pages 8024–8035.
- Russakovsky et al. (2015) Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. In IJCV,, pages 211–252.
- Shi et al. (2020) Xiangxi Shi, Xu Yang, Jiuxiang Gu, Shafiq R. Joty, and Jianfei Cai. 2020. Finding it at another side: A viewpoint-adapted matching encode for change captioning. In ECCV, pages 574–590.
- Tan et al. (2019) Hao Tan, Franck Dernoncourt, Zhe Lin, Trung Bui, and Mohit Bansal. 2019. Expressing visual relationships via language. In ACL, pages 1873–1883.
- Tu et al. (2021) Yunbin Tu, Tingting Yao, Liang Li, Jiedong Lou, Shengxiang Gao, Zhengtao Yu, and Chenggang Yan. 2021. Semantic relation-aware difference representation learning for change captioning. In Findings of ACL, pages 63–73.
- Tu et al. (2017) Yunbin Tu, Xishan Zhang, Bingtao Liu, and Chenggang Yan. 2017. Video description with spatial-temporal attention. In ACM MM, pages 1014–1022.
- Tu et al. (2020) Yunbin Tu, Chang Zhou, Junjun Guo, Shengxiang Gao, and Zhengtao Yu. 2020. Enhancing the alignment between target words and corresponding frames for video captioning. Pattern Recognition, page 107702.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NeurIPS, pages 5998–6008.
- Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575.
- Wu et al. (2019) Aming Wu, Linchao Zhu, Yahong Han, and Yi Yang. 2019. Connective cognition network for directional visual commonsense reasoning. In NeurIPS, pages 5669–5679.
- Yin et al. (2020) Yongjing Yin, Fandong Meng, Jinsong Su, Chulun Zhou, Zhengyuan Yang, Jie Zhou, and Jiebo Luo. 2020. A novel graph-based multi-modal fusion encoder for neural machine translation. In ACL, pages 3025–3035.
- Zhang et al. (2017) Xishan Zhang, Ke Gao, Yongdong Zhang, Dongming Zhang, Jintao Li, and Qi Tian. 2017. Task-driven dynamic fusion: Reducing ambiguity in video description. In CVPR, pages 3713–3721.