Learning Action-Effect Dynamics from Pairs of Scene-graphs
Abstract
‘Actions’ play a vital role in how humans interact with the world. Thus, autonomous agents that would assist us in everyday tasks also require the capability to perform ‘Reasoning about Actions & Change’ (RAC). Recently, there has been growing interest in the study of RAC with visual and linguistic inputs. Graphs are often used to represent semantic structure of the visual content (i.e. objects, their attributes and relationships among objects), commonly referred to as scene-graphs. In this work, we propose a novel method that leverages scene-graph representation of images to reason about the effects of actions described in natural language. We experiment with existing CLEVR_HYP (Sampat et al. 2021) dataset and show that our proposed approach is effective in terms of performance, data efficiency, and generalization capability compared to existing models.
Introduction
Reasoning about ‘Actions’ is important for humans as it helps us to predict if a sequence of actions will lead us to achieve a desired goal; to explain observations i.e. what actions may have taken place; and to diagnose faults i.e. identifying actions that may have resulted in undesirable situation (Baral 2010). As we are developing autonomous agents that can assist us in performing everyday tasks, they would also require to interact with complex environments. As pointed out by Davis and Marcus (2015), imagine a guest asks a robot for a glass of wine; if the robot sees that the glass is broken or has a dead cockroach inside, it should not simply pour the wine. Similarly, if a cat runs in front of a house-cleaning robot, the robot should neither run it over nor put it away on a shelf. Hence, the ability of artificial agents to perform reasoning about actions is highly desirable.

As a result, Reasoning about Action and Change (RAC) has been a long-established research problem, since the rise of Artificial Intelligence (AI). McCarthy et al. (1960) were the first to emphasize on reasoning about effects of actions. They developed an advice taker system that can do deductive reasoning about scenarios such as “going to the airport from home” requires “walking to the car” and “driving the car to airport”. Since then, many real-life use cases have been identified which require AI models to understand interactions among the states of the world, actions being performed, and most likely following states (Banerjee et al. 2020).
While RAC has been more popular among knowledge representation and logic community, it has recently piqued the interest of NLP and vision researchers. A recent survey by Sampat et al. (2022) compiled a comprehensive list of works that explore neural network’s ability to reason about actions and changes, provided linguistic and/or visual inputs. Specifically, the works of Park et al. (2020); Sampat et al. (2021); Shridhar et al. (2020); Yang et al. (2021); Gao et al. (2018); Patel et al. (2022) are quite relevant.
In Figure 1, we describe two possible action-effect learning strategies (LS1 and LS2) through a toy example to convey our intuition behind this work. LS1 uses visual features (i.e. features from the image of an apple) and action representation (of text “rotten”) learned through sentence embedding to imagine effects (i.e. how a rotten apple would look like). LS1 has been an intuitive choice to model action-effect learning in a supervised setting in previous literature. In our hypothesis, LS1 does not improve the model’s understanding of what effects the actions will produce. Thus, we propose alternative strategy LS2. Specifically, we let the model observe the difference between pairs of states before and after the action is performed (i.e. decayed portion of the apple that distinguishes a good apple from the rotten one), then associate those visual differences with the corresponding linguistic action descriptions (i.e. text “rotten”). LS2 is likely to better capture action-effect dynamics, as action representations are learned explicitly.
CLEVR_HYP Dataset (Sampat et al. 2021)
In this section, we summarize important aspects of CLEVR_HYP and terminology used in subsequent sections.
Problem Formulation
The task aims at understanding changes caused over an image by performing an action described in natural language and then answering a reasoning question over the resulting scene. Figure 2 shows an example from the dataset;
-
•
Inputs:
-
1.
Image (I)- Visual scene with rendered objects
-
2.
Action Text (TA)- Text describing action to be performed over I
-
3.
Question (QH)- Question to assess the system’s ability to understand changes caused by TA on I
-
1.
-
•
Output: Answer (A) for the given QH
-
•
Answer Vocabulary: [0-9, yes, no, cylinder, sphere, cube, small, big, metal, rubber, red, green, gray, blue, brown, yellow, purple, cyan]
-
•
Evaluation: 27-way Answer Classification / Accuracy (%)

Dataset Details and Partitions
The CLEVR_HYP dataset assumes to have a closed set of object attributes, action types, and question reasoning types.
-
•
Object attributes: 5 colors, 3 shapes, 2 sizes, 2 materials
-
•
Action types: Add object, Remove object(s), Change attribute, Move object (in-plane and out-of-plane)
-
•
Reasoning types: Count objects, Compare objects, Existence of objects, Query attribute and Compare attribute
The dataset is divided into the following partitions;
-
•
Train (67.5k) / Val (13.5k) sets have I, TA, QH, A tuples along with the scene-graphs as a visual oracle and functional programs111Originally introduced in CLEVR (Johnson et al. 2017), ex. question ‘How many red metal things are there?’ functional program ‘count(filter_color(filter_material(scene(),metal),red))’ as a textual oracle.
-
•
Test sets consist of only I, TA, QH, A tuples, and no oracle annotations available. There are three test sets,
-
1.
Ordinary test (13.5k) consists of examples with the same difficulty as train/val
-
2.
2HopTA test (1.5k) consists of examples where two actions are performed ex. ‘Move a purple object on a red cube then paint it cyan.’
-
3.
2HopQH test (1.5k) consists of examples where two reasoning types are combined ex. ‘How many objects are either red or cylinder?’
-
1.
Baseline Models
Following are two top-performing baselines reported in Sampat et al. (2021), to which we will compare the results of our proposed approach in this paper.
-
•
(TIE) Text-conditioned Image Editing: Text-adaptive encoder-decoder with residual gating (Vo et al. 2019) is used to generate new image conditioned on the action text. Then, new image along with the question is fed into LXMERT (Tan and Bansal 2019) (which is a pre-trained vision-language transformer), to generate an answer.
-
•
(SGU) Scene-graph Update: In this model, understanding changes caused by an action text is considered as a graph-editing problem. First, an image is converted into a scene-graph and action text is converted into a functional program (FP). Sampat et al. (2021) developed a module inspired by Chen et al. (2020) that can generate an updated scene graph provided the initial scene-graph and a functional program of action text. It is followed by a neural-symbolic VQA model (Yi et al. 2018) that can generate an answer to the question provided the updated scene-graph.
Proposed Model: Action Representation Learner (ARL)
In this section, we describe the architecture of our proposed model Action Representation Learner (ARL). Our hypothesis is that a model can learn better action representations by observing difference between a pair of states (before and after the action is performed) and then associate those visual differences with linguistic description of actions. We create a 3-stage model shown in Figure 3, which we believe would better capture the causal structure of this task.
Stage-1
This stage comprises an ‘Action Encoder’ and an ‘Effect Decoder’. As described in previous section, the training set of CLEVR_HYP provides oracle annotations for an initial scene-graph S (using which the image is rendered) and a scene-graph after executing the action text S′. We take a random subset of 20k scene-graph pairs from the training set balanced by action types (add, remove, change, move) to train this encoder-decoder.
We capture the difference between states S and S′ i.e. A using the encoder. At the test time, we do not have the updated scene-graph S′ available. To address this issue, the encoder is followed by a decoder, which can reconstruct S′ provided S and the learned scene difference A. We jointly train encoder-decoder with the following objective;
(1) |

Stage-2
We assume changes in the scene as a function of action. However we can obtain Arep i.e. vector representation of natural language action, and it can be trained to approximate A (from Stage-1). Specifically, we freeze encoder-decoder trained in Stage-1 and learn ‘Natural language to Action Representation’ module that maximizes the following log probability;
(2) |
Inside this module lies LSTM encoder, which precedes by an embedding layer and followed by dense layers. During training, in addition to finding the values for the weights of the LSTM and dense layers, the word embeddings for each word in the training set are computed. This way, a fixed length vector is generated for each word in the vocabulary depending on the position of the word in context and updated using back-propagation. The LSTM has a hidden layer of size 200.
Stage-3
Stage-3 combines modules trained in Stage-1 and Stage-2 with off-the-shelf ‘Image to Scene-graph’ and ‘Scene-graph Question Answering’ networks. Specifically, ‘Image to Scene-graph’ is Mask R-CNN (He et al. 2017) followed by a ResNet-34 (He et al. 2016), that classifies visual attributes- color, material, size, and shape and obtains 3D coordinates of each object in the scene. The ‘Scene-graph Question Answering’ network is based on (Yi et al. 2018), which has near-perfect accuracy on the scene-graph question answering task over CLEVR (Johnson et al. 2017).
Results and Analysis
Quantitative Results
In this section, we discuss performance of our model quantitatively and qualitatively. We also discuss three ablations conducted for our model.
Evaluation Metric: The classification task of CLEVR_HYP has exactly one correct answer. Therefore, the exact match accuracy (%) metric is used for evaluation.
Test performance on CLEVR_HYP(%) | ||||
TIE | SGU | ARL | ||
Ordinary | 64.7 | 70.5 | 76.4 | |
2HopAT | 55.6 | 64.4 | 69.2 | |
2HopQH | 58.7 | 66.5 | 70.7 |
Our experimental results are summarized in Table 1. Our proposed approach (ARL) outperforms existing baselines by 5.9%, 4.8% and 4.2% on Ordinary, 2HopTA Test and 2HopQH Test respectively. This demonstrates that our model not only achieves better overall accuracy but also has improved generalization capability when multiple actions have to be performed on the image or understand logical combinations of attributes while performing reasoning.
Qualitative results

In Figure 4, we visually demonstrate scene-graphs predicted by our ARL model over a variety of action texts. From examples 1-2, we can observe that the model can correctly identify objects that match the object attributes (color, size, shape, material) provided in the action text. Examples 3-4 demonstrate that our system is consistent in predictions when we use synonyms of various words (e.g. sphereball, shinymetallic) in the dataset. Finally, examples 5-6 show that our model does reasonably well on other actions (add and change).
We further generate a t-SNE plot of action vectors learned by our best proposed model, which is shown in Figure 5. At a first glance, we can say that the learned action representations formulate well-defined and separable clusters corresponding to each action type. Clusters for add, remove and change actions are closer and somewhat overlapping.

Ablations
Importance of Stage-1 training
Cause-effect learning with respect to actions is a key focus in CLEVR_HYP. In existing models, it is formulated as a updated scene-graph prediction task (i.e. given an initial scene and an action, determine what the resulting scene would look like after executing the action). In our opinion, Stage-1 plays a critical role in learning causal structure of the world. To demonstrate this, we set up two experiments; first, where training takes place in a sequential manner (Stage-1 followed by Stage-2), where trained encoder-decoders from Stage-1 are frozen and utilized in Stage-2. Second experiment, where there is no separate Stage-1 training and encoder-decoder in Stage-2 are randomly initialized.


Task | Experiment | Accuracy (%) |
---|---|---|
Scene-graph Update | Stage(2 only) | 56.3 |
Stage(1+2) | 87.2 | |
Question Answering | Stage(2+3 only) | 45.7 |
Stage(1+2+3) | 76.4 |
The results are summarized in Table 2. We can observe that inclusion of Stage-1 training improves the accuracy of scene-graph prediction by 30% compared to the Stage(2 only) model. To evaluate question answering task of CLEVR_HYP, both setups are followed by Stage-3 where the image to scene-graph generator and scene-graph question answering modules are combined to predict the answer. It is known that (Yi et al. 2018) has near-perfect performance on the scene-graph question answering task over CLEVR (Johnson et al. 2017). As a result, the gains achieved in the scene-graph task directly benefit the question answering performance without much of a loss. In other words, there are only 0.2% instances where the scene prediction is correct but the final answer is incorrect.
Performance with different data size used for training Stage-1
In this ablation, the goal is to find out how many scene-graph pairs are required to effectively learn effects of the actions. We experiment with different data sizes- from 2k to 25k samples. Figure 6 (top) shows the effect of training with diverse training data size on scene-graph update and downstream question answering task. The model learns better initially with more data samples, however performance saturates after 20k samples.
Performance with different lengths of learned action vector in Stage-1
In this ablation, the goal is to find out optimal length of action vectors that can reasonably simulate the effects of the actions. We experiment with different lengths of learned action vector- from 25 to 200 in increment of 25. Figure 6 (bottom) shows the effect of training with diverse action vector lengths on scene-graph update and downstream question answering task. The model learns better initially when the vector length is increased, however performance reaches at peak for the action vector length of 125.
Conclusion
In the vision and language domain, several tasks are proposed that require an understanding of the causal structure of the world. In this work, we propose an effective way of learning action representations and implement a 3-stage model for the what-if vision-language reasoning task CLEVR_HYP. We provide insights on the learned action representations and validate the effectiveness of our proposed method through ablations. Finally, we demonstrate that our proposed method outperforms existing baselines while being data-efficient and showing some degree of generalization capability. By extending our approach to a larger set of actions, we aim to develop AI agents which are equipped with action-effect reasoning capability and can better collaborate with humans in the physical world.
References
- Banerjee et al. (2020) Banerjee, P.; Baral, C.; Luo, M.; Mitra, A.; Pal, K.; Son, T. C.; and Varshney, N. 2020. Can Transformers Reason About Effects of Actions? arXiv preprint arXiv:2012.09938.
- Baral (2010) Baral, C. 2010. Reasoning about actions and change: from single agent actions to multi-agent actions. In In KR.
- Chen et al. (2020) Chen, L.; Lin, G.; Wang, S.; and Wu, Q. 2020. Graph edit distance reward: Learning to edit scene graph. In European Conference on Computer Vision, 539–554. Springer.
- Davis and Marcus (2015) Davis, E.; and Marcus, G. 2015. Commonsense reasoning and commonsense knowledge in artificial intelligence. In Communications of ACM.
- Gao et al. (2018) Gao, Q.; Yang, S.; Chai, J.; and Vanderwende, L. 2018. What action causes this? towards naive physical action-effect prediction. In In ACL.
- He et al. (2017) He, K.; Gkioxari, G.; Dollár, P.; and Girshick, R. B. 2017. Mask R-CNN. In IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy, October 22-29, 2017, 2980–2988. IEEE Computer Society.
- He et al. (2016) He, K.; Zhang, X.; Ren, S.; and Sun, J. 2016. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 770–778. IEEE Computer Society.
- Johnson et al. (2017) Johnson, J.; Hariharan, B.; van der Maaten, L.; Fei-Fei, L.; Zitnick, C. L.; and Girshick, R. B. 2017. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 1988–1997. IEEE Computer Society.
- McCarthy et al. (1960) McCarthy, J.; et al. 1960. Programs with common sense. RLE and MIT computation center Cambridge, MA, USA.
- Park et al. (2020) Park, J. S.; Bhagavatula, C.; Mottaghi, R.; Farhadi, A.; and Choi, Y. 2020. Visualcomet: Reasoning about the dynamic context of a still image. In In ECCV.
- Patel et al. (2022) Patel, M.; Gokhale, T.; Baral, C.; and Yang, Y. 2022. Benchmarking Counterfactual Reasoning Abilities about Implicit Physical Properties. In NeurIPS 2022 Workshop on Neuro Causal and Symbolic AI (nCSI).
- Sampat et al. (2021) Sampat, S. K.; Kumar, A.; Yang, Y.; and Baral, C. 2021. CLEVR_HYP: A Challenge Dataset and Baselines for Visual Question Answering with Hypothetical Actions over Images. In In NAACL:HLT.
- Sampat et al. (2022) Sampat, S. K.; Patel, M.; Das, S.; Yang, Y.; and Baral, C. 2022. Reasoning about Actions over Visual and Linguistic Modalities: A Survey. arXiv preprint arXiv:2207.07568.
- Shridhar et al. (2020) Shridhar, M.; Thomason, J.; Gordon, D.; Bisk, Y.; Han, W.; Mottaghi, R.; Zettlemoyer, L.; and Fox, D. 2020. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. In In CVPR.
- Tan and Bansal (2019) Tan, H.; and Bansal, M. 2019. LXMERT: Learning Cross-Modality Encoder Representations from Transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 5100–5111. Association for Computational Linguistics.
- Vo et al. (2019) Vo, N.; Jiang, L.; Sun, C.; Murphy, K.; Li, L.; Fei-Fei, L.; and Hays, J. 2019. Composing Text and Image for Image Retrieval - an Empirical Odyssey. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, 6439–6448. Computer Vision Foundation / IEEE.
- Yang et al. (2021) Yang, Y.; Panagopoulou, A.; Lyu, Q.; Zhang, L.; Yatskar, M.; and Callison-Burch, C. 2021. Visual Goal-Step Inference using wikiHow. In In EMNLP.
- Yi et al. (2018) Yi, K.; Wu, J.; Gan, C.; Torralba, A.; Kohli, P.; and Tenenbaum, J. 2018. Neural-Symbolic VQA: Disentangling Reasoning from Vision and Language Understanding. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, 1039–1050.