This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Abstract Reasoning via Logic-guided Generation

Sihyun Yu    Sangwoo Mo    Sungsoo Ahn    Jinwoo Shin
Abstract

Abstract reasoning, i.e., inferring complicated patterns from given observations, is a central building block of artificial general intelligence. While humans find the answer by either eliminating wrong candidates or first constructing the answer, prior deep neural network (DNN)-based methods focus on the former discriminative approach. This paper aims to design a framework for the latter approach and bridge the gap between artificial and human intelligence. To this end, we propose logic-guided generation (LoGe), a novel generative DNN framework that reduces abstract reasoning as an optimization problem in propositional logic. LoGe is composed of three steps: extract propositional variables from images, reason the answer variables with a logic layer, and reconstruct the answer image from the variables. We demonstrate that LoGe outperforms the black box DNN frameworks for generative abstract reasoning under the RAVEN benchmark, i.e., reconstructing answers based on capturing correct rules of various attributes from observations.


1 Introduction

Imitating the human ability to infer complicated patterns from observations has been a long-standing goal of artificial intelligence. In order to build such models capable of this reasoning ability, recent works (Zheng et al., 2019; Zhang et al., 2019b; Hahne et al., 2019; Wang et al., 2020; Hu et al., 2021; Wu et al., 2020) have focused on training a deep neural network (DNN) which solves abstract reasoning problems that resemble an IQ test (Figure 1). In these problems, one should infer common rules of contexts without any additional information other than context images and select a correct answer from a candidate set. Accordingly, those DNN-based approaches attempt to derive the framework in situations where both the supervision on rules of the problem and the explicit symbol labels of each image are not provided. A couple of studies (Santoro et al., 2018; Zhang et al., 2019a) demonstrated how the widely used neural network architectures such as ResNet (He et al., 2016) or LSTM (Hochreiter & Schmidhuber, 1997) are unfit for learning reasoning capability, as any priors to resemble the human reasoning procedure are not employed in these architectures. Remarkably, recent works have shown that the performance can be significantly improved with the careful neural architecture design motivated by the human reasoning process and even outperforms humans (Zhang et al., 2019b; Zheng et al., 2019; Wu et al., 2020).

Refer to caption
(a) Response elimination.
Refer to caption
(b) Constructive matching.
Refer to caption
(c) Connection to propositional logic.
Figure 1: Illustration of strategies to solve Raven’s Progressive Matrices (RPM) problem. (a) In response elimination, one finds the answer by eliminating the irrelevant candidates which are not aligned from context images. (b) In constructive matching, one first imagines the answer candidate from inferred rules then selects the answer. (c) Interpretation of the problem combined with the propositional logic.

Most DNN-based methods on abstract reasoning mostly have resembled how humans perform reasoning via response elimination strategy (Figure 1(a)), i.e., to exclude candidate answers based on matching with the given context images. Intriguingly, cognitive science (Bethell-Fox et al., 1984; Carpenter et al., 1990) reveals that humans use two types of abstract reasoning strategies, not only limited to response elimination strategy. To be specific, humans also have the ability to perform constructive matching (Figure 1(b)); they can first imagine the answer from context images without any candidates and then match the candidate answers to select the most similar one. Especially, several works (Mitchum & Kelley, 2010; Becker et al., 2016) have emphasized that the latter strategy better reflects the general intelligence of humans. However, investigation on the ability of neural networks to achieve constructive matching is yet under-explored; even such a direction is promising.

Contribution. We introduce a new end-to-end generative framework, coined logic-guided generation (LoGe), to learn a constructive matching strategy on abstract reasoning like humans. Our main idea is to reduce these reasoning problems into optimization problems in propositional logic. Leveraging such prior knowledge, LoGe learns to embed each image to discrete variables and generate the answer image via incorporating a differentiable propositional logical reasoning layer. We note that both objectives are achieved without any supervision on the exact propositional variables of each image and underlying rules in the problem. Specifically, we propose a three-step framework to achieve these objectives: abstraction, reasoning, and reconstruction.

We verify how our LoGe effectively solves the proposed task on the RAVEN (Zhang et al., 2019a) dataset. To be specific, we show that our framework generates high-quality images, which is correct, based on capturing the underlying abstract rules and attributes. This result is remarkable, as the widely used neural network architectures perform poorly for this conditional generation task. We also verify how LoGe performs comparably to neural networks that rely on response elimination strategy to perform abstract reasoning, even though our task is arguably harder and has not accessed to the wrong candidates while training.

2 Logic-guided Generation

In this section, we demonstrate logic-guided generation (LoGe), a framework to imitate a human’s constructive matching strategy on abstract reasoning.

2.1 Overview of logic-guided generation

Our problem setup is largely inspired by Bethell-Fox et al. (1984), who evaluated the constructive matching ability of humans to measure their generative reasoning ability. In this perspective, we express reasoning as a task of inferring the rule rr from a given problem PP, where the problem PP is a pair of a context 𝒙\bm{x} and answer yy satisfying the rule y=r(𝒙)y=r(\bm{x}). We especially focus on the generative strategy for solving this task; given a query context 𝒙\bm{x}, we evaluate the ability of machines to infer a rule as r^\hat{r} from contexts 𝒙\bm{x} to generate an answer image y^=r^(𝒙)\hat{y}=\hat{r}({\bm{x}}) that matches the ground-truth image yy.

For teaching models the ability of generative strategy in abstract reasoning, we train them on a dataset 𝒟\mathcal{D} consisting of dd problems, i.e., 𝒟={(𝒙i,yi)}i=1d\mathcal{D}=\{({\bm{x}}_{i},y_{i})\}_{i=1}^{d}. To be specific, we consider a dataset where each context 𝒙\bm{x} in the dataset 𝒟\mathcal{D} is a tuple of MM images (x1,,xM)(x_{1},\ldots,x_{M}) and yy is an answer image. Images are specified by a collection of abstract features such as shapes, colors and size. For instance, we visualize the case of the generative strategy in Raven’s Progressive Matrices (RPM) structure in Figure 1(b): contexts are given as eight images i.e., M=8M=8, and the goal is to generate an answer image for the remaining location, denoted by a question mark.

Our main idea is to connect abstract reasoning to the optimization problem in propositional logic to achieve the generative reasoning strategy into the framework. For instance, in RPM problems, one can define propositional variables p(i,j,o)p(i,j,o) as “an image placed at iith row and jjth column contains an attribute oo” to represent contexts in the problem, as shown in Figure 1(c). Here, attributes are sets of features in each context image, e.g., set of shapes 𝒜\mathcal{A}, color 𝒞\mathcal{C}, and size 𝒮\mathcal{S}. With those variables, underlying rules can be written as propositional logical formulas as in Figure 1(c). In this respect, one may interpret the answer generation procedure as the MAXSAT optimization problem in propositional logic: finding propositional variables representing the answer which satisfy the underlying logical formula in the given RPM problem as much as possible. We provide a more description of the MAXSAT problem in Appendix A.

Refer to caption
Figure 2: Overall illustration of our framework consists of three steps: abstraction, reasoning, and reconstruction. We compute the propositional logical variables of each image and predict variables of the answer combined with the logical reasoning layer.

As those propositional variables are not provided in dataset 𝒟\mathcal{D}, LoGe learns a propositional embedding and rules in problems in self-supervised manner. In particular, we derive a three-step framework with an encoder network fθf_{\theta}, a decoder network gϕg_{\phi}, a logical reasoning layer SψS_{\psi} parametrized by θ\theta, ϕ\phi, ψ\psi, respectively, and a latent codebook ={e1,,eK}\mathcal{E}=\{e_{1},\cdots,e_{K}\}, where each element eke_{k}\in\mathcal{E} is a trainable DD-dimensional real-valued vector:

  • (Abstraction.) The encoder network fθf_{\theta} and the codebook \mathcal{E} embeds contexts into propositional variables.

  • (Reasoning.) The reasoning layer SψS_{\psi} predicts propositional variables of the answer image.

  • (Reconstruction.) The decoder network gϕg_{\phi} and the codebook \mathcal{E} generates the answer image from inferred propositional variables.

We provide an illustration of our framework in Figure 2.

2.2 Detailed description of LoGe

In the rest of this section, we describe each step of our framework in detail.

Abstraction. We first compute propositional logical variables of each context image xi𝒙x_{i}\in\bm{x} from the encoder network fθf_{\theta} and the latent codebook \mathcal{E}. To achieve this, we first pass through each image xix_{i} into the encoder network fθf_{\theta} to have a corresponding output zifθ(xi)H×Dz_{i}\coloneqq f_{\theta}(x_{i})\in\mathbb{R}^{H\times D}, in which we denote zi=(z(i,1),,z(i,H))𝖳z_{i}=(z_{(i,1)},\ldots,z_{(i,H)})^{{\scriptscriptstyle\mathsf{T}}}. We then quantize the output ziz_{i} with the codebook \mathcal{E}, denoted by qi(es(i,1),,es(i,H))𝖳q_{i}\coloneqq(e_{s(i,1)},\ldots,e_{s(i,H)})^{{\scriptscriptstyle\mathsf{T}}}, where s(i,h)[K]s(i,h)\in[K] with [K]{1,,K}[K]\coloneqq\{1,\ldots,K\} is defined as follows:

s(i,h)\displaystyle s(i,h) argmink[K]z(i,h)ek2,h[H],\displaystyle\coloneqq\underset{k\in[K]}{\arg\min}\,\,\lVert z_{(i,h)}-e_{k}\rVert_{2},\quad\forall h\in[H],

where [H]{1,,H}[H]\coloneqq\{1,\ldots,H\}. We finally consider an one-hot encoding of indices of qiq_{i}, namely (s(i,1),,s(i,H))(s(i,1),\ldots,s(i,H)). Specifically, we map each index s(i,h)[K]s(i,h)\in[K] into KK-categorical one-hot vector t(i,H)t_{(i,H)} in which the s(i,h)s(i,h)-th value is 1 for all h[H]h\in[H]. Consequently, we have an one-hot embedding ti(t(i,1),,t(i,H))𝖳{0,1}H×Kt_{i}\coloneqq(t_{(i,1)},\ldots,t_{(i,H)})^{{\scriptscriptstyle\mathsf{T}}}\in\{0,1\}^{H\times K} of each image xix_{i}. This one-hot embedding tit_{i} is regarded and utilized as propositional variables of xix_{i} in further steps.

Reasoning. With propositional variables (t1,,tM)(t_{1},\ldots,t_{M}) of the context 𝒙\bm{x}, we compute propositional variables t^{0,1}H×K\hat{t}\in\{0,1\}^{H\times K} which corresponds to the predicted answer image. To be specific, we evaluate t^\hat{t} from the reasoning layer SϕS_{\phi} and propositional variables of contexts 𝒙\bm{x}, i.e., t^=Sψ(t1,,tM)\hat{t}=S_{\psi}(t_{1},\ldots,t_{M}). For the reasoning layer SψS_{\psi}, we choose the SATNet layer (Wang et al., 2019), which is a differentiable version of the MAXSAT problem solver and learns propositional logical formulas from data as layer weights. We provide details of this reasoning layer in Appendix B.

Reconstruction. Finally, we infer the answer image y^\hat{y} from predicted propositional variables t^{0,1}H×K\hat{t}\in\{0,1\}^{H\times K}. To achieve this, we first compute a latent vector q^H×D\hat{q}\in\mathbb{R}^{H\times D} of y^\hat{y} from predicted t^\hat{t} and the codebook \mathcal{E}:

q^t^E,whereE=(e1,,eK)𝖳K×D.\displaystyle\hat{q}\coloneqq\hat{t}E,\quad\text{where}\,\,E=(e_{1},\ldots,e_{K})^{{\scriptscriptstyle\mathsf{T}}}\in\mathbb{R}^{K\times D}.

We then return the output from the decoder gϕ(q^)g_{\phi}(\hat{q}) as the final answer image y^\hat{y}.

Training objective. To train LoGe, we propose three loss functions: 𝚊𝚋𝚜𝚝\mathcal{L}_{\mathtt{abst}}, 𝚛𝚎𝚊𝚜𝚘𝚗\mathcal{L}_{\mathtt{reason}}, and 𝚛𝚎𝚌𝚘𝚗\mathcal{L}_{\mathtt{recon}} for each (𝒙,y)𝒟(\bm{x},y)\in\mathcal{D}, which is for abstraction, reasoning, and reconstruction step, respectively. To formulate those objectives, we additionally consider qq^{\ast} and tt^{\ast}, which indicates a quantized vector and propositional variables of the answer yy from the abstraction step.

We first formulate 𝚊𝚋𝚜𝚝\mathcal{L}_{\mathtt{abst}} and 𝚛𝚎𝚌𝚘𝚗\mathcal{L}_{\mathtt{recon}}, which resemble the objective in vector-quantized variational autoencoder (Van Den Oord et al., 2017; Razavi et al., 2019):

Refer to caption
Figure 3: Comparison of generated and ground truth image on RAVEN dataset among different architecture variants. (a)-(c): autoencoder with an attention layer and two convolutional networks with different kernel sizes as the reasoning network. We set the kernel size of 3 and 1 for (b) and (c), respectively. (d)-(e): VQ-VAE with an attention layer and a convolutional network with a kernel size as 3. Ours: our LoGe framework, which is composed of VQ-VAE and SATNet. GT: ground-truth answer image.
𝚊𝚋𝚜𝚝\displaystyle\mathcal{L}_{\mathtt{abst}}\coloneqq xi𝒙[fθ(xi)¯qi22+βfθ(xi)qi¯22]\displaystyle\sum_{x_{i}\in\bm{x}}\Big{[}\lVert\overline{f_{\theta}(x_{i})}-q_{i}\rVert_{2}^{2}+\beta\lVert f_{\theta}(x_{i})-\overline{q_{i}}\rVert_{2}^{2}\Big{]}
+fθ(y)¯q22+βfθ(y)q¯22,\displaystyle+\lVert\overline{f_{\theta}(y)}-q^{\ast}\rVert_{2}^{2}+\beta\lVert f_{\theta}(y)-\overline{q^{\ast}}\rVert_{2}^{2},
𝚛𝚎𝚌𝚘𝚗\displaystyle\mathcal{L}_{\mathtt{recon}}\coloneqq xi𝒙gϕ(qi)xi||22+gϕ(q)y22,\displaystyle\sum_{x_{i}\in\bm{x}}\lVert g_{\phi}(q_{i})-x_{i}||_{2}^{2}+\rVert g_{\phi}(q^{\ast})-y\rVert_{2}^{2},

where the term with the bar indicates the term with a stop-gradient operator. Moreover, we define 𝚛𝚎𝚊𝚜𝚘𝚗\mathcal{L}_{\mathtt{reason}} as follows:

𝚛𝚎𝚊𝚜𝚘𝚗𝙱𝙲𝙴(Sψ(t1,,tM),t).\displaystyle\mathcal{L}_{\mathtt{reason}}\coloneqq\mathcal{L}_{\mathtt{BCE}}(S_{\psi}(t_{1},\ldots,t_{M}),t^{\ast}).

Here, 𝙱𝙲𝙴\mathcal{L}_{\mathtt{BCE}} denotes the binary cross-entropy loss.

To sum up, we optimize the loss 𝚝𝚘𝚝𝚊𝚕(θ,ϕ,ψ,)\mathcal{L}_{\mathtt{total}}(\theta,\phi,\psi,\mathcal{E}), which is a sum of above three loss functions:

𝚝𝚘𝚝𝚊𝚕(θ,ϕ,ψ,)𝚊𝚋𝚜𝚝+𝚛𝚎𝚌𝚘𝚗+𝚛𝚎𝚊𝚜𝚘𝚗.\displaystyle\mathcal{L}_{\mathtt{total}}(\theta,\phi,\psi,\mathcal{E})\coloneqq\mathcal{L}_{\mathtt{abst}}+\mathcal{L}_{\mathtt{recon}}+\mathcal{L}_{\mathtt{reason}}.

Here, the total loss 𝚝𝚘𝚝𝚊𝚕\mathcal{L}_{\mathtt{total}} contains several discrete outputs, e.g., outputs of the reasoning layer in 𝚛𝚎𝚊𝚜𝚘𝚗\mathcal{L}_{\mathtt{reason}}. We provide a detailed description of how to deal with such non-differentiability on the optimization in Appendix C.

3 Experiments

We verify the effectiveness of our framework on the RAVEN/i-RAVEN dataset (Zhang et al., 2019a; Hu et al., 2021). Our result demonstrates that the proposed logic-guided generation (LoGe) framework how well generates the answer image at a given abstract reasoning problem, while other neural architectures fail to achieve this. Moreover, we also show our framework can be employed for discriminative tasks, i.e., choose the answer among candidates, and it shows improved results compared to existing discriminative methods (Zhang et al., 2019a; Zheng et al., 2019; Zhang et al., 2019b; Hu et al., 2021; Wu et al., 2020).

Method Center U-D L-R O-IC
LSTM (Zhang et al., 2019a) 12.3 10.3 12.7 12.9
WReN (Santoro et al., 2018) 23.3 15.2 16.5 16.8
LEN (Zheng et al., 2019) 42.5 28.1 27.6 32.9
CoPINet (Zhang et al., 2019b) 50.4 40.8 40.0 42.7
SRAN (Hu et al., 2021) 53.4 43.1 41.4 44.0
LoGe (Ours) 87.5 64.0 51.7 48.5
Table 1: Accuracy of discriminate tasks on i-RAVEN dataset from existing approaches and LoGe. For LoGe, we select the image with the smallest mean-squared error from the generated image.

3.1 Experimental setup

Datasets. To verify the effectiveness of LoGe, we choose RAVEN (Zhang et al., 2019a) and i-RAVEN (Hu et al., 2021). For more details of these datasets, see Appendix D.

Baselines. We note that LoGe utilizes VQ-VAE (Van Den Oord et al., 2017) and SATNet (Wang et al., 2019) to leverage a propositional logic. To verify the effectiveness of this logical prior, we qualitatively compare the generated answers with ones from other black-box neural network frameworks, i.e., different encoders and reasoning networks other than VQ-VAE and SATNet, respectively. For quantitative results, we compare the performance with prior methods on discriminative abstract reasoning. We provide detailed descriptions of baselines for comparison in Appendix E.

3.2 Main results

Qualitative result. Figure 3 summarizes results of the generated answers from different configurations of RAVEN dataset. LoGe successfully generates the answer image at the various configuration of abstract reasoning problems, while other black-box architecture design choices fail. It indicates how propositional logic prior is beneficial to achieve the objective. We provide more illustrations in Appendix 7.

Quantitative result. Table 1 shows the comparison of our framework and prior approaches on discriminative tasks.111For Hu et al. (2021), we resized the image to 80×\times80 for a fair comparison with other baselines. We select the answer by choosing the candidate which has the smallest mean-squared error from the generated answer image to employ our framework into discriminative tasks. Intriguingly, LoGe shows better performance to existing works; even our method has not been accessed to other candidates other than the answer while training.

4 Conclusion

We introduce a new deep neural network generative framework that resembles a human’s constructive matching strategy on abstract reasoning. Specifically, we derive a three-step procedure based on connecting the optimization problem in propositional logic and abstract reasoning. Experimental results demonstrate the effectiveness of our framework among various problem types in abstract reasoning to generate the correct answer based on capturing common patterns with propositional logic prior.

5 Acknowledgements

This work was partially supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00075, Artificial Intelligence Graduate School Program (KAIST)). This work was mainly supported by Samsung Research Funding & Incubation Center of Samsung Electronics under Project Number SRFCIT1902-06.

References

  • Barvinok (1995) Barvinok, A. I. Problems of distance geometry and convex properties of quadratic maps. Discrete & Computational Geometry, 13(2), 1995.
  • Becker et al. (2016) Becker, N., Schmitz, F., Falk, A. M., Feldbrügge, J., Recktenwald, D. R., Wilhelm, O., Preckel, F., and Spinath, F. M. Preventing response elimination strategies improves the convergent validity of figural matrices. Journal of Intelligence, 4(1):2, 2016.
  • Bengio et al. (2013) Bengio, Y., Léonard, N., and Courville, A. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  • Bethell-Fox et al. (1984) Bethell-Fox, C. E., Lohman, D. F., and Snow, R. E. Adaptive reasoning: Componential and eye movement analysis of geometric analogy performance. Intelligence, 8(3):205–238, 1984.
  • Carpenter et al. (1990) Carpenter, P. A., Just, M. A., and Shell, P. What one intelligence test measures: a theoretical account of the processing in the raven progressive matrices test. Psychological review, 97(3):404, 1990.
  • Goemans & Williamson (1995) Goemans, M. X. and Williamson, D. P. Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. Journal of the ACM (JACM), 42(6), 1995.
  • Hahne et al. (2019) Hahne, L., Lüddecke, T., Wörgötter, F., and Kappel, D. Attention on abstract visual reasoning. arXiv preprint arXiv:1911.05990, 2019.
  • He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hu et al. (2021) Hu, S., Ma, Y., Liu, X., Wei, Y., and Bai, S. Stratified rule-aware network for abstract visual reasoning. In AAAI conference on Artificial Intelligence, 2021.
  • Mitchum & Kelley (2010) Mitchum, A. L. and Kelley, C. M. Solve the problem first: Constructive solution strategies can influence the accuracy of retrospective confidence judgments. Journal of Experimental Psychology: Learning, Memory, and Cognition, 36(3):699, 2010.
  • Pataki (1998) Pataki, G. On the rank of extreme matrices in semidefinite programs and the multiplicity of optimal eigenvalues. Mathematics of operations research, 23(2), 1998.
  • Razavi et al. (2019) Razavi, A., van den Oord, A., and Vinyals, O. Generating diverse high-fidelity images with vq-vae-2. In Advances in Neural Information Processing Systems, 2019.
  • Santoro et al. (2017) Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap, T. A simple neural network module for relational reasoning. In Advances in Neural Information Processing Systems, 2017.
  • Santoro et al. (2018) Santoro, A., Hill, F., Barrett, D., Morcos, A., and Lillicrap, T. Measuring abstract reasoning in neural networks. In International Conference on Machine Learning, 2018.
  • Van Den Oord et al. (2017) Van Den Oord, A., Vinyals, O., et al. Neural discrete representation learning. In Advances in Neural Information Processing Systems, 2017.
  • Wang et al. (2020) Wang, D., Jamnik, M., and Lio, P. Abstract diagrammatic reasoning with multiplex graph networks. In International Conference on Learning Representations, 2020.
  • Wang & Kolter (2019) Wang, P.-W. and Kolter, J. Z. Low-rank semidefinite programming for the max2sat problem. In AAAI Conference on Artificial Intelligence, 2019.
  • Wang et al. (2017) Wang, P.-W., Chang, W.-C., and Kolter, J. Z. The mixing method: low-rank coordinate descent for semidefinite programming with diagonal constraints. arXiv preprint arXiv:1706.00476, 2017.
  • Wang et al. (2019) Wang, P.-W., Donti, P. L., Wilder, B., and Kolter, Z. Satnet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver. In International Conference on Machine Learning, 2019.
  • Wu et al. (2020) Wu, Y., Dong, H., Grosse, R., and Ba, J. The scattering compositional learner: Discovering objects, attributes, relationships in analogical reasoning. arXiv preprint arXiv:2007.04212, 2020.
  • Zhang et al. (2019a) Zhang, C., Gao, F., Jia, B., Zhu, Y., and Zhu, S.-C. Raven: A dataset for relational and analogical visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019a.
  • Zhang et al. (2019b) Zhang, C., Jia, B., Gao, F., Zhu, Y., Lu, H., and Zhu, S.-C. Learning perceptual inference by contrasting. In Neural Information Processing Systems, 2019b.
  • Zheng et al. (2019) Zheng, K., Zha, Z.-J., and Wei, W. Abstract reasoning with distracting features. In Advances in Neural Information Processing Systems, 2019.

Appendix A Detailed description of the MAXSAT problem

CNF formula. Conjunctive normal form (CNF) formula is a conjunction of clauses, where each clause is composed of OR operation of propositional variables. To be specific, the following is an example of the CNF formula with 2 propositional variables p1p_{1} and p2p_{2}:

(p1p2)(¬p1p2)(p1¬p2)(¬p1¬p2).\displaystyle(p_{1}\vee p_{2})\wedge(\neg p_{1}\vee p_{2})\wedge(p_{1}\vee\neg p_{2})\wedge(\neg p_{1}\vee\neg p_{2}). (1)

MAXSAT problem. MAXSAT problem aims to find values of propositional variables to maximize the number of clauses of given propositional logical formula in CNF. One may interpret the MAXSAT problem as a combinatorial optimization problem. Specifically, by letting P:=[p1,p2,,pn]{1,1}nP:=[p_{1},p_{2},\cdots,p_{n}]\in\{-1,1\}^{n} as nn propositional variables and S:=[s1,s2,sn]S:=[s_{1},s_{2}\cdots,s_{n}] as the CNF formula with mm clauses where si{1,0,1}ms_{i}\in\{-1,0,1\}^{m} is an ii-th propositional variable in mm-th clauses:

maximizeP{1,1}ni=1mj=1n𝟏(sijpj>0).\displaystyle\text{maximize}_{P\in\{-1,1\}^{n}}\sum_{i=1}^{m}\bigvee_{j=1}^{n}\bm{1}(s_{ij}p_{j}>0). (2)

We notice that there may not exist any values of propositional variables that satisfy all clauses in the CNF formula to be satisfied. For instance, consider the CNF formula (1) with propositional variables p1p_{1} and p2p_{2}: there are no assignments of two propositional variables p1,p2p_{1},p_{2} makes CNF formula satisfiable, while there exist assignments simultaneously satisfying three clauses out of four clauses exist.

Appendix B Detailed description of the reasoning layer

As MAXSAT problem in Appendix A is a combinatorial optimization problem which is not differentiable, several approaches (Wang & Kolter, 2019; Goemans & Williamson, 1995) attempt to relax such MAXSAT problems into the continuous optimization problem. Specifically, they propose the conversion of MAXSAT problem into semidefinite program (SDP). In depth, the MAXSAT problem in (2) can be formulated as the following optimization problem:

minVk×(n+1)<VTV,STS>subject to||vi||=1,i=,1,,n,\min_{V\in\mathbb{R}^{k\times(n+1)}}<V^{T}V,S^{\prime T}S^{\prime}>\quad\text{subject to}\;\;||v_{i}||=1,\quad i=\top,1,\cdots,n, (3)

where V:=[vv1vn]k×(n+1)V:=[v_{\top}\;v_{1}\;\cdots\;v_{n}]\in\mathbb{R}^{k\times(n+1)}, S:=[ss1sm]diag(1/(4|si|)m×(n+1)S^{\prime}:=[s_{\top}\;s_{1}\;\cdots\;s_{m}]\text{diag}(1/\sqrt{(4|s_{i}|})\in\mathbb{R}^{m\times(n+1)} with s={1}ms_{\top}=\{-1\}^{m} are relaxations of propositional variables and clauses, respectively. Here, the solution of original MAXSAT problem can be recovered from the solution of SDP in probabilistic manner via randomized rounding, i.e., (pi=1)=cos1(viTv)/π\mathbb{P}(p_{i}=1)=\cos^{-1}(-v_{i}^{T}v_{\top})/\pi (Goemans & Williamson, 1995; Wang et al., 2019). Moreover, Barvinok (1995); Pataki (1998) show the optimal solution of the original problem can be recovered from relaxed SDP if k>2nk>\sqrt{2n}, and Wang et al. (2017) proves coordinate descent update in this SDP on VV converges to the optimal fixed point.

With this continuous relaxation of the MAXSAT problem, Wang et al. (2019) proposes SATNet to bridge such continuously relaxed problems and the deep neural network. Specifically, they regard SS^{\prime} in (3) as the layer weight of the neural network, i.e., they define differentiable forward operation which solves (3) from current weights SS^{\prime} and backward operation to learn a relaxed logical formula from a given dataset via optimizing SS^{\prime}.

Appendix C Detailed description of the optimization scheme

We first notice that the input (t1,,tM)(t_{1},\ldots,t_{M}) of the reasoning layer SψS_{\psi} is non-differentiable, as it includes the 𝚊𝚛𝚐𝚖𝚊𝚡\mathtt{argmax} operator. Consequently, optimization of the reasoning loss 𝚛𝚎𝚊𝚜𝚘𝚗\mathcal{L}_{\mathtt{reason}} affects the layer parameter ψ\psi but not other parameters, e.g., codebook \mathcal{E}. To solve this problem, we propose to use the relaxed version of propositional embeddings (t1,,tM)(t_{1},\ldots,t_{M}), denoted by (t1,,tM)𝖳(t_{1}^{\prime},\ldots,t_{M}^{\prime})^{\scriptscriptstyle\mathsf{T}}, where each (ti)𝖳=(t(i,1),,t(i,H))𝖳=[0,1]H×K(t_{i}^{\prime})^{\scriptscriptstyle\mathsf{T}}=(t_{(i,1)}^{\prime},\ldots,t_{(i,H)}^{\prime})^{\scriptscriptstyle\mathsf{T}}=\in[0,1]^{H\times K} for i[M]\forall i\in[M] is defined as follows:

(t(i,h))𝚜𝚘𝚏𝚝𝚖𝚊𝚡(z(i,h)e12,,z(i,h)eK2),h[H].\displaystyle(t_{(i,h)}^{\prime})\coloneqq\mathtt{softmax}\big{(}-\lVert z_{(i,h)}-e_{1}\rVert_{2},\ldots,-\lVert z_{(i,h)}-e_{K}\rVert_{2}\big{)},\quad\forall h\in[H].

Moreover, we also note that there is no gradient in 𝚊𝚋𝚜𝚝\mathcal{L}_{\mathtt{abst}} due to the 𝚊𝚛𝚐𝚖𝚒𝚗\mathtt{argmin} operator to have qiq_{i} from each image xix_{i}. To compensate this issue, we simply use straight-through operator (Bengio et al., 2013), i.e., we copy gradients of the decoder input qiq_{i} to the encoder output ziz_{i}.

Hyperparameters. We note that LoGe contains following hyperparameters: the size of codebook \mathcal{E}, the size of spatial features (H,D)(H,D), the size of reasoning layer (n,m)(n,m) (see Appendix B), and the coefficient β\beta in the abstraction loss 𝚊𝚋𝚜𝚝\mathcal{L}_{\mathtt{abst}}. In all experiments, we use universal hyperparameter setups: ||=8|\mathcal{E}|=8, (H,D)=(25,192)(H,D)=(25,192), (n,m)=(1800,500)(n,m)=(1800,500), and β=0.25\beta=0.25.

Appendix D Details of the RAVEN/i-RAVEN dataset

RAVEN. RAVEN dataset (Zhang et al., 2019a) is a synthetic dataset to evaluate the abstract reasoning ability of machines, where each problem is a Raven’s Progressive Matrices (RPM) format. Specifically, the dataset consists of total 7 problem types: Center-Single (Center), Left-Right (L-R), Up-Down (U-D), Out-InCenter (O-IC), Out-InGrid (O-ID), 2x2Grid, and 3x3Grid, where each configuration contains 10000 problems. Here, we consider 4 out of total 7 configurations: Center, L-R, U-D, and O-IC. Each image contains five attributes: number of objects, position, shape, size, and color. For rules, the dataset contains 4 rules in total: the attribute is either constant, progressive, arithmetic, and distributed across each row in the problem. Figure 1 illustrates the Center configuration in RAVEN dataset, and more illustrations are provided in Figure 3.

i-RAVEN. i-RAVEN dataset (Hu et al., 2021) is a modified version of the RAVEN dataset with a different rule to generate a list of candidates in the problem. To be specific, Hu et al. (2021) finds there exists a shortcut bias in candidates in RAVEN dataset, i.e., one can find the answer only from candidates without accessing the context images of the problem. Accordingly, they propose a new RAVEN dataset in which such a bias is removed from the candidate set to better measure the reasoning ability of the discriminative abstract reasoning framework. Moreover, they find that the accuracy of existing methods significantly drops if this shortcut bias is removed.

Appendix E Detailed description of baselines

Baselines for qualitative results. We notice that the neural architecture in LoGe is composed of vector-quantized variational autoencoder (VQ-VAE) (Van Den Oord et al., 2017) and SATNet (Wang et al., 2019), based on employing propositional logical prior to solve abstract reasoning problems. To verify the effectiveness of this prior on the abstract reasoning, we qualitatively compare generated answer images from other widely used neural architectures without this assumption. As generative neural architectures for solving reasoning problems are under-explored, we compare the result with other architecture variants where VQ-VAE and SATNet is substituted to different neural architectures. Specifically, we compare results from combinations of different encoders (autoencoder and VQ-VAE) and reasoning networks (attention layer and 2-layer convolutional neural networks (CNNs) with different kernel sizes, where kernel sizes of CNNs set to 3 and 1).

Baselines for qualitative results. In the rest of this section, we briefly explain previous approaches to solve the abstract reasoning problem via deep neural networks.

  • LSTM (Zhang et al., 2019a) attempts to utilize LSTM to validate the inefficiency of conventional deep neural network architectures to solve reasoning problems.

  • WReN (Santoro et al., 2018) proposes to solve the abstract reasoning problem via relation network. (Santoro et al., 2017).

  • LEN (Zheng et al., 2019) proposes a variant of relation network, where the input of the network is a triplet of images rather than a pair of images. Furthermore, they empirically verify the performance can be further boosted with the curriculum learning based on a reinforcement learning framework.

  • CoPINet (Zhang et al., 2019b) suggests a contrastive learning algorithm to learn underlying rules from given images.

  • SRAN (Hu et al., 2021) designs a hierarchical neural network framework that simultaneously considers images in the problem individually and also at the row and column level.

Appendix F More illustration of generated results

In this section, we provide additional generated comparisons of different configurations in RAVEN deadset.

Refer to caption
Figure 4: Comparison of generated and ground truth image on Center configuration in RAVEN dataset among different architecture variants. (a)-(c): autoencoder with an attention layer and two convolutional networks with different kernel sizes as the reasoning network. We set the kernel size of 3 and 1 for (b) and (c), respectively. (d)-(e): VQ-VAE with an attention layer and a convolutional network with a kernel size of 3. Ours: our LoGe framework, which is composed of VQ-VAE and SATNet. GT: ground-truth answer image.
Refer to caption
Figure 5: Comparison of generated and ground truth image on U-D configuration in RAVEN dataset among different architecture variants. (a)-(c): autoencoder with an attention layer and two convolutional networks with different kernel sizes as the reasoning network. We set the kernel size of 3 and 1 for (b) and (c), respectively. (d)-(e): VQ-VAE with an attention layer and a convolutional network with a kernel size of 3. Ours: our LoGe framework, which is composed of VQ-VAE and SATNet. GT: ground-truth answer image.
Refer to caption
Figure 6: Comparison of generated and ground truth image on L-R configuration in RAVEN dataset among different architecture variants. (a)-(c): autoencoder with an attention layer and two convolutional networks with different kernel sizes as the reasoning network. We set the kernel size of 3 and 1 for (b) and (c), respectively. (d)-(e): VQ-VAE with an attention layer and a convolutional network with a kernel size of 3. Ours: our LoGe framework, which is composed of VQ-VAE and SATNet. GT: ground-truth answer image.
Refer to caption
Figure 7: Comparison of generated and ground truth image on O-IC configuration in RAVEN dataset among different architecture variants. (a)-(c): autoencoder with an attention layer and two convolutional networks with different kernel sizes as the reasoning network. We set the kernel size of 3 and 1 for (b) and (c), respectively. (d)-(e): VQ-VAE with an attention layer and a convolutional network with a kernel size of 3. Ours: our LoGe framework, which is composed of VQ-VAE and SATNet. GT: ground-truth answer image.