This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Short Survey of Systematic Generalization

Yuanpeng Li [email protected]
Abstract

This survey includes systematic generalization and a history of how machine learning addresses it. We aim to summarize and organize the related information of both conventional and recent improvements. We first look at the definition of systematic generalization, then introduce Classicist and Connectionist. We then discuss different types of Connectionists and how they approach the generalization. Two crucial problems of variable binding and causality are discussed. We look into systematic generalization in language, vision, and VQA fields. Recent improvements from different aspects are discussed. Systematic generalization has a long history in artificial intelligence. We could cover only a small portion of many contributions. We hope this paper provides a background and is beneficial for discoveries in future work.

1 Introduction

Artificial intelligence with deep learning has rapid improvement in recent years. As it addresses many problems, an old question of systematic generalization (Fodor and Pylyshyn, 1988; Lake and Baroni, 2018) returns to receive focus. Systematic generalization requires correctly addressing unseen samples by recombining seen ones. For example, a model trained with blue rectangles and green triangles predicts blue triangles. Though it is straightforward for humans, it is still challenging for deep learning models. It has a long history and many recent related works. This survey summarizes and organizes the information into a series of subtopics. The initial ones mainly focus on historical perspectives, and the latter discuss recent works. We will go through them in this introduction and look at their details in the following sections.

The fast growth of deep learning has addressed many i.i.d. problems in artificial intelligence. On the other hand, systematic generalization is an out-of-distribution (o.o.d.) generalization with disjoint training and test domains. It provides the ability for fast learning and creation. They are essential for more human-like intelligence, which current machine learning does not achieve (Section 2).

Artificial intelligence has Classicist and Connectionist approaches, and they have complementary advantages. Connectionist (Feldman and Ballard, 1982) is good at i.i.d. generalization but not at systematic generalization (Fodor and Pylyshyn, 1988; Marcus, 1998). Classicist with symbol processing is the opposite. A neural network originates from the Connectionist model and is still weak at such generalization (Section 3).

Three types of Connectionist are explored to facilitate the advantages of Classicist and Connectionist. Eliminative Connectionist does not use symbolic processing. Hybrid Connectionist involves both Connectionist and symbolic processing. Implementational Connectionist uses Connectionism to implement symbolic processing (Section 4).

Variable binding is a Connectionist topic closely related to systematic generalization. A variable is a placeholder, and it can be replaced with values. It decouples the manipulation of a variable and its value, so it generalizes to their combinations (Section 5).

Causality is also a related topic. It uses do-calculus with intervention and enables studying probabilities of counterfactual events. Systematic generalization is such a counterfactual event (Section 6).

Systematic generalization problems are widely encountered in different fields. We mainly focus on language, vision, and visual question answering (VQA). Many datasets are designed in these fields. Language has historically been more studied since sentences are more straightforward to process than images. The recent refocus on systematic generalization also started from language tasks (Lake and Baroni, 2018) (Section 7).

Many deep learning approaches have been recently proposed for systematic generalization, such as disentangled representation learning, meta-learning, attention mechanism, modular architecture, specialized architectures, and data augmentation (Section 8).

We cover the history of systematic generalization in artificial intelligence and summarize the recent development after the wide use of deep learning. We hope this survey is helpful as background information for potential future research. The following sections have more detailed discussions for each subtopic mentioned above.

2 Systematic Generalization

A fundamental property of artificial intelligence is generalization, where a trained model appropriately addresses unseen test samples. Many problems adopt the i.i.d. assumption, where training and test samples are independently drawn from the identical distribution (i.i.d. generalization). On the other hand, test distribution can be different from training distribution, and they may have disjoint input domains or support. It means that the test samples have zero probability in training distribution (o.o.d. generalization).

Systematic generalization is an o.o.d. generalization. Systematicity is a property where the ability to produce or understand some sentences (or objects in general) is intrinsically connected to that ability for others (Fodor and Pylyshyn, 1988). It usually uses factors of variation (Bengio et al., 2013) for recombination. For example, a model trained with blue rectangles and green triangles predicts blue triangles. A systematic generalization is often called a compositional generalization, mainly in language domains.

Systematic generalization is considered the ”Great Move” of evolution, developed to process an increasing amount and diversity of information from the environment; for example, humans can recognize new spatial relationships of seen objects (Newell, 1990). It is also related to the evolution of the prefrontal cortex (Robin and Holyoak, 1995). Cognitive scientists see such generalizations as central for an organism to view the world (Gallistel and King, 2009). A book (Calvo and Symons, 2014) contains works mainly from cognitive science perspectives.

It has been discussed that commonsense is critical (Mccarthy, 1959; Lenat et al., 1986) for systematic generalization. The discussions also seek general prior knowledge for systematic generalization, e.g., Consciousness Prior (Bengio, 2017). Recently, different types of inductive bias (Goyal and Bengio, 2020) were summarized. There are different ways to categorize systematicity. Three levels of systematicity, from weak to strong, are defined (Hadley, 1992). More precisely, six levels of systematicity are defined (Niklasson and van Gelder, 1994). Recently, five types of tests are summarized (Hupkes et al., 2020).

The development of deep learning reliably addresses many i.i.d. problems, so it is also encouraged to address systematic generalization. It relates to many areas, including reasoning (Talmor et al., 2020), continual learning (Li et al., 2020), zero-shot learning (Sylvain et al., 2020), and language inference (Geiger et al., 2019). GFlowNets (Bengio et al., 2021) generate informative samples to address systematic generalization for active learning and exploration in reinforcement learning.

3 Classicist and Connectionist

Artificial Intelligence has been developed in two approaches: Classicists and Connectionists. They are discussed for human cognition and refer to approaches in artificial intelligence.

Classicist (Fodor, 1975; Pylyshyn, 1980) refers to the computing operations on symbols (e.g., tokens) derived from Turing and Von Neumann machines. It typically means fixed serial rules with variables, e.g., computer programs. The Physical Symbol System (PSS) hypothesis (Newel and Simon, 1976) was developed to study the systematic mental representations of humans, and it says that human cognition is physically the product of a symbol system111“A physical symbol system has the necessary and sufficient means for intelligent action” (Newel and Simon, 1976).. Connectionist (Feldman and Ballard, 1982; Rumelhart et al., 1986b ) uses many simple neuron-like units which are richly interconnected and processed in parallel (Hinton, 1991). It implies learning from data without explicit symbol information, e.g., neural networks.

While Connectionist is good at i.i.d. generalization, it does not address o.o.d. generalization in general (Fodor and Pylyshyn, 1988). On the other hand, Classicist is good at o.o.d. generalization, while not at i.i.d. generalization (sometimes referred to as the “graceful degradation” issue). For example, computer programs can be reliable for new data that strictly fit input requirements, but it is brittle to process noisy images or speeches correctly. There is a spectrum between Classicist and Connectionist, which is a trade-off between the advantages of both approaches (Section 4).

Distributed representation

Connectionist has distributed and localist representations (Feldman, 1986; Sejnowski and Rosenberg, 1987). We mainly discuss distributed representation, which is widely used and is more efficient when many items are present (Touretzky and Hinton, 1988). It refers to Parallel Distributed Processing (PDP) models (Rumelhart et al., 1986b ).

Distributed representations of symbols (Rumelhart et al., 1986a ) were introduced to capture relationships between family members. A distributed representation can describe an object in terms of primitive descriptors; it has a significant advantage because it can describe a novel object using the same primitive descriptors to create novel combinations as representations (Hinton, 1990). During training, primitive descriptors or hidden nodes in distributed representation continually change their meanings; though they might be stable in the short term, they shift around in the longer term (Hinton et al., 1986a ).

Localist representation often refers to a one-hot representation when used in deep learning. It is efficient in the following cases (Hinton, 1990). A significant portion of samples activates a node, e.g., the end-of-sentence symbol in sentences. Entries are mutually exclusive, e.g., classification outputs and input words.

Disentangled representation

Distributed representation can be disentangled (Bengio et al., 2013). With fewer requirements, e.g., linearity, it is also referred to as factorized representation (Ke et al., 2021) or microfeatures (Hinton et al., 1986b ). Many early works for systematic generalization studies disentangled representation. Also, disentangled representation learning is mainly studied in unsupervised manners (Higgins et al., 2017), and it can be used as a feature extractor for systematic generalization models. These previous works usually do not discuss disentangled representation learning and systematic generalization together. However, it is necessary in some cases because humans systematically generalize many entangled data, such as images and sentences.

Disentangled representation is also related to Invariant Risk Minimization (IRM) (Arjovsky et al., 2019; Peters et al., 2016), a learning paradigm that estimates invariant predictors from multiple training environments. IRM focuses on learning individual features. However, disentangled representations require decomposing an input to features, and the features together should keep the original input information (Bengio et al., 2013). IRM is used in domain generalization, and there is a recent related survey (Wang et al., 2022).

Connectionist and systematic generalization

It is argued that Connectionism lacks systematicity, which also indicates that the mind is not a Connectionist network (Fodor and Pylyshyn, 1988). Current machine learning methods also seem weak at generalization beyond the training distribution (Goyal et al., 2021c ; Hendrycks and Dietterich, 2019), though it is often necessary in many cases. Also, state-of-the-art models often learn spurious statistical patterns while humans avoid them (Nie et al., 2020).

There has been the exploration of compositionality in neural networks for systematic behavior (Wong and Wang, 2007; Brakel and Frank, 2009), counting ability (Rodriguez and Wiles, 1998; Weiss et al., 2018), and sensitivity to the hierarchical structure (Linzen et al., 2016). Systematicity has been partially achieved in previous work (Niklasson and van Gelder, 1994; Hadley and Hayward, 1997; Bodén and Niklasson, 2000). More recent related efforts are summarized (Jansen and Watter, 2012). There are also recent improvements in systematic generalization with structure design (Andreas et al., 2016; Gaunt et al., 2017) and structure prediction (Johnson et al., 2017; Hu et al., 2017). More recently, neural production systems (Goyal et al., 2021b ; Goyal et al., 2021a ), shared global workspace (Goyal et al., 2022), and reinforcement learning (Ke et al., 2021) are also investigated to address this problem. We will discuss more in Section 8.

System 1 and System 2 in human thinking

Human thinking has two systems. System 1 refers to fast and less reliable thinking. People always make unconscious decisions, and the process is mundane reasoning. System 2 refers to slow and more reliable thinking. It is a conscious process including logical thinking, e.g., solving a math problem. System 2 is a more precious resource. People switch between the two systems to best use the resources. We mainly use system 1 in daily life and system 2 for complicated problems that system 1 cannot address. An introductory book (Kahneman, 2011) contains many examples, experiments, and discussions of how humans think on different occasions.

Systematic generalization is more related to system 2, where a decision in a new environment is inferred with logical reasoning with familiar knowledge. Human thinking is also discussed from other perspectives. Dynamic memory is considered a fundamental ingredient of intelligence (Schank, 1982). Consciousness prior (Bengio, 2017) is also proposed in deep learning. It leads to sparseness prior and attention-based modular networks.

4 Types of Connectionist

Classicists and Connectionists have complementary advantages. Implementational Connectionist merely implements a Classicist symbol manipulation system. On the contrary, eliminative or radical Connectionist is not designed with knowledge of the symbol system, so it eliminates the PSS hypothesis. Also, there are hybrid methods that combine both of them. Eliminative, hybrid, and implementational Connectionists have weak to intense use of symbol systems.

4.1 Eliminative Connectionist

Eliminative Connectionist eliminates the symbol system by purely using distributed representation (Pinker and Prince, 1988; Marcus, 1998). It is attractive because it avoids knowledge of symbol systems. For example, a model without symbolic rules or lexicon can learn past tense grammar rules and exceptions in English verbs (Rumelhart and Mcclelland, 1986).

Eliminative Connectionist also uses prior knowledge, not including symbol systems, e.g., layered architecture. Also, some architecture designs, such as convolutional layers and attention mechanisms, are less mentioned as symbol systems. Regularization algorithms like noise insertion and optimization algorithms like Adam are not much related to symbol systems. Modular architecture design a module for a factor, but the prior knowledge is not specific to each symbol.

Pure eliminative Connectionist has many challenges. One long-standing problem is that a model has a fixed-length vector of units for representations of recursive symbol structure (Pollack, 1990), which vary in size and complexity (Holyoak and Hummel, 2000), e.g., context-free languages. Tree-structured composition (Bowman et al., 2015) is developed to address such cases.

4.2 Implementational Connectionist

Implementational Connectionist (Ballard, 1986; Pinker and Prince, 1988; Hinton and Anderson, 1981; Hinton et al., 1986b ; Touretzky, 1986) directly designs symbol-processing architectures in PDP models (Chalmers, 1993). It was argued that the only Connectionist approach to achieve systematic generalization is implementational Connectionist (Fodor and Pylyshyn, 1988). Such approaches share the explanatory capability and the empirical results of Classicist models.

For example, PDP networks can implement parts of LISP and production systems (Touretzky and Hinton, 1985). BoltzCONS (Touretzky, 1986) is another example. Also, μ\muKLONE (Derthick, 1990) uses microfeatures to implement functionality similar to the knowledge representation system KL-ONE.

4.3 Hybrid Neural Systems

It is proposed to integrate symbolic processing into neural networks (Hinton, 1990; Sun, 1996; Wermter and Sun, 1998) to take advantage of both. The majority of early hybrid systems have a neural network and rule-based modules (McGarry et al., 1999).

Like classicists, the neural net was designed to separate modules for storing values and operations (Miikkulainen, 1993). It has been shown that some logic can be translated into neural networks (Shavlik, 1994). Methods are proposed for variable binding (Section 5) with tensor products (Smolensky, 1990) and semantic pointers (Eliasmith, 2013). There is also semi-local tensor product representation, such as the semantic net model (Hinton and Anderson, 1981). Neural-symbolic methods are proposed for logical operations (Niklasson and van Gelder, 1994) and reasoning (D’Avila Garcez et al., 2009; d’Avila Garcez et al., 2019).

Recently, the Neuro-Symbolic concept learner (Mao et al., 2019) has been proposed for VQA. Tensor-Product Transformer combines BERT and tensor products to represent symbolic variables and their bindings (Schlag et al., 2020). Neural-symbol stack machines (Chen et al., 2020) are used for instruction learning problems. Recent work also includes Edge Transformers (Bergen et al., 2021), which combines Transformers and rule-based symbolic systems. Inspired by logical programming, it proposes triangular attention to manipulate pairs of input nodes. It is proposed to use symbolic modules to examine the logical reasoning of neural sequence modules (Nye et al., 2021). A book (Hinton, 1991) contains works for connectionist symbol processing.

5 Variable binding

Variable binding assigns a value to a variable. It is a difficult and important problem in Connectionist models (Barnden, 1984), and it is required for complex reasoning tasks (Browne and Sun, 2000) and more efficient computation (Sun, 1992). Systematic generalization requires learning true rules containing variables, so it must do something equivalent to the variable binding (Touretzky and Hinton, 1988). Manipulation of variables is essential for animal cognition (Gallistel and King, 2009). For example, honeybees extend the solar azimuth function to the lighting of unseen conditions (Dyer and Dickinson, 1994). Variable binding may be one of the reasons to force people to be sequential processors (Newell, 1980a ).

We look at an example. Variable binding enables one general rule: dog(X), barks(X). It means “X barks if X is a dog.” Without variable binding, we need a specific rule for each possible value, such as dog(rei) and dog(bue) (Browne and Sun, 2000). Similarly, the binding problem occurs when we encode feature conjunctions in a representation, e.g., a red triangle and a blue square (Treisman, 1998). Production systems (Touretzky and Hinton, 1988; Goyal et al., 2021a ) have rule and placeholder variables, and variable binding is required for both of them.

It is argued that eliminative Connectionists cannot explicitly address the variable binding problem (Holyoak and Hummel, 2000). Many Connectionist researchers have considered embedding symbol systems in a neural network for variable binding (Feldman and Ballard, 1982; Pollack, 1990). Holographic Reduced Representations (Plate, 1991) use convolution. Temporal Synchrony (Shastri and Ajjanagadde, 1993) is proposed for reasoning. Analogical Access and Mapping (Hummel and Holyoak, 1997) mainly regards tree-structure grammar in language. Please also refer to recent surveys (Gosmann and Eliasmith, 2019; Frady et al., 2021).

6 Causality

Causal learning has a long history, rooted in the eighteenth century (Hume, 2003) and the classical field of AI (Pearl, 2003). The primary exploration has been from statistical perspectives (Pearl, 2009; Peters et al., 2016; Greenland et al., 1999; Pearl, 2018). Much causal reasoning literature is built upon do-calculus (Pearl, 1995, 2009) and interventions (Peters et al., 2016), though some early work does not consider interventions (Heckerman et al., 1995). The question of how to separate correlation and causation is raised (Welling, 2015).

The causation forms Independent Causal Mechanisms (Peters et al., 2017; Schölkopf et al., 2021), or ICMs, which avoid spurious connections222“ICM principle: The causal generative process of a system’s variables is composed of autonomous modules that do not inform or influence each other” (Schölkopf et al., 2021).. ICM is robust across different domains (Schölkopf et al., 2016) to support systematic generalization (Parascandolo et al., 2018; Goyal et al., 2021c ).

Causality and variable binding have been discussed in different fields while closely related. For example, an ICM states that an output variable depends only on a corresponding input variable. It binds the output variable and the values of a factor in a disentangled input variable. Causality also indicates that the binding is robust in different domains.

Systematic generalization is the counterfactual when the joint input distribution is intervened to have new values with zero probability in training (covariate shift). With ICMs, models can be trained with data distributions induced by causal models to achieve systematic generalization (Tsirtsis et al., 2020). For example, causal mechanisms are augmented into generative models for constructing images and planning (Kocaoglu et al., 2018; Kurutach et al., 2018).

As mentioned in the book (Peters et al., 2017), causality research is still in an early stage, and the assumptions are not general. The theory has more results on linear models. Some main approaches include the following.

  • Independence-based methods

  • Restricted structural models, such as additive noise

  • Invariant causal prediction

The current approaches mainly study disentangled input representations. Research for entangled representations has been recently conducted (Bengio et al., 2020), and it uses adaptation speed to learn causality with meta-learning. However, extending such a representation-learning algorithm to multiple variables is still challenging. Some work includes multiple variables but disentangled data (Ke et al., 2019).

One possibility is to use the causality module as an intermediate part of a neural network model (Schölkopf et al., 2021). A model can be divided into an encoder, an intermediate network, and a decoder. The encoder converts an entangled input to a disentangled input representation. The intermediate network models the causality between input and output disentangled representations. The decoder converts a disentangled output representation to an entangled output. Graph networks (Battaglia et al., 2018) can be used as intermediate networks (Ke et al., 2021).

7 Language, Vision, and VQA

Systematic generalization has been studied in different fields, and we look into language, vision, and VQA.

7.1 Language

Systematicity is often referred to as compositionality in the language domain. They may be different aspects of the same phenomenon (Fodor and Pylyshyn, 1988). Compositionality is the algebraic capacity to understand and produce novel combinations from known components (Chomsky, 1957; Montague, 1970). For example, a person who knows how to “step,” “step twice,” and “jump” naturally knows how to “jump twice” (Lake and Baroni, 2018). This generalization ability is critical in human cognition (Minsky, 1986; Lake et al., 2017). It helps humans to learn languages flexibly and efficiently from limited data and extend to unseen sentences.

Human-level compositional learning has been an open challenge (Yang et al., 2019). With the breakthroughs in sequence-to-sequence neural networks for NLP, such as RNN (Sutskever et al., 2014), Attention (Xu et al., 2015), Pointer Network (Vinyals et al., 2015), and Transformer (Vaswani et al., 2017), there are more contemporary attempts to encode compositionality in sequence-to-sequence neural networks. Words are natural symbols in language and are extended to word embeddings (Deerwester et al., 1990). Further, neural language models (Bengio et al., 2003) introduce interpretable word embeddings.

SCAN dataset (Lake and Baroni, 2018) is an early compositional generalization dataset in recent years. It is a sequence-to-sequence task that translates natural language into a sequence of robot actions. It considers several aspects of compositional generalization. One of them is primitive substitutions, where a word is replaced with another, and the combination of the word and the context is new. Please see the “jump” example above. Many related tasks (Loula et al., 2018; Liška et al., 2018; Bastings et al., 2018; Lake et al., 2019) are also proposed.

Multiple methods (Bastings et al., 2018; Loula et al., 2018; Kliegl and Xu, 2018; Chang et al., 2019) have been proposed using various RNN models and attention mechanisms. These methods successfully generalize when the difference between training and test data is slight. Requirements for systematic generalization are discussed (Bahdanau et al., 2019b ), concluding that additional regularization or prior is necessary for modular designs. SCAN dataset inspired multiple approaches (Russin et al., 2019; Lake, 2019; Li et al., 2019; Andreas, 2020; Gordon et al., 2020; Liu et al., 2020; Chen et al., 2020) discussed in the next section.

The CFQ dataset considers syntactic compositionality in real data (Keysers et al., 2020). It generally requires recombining syntactic structures beyond primitive substitution. The methods on the SCAN dataset do not work well on CFQ, while pretraining provides improvements (Furrer et al., 2020). The Semantic Parsing approach also addresses part of the problem (Shaw et al., 2021). There are analyses for training data size (Tsarkov et al., 2021) and model size (Qiu et al., 2022b ) for compositional generalization.

There are other recent semantic parsing datasets. COGS (Kim and Linzen, 2020) is a synthetic dataset with pairs of sentences and logical forms, and the generalization test set evaluates novel linguistic structures. PCFG (Hupkes et al., 2020) manipulates executable operations. GeoQuery (Shaw et al., 2021) is a non-synthetic dataset with pairs of questions and meaning representations annotated by humans. It has three systematic generalization splits. Template split has disjoint abstract output templates for training and test data. TMCD split has training and test compound distributions as divergent as possible. Length split has different lengths for training and test data. SMCalFlow-CS (Yin et al., 2021) is a split of SMCalFlow for compositional skills. Machine translation dataset is also recently proposed (Dankers et al., 2022). Math expressions can be treated as language, and a mathematical reasoning dataset is proposed (Saxton et al., 2019).

7.2 Vision

Systematic generalization is often referred to as zero-shot learning in vision (Rohrbach et al., 2011; Larochelle et al., 2008; Yu and Aloimonos, 2010; Xu et al., 2017; Ding et al., 2017). The difference is that vision tasks are additionally given attributes (factors) for classes or samples. Common datasets include AWA (Lampert et al., 2014; Xian et al., 2019), CUB (Wah et al., 2011), SUN (Patterson and Hays, 2012), and aPY (Farhadi et al., 2009). There are also recent vision benchmarks (Hendrycks and Dietterich, 2019; Hendrycks et al., 2020; Tang et al., 2021) for systematic generalization.

Many approaches have been proposed with linear (Frome et al., 2013; Romera-Paredes and Torr, 2015; Akata et al., 2013, 2015) and nonlinear (Socher et al., 2013; Norouzi et al., 2014) compatibility models. Other algorithms learn independent attributes (Lampert et al., 2014). There are also hybrid models between them (Changpinyo et al., 2016; Zhang and Saligrama, 2015; Xian et al., 2016). There are related surveys (Wang and Deng, 2018; Zhou et al., 2022).

Using attributes or other side information makes the problem easier than systematic generalization. Many works have been done to avoid attribute annotation, e.g., one-shot image novel class (Mensink et al., 2012), external lexical information for class embeddings (Rohrbach et al., 2011; Akata et al., 2015), and visual descriptions (Reed et al., 2016). Other work has been done to understand the systematicity of images (Goyal et al., 2022). It has also been argued that zero-shot learning is related to the attention mechanism (Sylvain et al., 2020).

A related topic is domain generalization with multiple vision datasets, such as PACS (Li et al., 2017), VLCS (Torralba and Efros, 2011), MNIST-M (Ganin and Lempitsky, 2015), and NICO (He et al., 2021; Zhang et al., 2022). NICO labels both concept and context and the context can be attributes or backgrounds.

7.3 VQA

Both language and vision are essential for human recognition, and VQA (Antol et al., 2015) combines them. VQA naturally includes grounding, which finds the mapping between words and objects or their properties. Systematicity is also applicable and critical in other multimodal problems, including Image Captioning (Karpathy and Li, 2015), Image Generation (Klinger et al., 2020), and Embodied Question Answering (Das et al., 2018).

In early VQA, it was found that the trained models are likely to learn superficial and spurious relations between input and output. For example, when a question asks what is on the ground, the answer is likely to be snow. It is because the snow on the ground is worth to be asked. They are systematic generalization problems.

VQA datasets are designed for systematic generalization. CLEVR (Johnson et al., 2017) contains Compositional Generalization Test (CoGenT) for novel attribute combinations in the test. CLOSURE (Bahdanau et al., 2019a ) measures systematic generalization in the CLEVR dataset. Another VQA dataset is SQOOP (Bahdanau et al., 2019b ). GQA is a more realistic dataset (Hudson and Manning, 2019).

Algorithms for visual question answering include architecture design of Neural Module Networks (Andreas et al., 2016), Film (Perez et al., 2018), Relation Networks (Santoro et al., 2017), and MAC networks (Hudson and Manning, 2018). Latent Compositional Representation (Bogin et al., 2021) also helps.

Also, following the SCAN dataset for one-shot learning in language, the gSCAN dataset was proposed for one-shot learning problems in grounding and visual question answering (Ruis et al., 2020). The input is a human language instruction and an environment, and the output is a sequence of robot actions. A study on gSCAN shows that it is crucial to think before acting (Heinze-Deml and Bouchacourt, 2020). The object relations are modeled in the contexts (Gao et al., 2020). It is important to fit the network structure to the compositional structure of the problem (Kuo et al., 2021). A general transformer with cross-modal attention achieves nearly perfect results for majority splits, and the remaining problems correspond to the fundamental challenges of compositional generalization for language (Qiu et al., 2021).

There are also various simulated settings for grounded language acquisition with reinforcement learning, such as X World (Yu et al., 2018), BabyAI (Chevalier-Boisvert et al., 2019), and others (Hermann et al., 2017; Wu et al., 2018).

8 Recent Improvements

Different systematic generalization approaches have been investigated. However, the generalization is still difficult for deep learning in general (Hendrycks and Dietterich, 2019; Goyal et al., 2021c ). The main directions include disentangled representation learning, meta-learning, attention mechanism, modular architectures, specialized architectures, and data augmentation.

8.1 Disentangled Representation Learning

It was argued that good representations should help express the regularities (Hinton, 1990). Disentangled representation (Bengio et al., 2013) learning is developing quickly. Early work learns the representation from statistical marginal independence (Higgins et al., 2017; Burgess et al., 2018; Locatello et al., 2019).

The definition of disentangled representation has recently been proposed with symmetry transformation in Physics (Higgins et al., 2018). It leads to Symmetry-based Disentangled Representation Learning (Caselles-Dupré et al., 2019; Quessard et al., 2020; Painter et al., 2020; Pfau et al., 2020). Such approaches explain disentangled representation using group theory and Physics.

It is mentioned that disentangled representation is an example of ICM learning (Schölkopf et al., 2021). There are also methods to measure compositionality in representations (Andreas, 2019). Disentangled representation tends to be discussed without simultaneous systematic generalization. It can be a feature extractor to obtain disentangled representations, and in other systematic generalization tasks, the representations are used as inputs for downstream modules.

8.2 Meta-learning

Meta-learning is an approach for systematic generalization (Lake, 2019). It usually designs a series of training tasks for learning a meta-learner, which is used to address the problem in the target task. There is training and test data in each training task, where test data requires systematic generalization from training data. The training tasks need to have similar structures as the target task so that the meta-learner can learn how to generalize from the training data in the target task.

When ICMs are available, they can be used to generate meta-learning tasks (Schölkopf et al., 2021). It is discussed to employ meta-reinforcement learning for causal reasoning (Dasgupta et al., 2019). Meta-learning can also capture the adaptation speed to discover causal relations (Bengio et al., 2020; Ke et al., 2019). However, it is hard to disentangle the factors when multiple variables exist.

There are other works with meta-learning. Pairs of meta-learning tasks are constructed from sub-sampling training data (Conklin et al., 2021). Representation and task-specific layers of models are trained differently to generalize mismatched splits on pre-finetuning tasks, so transfer learning between compositional generalization tasks is enabled (Zhu et al., 2021).

8.3 Attention Mechanism

Attention mechanisms, especially key-value attention mechanisms, are widely used in neural networks (Bahdanau et al., 2015). The key-value mechanisms are composed of a query, keys, and values. The query and the keys generate an attention map, which extracts a value from the values. An attention map is similar to a pointer, often used in symbol processing. It is also a type of distal access (Newell, 1980b ), which uses an abbreviated tag for referring to a structure. A symbol is informally regarded as a small representation of an object, which provides “remote access” for the fuller representation of an object (Hinton, 1990).

Transformers (Vaswani et al., 2017) are modern neural network architectures with self-attention. Recurrent Independent Mechanisms (Goyal et al., 2021c ) use attention mechanisms and the names of the incoming nodes for variable binding. Global workspace (Goyal et al., 2022) improves them by using limited-capacity global communication to enable the exchangeability of knowledge for systematic generalization. Discrete-valued communication bottleneck (Liu et al., 2021) further enhances the generalization.

Different extensions to attention modules are discussed (Oren et al., 2020). Auxiliary objectives to bias attention in encoder-decoder models are proposed (Yin et al., 2021; Jiang and Bansal, 2021). There are also sparse variants (Shazeer et al., 2017) of attention. Compositional Attention (Mittal et al., 2022b ) disentangles search and retrieval in Transformer architecture. It addresses redundancies in multi-head attention with different numbers of searches and retrievals and dynamic selection.

We like to discuss the relationship between the attention mechanism and ICMs. The sparse connection prior knowledge (Bengio, 2017) has two types. The first is the sparseness on a dynamic graph or the routes for each sample. It corresponds to attention mechanisms. The second is the sparseness on a static graph or the connections between variables. It corresponds to ICMs. ICMs enable systematic generalization. The dynamic sparseness may not infer the static one, so attention may not establish ICMs to enable the generalization. However, attention reduces the size of a module input, so test inputs are more likely to remain in the training domain, which helps systematic generalization. For example, a word is a part of an input sentence, so an attended word can be correctly processed, even if the sentence is unseen. Also, the attention mechanism is usually an operator and does not contain parameters, so the mechanism suffers less from the change of distribution.

8.4 Modular architectures

Modular architecture has a long history, such as the mixture of experts (Jacobs et al., 1991b ; Jordan and Jacobs, 1994). Early related ideas apply micro-inference, which uses some of the features of some of the role-fillers to infer some of the features of the other role-fillers (Hinton, 1990). There are also recent results (Graves et al., 2014; Andreas et al., 2016; Hu et al., 2017; Vaswani et al., 2017; Goyal et al., 2021c ; Goyal et al., 2021a ; Mittal et al., 2020; Ke et al., 2021).

Modular architectures are natural for combinatorial generalization (Battaglia et al., 2018). There are task-specific modular networks (Jacobs et al., 1991a ). Though modules can be designed for different factors, the input to each module may still have spurious influence from other factors when the model input is entangled. It can be helpful to regularize entropy to bottleneck modules in such cases (Li et al., 2019).

Attention mechanisms can be used with modular architecture (Riemer et al., 2016; Peters et al., 2017; Mittal et al., 2022a ). Object-centric slot attention (Locatello et al., 2020) finds objects for downstream networks. Neural Interpreters (Rahaman et al., 2021) factorize inferences to modules in a self-attention network. It can be trained end-to-end by routing through modules.

Modular and compositional computation (Rosenbaum et al., 2019) in routing networks were analyzed. A differentiable weight mask is used to examine the modularity of neural networks. It finds that neural networks are not trained to be modular. Common modular architectures are assessed (Mittal et al., 2022a ) with collapse and specialization problems, finding end-to-end learned modular systems are not optimal.

8.5 Specialized architectures

Another common approach is specialized architecture design (Russin et al., 2019; Gordon et al., 2020; Liu et al., 2020; Chen et al., 2020). The importance of design decisions is reported (Ontanon et al., 2022).

Transformers significantly improve semantic parsing when model configurations are carefully adjusted, and Universal Transformer variants also work well (Csordás et al., 2021). Reordering and aligning the structure (Wang et al., 2021a ) can model segment-to-segment alignments with a neural reordering module for separable permutations. The span-based parser (Herzig and Berant, 2021) treats a tree as a hidden variable.

Large pre-trained language models convert inputs to intermediate language representations for semantic parsing (Shin et al., 2021). Intermediate representation helps compositional generalization for pre-trained seq2seq models (Herzig et al., 2021). Program synthesis (Nye et al., 2020) learns explicit programs from training data. Semantic tagging (Zheng and Lapata, 2021) trains an alignment tagger by entity linking with λ\lambda-calculus and SQL expressions. It uses tags to supervise hidden variables. Iterative decoding (Ruiz et al., 2021) breaks training examples down into a sequence of intermediate steps.

8.6 Data augmentation

Data augmentation is primarily for language tasks, as words and phrases in a sentence are more straightforward to modify than the pixels in images. Multiple approaches are proposed for adding data (Guo et al., 2021; Wang et al., 2021b ; Guo et al., 2020) and training from labeled data (Yu et al., 2021; Zhong et al., 2020).

GECA (good-enough compositional augmentation) (Andreas, 2020) is a rule-based protocol for sequence modeling. It provides inductive bias for compositionality. It can replace discontinuous sentence fragments, e.g., “Tom picks apples up.” R&R (recombine and resample) (Akyürek et al., 2021) learns schemes for data augmentation. It replaces the symbolic generative process with neural models and obtains the inductive bias as explicit rules.

Data recombination (Jia and Liang, 2016) injects task-specific prior knowledge for modeling logical regularities in semantic parsing. It induces synchronous context-free grammar from training data. It is also studied that the diverse sampling structure of synthetic examples helps systematic generalization (Oren et al., 2021). CSL (Compositional Structure Learning) (Qiu et al., 2022a ) is a generative model with context-free grammar induced from training data. The examples from CSL are recombined and used to fine-tune a pre-trained model. It is studied to use subtree substitution (Yang et al., 2022) for data augmentation.

There is also data augmentation for images, e.g., interpolating both image input and label output (Yao et al., 2022). Stable learning (Zhang et al., 2021) learns weights for training samples to remove dependencies between features.

9 Conclusion

Systematic generalization is a critical capability for artificial intelligence. While it is straightforward for classic symbol processing approaches, it is difficult for Connectionist approaches. It has been discussed with crucial problems of variable binding and causal learning. Our discussion covers different AI fields, such as language, vision, and VQA, and there are recent improvements in different aspects. Though some specific problems are addressed, there are still many things unknown about systematic generalization in deep learning. We hope this survey helps in understanding the background and inspiring future work.

Acknowledgments

We thank Liang Zhao for beneficial discussions, suggestions, and adding information. We also thank Yi Yang for the helpful advice.

References

  • Akata et al., (2013) Akata, Z., Perronnin, F., Harchaoui, Z., and Schmid, C. (2013). Label-embedding for attribute-based classification. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, pages 819–826.
  • Akata et al., (2015) Akata, Z., Reed, S., Walter, D., Lee, H., and Schiele, B. (2015). Evaluation of output embeddings for fine-grained image classification. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2927–2936.
  • Akyürek et al., (2021) Akyürek, E., Akyürek, A. F., and Andreas, J. (2021). Learning to recombine and resample data for compositional generalization. In International Conference on Learning Representations.
  • Andreas, (2019) Andreas, J. (2019). Measuring compositionality in representation learning. In International Conference on Learning Representations.
  • Andreas, (2020) Andreas, J. (2020). Good-enough compositional data augmentation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7556–7566, Online. Association for Computational Linguistics.
  • Andreas et al., (2016) Andreas, J., Rohrbach, M., Darrell, T., and Klein, D. (2016). Neural module networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 39–48.
  • Antol et al., (2015) Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., and Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433.
  • Arjovsky et al., (2019) Arjovsky, M., Bottou, L., Gulrajani, I., and Lopez-Paz, D. (2019). Invariant risk minimization. arXiv.
  • Bahdanau et al., (2015) Bahdanau, D., Cho, K., and Bengio, Y. (2015). Neural machine translation by jointly learning to align and translate. In ICLR 2015 : International Conference on Learning Representations 2015.
  • (10) Bahdanau, D., de Vries, H., O’Donnell, T. J., Murty, S., Beaudoin, P., Bengio, Y., and Courville, A. (2019a). Closure: Assessing systematic generalization of clevr models. arXiv preprint arXiv:1912.05783.
  • (11) Bahdanau, D., Murty, S., Noukhovitch, M., Nguyen, T. H., de Vries, H., and Courville, A. (2019b). Systematic generalization: What is required and can it be learned? In International Conference on Learning Representations.
  • Ballard, (1986) Ballard, D. H. (1986). Cortical connections and parallel processing: structure and function. Behavioral and Brain Sciences, 9(1):67–90.
  • Barnden, (1984) Barnden, J. A. (1984). On short-term information processing in connectionist theories. cognition and brain theory.
  • Bastings et al., (2018) Bastings, J., Baroni, M., Weston, J., Cho, K., and Kiela, D. (2018). Jump to better conclusions: SCAN both left and right. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 47–55, Brussels, Belgium. Association for Computational Linguistics.
  • Battaglia et al., (2018) Battaglia, P. W., Hamrick, J. B., Bapst, V., Sanchez-Gonzalez, A., Zambaldi, V., Malinowski, M., Tacchetti, A., Raposo, D., Santoro, A., Faulkner, R., et al. (2018). Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261.
  • Bengio et al., (2021) Bengio, E., Jain, M., Korablyov, M., Precup, D., and Bengio, Y. (2021). Flow network based generative models for non-iterative diverse candidate generation. Advances in Neural Information Processing Systems, 34:27381–27394.
  • Bengio, (2017) Bengio, Y. (2017). The consciousness prior. arXiv preprint arXiv:1709.08568.
  • Bengio et al., (2013) Bengio, Y., Courville, A., and Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828.
  • Bengio et al., (2020) Bengio, Y., Deleu, T., Rahaman, N., Ke, N. R., Lachapelle, S., Bilaniuk, O., Goyal, A., and Pal, C. (2020). A meta-transfer objective for learning to disentangle causal mechanisms. In International Conference on Learning Representations.
  • Bengio et al., (2003) Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research.
  • Bergen et al., (2021) Bergen, L., O’Donnell, T. J., and Bahdanau, D. (2021). Systematic generalization with edge transformers. In Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems.
  • Bodén and Niklasson, (2000) Bodén, M. and Niklasson, L. (2000). Semantic systematicity and context in connectionist networks. Connection Science, 12(2):111–142.
  • Bogin et al., (2021) Bogin, B., Subramanian, S., Gardner, M., and Berant, J. (2021). Latent compositional representations improve systematic generalization in grounded question answering. Transactions of the Association for Computational Linguistics, 9:195–210.
  • Bowman et al., (2015) Bowman, S. R., Manning, C. D., and Potts, C. (2015). Tree-structured composition in neural networks without tree-structured architectures.
  • Brakel and Frank, (2009) Brakel, P. and Frank, S. (2009). Strong systematicity in sentence processing by simple recurrent networks. In 31th Annual Conference of the Cognitive Science Society (COGSCI-2009), pages 1599–1604. Cognitive Science Society.
  • Browne and Sun, (2000) Browne, A. and Sun, R. (2000). Connectionist variable binding. Expert Systems, 16:189–207.
  • Burgess et al., (2018) Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N., Desjardins, G., and Lerchner, A. (2018). Understanding disentangling in β\beta-vae. CoRR, abs/1804.03599.
  • Calvo and Symons, (2014) Calvo, P. and Symons, J. (2014). The architecture of cognition: Rethinking Fodor and Pylyshyn’s systematicity challenge. MIT Press.
  • Caselles-Dupré et al., (2019) Caselles-Dupré, H., Garcia Ortiz, M., and Filliat, D. (2019). Symmetry-based disentangled representation learning requires interaction with environments. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  • Chalmers, (1993) Chalmers, D. J. (1993). Connectionism and compositionality: Why fodor and pylyshyn were wrong. Philosophical Psychology.
  • Chang et al., (2019) Chang, M., Gupta, A., Levine, S., and Griffiths, T. L. (2019). Automatically composing representation transformations as a means for generalization. In International Conference on Learning Representations.
  • Changpinyo et al., (2016) Changpinyo, S., Chao, W.-L., Gong, B., and Sha, F. (2016). Synthesized classifiers for zero-shot learning. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5327–5336.
  • Chen et al., (2020) Chen, X., Liang, C., Yu, A. W., Song, D., and Zhou, D. (2020). Compositional generalization via neural-symbolic stack machines. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 1690–1701. Curran Associates, Inc.
  • Chevalier-Boisvert et al., (2019) Chevalier-Boisvert, M., Bahdanau, D., Lahlou, S., Willems, L., Saharia, C., Nguyen, T. H., and Bengio, Y. (2019). BabyAI: First steps towards grounded language learning with a human in the loop. In International Conference on Learning Representations.
  • Chomsky, (1957) Chomsky, N. (1957). Syntactic structures. Walter de Gruyter.
  • Conklin et al., (2021) Conklin, H., Wang, B., Smith, K., and Titov, I. (2021). Meta-learning to compositionally generalize. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3322–3335, Online. Association for Computational Linguistics.
  • Csordás et al., (2021) Csordás, R., Irie, K., and Schmidhuber, J. (2021). The devil is in the detail: Simple tricks improve systematic generalization of transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 619–634, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Dankers et al., (2022) Dankers, V., Bruni, E., and Hupkes, D. (2022). The paradox of the compositionality of natural language: A neural machine translation case study. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4154–4175, Dublin, Ireland. Association for Computational Linguistics.
  • Das et al., (2018) Das, A., Datta, S., Gkioxari, G., Lee, S., Parikh, D., and Batra, D. (2018). Embodied question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 2054–2063.
  • Dasgupta et al., (2019) Dasgupta, I., Wang, J., Chiappa, S., Mitrovic, J., Ortega, P., Raposo, D., Hughes, E., Battaglia, P., Botvinick, M., and Kurth-Nelson, Z. (2019). Causal reasoning from meta-reinforcement learning. arXiv preprint arXiv:1901.08162.
  • d’Avila Garcez et al., (2019) d’Avila Garcez, A., Gori, M., Lamb, L. C., Serafini, L., Spranger, M., and Tran, S. N. (2019). Neural-symbolic computing: An effective methodology for principled integration of machine learning and reasoning.
  • D’Avila Garcez et al., (2009) D’Avila Garcez, A. S., Lamb, L. C., and Gabbay, D. M. (2009). Neural-Symbolic Cognitive Reasoning. Springer Science and Business Media.
  • Deerwester et al., (1990) Deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., and Harshman, R. (1990). Indexing by latent semantic analysis. JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE, 41(6):391–407.
  • Derthick, (1990) Derthick, M. (1990). Mundane reasoning by settling on a plausible model. Artif. Intell., 46(1–2):107–157.
  • Ding et al., (2017) Ding, Z., Shao, M., and Fu, Y. (2017). Low-rank embedded ensemble semantic dictionary for zero-shot learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6005–6013.
  • Dyer and Dickinson, (1994) Dyer, F. C. and Dickinson, J. A. (1994). Development of sun compensation by honeybees: how partially experienced bees estimate the sun’s course. Proceedings of the National Academy of Sciences, 91(10):4471–4474.
  • Eliasmith, (2013) Eliasmith, C. (2013). How to build a brain: A neural architecture for biological cognition. Oxford University Press.
  • Farhadi et al., (2009) Farhadi, A., Endres, I., Hoiem, D., and Forsyth, D. (2009). Describing objects by their attributes. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 1778–1785.
  • Feldman, (1986) Feldman, J. A. (1986). Neural representation of conceptual knowledge.
  • Feldman and Ballard, (1982) Feldman, J. A. and Ballard, D. H. (1982). Connectionist models and their properties. Cognitive Science, 6(3):205–254.
  • Fodor, (1975) Fodor, J. A. (1975). The Language of Thought. Harvard University Press paperback.
  • Fodor and Pylyshyn, (1988) Fodor, J. A. and Pylyshyn, Z. W. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 28(1-2):3–71.
  • Frady et al., (2021) Frady, E. P., Kleyko, D., and Sommer, F. T. (2021). Variable binding for sparse distributed representations: Theory and applications. IEEE Transactions on Neural Networks, pages 1–14.
  • Frome et al., (2013) Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M., and Mikolov, T. (2013). Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems 26, volume 26, pages 2121–2129.
  • Furrer et al., (2020) Furrer, D. P., van Zee, M., Scales, N., and Schärli, N. (2020). Compositional generalization in semantic parsing: Pre-training vs. specialized architectures. arXiv preprint arXiv:2007.08970.
  • Gallistel and King, (2009) Gallistel, C. R. and King, A. P. (2009). Memory and The Computational Brain: Why Cognitive Science Will Transform Neuroscience. Memory and The Computational Brain: Why Cognitive Science Will Transform Neuroscience.
  • Ganin and Lempitsky, (2015) Ganin, Y. and Lempitsky, V. (2015). Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR.
  • Gao et al., (2020) Gao, T., Huang, Q., and Mooney, R. (2020). Systematic generalization on gSCAN with language conditioned embedding. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 491–503, Suzhou, China. Association for Computational Linguistics.
  • Gaunt et al., (2017) Gaunt, A. L., Brockschmidt, M., Kushman, N., and Tarlow, D. (2017). Differentiable programs with neural libraries. In International Conference on Machine Learning, pages 1213–1222. PMLR.
  • Geiger et al., (2019) Geiger, A., Cases, I., Karttunen, L., and Potts, C. (2019). Posing fair generalization tasks for natural language inference. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4485–4495, Hong Kong, China. Association for Computational Linguistics.
  • Gordon et al., (2020) Gordon, J., Lopez-Paz, D., Baroni, M., and Bouchacourt, D. (2020). Permutation equivariant models for compositional generalization in language. In International Conference on Learning Representations.
  • Gosmann and Eliasmith, (2019) Gosmann, J. and Eliasmith, C. (2019). Vector-Derived Transformation Binding: An Improved Binding Operation for Deep Symbol-Like Processing in Neural Networks. Neural Computation, 31(5):849–869.
  • Goyal and Bengio, (2020) Goyal, A. and Bengio, Y. (2020). Inductive biases for deep learning of higher-level cognition. arXiv preprint arXiv:2011.15091.
  • (64) Goyal, A., Didolkar, A., Ke, N. R., Blundell, C., Beaudoin, P., Heess, N., Mozer, M., and Bengio, Y. (2021a). Neural production systems.
  • Goyal et al., (2022) Goyal, A., Didolkar, A. R., Lamb, A., Badola, K., Ke, N. R., Rahaman, N., Binas, J., Blundell, C., Mozer, M. C., and Bengio, Y. (2022). Coordination among neural modules through a shared global workspace. In International Conference on Learning Representations.
  • (66) Goyal, A., Lamb, A., Gampa, P., Beaudoin, P., Blundell, C., Levine, S., Bengio, Y., and Mozer, M. C. (2021b). Factorizing declarative and procedural knowledge in structured, dynamical environments. In International Conference on Learning Representations.
  • (67) Goyal, A., Lamb, A., Hoffmann, J., Sodhani, S., Levine, S., Bengio, Y., and Schölkopf, B. (2021c). Recurrent independent mechanisms. In International Conference on Learning Representations.
  • Graves et al., (2014) Graves, A., Wayne, G., and Danihelka, I. (2014). Neural turing machines. arXiv preprint arXiv:1410.5401.
  • Greenland et al., (1999) Greenland, S., Pearl, J., and Robins, J. M. (1999). Causal diagrams for epidemiologic research. Epidemiology, pages 37–48.
  • Guo et al., (2020) Guo, D., Kim, Y., and Rush, A. (2020). Sequence-level mixed sample data augmentation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5547–5552, Online. Association for Computational Linguistics.
  • Guo et al., (2021) Guo, Y., Zhu, H., Lin, Z., Chen, B., Lou, J.-G., and Zhang, D. (2021). Revisiting iterative back-translation from the perspective of compositional generalization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 7601–7609.
  • Hadley, (1992) Hadley, R. (1992). Compositionality and systematicity in connectionist language learning. In Proceedings of the 14th Annual Conference of the Cognitive Science Society, pages 659–664. Lawrence Erlbaum.
  • Hadley and Hayward, (1997) Hadley, R. F. and Hayward, M. B. (1997). Strong semantic systematicity from hebbian connectionist learning. Minds and Machines, 7(1):1–37.
  • He et al., (2021) He, Y., Shen, Z., and Cui, P. (2021). Towards non-iid image classification: A dataset and baselines. Pattern Recognition, 110:107383.
  • Heckerman et al., (1995) Heckerman, D., Geiger, D., and Chickering, D. M. (1995). Learning bayesian networks: The combination of knowledge and statistical data. Machine learning, 20(3):197–243.
  • Heinze-Deml and Bouchacourt, (2020) Heinze-Deml, C. and Bouchacourt, D. (2020). Think before you act: A simple baseline for compositional generalization. arXiv preprint arXiv:2009.13962.
  • Hendrycks and Dietterich, (2019) Hendrycks, D. and Dietterich, T. (2019). Benchmarking neural network robustness to common corruptions and perturbations. In International Conference on Learning Representations.
  • Hendrycks et al., (2020) Hendrycks, D., Mu, N., Cubuk, E. D., Zoph, B., Gilmer, J., and Lakshminarayanan, B. (2020). Augmix: A simple method to improve robustness and uncertainty under data shift. In International Conference on Learning Representations.
  • Hermann et al., (2017) Hermann, K. M., Hill, F., Green, S., Wang, F., Faulkner, R., Soyer, H., Szepesvari, D., Czarnecki, W. M., Jaderberg, M., Teplyashin, D., Wainwright, M., Apps, C., Hassabis, D., and Blunsom, P. (2017). Grounded language learning in a simulated 3d world. arXiv preprint arXiv:1706.06551.
  • Herzig and Berant, (2021) Herzig, J. and Berant, J. (2021). Span-based semantic parsing for compositional generalization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 908–921, Online. Association for Computational Linguistics.
  • Herzig et al., (2021) Herzig, J., Shaw, P., Chang, M.-W., Guu, K., Pasupat, P., and Zhang, Y. (2021). Unlocking compositional generalization in pre-trained models using intermediate representations. arXiv preprint arXiv:2104.07478.
  • Higgins et al., (2018) Higgins, I., Amos, D., Pfau, D., Racanière, S., Matthey, L., Rezende, D. J., and Lerchner, A. (2018). Towards a definition of disentangled representations. CoRR, abs/1812.02230.
  • Higgins et al., (2017) Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., and Lerchner, A. (2017). beta-vae: Learning basic visual concepts with a constrained variational framework. In International Conference on Learning Representations, volume 3.
  • Hinton, (1990) Hinton, G. E. (1990). Mapping part-whole hierarchies into connectionist networks. Artificial Intelligence, 46(1):47–75.
  • Hinton, (1990) Hinton, G. E. (1990). Preface to the special issue on connectionist symbol processing. Artificial Intelligence, 46(1-2):1–4.
  • Hinton, (1991) Hinton, G. E. (1991). Connectionist Symbol Processing. MIT Press.
  • Hinton and Anderson, (1981) Hinton, G. E. and Anderson, J. A. (1981). Implementing semantic networks in parallel hardware. Parallel Models of Association Memory.
  • (88) Hinton, G. E. et al. (1986a). Learning distributed representations of concepts. In Proceedings of the eighth annual conference of the cognitive science society, volume 1, page 12. Amherst, MA.
  • (89) Hinton, G. E., McClelland, J. L., and Rumelhart, D. E. (1986b). Distributed Representations, page 77–109. MIT Press, Cambridge, MA, USA.
  • Holyoak and Hummel, (2000) Holyoak, K. and Hummel, J. (2000). The proper treatment of symbols in a connectionist architecture. In Cognitive dynamics: Conceptual change in humans and machines.
  • Hu et al., (2017) Hu, R., Andreas, J., Rohrbach, M., Darrell, T., and Saenko, K. (2017). Learning to reason: End-to-end module networks for visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 804–813.
  • Hudson and Manning, (2018) Hudson, D. A. and Manning, C. D. (2018). Compositional attention networks for machine reasoning. In International Conference on Learning Representations (ICLR).
  • Hudson and Manning, (2019) Hudson, D. A. and Manning, C. D. (2019). Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6700–6709.
  • Hume, (2003) Hume, D. (2003). A treatise of human nature. Courier Corporation.
  • Hummel and Holyoak, (1997) Hummel, J. and Holyoak, K. (1997). Distributed representations of structure: A theory of analogical access and mapping. Psychological Review, 104(3):427–466.
  • Hupkes et al., (2020) Hupkes, D., Dankers, V., Mul, M., and Bruni, E. (2020). Compositionality decomposed: How do neural networks generalise? (extended abstract). In Bessiere, C., editor, Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, pages 5065–5069. International Joint Conferences on Artificial Intelligence Organization. Journal track.
  • (97) Jacobs, R. A., Jordan, M. I., and Barto, A. G. (1991a). Task decomposition through competition in a modular connectionist architecture: The what and where vision tasks. Cognitive science, 15(2):219–250.
  • (98) Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. (1991b). Adaptive mixtures of local experts. Neural computation, 3(1):79–87.
  • Jansen and Watter, (2012) Jansen, P. A. and Watter, S. (2012). Strong systematicity through sensorimotor conceptual grounding: an unsupervised, developmental approach to connectionist sentence processing. Connection Science, 24(1):25–55.
  • Jia and Liang, (2016) Jia, R. and Liang, P. (2016). Data recombination for neural semantic parsing. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12–22, Berlin, Germany. Association for Computational Linguistics.
  • Jiang and Bansal, (2021) Jiang, Y. and Bansal, M. (2021). Inducing transformer’s compositional generalization ability via auxiliary sequence prediction tasks. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 6253–6265, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Johnson et al., (2017) Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Zitnick, C. L., and Girshick, R. (2017). Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1988–1997.
  • Johnson et al., (2017) Johnson, J., Hariharan, B., Van Der Maaten, L., Hoffman, J., Fei-Fei, L., Lawrence Zitnick, C., and Girshick, R. (2017). Inferring and executing programs for visual reasoning. In Proceedings of the IEEE International Conference on Computer Vision, pages 2989–2998.
  • Jordan and Jacobs, (1994) Jordan, M. I. and Jacobs, R. A. (1994). Hierarchical mixtures of experts and the em algorithm. Neural computation, 6(2):181–214.
  • Kahneman, (2011) Kahneman, D. (2011). Thinking, fast and slow. Macmillan.
  • Karpathy and Li, (2015) Karpathy, A. and Li, F.-F. (2015). Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128–3137.
  • Ke et al., (2019) Ke, N. R., Bilaniuk, O., Goyal, A., Bauer, S., Larochelle, H., Schölkopf, B., Mozer, M. C., Pal, C., and Bengio, Y. (2019). Learning neural causal models from unknown interventions. arXiv preprint arXiv:1910.01075.
  • Ke et al., (2021) Ke, N. R., Didolkar, A., Mittal, S., Goyal, A., Lajoie, G., Bauer, S., Rezende, D., Bengio, Y., Mozer, M., and Pal, C. (2021). Systematic evaluation of causal discovery in visual model based reinforcement learning.
  • Keysers et al., (2020) Keysers, D., Schärli, N., Scales, N., Buisman, H., Furrer, D., Kashubin, S., Momchev, N., Sinopalnikov, D., Stafiniak, L., Tihon, T., Tsarkov, D., Wang, X., van Zee, M., and Bousquet, O. (2020). Measuring compositional generalization: A comprehensive method on realistic data. In International Conference on Learning Representations.
  • Kim and Linzen, (2020) Kim, N. and Linzen, T. (2020). COGS: A compositional generalization challenge based on semantic interpretation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9087–9105, Online. Association for Computational Linguistics.
  • Kliegl and Xu, (2018) Kliegl, M. and Xu, W. (2018). More systematic than claimed: Insights on the scan tasks. OpenReview.
  • Klinger et al., (2020) Klinger, T., Adjodah, D., Marois, V., Joseph, J., Riemer, M., Pentland, A. S., and Campbell, M. (2020). A study of compositional generalization in neural models.
  • Kocaoglu et al., (2018) Kocaoglu, M., Snyder, C., Dimakis, A. G., and Vishwanath, S. (2018). CausalGAN: Learning causal implicit generative models with adversarial training. In International Conference on Learning Representations.
  • Kuo et al., (2021) Kuo, Y.-L., Katz, B., and Barbu, A. (2021). Compositional networks enable systematic generalization for grounded language understanding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 216–226, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Kurutach et al., (2018) Kurutach, T., Tamar, A., Yang, G., Russell, S. J., and Abbeel, P. (2018). Learning plannable representations with causal infogan. In Advances in Neural Information Processing Systems, pages 8733–8744.
  • Lake and Baroni, (2018) Lake, B. and Baroni, M. (2018). Generalization without systematicity: On the compositional skills of sequence-to-sequence recurrent networks. In International Conference on Machine Learning, pages 2873–2882.
  • Lake, (2019) Lake, B. M. (2019). Compositional generalization through meta sequence-to-sequence learning. In Advances in Neural Information Processing Systems, pages 9788–9798.
  • Lake et al., (2019) Lake, B. M., Linzen, T., and Baroni, M. (2019). Human few-shot learning of compositional instructions. arXiv preprint arXiv:1901.04587.
  • Lake et al., (2017) Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gershman, S. J. (2017). Building machines that learn and think like people. Behavioral and Brain Sciences, 40.
  • Lampert et al., (2014) Lampert, C. H., Nickisch, H., and Harmeling, S. (2014). Attribute-based classification for zero-shot visual object categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(3):453–465.
  • Larochelle et al., (2008) Larochelle, H., Erhan, D., and Bengio, Y. (2008). Zero-data learning of new tasks. In AAAI’08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 2, pages 646–651.
  • Lenat et al., (1986) Lenat, D., Prakash, M., and Shepherd, M. (1986). Cyc: Using common sense knowledge to overcome brittleness and knowledge acquisition bottlenecks. Ai Magazine, 6(4):65–85.
  • Li et al., (2017) Li, D., Yang, Y., Song, Y.-Z., and Hospedales, T. M. (2017). Deeper, broader and artier domain generalization. In Proceedings of the IEEE international conference on computer vision, pages 5542–5550.
  • Li et al., (2020) Li, Y., Zhao, L., Church, K., and Elhoseiny, M. (2020). Compositional language continual learning. In International Conference on Learning Representations.
  • Li et al., (2019) Li, Y., Zhao, L., Wang, J., and Hestness, J. (2019). Compositional generalization for primitive substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4284–4293.
  • Linzen et al., (2016) Linzen, T., Dupoux, E., and Goldberg, Y. (2016). Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535.
  • Liška et al., (2018) Liška, A., Kruszewski, G., and Baroni, M. (2018). Memorize or generalize? searching for a compositional rnn in a haystack. AEGAP.
  • Liu et al., (2021) Liu, D., Lamb, A. M., Kawaguchi, K., ALIAS PARTH GOYAL, A. G., Sun, C., Mozer, M. C., and Bengio, Y. (2021). Discrete-valued neural communication. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 2109–2121. Curran Associates, Inc.
  • Liu et al., (2020) Liu, Q., An, S., Lou, J.-G., Chen, B., Lin, Z., Gao, Y., Zhou, B., Zheng, N., and Zhang, D. (2020). Compositional generalization by learning analytical expressions. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 11416–11427. Curran Associates, Inc.
  • Locatello et al., (2019) Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., and Bachem, O. (2019). Challenging common assumptions in the unsupervised learning of disentangled representations. In international conference on machine learning, pages 4114–4124. PMLR.
  • Locatello et al., (2020) Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., and Kipf, T. (2020). Object-centric learning with slot attention. Advances in Neural Information Processing Systems, 33:11525–11538.
  • Loula et al., (2018) Loula, J., Baroni, M., and Lake, B. (2018). Rearranging the familiar: Testing compositional generalization in recurrent networks. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 108–114, Brussels, Belgium. Association for Computational Linguistics.
  • Mao et al., (2019) Mao, J., Gan, C., Kohli, P., Tenenbaum, J. B., and Wu, J. (2019). The neuro-symbolic concept learner: Interpreting scenes, words, and sentences from natural supervision. In International Conference on Learning Representations.
  • Marcus, (1998) Marcus, G. F. (1998). Rethinking eliminative connectionism. Cognitive psychology, 37(3):243–282.
  • Mccarthy, (1959) Mccarthy, J. (1959). Program with common sense. Teddington Conference on the Mechanization of Thought Processes.
  • McGarry et al., (1999) McGarry, K., Wermter, S., and MacIntyre, J. (1999). Hybrid neural systems: from simple coupling to fully integrated neural networks. Neural Computing Surveys, 2(1):62–93.
  • Mensink et al., (2012) Mensink, T., Verbeek, J., Perronnin, F., and Csurka, G. (2012). Metric learning for large scale image classification: generalizing to new classes at near-zero cost. In Proceedings, Part II, of the 12th European Conference on Computer Vision — ECCV 2012 - Volume 7573, volume 7573, pages 488–501.
  • Miikkulainen, (1993) Miikkulainen, R. (1993). Subsymbolic Natural Language Processing: An Integrated Model of Scripts, Lexicon, and Memory. MIT Press.
  • Minsky, (1986) Minsky, M. (1986). Society of mind. Simon and Schuster.
  • (140) Mittal, S., Bengio, Y., and Lajoie, G. (2022a). Is a modular architecture enough? DyNN workshop at the 39th International Conference on Machine Learning.
  • Mittal et al., (2020) Mittal, S., Lamb, A., Goyal, A., Voleti, V., Shanahan, M., Lajoie, G., Mozer, M., and Bengio, Y. (2020). Learning to combine top-down and bottom-up signals in recurrent neural networks with attention over modules. In International Conference on Machine Learning, pages 6972–6986. PMLR.
  • (142) Mittal, S., Raparthy, S. C., Rish, I., Bengio, Y., and Lajoie, G. (2022b). Compositional attention: Disentangling search and retrieval. In International Conference on Learning Representations.
  • Montague, (1970) Montague, R. (1970). Universal grammar. Theoria, 36(3):373–398.
  • Newel and Simon, (1976) Newel, A. and Simon, H. A. (1976). Computer science as empirical inquiry: Symbols and search. Communications of the ACM, 19(3):113–126.
  • (145) Newell, A. (1980a). Harpy, production systems, and human cognition. Perception and production of fluent speech, pages 289–380.
  • (146) Newell, A. (1980b). Physical symbol systems. Cognitive Science, 4(2):135–183.
  • Newell, (1990) Newell, A. (1990). Unified Theories of Cognition. Harvard University Press, USA.
  • Nie et al., (2020) Nie, Y., Williams, A., Dinan, E., Bansal, M., Weston, J., and Kiela, D. (2020). Adversarial NLI: A new benchmark for natural language understanding. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4885–4901, Online. Association for Computational Linguistics.
  • Niklasson and van Gelder, (1994) Niklasson, L. and van Gelder, T. (1994). Can connectionist models exhibit non-classical structure sensitivity? In PROCEEDINGS OF THE SIXTEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY. ERLBAUM, pages 664–669.
  • Norouzi et al., (2014) Norouzi, M., Mikolov, T., Bengio, S., Singer, Y., Shlens, J., Frome, A., Corrado, G. S., and Dean, J. (2014). Zero-shot learning by convex combination of semantic embeddings. In ICLR 2014 : International Conference on Learning Representations (ICLR) 2014.
  • Nye et al., (2020) Nye, M., Solar-Lezama, A., Tenenbaum, J., and Lake, B. M. (2020). Learning compositional rules via neural program synthesis. Advances in Neural Information Processing Systems, 33:10832–10842.
  • Nye et al., (2021) Nye, M., Tessler, M., Tenenbaum, J., and Lake, B. M. (2021). Improving coherence and consistency in neural sequence models with dual-system, neuro-symbolic reasoning. Advances in Neural Information Processing Systems, 34:25192–25204.
  • Ontanon et al., (2022) Ontanon, S., Ainslie, J., Fisher, Z., and Cvicek, V. (2022). Making transformers solve compositional tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3591–3607, Dublin, Ireland. Association for Computational Linguistics.
  • Oren et al., (2021) Oren, I., Herzig, J., and Berant, J. (2021). Finding needles in a haystack: Sampling structurally-diverse training sets from synthetic data for compositional generalization. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 10793–10809, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Oren et al., (2020) Oren, I., Herzig, J., Gupta, N., Gardner, M., and Berant, J. (2020). Improving compositional generalization in semantic parsing. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2482–2495, Online. Association for Computational Linguistics.
  • Painter et al., (2020) Painter, M., Prugel-Bennett, A., and Hare, J. (2020). Linear disentangled representations and unsupervised action estimation. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 13297–13307. Curran Associates, Inc.
  • Parascandolo et al., (2018) Parascandolo, G., Kilbertus, N., Rojas-Carulla, M., and Schölkopf, B. (2018). Learning independent causal mechanisms. In International Conference on Machine Learning, pages 4036–4044. PMLR.
  • Patterson and Hays, (2012) Patterson, G. and Hays, J. (2012). Sun attribute database: Discovering, annotating, and recognizing scene attributes. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2751–2758.
  • Pearl, (1995) Pearl, J. (1995). Causal diagrams for empirical research. Biometrika, 82(4):669–688.
  • Pearl, (2003) Pearl, J. (2003). CAUSALITY: MODELS, REASONING, AND INFERENCE. Cambridge university press.
  • Pearl, (2009) Pearl, J. (2009). Causality. Cambridge university press.
  • Pearl, (2018) Pearl, J. (2018). Does obesity shorten life? or is it the soda? on non-manipulable causes. Journal of Causal Inference, 6(2).
  • Perez et al., (2018) Perez, E., Strub, F., De Vries, H., Dumoulin, V., and Courville, A. (2018). Film: Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
  • Peters et al., (2016) Peters, J., Bühlmann, P., and Meinshausen, N. (2016). Causal inference using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), pages 947–1012.
  • Peters et al., (2017) Peters, J., Janzing, D., and Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.
  • Pfau et al., (2020) Pfau, D., Higgins, I., Botev, A., and Racanière, S. (2020). Disentangling by subspace diffusion. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 17403–17415. Curran Associates, Inc.
  • Pinker and Prince, (1988) Pinker, S. and Prince, A. (1988). On language and connectionism: Analysis of a parallel distributed processing model of language acquisition. Cognition, 28(1-2):73–193.
  • Plate, (1991) Plate, T. (1991). Holographic reduced representations: Convolution algebra for compositional distributed representations. In INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, pages 30–35. Morgan Kaufmann.
  • Pollack, (1990) Pollack, J. B. (1990). Recursive distributed representations. Artificial Intelligence, 46(1-2):77–105.
  • Pylyshyn, (1980) Pylyshyn, Z. W. (1980). Computation and cognition: Issues in the foundations of cognitive science. Behavioral and Brain Sciences, 3(1):111–132.
  • Qiu et al., (2021) Qiu, L., Hu, H., Zhang, B., Shaw, P., and Sha, F. (2021). Systematic generalization on gSCAN: What is nearly solved and what is next? In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 2180–2188, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • (172) Qiu, L., Shaw, P., Pasupat, P., Nowak, P., Linzen, T., Sha, F., and Toutanova, K. (2022a). Improving compositional generalization with latent structure and data augmentation. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4341–4362, Seattle, United States. Association for Computational Linguistics.
  • (173) Qiu, L., Shaw, P., Pasupat, P., Shi, T., Herzig, J., Pitler, E., Sha, F., and Toutanova, K. (2022b). Evaluating the impact of model scale for compositional generalization in semantic parsing. arXiv preprint arXiv:2205.12253.
  • Quessard et al., (2020) Quessard, R., Barrett, T., and Clements, W. (2020). Learning disentangled representations and group structure of dynamical environments. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M. F., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 19727–19737. Curran Associates, Inc.
  • Rahaman et al., (2021) Rahaman, N., Gondal, M. W., Joshi, S., Gehler, P., Bengio, Y., Locatello, F., and Schölkopf, B. (2021). Dynamic inference with neural interpreters. Advances in Neural Information Processing Systems, 34:10985–10998.
  • Reed et al., (2016) Reed, S., Akata, Z., Lee, H., and Schiele, B. (2016). Learning deep representations of fine-grained visual descriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 49–58.
  • Riemer et al., (2016) Riemer, M., Vempaty, A., Calmon, F., Heath, F., Hull, R., and Khabiri, E. (2016). Correcting forecasts with multifactor neural attention. In International Conference on Machine Learning, pages 3010–3019. PMLR.
  • Robin and Holyoak, (1995) Robin, N. and Holyoak, K. J. (1995). Relational complexity and the functions of the prefrontal cortex. Cognitive Neurosciences.
  • Rodriguez and Wiles, (1998) Rodriguez, P. and Wiles, J. (1998). Recurrent neural networks can learn to implement symbol-sensitive counting. In Advances in Neural Information Processing Systems, pages 87–93.
  • Rohrbach et al., (2011) Rohrbach, M., Stark, M., and Schiele, B. (2011). Evaluating knowledge transfer and zero-shot learning in a large-scale setting. In CVPR 2011, pages 1641–1648.
  • Romera-Paredes and Torr, (2015) Romera-Paredes, B. and Torr, P. (2015). An embarrassingly simple approach to zero-shot learning. In Proceedings of The 32nd International Conference on Machine Learning, pages 2152–2161.
  • Rosenbaum et al., (2019) Rosenbaum, C., Cases, I., Riemer, M., and Klinger, T. (2019). Routing networks and the challenges of modular and compositional computation. arXiv preprint arXiv:1904.12774.
  • Ruis et al., (2020) Ruis, L., Andreas, J., Baroni, M., Bouchacourt, D., and Lake, B. M. (2020). A benchmark for systematic generalization in grounded language understanding. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 19861–19872. Curran Associates, Inc.
  • Ruiz et al., (2021) Ruiz, L., Ainslie, J., and Ontañón, S. (2021). Iterative decoding for compositional generalization in transformers. arXiv preprint arXiv:2110.04169.
  • (185) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1986a). Learning representations by back propagating errors. Nature, 323(6088):533–536.
  • Rumelhart and Mcclelland, (1986) Rumelhart, D. E. and Mcclelland, J. L. (1986). On learning the past tenses of english verbs. MIT Press.
  • (187) Rumelhart, D. E., McClelland, J. L., and Au (1986b). Parallel distributed processing: Explorations in the microstructure of cognition: Foundations.
  • Russin et al., (2019) Russin, J., Jo, J., O’Reilly, R. C., and Bengio, Y. (2019). Compositional generalization in a deep seq2seq model by separating syntax and semantics. arXiv preprint arXiv:1904.09708.
  • Santoro et al., (2017) Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap, T. (2017). A simple neural network module for relational reasoning. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  • Saxton et al., (2019) Saxton, D., Grefenstette, E., Hill, F., and Kohli, P. (2019). Analysing mathematical reasoning abilities of neural models. In International Conference on Learning Representations.
  • Schank, (1982) Schank, R. C. (1982). Dynamic memory: a theory of reminding and learning in computers and people.
  • Schlag et al., (2020) Schlag, I., Smolensky, P., Fernandez, R., Jojic, N., Schmidhuber, J., and Gao, J. (2020). Enhancing the transformer with explicit relational encoding for math problem solving.
  • Schölkopf et al., (2016) Schölkopf, B., Janzing, D., and Lopez-Paz, D. (2016). Causal and statistical learning. In Oberwolfach Reports, volume 13, pages 1896–1899.
  • Schölkopf et al., (2021) Schölkopf, B., Locatello, F., Bauer, S., Ke, N. R., Kalchbrenner, N., Goyal, A., and Bengio, Y. (2021). Toward causal representation learning. Proceedings of the IEEE, 109(5):612–634.
  • Sejnowski and Rosenberg, (1987) Sejnowski, T. J. and Rosenberg, C. R. (1987). Parallel networks that learn to pronounce english text. Complex Systems, 1:145–168.
  • Shastri and Ajjanagadde, (1993) Shastri, L. and Ajjanagadde, V. (1993). From simple associations to systematic reasoning: a connectionist representation of rules, variables and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences.
  • Shavlik, (1994) Shavlik, J. W. (1994). Combining symbolic and neural learning. Machine Learning, 14(3):321–331.
  • Shaw et al., (2021) Shaw, P., Chang, M.-W., Pasupat, P., and Toutanova, K. (2021). Compositional generalization and natural language variation: Can a semantic parsing approach handle both? In ACL 2021: 59th annual meeting of the Association for Computational Linguistics, pages 922–938.
  • Shazeer et al., (2017) Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In International Conference on Learning Representations.
  • Shin et al., (2021) Shin, R., Lin, C., Thomson, S., Chen, C., Roy, S., Platanios, E. A., Pauls, A., Klein, D., Eisner, J., and Van Durme, B. (2021). Constrained language models yield few-shot semantic parsers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7699–7715, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Smolensky, (1990) Smolensky, P. (1990). Tensor product variable binding and the representation of symbolic structures in connectionist systems. Artificial Intelligence, 46(1):159–216.
  • Socher et al., (2013) Socher, R., Ganjoo, M., Sridhar, H., Bastani, O., Manning, C. D., and Ng, A. Y. (2013). Zero-shot learning through cross-modal transfer. In ICLR (Workshop).
  • Sun, (1992) Sun, R. (1992). On variable binding in connectionist networks. Connection Science, 4(2):93–124.
  • Sun, (1996) Sun, R. (1996). Hybrid connectionist-symbolic modules: A report from the ijcai-95 workshop on connectionist-symbolic integration. Ai Magazine, 17:99–103.
  • Sutskever et al., (2014) Sutskever, I., Vinyals, O., and Le, Q. V. (2014). Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pages 3104–3112.
  • Sylvain et al., (2020) Sylvain, T., Petrini, L., and Hjelm, D. (2020). Locality and compositionality in zero-shot learning. In International Conference on Learning Representations.
  • Talmor et al., (2020) Talmor, A., Tafjord, O., Clark, P., Goldberg, Y., and Berant, J. (2020). Leap-of-thought: Teaching pre-trained models to systematically reason over implicit knowledge. Advances in Neural Information Processing Systems, 33:20227–20237.
  • Tang et al., (2021) Tang, Z., Gao, Y., Zhu, Y., Zhang, Z., Li, M., and Metaxas, D. (2021). Crossnorm and selfnorm for generalization under distribution shifts.
  • Torralba and Efros, (2011) Torralba, A. and Efros, A. A. (2011). Unbiased look at dataset bias. In CVPR 2011, pages 1521–1528. IEEE.
  • Touretzky and Hinton, (1988) Touretzky, D. and Hinton, G. (1988). A distributed connectionist production system. Cognitive Science, 12:423–466.
  • Touretzky, (1986) Touretzky, D. S. (1986). Boltzcons: Reconciling connectionism with the recursive nature of stacks and trees. Proceedings of the Eighth Annual Conference of the Cognitive Science Society.
  • Touretzky and Hinton, (1985) Touretzky, D. S. and Hinton, G. E. (1985). Symbols among the neurons: Details of a connectionist inference architecture. In IJCAI, volume 85, pages 238–243.
  • Treisman, (1998) Treisman, A. (1998). Feature binding, attention and object perception. Philosophical Transactions of the Royal Society B, 353(1373):1295–1306.
  • Tsarkov et al., (2021) Tsarkov, D., Tihon, T., Scales, N., Momchev, N., Sinopalnikov, D., and Schärli, N. (2021). *-cfq: Analyzing the scalability of machine learning on a compositional task. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 9949–9957.
  • Tsirtsis et al., (2020) Tsirtsis, S., Tabibian, B., Khajehnejad, M., Singla, A., Schölkopf, B., and Gomez-Rodriguez, M. (2020). Optimal decision making under strategic behavior.
  • Vaswani et al., (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. In Advances in Neural Information Processing Systems, pages 5998–6008.
  • Vinyals et al., (2015) Vinyals, O., Fortunato, M., and Jaitly, N. (2015). Pointer networks. In Advances in Neural Information Processing Systems, pages 2692–2700.
  • Wah et al., (2011) Wah, C., Branson, S., Welinder, P., Perona, P., and Belongie, S. (2011). The caltech-ucsd birds-200-2011 dataset.
  • (219) Wang, b., Lapata, M., and Titov, I. (2021a). Structured reordering for modeling latent alignments in sequence transduction. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 13378–13391. Curran Associates, Inc.
  • (220) Wang, B., Yin, W., Lin, X. V., and Xiong, C. (2021b). Learning to synthesize data for semantic parsing. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2760–2766, Online. Association for Computational Linguistics.
  • Wang et al., (2022) Wang, J., Lan, C., Liu, C., Ouyang, Y., Qin, T., Lu, W., Chen, Y., Zeng, W., and Yu, P. (2022). Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering.
  • Wang and Deng, (2018) Wang, M. and Deng, W. (2018). Deep visual domain adaptation: A survey. Neurocomputing, 312:135–153.
  • Weiss et al., (2018) Weiss, G., Goldberg, Y., and Yahav, E. (2018). On the practical computational power of finite precision RNNs for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, Melbourne, Australia. Association for Computational Linguistics.
  • Welling, (2015) Welling, M. (2015). Are ml and statistics complementary? In IMS-ISBA Meeting on ‘Data Science in the Next 50 Years.
  • Wermter and Sun, (1998) Wermter, S. and Sun, R. (1998). An overview of hybrid neural systems. In Hybrid Neural Systems, revised papers from a workshop, pages 1–13.
  • Wong and Wang, (2007) Wong, F. C. and Wang, W. S. (2007). Generalisation towards combinatorial productivity in language acquisition by simple recurrent networks. In 2007 International Conference on Integration of Knowledge Intensive Multi-Agent Systems, pages 139–144. IEEE.
  • Wu et al., (2018) Wu, Y., Wu, Y., Gkioxari, G., and Tian, Y. (2018). Building generalizable agents with a realistic and rich 3d environment. arXiv: Learning.
  • Xian et al., (2016) Xian, Y., Akata, Z., Sharma, G., Nguyen, Q., Hein, M., and Schiele, B. (2016). Latent embeddings for zero-shot classification. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 69–77.
  • Xian et al., (2019) Xian, Y., Lampert, C. H., Schiele, B., and Akata, Z. (2019). Zero-shot learning—a comprehensive evaluation of the good, the bad and the ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9):2251–2265.
  • Xu et al., (2015) Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., and Bengio, Y. (2015). Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057.
  • Xu et al., (2017) Xu, X., Shen, F., Yang, Y., Zhang, D., Shen, H. T., and Song, J. (2017). Matrix tri-factorization with manifold regularizations for zero-shot learning. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2007–2016.
  • Yang et al., (2019) Yang, G. R., Joglekar, M. R., Song, H. F., Newsome, W. T., and Wang, X.-J. (2019). Task representations in neural networks trained to perform many cognitive tasks. Nature neuroscience, page 1.
  • Yang et al., (2022) Yang, J., Zhang, L., and Yang, D. (2022). SUBS: Subtree substitution for compositional semantic parsing. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 169–174, Seattle, United States. Association for Computational Linguistics.
  • Yao et al., (2022) Yao, H., Wang, Y., Li, S., Zhang, L., Liang, W., Zou, J., and Finn, C. (2022). Improving out-of-distribution robustness via selective augmentation. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S., editors, Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pages 25407–25437. PMLR.
  • Yin et al., (2021) Yin, P., Fang, H., Neubig, G., Pauls, A., Platanios, E. A., Su, Y., Thomson, S., and Andreas, J. (2021). Compositional generalization for neural semantic parsing via span-level supervised attention. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2810–2823, Online. Association for Computational Linguistics.
  • Yu et al., (2018) Yu, H., Zhang, H., and Xu, W. (2018). Interactive grounded language acquisition and generalization in a 2d world. In International Conference on Learning Representations.
  • Yu et al., (2021) Yu, T., Wu, C.-S., Lin, X. V., bailin wang, Tan, Y. C., Yang, X., Radev, D., richard socher, and Xiong, C. (2021). Gra{pp}a: Grammar-augmented pre-training for table semantic parsing. In International Conference on Learning Representations.
  • Yu and Aloimonos, (2010) Yu, X. and Aloimonos, Y. (2010). Attribute-based transfer learning for object categorization with zero/one training example. In ECCV’10 Proceedings of the 11th European conference on Computer vision: Part V, pages 127–140.
  • Zhang et al., (2021) Zhang, X., Cui, P., Xu, R., Zhou, L., He, Y., and Shen, Z. (2021). Deep stable learning for out-of-distribution generalization. In 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5368–5378.
  • Zhang et al., (2022) Zhang, X., Zhou, L., Xu, R., Cui, P., Shen, Z., and Liu, H. (2022). Nico++: Towards better benchmarking for domain generalization. arXiv preprint arXiv:2204.08040.
  • Zhang and Saligrama, (2015) Zhang, Z. and Saligrama, V. (2015). Zero-shot learning via semantic similarity embedding. In 2015 IEEE International Conference on Computer Vision (ICCV), pages 4166–4174.
  • Zheng and Lapata, (2021) Zheng, H. and Lapata, M. (2021). Compositional generalization via semantic tagging. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1022–1032, Punta Cana, Dominican Republic. Association for Computational Linguistics.
  • Zhong et al., (2020) Zhong, V., Lewis, M., Wang, S. I., and Zettlemoyer, L. (2020). Grounded adaptation for zero-shot executable semantic parsing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6869–6882, Online. Association for Computational Linguistics.
  • Zhou et al., (2022) Zhou, K., Liu, Z., Qiao, Y., Xiang, T., and Loy, C. C. (2022). Domain generalization: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  • Zhu et al., (2021) Zhu, W., Shaw, P., Linzen, T., and Sha, F. (2021). Learning to generalize compositionally by transferring across semantic parsing tasks. arXiv preprint arXiv:2111.05013.