A Survey of Neural Networks and Formal Languages ¹¹1This work was partially supported by DARPA Safedocs Program award HR001119C0075 for which SRI is the prime contractor and Dartmouth is a subcontractor.

Joshua Ackerman Department of Computer Science, Dartmouth College, Hanover NH 03755. [email protected] George Cybenko ²²2 Thayer School of Engineering, Dartmouth College, Hanover NH 03755. [email protected]

Abstract

This report is a survey of the relationships between various state-of-the-art neural network architectures and formal languages as, for example, structured by the Chomsky Language Hierarchy. Of particular interest are the abilities of a neural architecture to represent, recognize and generate words from a specific language by learning from samples of the language.

1 Introduction

Understanding how well different neural network architectures decide membership in classes of formal languages is a fundamental problem spanning machine learning and language processing. In principle, it is known that even the simplest variants of Recurrent Neural Networks are capable of emulating a Turing Machine in real-time, and consequently are Turing Complete [18]. However, this result and similar ones, rely on unrealistic assumptions such as unbounded computation time and infinite precision representation of real-valued states.

For these reasons, our understanding of the relationship between different network architectures and the Chomsky hierarchy [12] is not complete, and as a consequence there is growing interest in understanding how different networks operate under these more realistic constraints. In particular, we consider computation time linear in the input size and bounded precision numbers as constraints. Adopting the terminology of [23], we will describe such networks as input-bounded neural networks with finite-precision states (IBFP-NNs).

Research in this general area dates back to the early 1990s, with works such as [19] [21] [11] [17] empirically studying the ability of different networks to learn variations of context-free counter languages [4]. Notably, it was around this time when researchers started exploring the idea of augmenting networks with memory constructs such as in [2], which introduced the Recurrent Neural Network Pushdown Automaton (NNPDA) – an RNN augmented with an external stack. Very recently, the body of research on such Memory Augmented Neural Networks has grown considerably, with the introduction of fully differentiable memory models such as Neural Stacks [8], Neural Queues [8], and even Neural Turing Machines [6]. Many of these papers present empirical results, so naturally, it is not guaranteed their findings hold in general.

However, in the last year, a number of papers have appeared that make very direct comparisons of modern network architectures with respect to formal language processing. Papers such as [15] and [23] explore the theoretical power of IBFP variants of many such state-of-the-art networks, while papers like [20] explore how networks can be augmented to better perform formal language focused tasks.

For the SafeDocs project, we are primarily focused on two key tasks: (i) generating novel samples from unknown grammars given a small sample set of positive examples; (ii) empirically deciding whether new samples are in a language thus learned. Although the aforementioned papers primarily discuss results in the spirit of task (ii), it is worth noting that generative modeling (i), is a comparably harder task than that of discriminative modeling (ii), so many of these results are still applicable (albeit less precisely) with respect to quantifying the limits of networks towards generative tasks.

We believe our empirical exploration as part of task (ii) will motivate richer theoretical results surrounding the generative capacity of certain models in the setting of Generative Adversarial Neural Networks [5]. For the moment, however, the most relevant results to our work are those of papers like [23] [15] [20] which study the power or design of state-of-the-art networks in the context of language recognition. In this project oriented survey, we will walk through some of the most pertinent results from these papers, surveying theoretical and empirical results which quantify the ability of Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Transformer Networks, along with variations on Memory Augmented Neural Networks (MANNs) to learn languages in the Chomsky Hierarchy under the practical assumptions.

This report is structured as follows: Section 2 reviews several concepts related to network performance and language theory. Section 3 reviews known results regarding several network architecture types with respect to language representation and recognition problems. Section 4 is a succinct summary of the results we have reviewed in this report.

2 Definitions

In this section we recall the core terminology introduced in [15] which will be necessary to properly contextualize the concept of asymptotic analysis of neural networks. To start, given an alphabet $\Sigma$ of size $\ell$ , we encode an $n$ -length sequence as an $n\times\ell$ matrix X where the $i$ th row of X, denoted $\textbf{x}_{i}$ , is one-hot encoding of the $i$ th sequence character. With this basic notion in mind, we can define the fundamental concept of a neural sequence acceptor [15].

Definition 1 (Neural Sequence Acceptor).

A neural sequence acceptor $\hat{\mathds{1}}$ is a family of functions, parameterized by $\Theta$ , of the form

(\hat{\mathds{1}}^{\Theta}:\textbf{X}\mapsto p)_{\textbf{X}\in{\{0,1\}^{n\times\ell}}}

where $p\in[0,1]$ .

Intuitively, neural sequence acceptors are neural networks with parameters $\Theta$ which take sequences as input, and returns the probability that the input is part of some language.

All presented results from [15], are quantified by the notion of asymptotic acceptance. Very simply, in this setting we allow the magnitude of the parameters $\Theta$ of a neural node to get arbitrarily large. Of course, this alone is not intrinsically a practical assumption however, unintuitively, it actually leads to a more pessimistic view on the capacity of the computational power of IBFP-NNs. In practice, neural networks have been observed to have the ability to learn non-asymptotic strategies in certain problems [15]. However, small noisy perturbations of the activation functions of those networks during training lead to failure on the tested problems [15], which suggests that on “noisier” datasets (i.e., PDFs), asymptotic results are in fact more likely to be realized, and therefore are not an unreasonably pessimistic or irrational assumption in practice.

Definition 2 (Asymptotic Acceptance).

Let $L$ be a language with indicator function $\mathds{1}_{L}$ , then a neural sequence acceptor $\hat{\mathds{1}}^{\Theta}$ is said to asymptotically accept $L$ if

\lim_{N\rightarrow\infty}\hat{\mathds{1}}^{N\Theta}=\mathds{1}_{L}.

Beyond the notion of acceptance, another complexity metric for a IBFP-NN arising in this asymptotic context is that of general state complexity. General state complexity captures the full amount of memory a network can employ at each stage of a computation. Therefore, understanding the general state complexity of a network yields insight to its expressive capacity. A networks general state complexity is measured across its hidden states, or the intermediate representations a network employs to arrive at a final answer.

Definition 3 (Hidden State).

Let $\textbf{v}\in\mathds{R}^{k}$ . The $k$ -length hidden state, with respect to parameters $\theta$ is a family of functions,

(\textbf{h}^{\theta}_{t}:\textbf{X}\mapsto\textbf{v}_{t})_{|\textbf{X}|=n}

defined for every $t\in[n]$ .

For a given hidden state with fixed parameters, the configuration set is the set of values the hidden state takes, varied over possible different input sentences, or more simply, the number of configurations a hiddens state can exist in.

Definition 4 (Configuration Set).

For all $n$ , the configuration set of a hidden state $\textbf{h}_{n}$ with parameters $\theta$ , is defined as

M(\textbf{h}_{n}^{\theta})=\left\{\lim_{N\rightarrow\infty}\textbf{h}^{N\theta}_{n}(\textbf{X})\mid\textbf{x}_{i}\in\texttt{onehot}(\Sigma)\right\}

Finally, the quantity of authentic interest for a given network is the general state complexity, which considers the largest number of values a network hidden state can take across all possible model parameters.

Definition 5 (General State Complexity).

The general state complexity of a hidden state $\textbf{h}_{n}$ with parameters $\theta$ is defined as the maximum fixed state complexity, or,

\textarc{m}(\textbf{h}_{n})=\max_{\theta}\left|M(\textbf{h}_{n}^{\theta})\right|.

2.1 Language Theoretic

We use $\mathcal{L}(M)$ to denote the language accepted by some machine $M$ , and language classes such as the set of all regular languages, REGULAR, will be set in capitalized, bold text. We will assume knowledge of basic formal language theory, however we will define some possibly less common classes of languages. The first such language is a strictly $k$ -local language. Fittingly, these are languages with very structured, local behavior of size $k$ .

Definition 6 (Strictly $k$ -local language).

Let $\Sigma$ be an alphabet, which without loss of generality does not contain the character #. A strictly $k$ -local language is a set of constraints $S$ of the form,

S=\left\{s\mid s\in\big{(}\Sigma\cup\{\#\}\big{)}^{k}\right\}.

Next, we introduce Dyck Languages, which informally consist of correctly nested sequences of parentheses.

Definition 7 (Dyck Languages).

Given a bipartite set of characters $(P,\overline{P})$ , the Dyck language, $\mathcal{D}_{P}$ , is defined by the set,

\mathcal{D}_{P}=\left\{x\in(P\cup\overline{P})^{*}\mid x\text{ is a well balanced set of parenthesis}\right\}.

In contexts where we are only concerned with the number of parenthesis, we will write $\mathcal{D}_{n}$ as short hand for $\mathcal{D}_{[2n]}$ . Although seemingly very simple, the Chomsky–Schützenberger representation theorem suggests these languages are in some sense related to context free languages. This particular relationship, relies on the idea of a homomorphism, which for two alphabets $\Sigma,\Delta$ , is a map $h:\Sigma^{*}\mapsto\Delta^{*}$ such that $h(\epsilon)=\epsilon$ , and for any $x,y\in\Sigma$ , $h(xy)=h(x)h(y)$ .

Theorem 1 (Chomsky–Schützenberger Representation Theorem).

A language $L$ over $\Sigma$ is context-free if and only if there is a bipartite set of characters $(P,\overline{P})$ , a regular language $R$ over $(P,\overline{P})$ , and a homomorphism $h:(P,\overline{P})^{*}\mapsto\Sigma^{*}$ such that $L=h(\mathcal{D}_{P}\cap R)$ .

Finally, we mention the informal definitions for simplified $k$ -counter machines found in [23], and the more general notion of a counter machine from [15]. First, Weis et al. [23] loosely define a counter device as something holding a value which can be incremented by a fixed amount, decremented by a fixed amount, or compared to zero (COMP0). They then define a simplified $k$ -counter machine (SKCM) as “a finite-state automaton extended with $k$ -counters, where at each step of the computation each counter can be incremented, decremented or ignored in an input-dependent way, and state-transitions and accept/reject decisions can inspect the counters’ states using COMP0.” Accordingly, counter machines are simply finite automata which have access to a finite number of counting devices [15].

3 Results

3.1 Convolutional Neural Networks

Networks such as Recurrent Neural Networks, tend to be more colloquially associated with sequential data, however Convolutional Neural Networks (CNNs) do have compelling uses in sequential tasks [25] due to strengths such as their ability to learn positionally invariant features. We are not aware of any results concerning the capacity of deep convolutional networks in the context of formal language tasks, however [15] studied a simple, convolutional networks with max pooling in the IBFP setting. The model of a CNN which they study is given by the following equations

	$\displaystyle\textbf{h}_{t}=\tanh\left(\textbf{W}^{h}\left(\textbf{x}_{t-k}\|\|\cdots\|\|\textbf{x}_{t+k}\right)+\textbf{b}^{h}\right)$		(1)
	$\displaystyle\textbf{h}_{\text{pool}}=\text{maxpool}(\textbf{H})$		(2)
	$\displaystyle a=\sigma(\textbf{W}^{a}\textbf{h}_{\text{pool}}+\textbf{b}^{a}).$		(3)

Based on this model, [15] proved the following result.

Theorem 2.

Let REGULAR be the class of all regular languages, and STRICTLY-LOCAL be the class of all languages acceptable by a strictly local grammar. The following inclusions hold asymptotically,,

\textbf{STRICTLY-LOCAL}\subseteq\mathcal{L}(\textsf{IBFP-1-CNN})\subset\textbf{REGULAR}.

Proof.

In the interest of space, we show only $\mathcal{L}(\textsf{IBFP-1-CNN})\subset\textbf{REGULAR}$ , for the proof of the other containment see [15]. Consider the language $L$ given by the regular expression $a^{*}ba^{*}$ , and towards a contradiction assume that there is a IBFP-1-CNN that accepts $L$ . Consider a string $\textbf{x}\in L$ , with a $\textbf{x}_{i}=b$ . Pick a $j$ such that $|i-j|>2k+1$ , and change $\textbf{x}_{j}=b$ . No column in the network will get $\textbf{x}_{i}$ and $\textbf{x}_{j}$ as input, so perthe maxpooling step, the network will accept. ∎

Even though this result is not true for deep convolutional network, it seems unlikely that nesting layers would yield too much additional formal capacity in the IBFP setting. So, it is possible that, even more generally CNNs could have some intrinsic limitations.

3.2 Recurrent Neural Networks

In contrast to Convolutional Neural Networks, Recurrent Neural Networks (RNNs) were specifically designed with sequential data in mind, and as such have been one of the most ubiquitous paradigms in fields centered around highly sequential like natural language processing. Roughly speaking, for each token $\textbf{x}_{t}$ in the input RNNs compute a state $\textbf{h}_{t}$ using the preceding state and the new token, i.e., $\textbf{h}_{t}=f(\textbf{x}_{t},\textbf{h}_{t-1})$ for some nonlinear function $f$ . One of the simplest realizations of an RNN is the Simple Recurrent Neural Network (SRNN) [3], essentially just applies an affine transformation to $\textbf{x}_{t}$ , $\textbf{h}_{t-1}$ , before a non-linear function. The full SRNN network is defined by these update rules,

	$\displaystyle\textbf{h}_{t}=f_{h}(\textbf{W}\textbf{x}_{t}+\textbf{U}\textbf{h}_{t-1}+b_{h})$		(4)
	$\displaystyle\textbf{y}_{t}=f_{y}(\textbf{W}^{y}\textbf{h}_{t}+b_{y}).$		(5)

Although powerful, these networks have issues such as being highly susceptible to vanishing gradients [10], which motivated the development of more complex networks such as the Long Short Term Memory (LSTM) [10], or the Gated Recurrent Unit (GRU) [1]. LSTMs improve SRNNs by using more complicated rules to decide how information passes between states. Intuitively, LSTMs add “input”, “ouput”, and “forget” gates, $\textbf{i}_{t},\textbf{o}_{t},\textbf{f}_{t}$ , which control how values move between memory cells – the fundamental units of the LSTM. Memory cells have state $\textbf{c}_{t}$ , and input activation $\tilde{\textbf{c}}_{t}$ . The input gate and output gates control how values come into and how the value affects the activation of a cell, meanwhile, the forget gate controls the extent to which a value remains in a cell. Compactly, an LSTM is given by the following update rules,

	$\displaystyle\textbf{f}_{t}=\sigma(\textbf{W}^{f}\textbf{x}_{t}+\textbf{U}^{f}\textbf{h}_{t-1}+b^{f})$		(6)
	$\displaystyle\textbf{i}_{t}=\sigma(\textbf{W}^{i}\textbf{x}_{t}+\textbf{U}^{i}\textbf{h}_{t-1}+b^{i})$		(7)
	$\displaystyle\textbf{o}_{t}=\sigma(\textbf{W}^{o}\textbf{x}_{t}+\textbf{U}^{o}\textbf{h}_{t-1}+b^{o})$		(8)
	$\displaystyle\tilde{\textbf{c}}_{t}=\tanh(\textbf{W}^{\tilde{c}}\textbf{x}_{t}+\textbf{U}^{\tilde{c}}\textbf{h}_{t-1}+b^{\tilde{c}})$		(9)
	$\displaystyle\textbf{c}_{t}=\textbf{f}_{t}\odot\textbf{c}_{t-1}+\textbf{i}_{t}\odot\tilde{\textbf{c}}_{t}$		(10)
	$\displaystyle\textbf{h}_{t}=\textbf{o}_{t}\odot\tanh(\textbf{c}_{t}),$		(11)

where $\odot$ is element-wise multiplication. A more modern variation of the LSTM is that of the GRU, which employs only two gating mechanisms making it faster, and simpler to implement. GRUs have a similar motivation to LSTMS, but rather than using “input”, “output”, and “forget” gates, the GRU has a “reset” gate and an “update” gate. The “update” gate, $\textbf{z}_{t}$ determines the balance of old memory and new memory used in updating the new state, while the “reset” gate $\textbf{r}_{t}$ controls how much influence the input versus the previous state has on the new state. Mathematically, the GRU is defined as follows,

	$\displaystyle\textbf{z}_{t}=\sigma(\textbf{W}^{z}\textbf{x}_{t}+\textbf{U}^{z}\textbf{h}_{t-1}+b^{z})$		(12)
	$\displaystyle\textbf{r}_{t}=\sigma(\textbf{W}^{r}\textbf{x}_{t}+\textbf{U}^{r}\textbf{h}_{t-1}+b^{r})$		(13)
	$\displaystyle\textbf{h}_{t}=\textbf{z}_{t}\odot\textbf{h}_{t-1}+(1-\textbf{z}_{t})\odot\tanh\left(\textbf{W}^{h}\textbf{x}_{t}+\textbf{U}^{h}(\textbf{r}_{t}\odot\textbf{h}_{t-1})+b^{h}\right).$		(14)

Despite being newer and quite similar in design, we will see that GRUs are not just less powerful than LSTMs, but only equally as powerful as SRNNs.

In [23], Weiss et al. show how a LSTM can emulate an SKCM. To summarize their construction: the choice to increment ( $+1$ ) or decrement ( $-1$ ) is made in $\tilde{\textbf{c}}_{t}$ via the $\tanh$ function which naturally makes this decision as a function of the input token and previous state. A cell, $\textbf{c}_{t}$ , can maintain the counter when $\textbf{i}_{t}=1$ and $\textbf{f}_{t}=1$ . For comparison operations, the counter is completely visible in $h_{t}$ . Extending this, Merrill [15] shows that, asymptotically, $\mathcal{L}(\textsf{IBFP-LSTM})\subseteq\textbf{CL}$ . Together these results give us the theorem below.

Theorem 3.

Let SKCL be the class of all simplified $k$ -counter languages (SKCL), and CL be the class of all counter languages, then asymptotically,

\textbf{SKCL}\subseteq\mathcal{L}(\textsf{IBFP-LSTM})\subseteq\textbf{CL}.

On the other hand, Weiss et al. [23] additionally discuss why a GRU is not capable of emulating a $k$ counter machine. Their argument, succinctly is that the update rule for $\textbf{h}_{t}$ forces the state values to be bounded between $-1$ and $1$ . In the IBFP setting, it is not possible to perform unbounded counting in this representation. Of course, for the same reason, a SRNN is also not capable of emulating a $k$ counter machine. Merrill further shows that both GRUs and SRNNs are limited to asymptotically accepting regular languages [15].

Theorem 4.

The following relationship holds asymptotically,

\mathcal{L}(\textsf{IBFP-GRU})=\mathcal{L}(\textsf{IBFP-SRNN})=\textbf{REGULAR}.

Correspondingly, they show that the state complexity of a memory cell captures these separations.

Theorem 5.

Let $\textbf{h}_{n}^{g}$ , and $\textbf{h}_{n}^{s}$ be GRU and SRNN cells respectively. Then asymptotically,

\textarc{m}(\textbf{h}_{n}^{g}),\textarc{m}(\textbf{h}_{n}^{s})\in O(1).

Theorem 6.

Let $\textbf{c}_{n}\in\mathds{R}^{k}$ be an LSTM cell state. Then asymptotically,

\textarc{m}(\textbf{c}_{n})\in O(n^{k}).

Overall, we have seen evidence that LSTMs are intrinsically related to counting machines in terms of function and expressibility. Meanwhile, GRUs and SRNNs are closely related to regular languages giving them have less formal power. However both GRUs and SRNNs are far simpler to implement, which in some constrained settings could be beneficial.

3.3 Transformer Networks

Transformer Networks [22] are a new, and increasingly popular architecture designed with the goal of improving machine translation. One practical advantage of Transformers, is they are far more parallelizable than RNNs, and learn long term dependencies more efficiently. Introduced by Vaswani et al., in their paper “Attention Is All You Need” [22], Transformers completely forgo classical recurrent connections for the notion of attention. Attention is a mechanism which enables a network to selectively recall a specific encoder state based on observed information. This process is modeled as a database-esque retrieval process involving a query $\textbf{q}\in\mathds{R}^{\ell}$ , a $n\times\ell$ matrix of key vectors K, and a $n\times d$ , matrix of value vectors V. However, unlike a traditional database retrieval, this process is “soft” meaning that for a given query we do not get the exact value back, but instead a sum of the values weighted by how similar each key is to the query. The compact notation expressing this idea is simply,

\textsf{attention}(\textbf{q},\textbf{K},\textbf{V})=\textsf{softmax}(\textbf{q}\textbf{K}^{\top})\textbf{V}.

Transformers use multiple attention heads in parallel, while simultaneously linearly projecting the queries, keys, and values with different, learned transformations. This enables them to leverage different trends across different embeddings. Then at the end, the output from each transformation is concatenated. This defines the notion of multihead attention [22],

	$\displaystyle\textbf{a}_{i}=\textsf{attention}(\textbf{W}^{q_{i}}\textbf{q},\textbf{W}^{K_{i}}\textbf{K},\textbf{W}^{V_{i}}\textbf{V})$		(15)
	$\displaystyle\textsf{multihead}(\textbf{q},\textbf{K},\textbf{V})=\textbf{a}_{1}\|\|\cdots\|\|\textbf{a}_{n}.$		(16)

Finally, the Transformer network studied in [15], is a variant adapted for language processing tasks [16], defined by,

	$\displaystyle\textbf{q}_{t},\textbf{k}_{t},\textbf{v}_{t}=\textbf{W}^{q_{t}}\textbf{x}_{t},\textbf{W}^{k_{t}}\textbf{x}_{t},\textbf{W}^{v_{t}}\textbf{x}_{t}$
	$\displaystyle\textbf{h}_{t}=\sigma(\textbf{W}^{h}\textsf{multihead}(\textbf{q},\textbf{K},\textbf{V})).$

Surprisingly, despite many transformers taking the spotlight in many top-performing natural language processing architectures, they are very weak asymptotically. A glaring flaw of the transformer architecture, is their positional invariance. Vaswani et al. incorporate a trick to augment the network with positional encodings to fix this, however they use periodic functions, which repeat asymptotically [22]. As such, in this setting Transformer Networks have quite limited power.

Theorem 7.

Asymptotically the following containment holds,

\textbf{REGULAR}\not\subset\mathcal{L}(\textsf{IBFP-TN}).

Interestingly, Transformer Networks do have high state-complexity. In particular, it is known that $\textarc{m}(\textbf{V}_{n})\in 2^{\Theta(n)}$ [15]. This may explain their practical success, since of course in a completely finite sense, one can add positional encodings via the trick in [22]. Subsequently, this result may be a drastic underestimation of their potential towards empirical language modeling.

3.4 Memory Augmented Neural Networks

All of the previously described networks maintain bizarre implicit representations of the data, which always seem impede their formal capacity under the IBFP assumptions. One might wonder, what if we just used constructs from formal language theory explicitly as part of the model? This question motivates the idea of Memory Augmented Neural Networks (MANNs) which integrate explicit memory constructs such as stacks or tapes into neural network architectures. A drawback, is that implementing these constructs in a fully differentiable fashion, adds a considerable number of new hyperparameters to optimize and to analyse. For this reason, the literature including both formal and empirical focused work on the power of MANNs is relatively sparse compared to the other methods. However, as we will see these models have been used to perform interesting formal language focused tasks, and have very high representational power.

Recently, there have been a few proposals on ways to augment networks with differentiable stacks, the most relevent of which are [8] [13], and [20]. Grefenstette et al. empirically demonstrated that memory augmented LSTMs had superior performance over vanilla LSTMs on certain transduction tasks [8]. Meanwhile Joulin et al. [13] studied the ability of MANNs to learn variations of languages such as $a^{n}b^{n}$ , however (unsurprisingly given theorem 3) they did not observe any superior performance in the augmented network over LSTMs. Lastly, [20] studied the ability for MANNs to learn $\mathcal{D}_{>1}$ languages. Due to its relevance to our work, and its relative similarity to the stack model described in [13], we will introduce Suzgun et al.’s Stack-RNN, and remark on the expressiveness of a more abstract variant.

The aptly named Stack-RNN [20] integrates a differentiable stack into a recurrent neural network by the following update rules.

	$\displaystyle\tilde{\textbf{h}}_{t-1}=\textbf{h}_{t-1}+\textbf{W}_{sh}\textbf{s}^{(0)}_{t-1}$		(17)
	$\displaystyle\textbf{h}_{t}=\tanh(\textbf{W}^{x}\textbf{x}_{t}+\textbf{U}^{h}\tilde{\textbf{h}}_{t-1}+b^{h})$		(18)
	$\displaystyle\textbf{y}_{t}=\sigma(\textbf{W}^{y}\textbf{h}_{t}+b^{y})$		(19)
	$\displaystyle\textbf{a}_{t}=\textsf{softmax}\left(\textbf{W}^{a}\textbf{h}_{t}\right)$		(20)
	$\displaystyle\textbf{n}_{t}=\sigma(\textbf{W}^{n}\textbf{h}_{t})$		(21)
	$\displaystyle\textbf{s}^{(0)}_{t}=\textbf{a}^{(0)}_{t}\textbf{n}_{t}+\textbf{a}^{(1)}_{t}\textbf{s}^{(1)}_{t-1}$		(22)
	$\displaystyle\textbf{s}^{(i)}_{t}=\textbf{a}^{(0)}_{t}\textbf{s}^{(i-1)}_{t-1}+\textbf{a}^{(1)}_{t}\textbf{s}^{(i+1)}_{t-1}$		(23)

The stack is maintained as $\textbf{s}_{t}=\textbf{s}^{(0)}_{t}\textbf{s}^{(1)}_{t}\cdots\textbf{s}^{(k)}_{t}$ . The vector $\textbf{a}_{t}=[\textbf{a}_{t}^{(0)}\textbf{a}_{t}^{(1)}]$ encodes the PUSH and POP operations. For example, if $\textbf{a}_{t}^{(0)}=1$ (PUSH) then all the stack elements $\textbf{s}_{t}^{(i)}$ are pushed down (Eq. 23), and the topmost element is changed to the new value $\textbf{n}_{t}$ . Likewise, if $\textbf{a}_{t}^{(1)}=1$ (POP), then each stack element is shifted down. Moreover, the hidden state (Eq. 17, 18) equations have been adapted to take into consideration the topmost stack element.

Unsurprisingly, Neural Stacks have been quite successful at certain formal language tasks, and similar abstract models which implement stack interfaces, as studied in [15], are known to inherit a great deal of expressive power from the addition of a stack. Suzgun et al. show that Neural Stacks can learn $\mathcal{D}_{1}$ , $\mathcal{D}_{2}$ , $\mathcal{D}_{3}$ , and even $\mathcal{D}_{6}$ languages with 99%-100% accuracy (often closer to 100%) on both training and test sets [20]. Merrill [15], shows that for his abstract, definition of a neural stack the following holds.

Theorem 8.

Let $\textbf{S}_{n}\in\mathds{R}^{nk}$ be a neural stack with a feed-forward controller. Then,

\textarc{m}\left(\textbf{S}_{n}\right)\in 2^{\Theta(n)}.

Beyond stacks, one can also augment neural networks with some notion of a memory tape to make Neural Turing Machines. Originally proposed in [6], Neural Turing Machines are notoriously complex to implement, and difficult to train. In fact, most follow up work to [6] is focused on improving the original NTM network design to make it more usable [26] [14] [24] [7] [9]. The greater model flexibility also makes it harder to analyse asymptotically. However, it seems believable to us that in the IBFP setting, a result like

\mathcal{L}(\textsf{IBFP-NTM})\subseteq\textbf{CONTEXT-SENSITIVE}

could hold, and that NTMs would have exponential state complexity. Due to the complexity of a full NTM, we refrain from defining it formally here. However, Suzgun et al. [20] define what they call a Baby-NTM, which implements five operations on the “tape” represented by a real vector in a Neural Stack like fashion. For a tape $[a,b,c,d,e]$ , these operations are,

	$\displaystyle\texttt{ROTATE-RIGHT}:[e,a,b,c,d]$
	$\displaystyle\texttt{ROTATE-LEFT}:[b,c,d,e,a]$
	$\displaystyle\texttt{NO-OP}:[a,b,c,d,e]$
	$\displaystyle\texttt{POP-RIGHT}:[0,a,b,c,d]$
	$\displaystyle\texttt{POP-LEFT}:[b,c,d,e,0].$

We also omit the specific implementation details of the Baby-NTM, as it is quite similar to the neural stack. Also the core point we want to convey revolves around the fact that this architecture is more complicated, and more flexible than the Neural Stack. However, despite this [20] observed the Neural Stack had slightly better performance on learning $\mathcal{D}_{1}$ , $\mathcal{D}_{2}$ , $\mathcal{D}_{3}$ , and $\mathcal{D}_{6}$ languages. This performance gap is unlikely representative of the model’s asymptotic power, however it does highlight some of the emperical pitfalls of augmenting a model with more powerful constructs.

Overall, MANNs certainly show a great deal of promise and potential to learn complex formal languages. However, due to their relative novelty they are far less understood, and less mature than other models we discussed in this paper. As such, there is certainly a great deal of room for future work leveraging these architectures.

4 Summary

Understanding how well different neural network architectures decide membership for different classes of formal languages is a fundamental problem that is closely connected to our aspirations in the SafeDocs project. We surveyed a range of results, both theoretical and empirical quantifying the expressiveness of many state-of-the-art network architectures such as Convolutional Neural Networks, Recurrent Neural Networks, Transformer Networks, and Memory Augmented Neural Networks. We saw that networks such as CNNs, SRNNs, and GRUs are easier to implement and train, but formally may lack the capacity to learn complex languages. Then, we saw that the well-known Transformer Network is asymptotically weak, but posses high state complexity, which may explain its practical success. Finally, we explored augmenting networks with various notions of differentiable memory, and observed that this drastically increased their power at the cost of lower usability. Looking forward, we expect that our current work will discover similar results but concerning the generative capability of different networks, with respect to the Chomsky Hierarchy.

References

[1] Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
[2] Sreerupa Das, C Lee Giles, and Guo-Zheng Sun. Learning context-free grammars: Capabilities and limitations of a recurrent neural network with an external stack memory. In Proceedings of The Fourteenth Annual Conference of Cognitive Science Society, 1992.
[3] Jeffrey L. Elman. Finding structure in time. Cognitive Science, 14:179–211, 1990.
[4] Patrick C Fischer, Albert R Meyer, and Arnold L Rosenberg. Counter machines and counter languages. Mathematical systems theory, 2(3):265–283, 1968.
[5] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron C. Courville, and Yoshua Bengio. Generative adversarial networks. ArXiv, abs/1406.2661, 2014.
[6] Alex Graves, Greg Wayne, and Ivo Danihelka. Neural Turing machines. ArXiv, abs/1410.5401, 2014.
[7] Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwinska, Sergio Gómez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adrià Puigdomènech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen. King, C. Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybrid computing using a neural network with dynamic external memory. Nature, 538:471–476, 2016.
[8] Edward Grefenstette, Karl Moritz Hermann, Mustafa Suleyman, and Phil Blunsom. Learning to transduce with unbounded memory. CoRR, abs/1506.02516, 2015.
[9] Caglar Gulcehre, Sarath Chandar, Kyunghyun Cho, and Yoshua Bengio. Dynamic neural Turing machine with continuous and discrete addressing schemes. Neural Comput., 30(4):857–884, April 2018.
[10] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput., 9(8):1735–1780, Nov 1997.
[11] Steffen Hölldobler, Yvonne Kalinke, and Helko Lehmann. Designing a counter: Another case study of dynamics and activation landscapes in recurrent networks. In KI, 1997.
[12] Gerhard Jäger and James Rogers. Formal language theory: refining the Chomsky hierarchy. Philosophical Transactions of the Royal Society B: Biological Sciences, 367(1598):1956–1970, 2012.
[13] Armand Joulin and Tomas Mikolov. Inferring algorithmic patterns with stack-augmented recurrent nets. In NIPS, 2015.
[14] Karol Kurach, Marcin Andrychowicz, and Ilya Sutskever. Neural random access machines. ERCIM News, 2016, 2015.
[15] William Merrill. Sequential neural networks as automata. ArXiv, abs/1906.01615, 2019.
[16] Alec Radford. Improving language understanding by generative pre-training. 2018.
[17] Paul Rodriguez and Janet Wiles. Recurrent neural networks can learn to implement symbol-sensitive counting. In M. I. Jordan, M. J. Kearns, and S. A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 87–93. MIT Press, 1998.
[18] Hava T Siegelmann and Eduardo D Sontag. Analog computation via neural networks. Theoretical Computer Science, 131(2):331 – 360, 1994.
[19] Mark Steijvers and Peter Grünwald. A recurrent network that performs a context-sensitive prediction task. In Proceedings of the 18th Annual Conference of the Cognitive Science Society, pages 335–339, 1996.
[20] Mirac Suzgun, Sebastian Gehrmann, Yonatan Belinkov, and Stuart M. Shieber. Memory-augmented recurrent neural networks can learn generalized Dyck languages. ArXiv, abs/1911.03329, 2019.
[21] Bradley Tonkes and Janet Wiles. Learning a context-free task with a recurrent neural network: An analysis of stability. In In Proceedings of the Fourth Biennial Conference of the Australasian Cognitive Science Society. Citeseer, 1997.
[22] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, 2017.
[23] Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of finite precision rnns for language recognition. ArXiv, abs/1805.04908, 2018.
[24] Greg Yang. Lie access neural Turing machine. ArXiv, abs/1602.08671, 2016.
[25] Wenpeng Yin, Katharina Kann, Mo Yu, and Hinrich Schütze. Comparative study of CNN and RNN for natural language processing. CoRR, abs/1702.01923, 2017.
[26] Wojciech Zaremba and Ilya Sutskever. Reinforcement learning neural Turing machines. CoRR, abs/1505.00521, 2015.

A Survey of Neural Networks and Formal Languages 111This work was partially supported by DARPA Safedocs Program award HR001119C0075 for which SRI is the prime contractor and Dartmouth is a subcontractor.