This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Limitations of Limited Context for Constituency Parsing

Yuchen Li
Carnegie Mellon University
[email protected]
   Andrej Risteski
Carnegie Mellon University
[email protected]
Abstract

Incorporating syntax into neural approaches in NLP has a multitude of practical and scientific benefits. For instance, a language model that is syntax-aware is likely to be able to produce better samples; even a discriminative model like BERT with a syntax module could be used for core NLP tasks like unsupervised syntactic parsing. Rapid progress in recent years was arguably spurred on by the empirical success of the Parsing-Reading-Predict architecture of (Shen et al., 2018b), later simplified by the Order Neuron LSTM of (Shen et al., 2019). Most notably, this is the first time neural approaches were able to successfully perform unsupervised syntactic parsing (evaluated by various metrics like F-1 score).

However, even heuristic (much less fully mathematical) understanding of why and when these architectures work is lagging severely behind. In this work, we answer representational questions raised by the architectures in (Shen et al., 2018b; 2019), as well as some transition-based syntax-aware language models (Dyer et al., 2016): what kind of syntactic structure can current neural approaches to syntax represent? Concretely, we ground this question in the sandbox of probabilistic context-free-grammars (PCFGs), and identify a key aspect of the representational power of these approaches: the amount and directionality of context that the predictor has access to when forced to make parsing decision. We show that with limited context (either bounded, or unidirectional), there are PCFGs, for which these approaches cannot represent the max-likelihood parse; conversely, if the context is unlimited, they can represent the max-likelihood parse of any PCFG.

1 Introduction

Neural approaches have been steadily making their way to NLP in recent years. By and large however, the neural techniques that have been scaled-up the most and receive widespread usage do not explicitly try to encode discrete structure that is natural to language, e.g. syntax. The reason for this is perhaps not surprising: neural models have largely achieved substantial improvements in unsupervised settings, BERT (Devlin et al., 2019) being the de-facto method for unsupervised pre-training in most NLP settings. On the other hand unsupervised syntactic tasks, e.g. unsupervised syntactic parsing, have long been known to be very difficult tasks (Htut et al., 2018). However, since incorporating syntax has been shown to improve language modeling (Kim et al., 2019b) as well as natural language inference (Chen et al., 2017; Pang et al., 2019; He et al., 2020), syntactic parsing remains important even in the current era when large pre-trained models, like BERT (Devlin et al., 2019), are available.

Arguably, the breakthrough works in unsupervised constituency parsing in a neural manner were (Shen et al., 2018b; 2019), achieving F1 scores 42.8 and 49.4 on the WSJ Penn Treebank dataset (Htut et al., 2018; Shen et al., 2019). Both of these architectures, however (especially Shen et al., 2018b) are quite intricate, and it’s difficult to evaluate what their representational power is (i.e. what kinds of structure can they recover). Moreover, as subsequent more thorough evaluations show (Kim et al., 2019b; a), these methods still have a rather large performance gap with the oracle binary tree (which is the best binary parse tree according to F1-score) — raising the question of what is missing in these methods.

We theoretically answer both questions raised in the prior paragraph. We quantify the representational power of two major frameworks in neural approaches to syntax: learning a syntactic distance (Shen et al., 2018b; a; 2019) and learning to parse through sequential transitions (Dyer et al., 2016; Chelba, 1997). To formalize our results, we consider the well-established sandbox of probabilistic context-free grammars (PCFGs). Namely, we ask:

When is a neural model based on a syntactic distance or transitions able to represent the max-likelihood parse of a sentence generated from a PCFG?

We focus on a crucial “hyperparameter” common to practical implementations of both families of methods that turns out to govern the representational power: the amount and type of context the model is allowed to use when making its predictions. Briefly, for every position tt in the sentence, syntactic distance models learn a distance dtd_{t} to the previous token — the tree is then inferred from this distance; transition-based models iteratively construct the parse tree by deciding, at each position tt, what operations to perform on a partial parse up to token tt. A salient feature of both is the context, that is, which tokens is dtd_{t} a function of (correspondingly, which tokens can the choice of operations at token tt depend on)?

We show that when the context is either bounded (that is, dtd_{t} only depends on a bounded window around the tt-th token) or unidirectional (that is, dtd_{t} only considers the tokens to the left of the tt-th token), there are PCFGs for which no distance metric (correspondingly, no algorithm to choose the sequence of transitions) works. On the other hand, if the context is unbounded in both directions then both methods work: that is, for any parse, we can design a distance metric (correspondingly, a sequence of transitions) that recovers it.

This is of considerable importance: in practical implementations the context is either bounded (e.g. in Shen et al., 2018b, the distance metric is parametrized by a convolutional kernel with a constant width) or unidirectional (e.g. in Shen et al., 2019, the distance metric is computed by a LSTM, which performs a left-to-right computation).

This formally confirms a conjecture of Htut et al. (2018), who suggested that because these models commit to parsing decision in a left-to-right fashion and are trained as a part of a language model, it may be difficult for them to capture sufficiently complex syntactic dependencies. Our techniques are fairly generic and seem amenable to analyzing other approaches to syntax. Finally, while the existence of a particular PCFG that is problematic for these methods doesn’t necessarily imply that the difficulties will carry over to real-life data, the PCFGs that are used in our proofs closely track linguistic intuitions about difficult syntactic structures to infer: the parse depends on words that come much later in the sentence.

2 Overview of Results

We consider several neural architectures that have shown success in various syntactic tasks, most notably unsupervised constituency parsing and syntax-aware language modeling. The general framework these architectures fall under is as follows: to parse a sentence W=w1w2wnW=w_{1}w_{2}...w_{n} with a trained neural model, the sentence WW is input into the model, which outputs oto_{t} at each step tt, and finally all the outputs {ot}t=1n\{o_{t}\}_{t=1}^{n} are utilized to produce the parse.

Given unbounded time and space resources, by a seminal result of Siegelmann and Sontag (1992), an RNN implementation of this framework is Turing complete. In practice it is common to restrict the form of the output oto_{t} in some way. In this paper, we consider the two most common approaches, in which oto_{t} is a real number representing a syntactic distance (Section 2.1) (Shen et al., 2018b; a; 2019) or a sequence of parsing operations (Section 2.2) (Chelba, 1997; Chelba and Jelinek, 2000; Dyer et al., 2016). We proceed to describe our results for each architecture in turn.

2.1 Syntactic distance

Syntactic distance-based neural parsers train a neural network to learn a distance for each pair of adjacent words, depending on the context surrounding the pair of words under consideration. The distances are then used to induce a tree structure (Shen et al., 2018b; a).

For a sentence W=w1w2wnW=w_{1}w_{2}...w_{n}, the syntactic distance between wt1w_{t-1} and wtw_{t} (2tn2\leq t\leq n) is defined as dt=d(wt1,wt|ct)d_{t}=d(w_{t-1},w_{t}\,|\,c_{t}), where ctc_{t} is the context that dtd_{t} takes into consideration 111Note that this is not a conditional distribution—we use this notation for convenience.. We will show that restricting the surrounding context either in directionality, or in size, results in a poor representational power, while full context confers essentially perfect representational power with respect to PCFGs.

Concretely, if the context is full, we show:

Theorem (Informal, full context).

For sentence WW generated by any PCFG, if the computation of dtd_{t} has as context the full sentence and the position index under consideration, i.e. ct=(W,t)c_{t}=(W,t) and dt=d(wt1,wt|ct)d_{t}=d(w_{t-1},w_{t}\,|\,c_{t}), then dtd_{t} can induce the maximum likelihood parse of WW.

On the flipside, if the context is unidirectional (i.e. unbounded left-context from the start of the sentence, and even possibly with a bounded look-ahead), the representational power becomes severely impoverished:

Theorem (Informal, limitation of left-to-right parsing via syntactic distance).

There exists a PCFG GG such that for any distance measure dtd_{t} whose computation incorporates only bounded context in at least one direction (left or right), e.g.

ct\displaystyle c_{t} =(w0,w1,,wt+L)\displaystyle=(w_{0},w_{1},...,w_{t+L^{\prime}})
dt\displaystyle d_{t} =d(wt1,wt|ct)\displaystyle=d(w_{t-1},w_{t}\,|\,c_{t})

the probability that dtd_{t} induces the max likelihood parse is arbitrarily low.

In practice, for computational efficiency, parametrizations of syntactic distances fall into the above assumptions of restricted context (Shen et al., 2018b). This puts the ability of these models to learn a complex PCFG syntax into considerable doubt. For formal definitions, see Section 4.2. For formal theorem statements and proofs, see Section 5.

Subsequently we consider ON-LSTM, an architecture proposed by Shen et al. (2019) improving their previous work (Shen et al., 2018b), which also is based on learning a syntactic distance, but in (Shen et al., 2019) the distances are reduced from the values of a carefully structured master forget gate (see Section 6). While we show ON-LSTM can in principle losslessly represent any parse tree (Theorem 3), calculating the gate values in a left to right fashion (as is done in practice) is subject to the same limitations as the syntactic distance approach:

Theorem (Informal, limitation of syntactic distance estimation based on ON-LSTM).

There exists a PCFG GG for which the probability that the syntactic distance converted from an ON-LSTM induces the max likelihood parse is arbitrarily low.

For a formal statement, see Section 6 and in particular Theorem 4.

2.2 Transition-based parsing

In principle, the output oto_{t} at each position tt of a left-to-right neural models for syntactic parsing need not be restricted to a real-numbered distance or a carefully structured vector. It can also be a combinatorial structure — e.g. a sequence of transitions (Chelba, 1997; Chelba and Jelinek, 2000; Dyer et al., 2016). We adopt a simplification of the neural parameterization in (Dyer et al., 2016) (see Definition 4.7).

With full context, Dyer et al. (2016) describes an algorithm to find a sequence of transitions to represent any parse tree, via a “depth-first, left-to-right traversal” of the tree. On the other hand, without full context, we prove that transition-based parsing suffers from the same limitations:

Theorem (Informal, limitation of transition-based parsing without full context).

There exists a PCFG GG, such that for any learned transition-based parser with bounded context in at least one direction (left or right), the probability that it returns the max likelihood parse is arbitrarily low.

For a formal statement, see Section 7, and in particular Theorem 5.

Remark.

There is no immediate connection between the syntactic distance-based approaches (including ON-LSTM) and the transition-based parsing framework, so the limitations of transition-based parsing does not directly imply the stated negative results for syntactic distance or ON-LSTM, and vice versa.

2.3 The counterexample family

Refer to caption
Figure 1: The parse trees of the two sentences: “I drink coffee with milk.” and “I drink coffee with friends.”. Their only difference occurs at their very last words, but their parses differ at some earlier words in each sentence

Most of our theorems proving limitations on bounded and unidirectional context are based on a PCFG family (Definition 2.1) which draws inspirations from natural language already suggested in (Htut et al., 2018): later words in a sentence can force different syntactic structures earlier in the sentence. For example, consider the two sentences: “I drink coffee with milk.” and “I drink coffee with friends.” Their only difference occurs at their very last words, but their parses differ at some earlier words in each sentence, too, as shown in Figure 1.

To formalize this intuition, we define the following PCFG.

Definition 2.1 (Right-influenced PCFG).

Let m2,L1m\geq 2,L^{\prime}\geq 1 be positive integers. The grammar Gm,LG_{m,L^{\prime}} has starting symbol SS, other non-terminals

Ak,Bk,Akl,Akr,Bk for all k{1,2,,m},A_{k},B_{k},A_{k}^{l},A_{k}^{r},B_{k}^{\prime}\text{ for all }k\in\{1,2,...,m\},

and terminals

ai for all i{1,2,,m+1+L},a_{i}\text{ for all }i\in\{1,2,...,m+1+L^{\prime}\},
cj for all j{1,2,,m}.c_{j}\text{ for all }j\in\{1,2,...,m\}.

The rules of the grammar are

S\displaystyle S AkBk,k{1,2,,m}w. prob.1/m\displaystyle\rightarrow A_{k}B_{k},\forall k\in\{1,2,\dots,m\}\;\text{w. prob.}1/m
Ak\displaystyle A_{k} AklAkrw. prob. 1\displaystyle\rightarrow A_{k}^{l}A_{k}^{r}\quad\text{w. prob. }1
Akl\displaystyle A_{k}^{l} a1a2akw. prob. 1\displaystyle\rightarrow^{*}a_{1}a_{2}...a_{k}\quad\text{w. prob. }1
Akr\displaystyle A_{k}^{r} ak+1ak+2am+1w. prob. 1\displaystyle\rightarrow^{*}a_{k+1}a_{k+2}...a_{m+1}\quad\text{w. prob. }1
Bk\displaystyle B_{k} Bkckw. prob. 1\displaystyle\rightarrow^{*}B_{k}^{\prime}c_{k}\quad\text{w. prob. }1
Bk\displaystyle B_{k}^{\prime} am+2am+3am+1+Lw. prob. 1\displaystyle\rightarrow^{*}a_{m+2}a_{m+3}...a_{m+1+L^{\prime}}\quad\text{w. prob. }1

in which \rightarrow^{*} means that the left expands into the right through a sequence of rules that conform to the requirements of the Chomsky normal form (CNF, Definition 4.4). Hence the grammar Gm,LG_{m,L^{\prime}} is in CNF.

The language of this grammar is

L(Gm,L)={lk=a1a2am+1+Lck:1km}.L(G_{m,L^{\prime}}){=}\{l_{k}{=}a_{1}a_{2}...a_{m+1+L^{\prime}}c_{k}:1\leq k\leq m\}.
Refer to caption
Figure 2: The structure of the parse tree of string lk=a1a2am+1+LckL(Gm,L)l_{k}=a_{1}a_{2}...a_{m+1+L^{\prime}}c_{k}\in L(G_{m,L^{\prime}}). Note that any lk1l_{k_{1}} and lk2l_{k_{2}} are almost the same except for the last token: the prefix a1a2am+1+La_{1}a_{2}...a_{m+1+L^{\prime}} is shared among all strings in L(Gm,L)L(G_{m,L^{\prime}}). However, their parses differ with respect to where AkA_{k} is split. The last token ckc_{k} is unique to lkl_{k} and hence determines the correct parse according to Gm,LG_{m,L^{\prime}}.

The parse of an arbitrary lkl_{k} is shown in Figure 2. Each lkl_{k} corresponds to a unique parse determined by the choice of kk. The structure of this PCFG is such that for the parsing algorithms we consider that proceed in a “left-to-right” fashion on lkl_{k}, before processing the last token ckc_{k}, it cannot infer the syntactic structure of a1a2am+1a_{1}a_{2}...a_{m+1} any better than randomly guessing one of the mm possibilities. This is the main intuition behind Theorems 2 and 5.

Remark.

While our theorems focus on the limitation of “left-to-right” parsing, a symmetric argument implies the same limitation of “right-to-left” parsing. Thus, our claim is that unidirectional context (in either direction) limits the expressive power of parsing models.

3 Related Works

Neural models for parsing were first successfully implemented for supervised settings, e.g. (Vinyals et al., 2015; Dyer et al., 2016; Shen et al., 2018a). Unsupervised tasks remained seemingly out of reach, until the proposal of the Parsing-Reading-Predict Network (PRPN) by Shen et al. (2018b), whose performance was thoroughly verified by extensive experiments in (Htut et al., 2018). The follow-up paper (Shen et al., 2019) introducing the ON-LSTM architecture simplified radically the architecture in (Shen et al., 2018b), while still ultimately attempting to fit a distance metric with the help of carefully designed master forget gates. Subsequent work by Kim et al. (2019a) departed from the usual way neural techniques are integrated in NLP, with great success: they proposed a neural parameterization for the EM algorithm for learning a PCFG, but in a manner that leverages semantic information as well — achieving a large improvement on unsupervised parsing tasks.222By virtue of not relying on bounded or unidirectional context, the Compound PCFG (Kim et al., 2019a) eschews the techniques in our paper. Specifically, by employing a bidirectional LSTM inference network in the process of constructing a tree given a sentence, the parsing is no longer “left-to-right”.

In addition to constituency parsing, dependency parsing is another common task for syntactic parsing, but for our analyses on the ability of various approaches to represent the max-likelihood parse of sentences generated from PCFGs, we focus on the task of constituency parsing. Moreover, it’s important to note that there is another line of work aiming to probe the ability of models trained without explicit syntactic consideration (e.g. BERT) to nevertheless discover some (rudimentary) syntactic elements (Bisk and Hockenmaier, 2015; Linzen et al., 2016; Choe and Charniak, 2016; Kuncoro et al., 2018; Williams et al., 2018; Goldberg, 2019; Htut et al., 2019; Hewitt and Manning, 2019; Reif et al., 2019). However, to-date, we haven’t been able to extract parse trees achieving scores that are close to the oracle binarized trees on standard benchmarks (Kim et al., 2019b; a).

Methodologically, our work is closely related to a long line of works aiming to characterize the representational power of neural models (e.g. RNNs, LSTMs) through the lens of formal languages and formal models of computation. Some of the works of this flavor are empirical in nature (e.g. LSTMs have been shown to possess stronger abilities to recognize some context-free language and even some context-sensitive language, compared with simple RNNs (Gers and Schmidhuber, 2001; Suzgun et al., 2019) or GRUs (Weiss et al., 2018; Suzgun et al., 2019)); some results are theoretical in nature (e.g. Siegelmann and Sontag (1992)’s proof that with unbounded precision and unbounded time complexity, RNNs are Turing-complete; related results investigate RNNs with bounded precision and computation time (Weiss et al., 2018), as well as memory (Merrill, 2019; Hewitt et al., 2020). Our work contributes to this line of works, but focuses on the task of syntactic parsing instead.

4 Preliminaries

In this section, we define some basic concepts and introduce the architectures we will consider.

4.1 Probabilistic context-free grammar

First recall several definitions around formal language, especially probabilistic context free grammar:

Definition 4.1 (Probabilistic context-free grammar (PCFG)).

Formally, a PCFG (Chomsky, 1956) is a 5-tuple G=(Σ,N,S,R,Π)G=(\Sigma,N,S,R,\Pi) in which Σ\Sigma is the set of terminals, NN is the set of non-terminals, SNS\in N is the start symbol, RR is the set of production rules of the form r=(rLrR)r=(r_{L}\rightarrow r_{R}), where rLNr_{L}\in N, rRr_{R} is of the form B1B2BmB_{1}B_{2}...B_{m}, m+m\in\mathbb{Z_{+}}, and i{1,2,,m},Bi(ΣN)\forall i\in\{1,2,...,m\},B_{i}\in(\Sigma\cup N). Finally, Π:R[0,1]\Pi:R\mapsto[0,1] is the rule probability function, in which for any

r=(AB1B2Bm)R,r=(A\rightarrow B_{1}B_{2}...B_{m})\in R,

Π(r)\Pi(r) is the conditional probability

P(rR=B1B2Bm|rL=A).P(r_{R}=B_{1}B_{2}...B_{m}\,|\,r_{L}=A).
Definition 4.2 (Parse tree).

Let TGT_{G} denote the set of parse trees that GG can derive. Each tTGt\in T_{G} is associated with yield(t)Σ\verb|yield(t)|\in\Sigma^{*}, the sequence of terminals composed of the leaves of tt and PT(t)[0,1]P_{T}(t)\in[0,1], the probability of the parse tree, defined by the product of the probabilities of the rules in the derivation of tt.

Definition 4.3 (Language and sentence).

The language of GG is

L(G)={sΣ:tTG,yield(t)=s}.L(G)=\{s\in\Sigma^{*}:\exists t\in T_{G},\verb|yield(t)|=s\}.

Each sL(G)s\in L(G) is called a sentence in L(G)L(G), and is associated with the set of parses TG(s)={tTG|yield(t)=s}T_{G}(s)=\{t\in T_{G}\,|\,\verb|yield(t)|=s\}, the set of max likelihood parses, argmaxtTG(s)PT(t)\operatorname*{arg\,max}_{t\in T_{G}(s)}P_{T}(t), and its probability PS(s)=tTG(s)PT(t)P_{S}(s)=\sum_{t\in T_{G}(s)}P_{T}(t).

Definition 4.4 (Chomsky normal form (CNF)).

A PCFG G=(Σ,N,S,R,Π)G=(\Sigma,N,S,R,\Pi) is in CNF (Chomsky, 1959) if we require, in addition to Definition 4.1, that each rule rRr\in R is in the form AB1B2A\rightarrow B_{1}B_{2} where B1,B2N{S}B_{1},B_{2}\in N\setminus\{S\}; AaA\rightarrow a where aΣ,aϵa\in\Sigma,a\neq\epsilon; or SϵS\rightarrow\epsilon which is only allowed if the empty string ϵL(G)\epsilon\in L(G).

Every PCFG GG can be converted into a PCFG GG^{\prime} in CNF such that L(G)=L(G)L(G)=L(G^{\prime}) (Hopcroft et al., 2006).

4.2 Syntactic distance

The Parsing-Reading-Predict Networks (PRPN) (Shen et al., 2018b) is one of the leading approaches to unsupervised constituency parsing. The parsing network (which computes the parse tree, hence the only part we focus on in our paper) is a convolutional network that computes the syntactic distances dt=d(wt1,wt)d_{t}=d(w_{t-1},w_{t}) (defined in Section 2.1) based on the past LL words. A deterministic greedy tree induction algorithm is then used to produce a parse tree as follows. First, we split the sentence w1wnw_{1}...w_{n} into two constituents, w1wt1w_{1}...w_{t-1} and wtwnw_{t}...w_{n}, where targmax{dt}t=2nt\in\mbox{argmax}\{d_{t}\}_{t=2}^{n} and form the left and right subtrees of tt. We recursively repeat this procedure for the newly created constituents. An algorithmic form of this procedure is included as Algorithm 1 in Appendix A.

Note that, due to the deterministic nature of the tree-induction process, the ability of PRPN to learn a PCFG is completely contingent upon learning a good syntactic distance.

4.3 The ordered neuron architecture

Building upon the idea of representing the syntactic information with a real-valued distance measure at each position, a simple extension is to associate each position with a learned vector, and then use the vector for syntactic parsing. The ordered-neuron LSTM (ON-LSTM, Shen et al., 2019) proposes that the nodes that are closer to the root in the parse tree generate a longer span of terminals, and therefore should be less frequently “forgotten” than nodes that are farther away from the root. The difference in the frequency of forgetting is captured by a carefully designed master forget gate vector f~\tilde{f}, as shown in Figure 3 (in Appendix B). Formally:

Definition 4.5 (Master forget gates, Shen et al., 2019).

Given the input sentence W=w1w2wnW=w_{1}w_{2}...w_{n} and a trained ON-LSTM, running the ON-LSTM on WW gives the master forget gates, which are a sequence of DD-dimensional vectors {f~t}t=1n\{\tilde{f}_{t}\}_{t=1}^{n}, in which at each position tt, f~t=f~t(w1,,wt)[0,1]D\tilde{f}_{t}=\tilde{f}_{t}(w_{1},...,w_{t})\in[0,1]^{D}. Moreover, let f~t,j\tilde{f}_{t,j} represent the jj-th dimension of f~t\tilde{f}_{t}. The ON-LSTM architectures requires that f~t,1=0\tilde{f}_{t,1}=0, f~t,D=1\tilde{f}_{t,D}=1, and

i<j,f~t,if~t,j.\forall i<j,\quad\tilde{f}_{t,i}\leq\tilde{f}_{t,j}.

When parsing a sentence, the real-valued master forget gate vector f~t\tilde{f}_{t} at each position tt is reduced to a single real number representing the syntactic distance dtd_{t} at position tt (see (1)) (Shen et al., 2018b). Then, use the syntactic distances to obtain a parse.

4.4 Transition-based parsing

In addition to outputting a single real numbered distance or a vector at each position tt, a left-to-right model can also parse a sentence by outputting a sequence of “transitions” at each position tt, an idea proposed in some traditional parsing approaches (Sagae and Lavie, 2005; Chelba, 1997; Chelba and Jelinek, 2000), and also some more recent neural parameterization (Dyer et al., 2016).

We introduce several items of notation:

  • zitz_{i}^{t}: the ii-th transition performed when reading in wtw_{t}, the tt-th token of the sentence

    W=w1w2wn.W=w_{1}w_{2}...w_{n}.
  • NtN_{t}: the number of transitions performed between reading in the token wtw_{t} and reading in the next token wt+1w_{t+1}.

  • ZtZ_{t}: the sequence of transitions after reading in the prefix w1w2wtw_{1}w_{2}...w_{t} of the sentence.

    Zt={(z1j,z2j,,zNjj)|j=1..t}.Z_{t}=\{(z_{1}^{j},z_{2}^{j},...,z_{N_{j}}^{j})\,|\,j=1..t\}.
  • ZZ: the parse of the sentence WW. Z=ZnZ=Z_{n}.

We base our analysis on the approach introduced in the parsing version of (Dyer et al., 2016), though that work additionally proposes a generator version. 333Dyer et al. (2016) additionally proposes some generator transitions. For simplicity, we analyze the simplest form: we only allow the model to return one parse, composed of the parser transitions, for a given input sentence. Note that this simplified variant still confers full representational power in the “full context” setting (see Section 7).

Definition 4.6 (Transition-based parser).

A transition-based parser uses a stack (initialized to empty) and an input buffer (initialized with the sentence w1wtw_{1}...w_{t}). At each position tt, based on a context ctc_{t}, the parser outputs a sequence of parsing transitions {zit}i=1Nt\{z_{i}^{t}\}_{i=1}^{N_{t}}, where each zitz_{i}^{t} can be one of the following transitions (Definition 4.7). The parsing stops when the stack contains one single constituent, and the buffer is empty.

Definition 4.7 (Parser transitions, Dyer et al., 2016).

A parsing transition can be one of the following three types:

  • NT(X) pushes a non-terminal X onto the stack.

  • SHIFT: removes the first terminal from the input buffer and pushes onto the stack.

  • REDUCE: pops from the stack until an open non-terminal is encountered, then pops this non-terminal and assembles everything popped to form a new constituent, labels this new constituent using this non-terminal, and finally pushes this new constituent onto the stack.

In Appendix Section C, we provide an example of parsing the sentence “I drink coffee with milk” using the set of transitions given by Definition 4.7.

The different context specifications and the corresponding representational powers of the transition-based parser are discussed in Section 7.

5 Representational Power of Neural Syntactic Distance Methods

In this section we formalize the results on syntactic distance-based methods. Since the tree induction algorithm always generates a binary tree, we consider only PCFGs in Chomsky normal form (CNF) (Definition 4.4) so that the max likelihood parse of a sentence is also a binary tree structure.

To formalize the notion of “representing” a PCFG, we introduce the following definition:

Definition 5.1 (Representing PCFG with syntactic distance).

Let GG be any PCFG in Chomsky Normal Form. A syntactic distance function dd is said to be able to pp-represent GG if for a set of sentences in L(G)L(G) whose total probability is at least pp, dd can correctly induce the tree structure of the max likelihood parse of these sentences without ambiguity.

Remark.

Ambiguities could occur when, for example, there exists tt such that dt=dt+1d_{t}=d_{t+1}. In this case, the tree induction algorithm would have to break ties when determining the local structure for wt1wtwt+1w_{t-1}w_{t}w_{t+1}. We preclude this possibility in Definition 5.1.

In the least restrictive setting, the whole sentence WW, as well as the position index tt can be taken into consideration when determining each dtd_{t}. We prove that under this setting, there is a syntactic distance measure that can represent any PCFG.

Theorem 1 (Full context).

Let ct=(W,t)c_{t}=(W,t). For each PCFG GG in Chomsky normal form, there exists a syntactic distance measure dt=d(wt1,wt|ct)d_{t}=d(w_{t-1},w_{t}\,|\,c_{t}) that can 1-represent GG.

Proof.

For any sentence s=s1s2snL(G)s=s_{1}s_{2}...s_{n}\in L(G), let TT be its max likelihood parse tree. Since GG is in Chomsky normal form, TT is a binary tree. We will describe an assignment of {dt:2tn}\{d_{t}:2\leq t\leq n\} such that their order matches the level at which the branches split in TT. Specifically, t[2,n]\forall t\in[2,n], let ata_{t} denote the lowest common ancestor of wt1w_{t-1} and wtw_{t} in TT. Let dtd^{\prime}_{t} denote the shortest distance between ata_{t} and the root of TT. Finally, let dt=ndtd_{t}=n-d^{\prime}_{t}. As a result, {dt:2tn}\{d_{t}:2\leq t\leq n\} induces TT. ∎

Remark.

Since any PCFG can be converted to Chomsky normal form (Hopcroft et al., 2006), Theorem 1 implies that given the whole sentence and the position index as the context, the syntactic distance has sufficient representational power to capture any PCFG. It does not state, however, that the whole sentence and the position are the minimal contextual information needed for representability nor does it address training (i.e. optimization) issues.

On the flipside, we show that restricting the context even mildly can considerably decrease the representational power. Namely, we show that if context is bounded even in a single direction (to the left or to the right), there are PCFGs on which any syntactic distance will perform poorly 444In Theorem 2 we prove the more typical case, i.e. unbounded left context and bounded right context. The other case, i.e. bounded left context and unbounded right context, can be proved symmetrically.. (Note in the implementation (Shen et al., 2018b) the context only considers a bounded window to the left.)

Theorem 2 (Limitation of left-to-right parsing via syntactic distance).

Let w0=Sw_{0}=\langle S\rangle be the sentence start symbol. Let the context

ct=(w0,w1,,wt+L).c_{t}=(w_{0},w_{1},...,w_{t+L^{\prime}}).

ϵ>0\forall\epsilon>0, there exists a PCFG GG in Chomsky normal form, such that any syntactic distance measure dt=d(wt1,wt|ct)d_{t}=d(w_{t-1},w_{t}\,|\,c_{t}) cannot ϵ\epsilon-represent GG.

Proof.

Let m>1/ϵm>1/\epsilon be a positive integer. Consider the PCFG Gm,LG_{m,L^{\prime}} in Definition 2.1.

For any k[m]k\in[m], consider the string lkL(Gm,L)l_{k}\in L(G_{m,L^{\prime}}). Note that in the parse tree of lkl_{k}, the rule SAkBkS\rightarrow A_{k}B_{k} is applied. Hence, aka_{k} and ak+1a_{k+1} are the unique pair of adjacent non-terminals in a1a2am+1a_{1}a_{2}...a_{m+1} whose lowest common ancestor is the closest to the root in the parse tree of lkl_{k}. Then, in order for the syntactic distance metric dd to induce the correct parse tree for lkl_{k}, dkd_{k} must be the unique maximum in {dt:2tm+1}\{d_{t}:2\leq t\leq{m+1}\}.

However, dd is restricted to be in the form

dt=d(wt1,wt|w0,w1,,wt+L).d_{t}=d(w_{t-1},w_{t}\,|\,w_{0},w_{1},...,w_{t+L^{\prime}}).

Note that 1k1<k2m\forall 1\leq k_{1}<k_{2}\leq{m}, the first m+1+Lm+1+L^{\prime} tokens of lk1l_{k_{1}} and lk2l_{k_{2}} are the same, which implies that the inferred syntactic distances

{dt:2tm+1}\{d_{t}:2\leq t\leq{m+1}\}

are the same for lk1l_{k_{1}} and lk2l_{k_{2}} at each position tt. Thus, it is impossible for dd to induce the correct parse tree for both lk1l_{k_{1}} and lk2l_{k_{2}}. Hence, dd is correct on at most one lkL(Gm,L)l_{k}\in L(G_{m,L^{\prime}}), which corresponds to probability at most 1/m<ϵ1/m<\epsilon. Therefore, dd cannot ϵ\epsilon-represent Gm,LG_{m,L^{\prime}}. ∎

Remark.

In the counterexample, there are only mm possible parse structures for the prefix a1a2am+1a_{1}a_{2}...a_{m+1}. Hence, the proved fact that the probability of being correct is at most 1/m1/m means that under the restrictions of unbounded look-back and bounded look-ahead, the distance cannot do better than random guessing for this grammar.

Remark.

The above Theorem 2 formalizes the intuition discussed in (Htut et al., 2018) outlining an intrinsic limitation of only considering bounded context in one direction. Indeed, for the PCFG constructed in the proof, the failure is a function of the context, not because of the fact that we are using a distance-based parser.

Note that as a corollary of the above theorem, if there is no context (ct=𝚗𝚞𝚕𝚕c_{t}=\verb|null|) or the context is both bounded and unidirectional, i.e.

ct=wtLwtL+1wt1wt,c_{t}=w_{t-L}w_{t-L+1}...w_{t-1}w_{t},

then there is a PCFG that cannot be ϵ\epsilon-represented by any such dd.

6 Representational Power of the Ordered Neuron Architecture

In this section, we formalize the results characterizing the representational power of the ON-LSTM architecture. The master forget gates of the ON-LSTM, {f~t}t=2n\{\tilde{f}_{t}\}_{t=2}^{n} in which each f~t[0,1]D\tilde{f}_{t}\in[0,1]^{D}, encode the hierarchical structure of a parse tree, and Shen et al. (2019) proposes to carry out unsupervised constituency parsing via a reduction from the gate vectors to syntactic distances by setting:

d^tf=Dj=1Df~t,j for t=2..n\hat{d}_{t}^{f}=D-\sum_{j=1}^{D}\tilde{f}_{t,j}\text{ for }t=2..n (1)

First we show that the gates in ON-LSTM in principle form a lossless representation of any parse tree.

Theorem 3 (Lossless representation of a parse tree).

For any sentence W=w1w2wnW=w_{1}w_{2}...w_{n} with parse tree TT in any PCFG in Chomsky normal form, there exists a dimensionality D+D\in\mathbb{Z}_{+}, a sequence of vectors {ft~}t=2n\{\tilde{f_{t}}\}_{t=2}^{n} in which each ft~[0,1]D\tilde{f_{t}}\in[0,1]^{D}, such that the estimated syntactic distances via (1) induce the structure of TT.

Proof.

By Theorem 1, there is a syntactic distance measure {dt}t=2n\{d_{t}\}_{t=2}^{n} that induces the structure of TT (such that t,dtdt+1\forall t,d_{t}\neq d_{t+1}).

For each t=2..nt=2..n, set dt^=k\hat{d_{t}}=k if dtd_{t} is the kk-th smallest entry in {dt}t=2n\{d_{t}\}_{t=2}^{n}, breaking ties arbitrarily. Then, each dt^[1,n1]\hat{d_{t}}\in[1,n-1], and {dt^}t=2n\{\hat{d_{t}}\}_{t=2}^{n} also induces the structure of TT.

Let D=n1D=n-1. For each t=2..nt=2..n, let ft~=(0,,0,1,,1)\tilde{f_{t}}=(0,...,0,1,...,1) whose lower dt^\hat{d_{t}} dimensions are 0 and higher Ddt^D-\hat{d_{t}} dimensions are 1. Then,

d^tf=Dj=1Df~t,j=D(Ddt^)=dt^.\hat{d}_{t}^{f}=D-\sum_{j=1}^{D}\tilde{f}_{t,j}=D-(D-\hat{d_{t}})=\hat{d_{t}}.

Therefore, the calculated {d^tf}t=2n\{\hat{d}_{t}^{f}\}_{t=2}^{n} induces the structure of TT. ∎

Although Theorem 3 shows the ability of the master forget gates to perfectly represent any parse tree, a left-to-right parsing can be proved to be unable to return the correct parse with high probability. In the actual implementation in (Shen et al., 2019), the (real-valued) master forget gate vectors {ft~}t=1n\{\tilde{f_{t}}\}_{t=1}^{n} are produced by feeding the input sentence W=w1w2wnW=w_{1}w_{2}...w_{n} to a model trained with a language modeling objective. In other words, f~t,j\tilde{f}_{t,j} is calculated as a function of w1,,wtw_{1},...,w_{t}, rather than the entire sentence.

As such, this left-to-right parser is subject to similar limitations as in Theorem 2:

Theorem 4 (Limitation of syntactic distance estimation based on ON-LSTM).

For any ϵ>0\epsilon>0, there exists a PCFG GG in Chomsky normal form, such that the syntactic distance measure calculated with (1), d^tf\hat{d}_{t}^{f}, cannot ϵ\epsilon-represent GG.

Proof.

Since by Definition 4.5, f~t,j\tilde{f}_{t,j} is a function of w1,,wtw_{1},...,w_{t}, the estimated syntactic distance d^tf\hat{d}_{t}^{f} is also a function of w1,,wtw_{1},...,w_{t}. By Theorem 2, even with unbounded look-back context w1,,wtw_{1},...,w_{t}, there exists a PCFG for which the probability that d^tf\hat{d}_{t}^{f} induces the correct parse is arbitrarily low. ∎

7 Representational Power of Transition-Based Parsing

In this section, we analyze a transition-based parsing framework inspired by (Dyer et al., 2016; Chelba and Jelinek, 2000; Chelba, 1997).

Again, we proceed to say first that “full context” confers full representational power. Namely, using the terminology of Definition 4.6, we let the context ctc_{t} at each position tt be the whole sentence WW and the position index tt. Note that any parse tree can be generated by a sequence of transitions defined in Definition 4.7. Indeed, Dyer et al. (2016) describes an algorithm to find such a sequence of transitions via a “depth-first, left-to-right traversal” of the tree.

Proceeding to limited context, in the setting of typical left-to-right parsing, the context ctc_{t} consists of all current and past tokens {wj}j=1t\{w_{j}\}_{j=1}^{t} and all previous parses {(z1j,,zNjj)}j=1t\{(z_{1}^{j},...,z_{N_{j}}^{j})\}_{j=1}^{t}. We’ll again prove even stronger negative results, where we allow an optional look-ahead to LL^{\prime} input tokens to the right.

Theorem 5 (Limitation of transition-based parsing without full context).

For any ϵ>0\epsilon>0, there exists a PCFG GG in Chomsky normal form, such that for any learned transition-based parser (Definition 4.6) based on context

ct=({wj}j=1t+L,{(z1j,,zNjj)}j=1t),c_{t}=(\{w_{j}\}_{j=1}^{t+L^{\prime}},\{(z_{1}^{j},...,z_{N_{j}}^{j})\}_{j=1}^{t}),

the sum of the probabilities of the sentences in L(G)L(G) for which the parser returns the maximum likelihood parse is less than ϵ\epsilon.

Proof.

Let m>1/ϵm>1/\epsilon be a positive integer. Consider the PCFG Gm,LG_{m,L^{\prime}} in Definition 2.1.

Note that k,SAkBk\forall k,S\rightarrow A_{k}B_{k} is applied to yield string lkl_{k}. Then in the parse tree of lkl_{k}, aka_{k} and ak+1a_{k+1} are the unique pair of adjacent terminals in a1a2am+1a_{1}a_{2}...a_{m+1} whose lowest common ancestor is the closest to the root. Thus, different lkl_{k} requires a different sequence of transitions within the first m+1m+1 input tokens, i.e. {zit}i1, 1tm+1\{z_{i}^{t}\}_{i\geq 1,\,1\leq t\leq m+1}.

For each wL(Gm,L)w\in L(G_{m,L^{\prime}}), before the last token wm+2+Lw_{m+2+L^{\prime}} is processed, based on the common prefix w1w2wm+1+L=a1a2am+1+Lw_{1}w_{2}...w_{m+1+L^{\prime}}=a_{1}a_{2}...a_{m+1+L^{\prime}}, it is equally likely that w=lk,kw=l_{k},\forall k, w. prob. 1/m1/m each.

Moreover, when processing wm+1w_{m+1}, the bounded look-ahead window of size LL^{\prime} does not allow access to the final input token am+2+L=cka_{m+2+L^{\prime}}=c_{k}.

Thus, 1k1<k2m\forall 1\leq k_{1}<k_{2}\leq{m}, it is impossible for the parser to return the correct parse tree for both lk1l_{k_{1}} and lk2l_{k_{2}} without ambiguity. Hence, the parse is correct on at most one lkL(G)l_{k}\in L(G), which corresponds to probability at most 1/m<ϵ1/m<\epsilon. ∎

8 Conclusion

In this work, we considered the representational power of two frameworks for constituency parsing prominent in the literature, based on learning a syntactic distance and learning a sequence of iterative transitions to build the parse tree — in the sandbox of PCFGs. In particular, we show that if the context for calculating distance/deciding on transitions is limited at least to one side (which is typically the case in practice for existing architectures), there are PCFGs for which no good distance metric/sequence of transitions can be chosen to construct the maximum likelihood parse.

This limitation was already suspected in (Htut et al., 2018) as a potential failure mode of leading neural approaches like (Shen et al., 2018b; 2019) and we show formally that this is the case. The PCFGs with this property track the intuition that bounded context methods will have issues when the parse at a certain position depends heavily on latter parts of the sentence.

The conclusions thus suggest re-focusing our attention on methods like (Kim et al., 2019a) which have enjoyed greater success on tasks like unsupervised constituency parsing, and do not fall in the paradigm analyzed in our paper. A question of definite further interest is how to augment models that have been successfully scaled up (e.g. BERT) in a principled manner with syntactic information, such that they can capture syntactic structure (like PCFGs). The other question of immediate importance is to understand the interaction between the syntactic and semantic modules in neural architectures — information is shared between such modules in various successful architectures, e.g. (Dyer et al., 2016; Shen et al., 2018b; 2019; Kim et al., 2019a), and the relative pros and cons of doing this are not well understood. Finally, our paper purely focuses on representational power, and does not consider algorithmic and statistical aspects of training. As any model architecture is associated with its distinct optimization and generalization considerations, and natural language data necessitates the modeling of the interaction between syntax and semantics, those aspects of considerations are well beyond the scope of our analysis in this paper using the controlled sandbox of PCFGs, and are interesting directions for future work.

References

  • Bisk and Hockenmaier [2015] Yonatan Bisk and Julia Hockenmaier. Probing the linguistic strengths and limitations of unsupervised grammar induction. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1395–1404, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-1135. URL https://www.aclweb.org/anthology/P15-1135.
  • Chelba [1997] Ciprian Chelba. A structured language model. In 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, pages 498–500, Madrid, Spain, July 1997. Association for Computational Linguistics. doi: 10.3115/976909.979681. URL https://www.aclweb.org/anthology/P97-1064.
  • Chelba and Jelinek [2000] Ciprian Chelba and Frederick Jelinek. Structured language modeling. Computer Speech & Language, 14(4):283 – 332, 2000. ISSN 0885-2308. doi: https://doi.org/10.1006/csla.2000.0147. URL http://www.sciencedirect.com/science/article/pii/S0885230800901475.
  • Chen et al. [2017] Qian Chen, Xiaodan Zhu, Zhen-Hua Ling, Si Wei, Hui Jiang, and Diana Inkpen. Enhanced LSTM for natural language inference. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1657–1668, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17-1152. URL https://www.aclweb.org/anthology/P17-1152.
  • Choe and Charniak [2016] Do Kook Choe and Eugene Charniak. Parsing as language modeling. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2331–2336, Austin, Texas, November 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16-1257. URL https://www.aclweb.org/anthology/D16-1257.
  • Chomsky [1956] N. Chomsky. Three models for the description of language. IRE Transactions on Information Theory, 2(3):113–124, 1956. doi: 10.1109/TIT.1956.1056813.
  • Chomsky [1959] Noam Chomsky. On certain formal properties of grammars. Information and Control, 2(2):137 – 167, 1959. ISSN 0019-9958. doi: https://doi.org/10.1016/S0019-9958(59)90362-6. URL http://www.sciencedirect.com/science/article/pii/S0019995859903626.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1423. URL https://www.aclweb.org/anthology/N19-1423.
  • Dyer et al. [2016] Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A. Smith. Recurrent neural network grammars. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 199–209, San Diego, California, June 2016. Association for Computational Linguistics. doi: 10.18653/v1/N16-1024. URL https://www.aclweb.org/anthology/N16-1024.
  • Gers and Schmidhuber [2001] F. Gers and J. Schmidhuber. Lstm recurrent networks learn simple context-free and context-sensitive languages. IEEE transactions on neural networks, 12 6:1333–40, 2001.
  • Goldberg [2019] Yoav Goldberg. Assessing bert’s syntactic abilities, 2019.
  • He et al. [2020] Qi He, Han Wang, and Yue Zhang. Enhancing generalization in natural language inference by syntax. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4973–4978, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.findings-emnlp.447. URL https://www.aclweb.org/anthology/2020.findings-emnlp.447.
  • Hewitt and Manning [2019] John Hewitt and Christopher D. Manning. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1419. URL https://www.aclweb.org/anthology/N19-1419.
  • Hewitt et al. [2020] John Hewitt, Michael Hahn, Surya Ganguli, Percy Liang, and Christopher D. Manning. RNNs can generate bounded hierarchical languages with optimal memory. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1978–2010, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.156. URL https://www.aclweb.org/anthology/2020.emnlp-main.156.
  • Hopcroft et al. [2006] John E. Hopcroft, Rajeev Motwani, and Jeffrey D. Ullman. Introduction to Automata Theory, Languages, and Computation (3rd Edition). Addison-Wesley Longman Publishing Co., Inc., USA, 2006. ISBN 0321462254.
  • Htut et al. [2018] Phu Mon Htut, Kyunghyun Cho, and Samuel Bowman. Grammar induction with neural language models: An unusual replication. In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP, pages 371–373, Brussels, Belgium, November 2018. Association for Computational Linguistics. doi: 10.18653/v1/W18-5452. URL https://www.aclweb.org/anthology/W18-5452.
  • Htut et al. [2019] Phu Mon Htut, Jason Phang, Shikha Bordia, and Samuel R. Bowman. Do attention heads in bert track syntactic dependencies? ArXiv, abs/1911.12246, 2019.
  • Kim et al. [2019a] Yoon Kim, Chris Dyer, and Alexander Rush. Compound probabilistic context-free grammars for grammar induction. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2369–2385, Florence, Italy, July 2019a. Association for Computational Linguistics. doi: 10.18653/v1/P19-1228. URL https://www.aclweb.org/anthology/P19-1228.
  • Kim et al. [2019b] Yoon Kim, Alexander Rush, Lei Yu, Adhiguna Kuncoro, Chris Dyer, and Gábor Melis. Unsupervised recurrent neural network grammars. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1105–1117, Minneapolis, Minnesota, June 2019b. Association for Computational Linguistics. doi: 10.18653/v1/N19-1114. URL https://www.aclweb.org/anthology/N19-1114.
  • Kuncoro et al. [2018] Adhiguna Kuncoro, Chris Dyer, John Hale, Dani Yogatama, Stephen Clark, and Phil Blunsom. LSTMs can learn syntax-sensitive dependencies well, but modeling structure makes them better. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1426–1436, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-1132. URL https://www.aclweb.org/anthology/P18-1132.
  • Linzen et al. [2016] Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics, 4:521–535, 2016. doi: 10.1162/tacl˙a˙00115. URL https://www.aclweb.org/anthology/Q16-1037.
  • Merrill [2019] William Merrill. Sequential neural networks as automata. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 1–13, Florence, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3901. URL https://www.aclweb.org/anthology/W19-3901.
  • Pang et al. [2019] Deric Pang, Lucy H. Lin, and Noah A. Smith. Improving natural language inference with a pretrained parser, 2019.
  • Reif et al. [2019] Emily Reif, Ann Yuan, Martin Wattenberg, Fernanda B Viegas, Andy Coenen, Adam Pearce, and Been Kim. Visualizing and measuring the geometry of bert. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32, pages 8594–8603. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper/2019/file/159c1ffe5b61b41b3c4d8f4c2150f6c4-Paper.pdf.
  • Sagae and Lavie [2005] Kenji Sagae and Alon Lavie. A classifier-based parser with linear run-time complexity. In Proceedings of the Ninth International Workshop on Parsing Technology, pages 125–132, Vancouver, British Columbia, October 2005. Association for Computational Linguistics. URL https://www.aclweb.org/anthology/W05-1513.
  • Shen et al. [2018a] Yikang Shen, Zhouhan Lin, Athul Paul Jacob, Alessandro Sordoni, Aaron Courville, and Yoshua Bengio. Straight to the tree: Constituency parsing with neural syntactic distance. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1171–1180, Melbourne, Australia, July 2018a. Association for Computational Linguistics. doi: 10.18653/v1/P18-1108. URL https://www.aclweb.org/anthology/P18-1108.
  • Shen et al. [2018b] Yikang Shen, Zhouhan Lin, Chin wei Huang, and Aaron Courville. Neural language modeling by jointly learning syntax and lexicon. In International Conference on Learning Representations, 2018b. URL https://openreview.net/forum?id=rkgOLb-0W.
  • Shen et al. [2019] Yikang Shen, Shawn Tan, Alessandro Sordoni, and Aaron Courville. Ordered neurons: Integrating tree structures into recurrent neural networks. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=B1l6qiR5F7.
  • Siegelmann and Sontag [1992] Hava T. Siegelmann and Eduardo D. Sontag. On the computational power of neural nets. In Proceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, page 440–449, New York, NY, USA, 1992. Association for Computing Machinery. ISBN 089791497X. doi: 10.1145/130385.130432. URL https://doi.org/10.1145/130385.130432.
  • Suzgun et al. [2019] Mirac Suzgun, Yonatan Belinkov, Stuart Shieber, and Sebastian Gehrmann. LSTM networks can perform dynamic counting. In Proceedings of the Workshop on Deep Learning and Formal Languages: Building Bridges, pages 44–54, Florence, August 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-3905. URL https://www.aclweb.org/anthology/W19-3905.
  • Vinyals et al. [2015] Oriol Vinyals, Ł ukasz Kaiser, Terry Koo, Slav Petrov, Ilya Sutskever, and Geoffrey Hinton. Grammar as a foreign language. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages 2773–2781. Curran Associates, Inc., 2015. URL https://proceedings.neurips.cc/paper/2015/file/277281aada22045c03945dcb2ca6f2ec-Paper.pdf.
  • Weiss et al. [2018] Gail Weiss, Yoav Goldberg, and Eran Yahav. On the practical computational power of finite precision RNNs for language recognition. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 740–745, Melbourne, Australia, July 2018. Association for Computational Linguistics. doi: 10.18653/v1/P18-2117. URL https://www.aclweb.org/anthology/P18-2117.
  • Williams et al. [2018] Adina Williams, Andrew Drozdov, and Samuel R. Bowman. Do latent tree learning models identify meaningful structure in sentences? Transactions of the Association for Computational Linguistics, 6:253–267, 2018. doi: 10.1162/tacl˙a˙00019. URL https://www.aclweb.org/anthology/Q18-1019.

Appendix A Tree Induction Algorithm Based on Syntactic Distance

The following algorithm is proposed in [Shen et al., 2018b] to create a parse tree based on a given syntactic distance.

Data: Sentence W=w1w2wnW=w_{1}w_{2}...w_{n}, syntactic distances dt=d(wt1,wt|ct)d_{t}=d(w_{t-1},w_{t}\,|\,c_{t}), 2tn2\leq t\leq n
Result: A parse tree for WW
Initialize the parse tree with a single node n0=w1w2wnn_{0}=w_{1}w_{2}...w_{n};
while \exists leaf node n=wiwi+1wjn=w_{i}w_{i+1}...w_{j} where i<ji<j do
       Find kargmaxi+1kjdkk\in\operatorname*{arg\,max}_{i+1\leq k\leq j}d_{k} ;
       Create the left child nln_{l} and the right child nrn_{r} of nn ;
       nlwiwi+1wk1n_{l}\leftarrow w_{i}w_{i+1}...w_{k-1} ;
       nrwkwk+1wjn_{r}\leftarrow w_{k}w_{k+1}...w_{j} ;
      
end while
return The parse tree rooted at n0n_{0}.
Algorithm 1 Tree induction based on syntactic distance

Appendix B ON-LSTM Intuition

See Figure 3 below, which is excerpted from [Shen et al., 2019] with minor adaptation to the notation.

Refer to caption
Figure 3: Relationship between the parse tree, the block view, and the ON-LSTM. Excerpted from [Shen et al., 2019] with minor adaptation to the notation.

Appendix C Examples of parsing transitions

Table 1 below shows an example of parsing the sentence “I drink coffee with milk” using the set of transitions given by Definition 4.7, which employs the parsing framework of [Dyer et al., 2016]. The parse tree of the sentence is given by

\Tree[.S [.NP [.N I ] ] [.VP [.V drink ] [.NP [.NP [.N coffee ] ] [.PP [.P with ] [.N milk ] ] ] ] ]
Stack Buffer Action
I drink coffee with milk NT(S)
(S I drink coffee with milk NT(NP)
(S || (NP I drink coffee with milk NT(N)
(S || (NP || (N I drink coffee with milk SHIFT
(S || (NP || (N || I drink coffee with milk REDUCE
(S || (NP (N I)) drink coffee with milk NT(VP)
(S || (NP (N I)) || (VP drink coffee with milk NT(V)
(S || (NP (N I)) || (VP || (V drink coffee with milk SHIFT
(S || (NP (N I)) || (VP || (V || drink coffee with milk REDUCE
(S || (NP (N I)) || (VP || (V drink) coffee with milk NT(NP)
(S || (NP (N I)) || (VP || (V || drink) ||
   (NP coffee with milk NT(NP)
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP coffee with milk NT(N)
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP || (N coffee with milk SHIFT
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP || (N || coffee with milk REDUCE
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP (N coffee)) with milk NT(PP)
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP (N coffee)) || (PP with milk NT(P)
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP (N coffee)) || (PP || (P with milk SHIFT
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP (N coffee)) || (PP || (P || with milk REDUCE
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP (N coffee)) || (PP || (P with) milk NT(N)
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP (N coffee)) || (PP || (P with) || (N milk SHIFT
(S || (NP (N I)) || (VP || (V drink) ||
   (NP || (NP (N coffee)) || (PP || (P with) || (N || milk REDUCE
(S (NP (N I)) (VP (V drink)
   (NP (NP (N coffee)) (PP (P with) (N milk)))))
Table 1: Transition-based parsing of the sentence “I drink coffee with milk”.