JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Yuandong Tian
AI@Meta (FAIR)
[email protected] &Yiping Wang
University of Washington
[email protected]
&Zhenyu Zhang
University of Texas at Austin
[email protected]
\ANDBeidi Chen
Carnegie Mellon University, AI@Meta (FAIR)
[email protected], [email protected]
&Simon Du
University of Washington
[email protected]

Abstract

We propose Joint MLP/Attention (JoMA) dynamics, a novel mathematical framework to understand the training procedure of multilayer Transformer architectures. This is achieved by integrating out the self-attention layer in Transformers, producing a modified dynamics of MLP layers only. JoMA removes unrealistic assumptions from previous analysis (e.g., lack of residual connection) and predicts that the attention first becomes sparse (to learn salient tokens), then dense (to learn less salient tokens) in the presence of nonlinear activations, while in the linear case, it is consistent with existing works that show attention becomes sparse over time. We leverage JoMA to qualitatively explains how tokens are combined to form hierarchies in multilayer Transformers, when the input tokens are generated by a latent hierarchical generative model. Experiments on models trained from real-world dataset (Wikitext2/Wikitext103) and various pre-trained models (OPT, Pythia) verify our theoretical findings. The code is at¹¹1 https://github.com/facebookresearch/luckmatters/tree/yuandong3.

1 Introduction

Since its debut, Transformers (Vaswani et al., 2017) have been extensively used in many applications and demonstrates impressive performance (Dosovitskiy et al., 2020; OpenAI, 2023) compared to domain-specific models (e.g., CNN in computer vision, GNN in graph modeling, RNN/LSTM in language modeling, etc). In all these scenarios, the basic Transformer block, which consists of one self-attention plus two-layer nonlinear MLP, plays a critical role. A natural question arises:

How the basic Transformer block leads to effective learning?

Due to the complexity and nonlinearity of Transformer architectures, it remains a highly nontrivial open problem to find a unified mathematical framework that characterizes the learning mechanism of multi-layer transformers. Existing works mostly focus on 1-layer Transformer (Li et al., 2023a; Tarzanagh et al., 2023b) with fixed MLP (Tarzanagh et al., 2023a) layer, linear activation functions (Tian et al., 2023), and local gradient steps at initialization (Bietti et al., 2023; Oymak et al., 2023), etc.

In this paper, we propose a novel joint dynamics of self-attention plus MLP, based on Joint MLP/Attention Integral (JoMA), a first integral that combines the lower layer of the MLP and self-attention layers. Leveraging this joint dynamics, the self-attention is shown to have more fine-grained and delicate behavior: it first becomes sparse as in the linear case (Tian et al., 2023), only attends to tokens that frequently co-occur with the query, and then becomes denser and gradually includes tokens with less frequent co-occurrence, in the case of nonlinear activation. This shows a changing inductive bias in the Transformer training: first the model focuses on most salient features, then extends to less salient ones.

Another natural question arises: why such a learning pattern is preferred? While for 1-layer this does not give any benefits, in multilayer Transformer setting, we show qualitatively that such a dynamics plays an important role. To demonstrate that this is the case, we assume a hierarchical tree generative model for the input tokens. In this model, starting from the upper level latent variables (in which the top-most is the class label of the input sequence), abbreviated as $\texttt{LV}_{s}$ , generates the latents $\texttt{LV}_{s-1}$ in the lower layer, until reaching the token level ( $s=0$ ). With this model, we show that the tokens generated by the lowest latents $\texttt{LV}_{1}$ co-occur a lot and thus can be picked up first by the attention dynamics as “salient features”. This leads to learning of such token combinations in hidden MLP nodes, which triggers self-attention grouping at $s=1$ , etc. In this way, the non-salient co-occurrences are naturally explained by the top level hierarchy, rather than incorrectly learned by the lower layer as spurious correlation, which is fortunately delayed by the attention mechanism. Our theoretical finding is consistent with both the pre-trained models such as OPT/Pythia and models trained from scratch using real-world dataset (Wikitext2 and Wikitext103).

We show that JoMA overcomes several main limitations from Scan&Snap (Tian et al., 2023). JoMA incorporates residual connections and MLP nonlinearity as key ingredients, analyzes joint training of MLP and self-attention layer, and qualitatively explains dynamics of multilayer Transformers. For linear activation, JoMA coincides with Scan&Snap, i.e., the attention becomes sparse during training.

1.1 Related Work

Expressiveness of Attention-based Models. The universal approximation abilities of attention-based models have been studied extensively (Yun et al., 2019; Bhattamishra et al., 2020a; b; Dehghani et al., 2018; Pérez et al., 2021). More recent studies offer detailed insights into their expressiveness for specific functions across various scenarios, sometimes incorporating statistical evaluations (Edelman et al., 2022; Elhage et al., 2021; Likhosherstov et al., 2021; Akyürek et al., 2022; Zhao et al., 2023; Yao et al., 2021; Anil et al., 2022; Barak et al., 2022). A fruitful line of work studied in-context learning capabilities of the Transformer (Dong et al., 2022), linking gradient descent in classification/regression learning to the feedforward actions in Transformer layers (Garg et al., 2022; Von Oswald et al., 2022; Bai et al., 2023; Olsson et al., 2022; Akyürek et al., 2022; Li et al., 2023b). However, unlike our study, these work do not characterize the training dynamics.

Training Dynamics of Neural Networks. Earlier research has delved into training dynamics within multi-layer linear neural networks (Arora et al., 2018; Bartlett et al., 2018), the teacher-student setting (Brutzkus & Globerson, 2017; Tian, 2017; Soltanolkotabi, 2017; Goel et al., 2018; Du et al., 2017; 2018a; Zhou et al., 2019; Liu et al., 2019; Xu & Du, 2023), and infinite-width limits (Jacot et al., 2018; Chizat et al., 2019; Du et al., 2018b; 2019; Allen-Zhu et al., 2019; Arora et al., 2019; Oymak & Soltanolkotabi, 2020; Zou et al., 2020; Li & Liang, 2018; Chizat & Bach, 2018; Mei et al., 2018; Nguyen & Pham, 2020; Fang et al., 2021; Lu et al., 2020). This includes extensions to attention-based-models (Hron et al., 2020; Yang et al., 2022). For self-supervised learning, there are analyses of linear networks (Tian, 2022) and explorations into the impact of nonlinearity (Tian, 2023).

Dynamics of Attention-based models. Regarding attention-based models, Zhang et al. (2020) delves into adaptive optimization techniques. Jelassi et al. (2022) introduces an idealized context, demonstrating that the vision transformer (Dosovitskiy et al., 2020) trained via gradient descent can discern spatial structures. Li et al. (2023c) illustrates that a single-layer Transformer can learn a constrained topic model, where each word is tied to a single topic, using $\ell_{2}$ loss, BERT-like framework (Devlin et al., 2018), and certain assumptions on attention patterns. Snell et al. (2021) investigate the training dynamics of single-head attention in mimicking Seq2Seq learning. Tian et al. (2023) characterizes the SGD training dynamics of a 1-layer Transformer and shows that with cross-entropy loss, the model will pay more attention to the key tokens that frequently co-occur with the query token. Oymak et al. (2023) constructs the attention-based contextual mixture model and demonstrates how the prompt can attend to the sparse context-relevant tokens via gradient descent. Tarzanagh et al. (2023b) also finds that running gradient descent will converge in direction to the max-margin solution that separates the locally optimal tokens from others, and Tarzanagh et al. (2023a) further disclose the connection between the optimization geometry of self-attention and hard-margin SVM problem. For the in-context learning scenario, several recent works analyze linear transformers trained on random instances for linear regression tasks from the perspective of loss landscape (Boix-Adsera et al., 2023; Zhang et al., 2023). While these studies also study the optimization dynamics of attention-based models, they do not reveal the phenomena we discuss.

2 Problem Setting

Let the total vocabulary size be $M$ , in which $M_{C}$ is the number of contextual tokens and $M_{Q}$ is the number of query tokens. Consider one layer in multilayer transformer (Fig. 1(b)):

h_{k}=\phi({\bm{w}}^{\top}_{k}{\bm{f}}),\quad{\bm{f}}=U_{C}{\bm{b}}+{\bm{u}}_{q},\quad{\bm{b}}=\sigma({\bm{z}}_{q})\circ{\bm{x}}/A

(1)

Input/outputs. ${\bm{x}}=[x_{l}]\in\mathbb{R}^{M_{C}}$ is the input frequency vector for contextual token $1\leq l\leq M_{C}$ , $1\leq q\leq M_{Q}$ is the query token index, $K$ is the number of nodes in the hidden MLP layer, whose outputs are $h_{k}$ . All the quantities above vary across different sample index $i$ (i.e., $x_{l}=x_{l}[i]$ , $q=q[i]$ ). In addition, $\phi$ is the nonlinearity (e.g., ReLU).

Model weights. ${\bm{z}}_{q}=[z_{ql}]\in\mathbb{R}^{M_{C}}$ is the (unnormalized) attention logits given query $q$ , and ${\bm{w}}_{k}\in\mathbb{R}^{d}$ are the weights for the lower MLP layer. These will be analyzed in the paper.

Refer to caption — Figure 1: (a) Overview of JoMA framework. Using the invariant of training dynamics, the self-attention layer and the lower layer of MLP can be merged together to yield a MLP layer with modified dynamics (Theorem 1), which explains the behaviors of attention in linear (Sec. 3.1) and nonlinear (Sec. 4) MLP activation $\phi$ , as well as hierarchical concept learning in multilayer cases (Sec. 5). (b) Problem setting. JoMA frameworks support different kind of attentions, including linear attention $b_{l}:=x_{l}z_{ql}$ , exp attention $b_{l}:=x_{l}e^{z_{ql}}/A$ and softmax $b_{l}:=x_{l}e^{z_{ql}}/\sum_{l}x_{l}e^{z_{ql}}$ .

The Attention Mechanism. In this paper, we mainly study three kinds of attention:

•

Linear Attention (Von Oswald et al., 2022): $\sigma(x)=x$ and $A:=1$ ;
•

Exp Attention: $\sigma(x)=\exp(x)$ and $A:=\mathrm{const}$ ;
•

Softmax Attention (Vaswani et al., 2017): $\sigma(x)=\exp(x)$ and $A:={\bm{1}}^{\top}\left(\sigma({\bm{z}}_{q})\circ{\bm{x}}\right)$ .

Here $\circ$ is the Hadamard (element-wise) product. ${\bm{b}}\in\mathbb{R}^{M_{C}}$ are the attention scores for contextual tokens, given by a point-wise attention function $\sigma$ . $A$ is the normalization constant.

Embedding vectors. ${\bm{u}}_{l}$ is the embedding vector for token $l$ . We assume that the embedding dimension $d$ is sufficiently large and thus ${\bm{u}}_{l}^{\top}{\bm{u}}_{l^{\prime}}=\mathbb{I}(l=l^{\prime})$ , i.e., $\{{\bm{u}}_{l}\}$ are orthonormal bases. Let $U_{C}=[{\bm{u}}_{1},{\bm{u}}_{2},\ldots,{\bm{u}}_{M_{C}}]\in\mathbb{R}^{d\times M_{C}}$ be the matrix that encodes all embedding vectors of contextual tokens. Then $U_{C}^{\top}U_{C}=I$ . Appendix B.1 verifies the orthogonality assumption in multiple pre-trained models (Pythia, LLaMA, etc).

Residual connections are introduced as an additional term ${\bm{u}}_{q}$ in Eqn. 1, which captures the critical component in Transformer architecture. Note that we do not model value matrix $W_{V}$ since it can be merged into the embedding vectors (e.g., by ${\bm{u}}_{l}^{\prime}=W_{V}{\bm{u}}_{l}$ ), while $W_{K}$ and $W_{Q}$ are already implicitly modeled by the self-attention logits $z_{ql}={\bm{u}}^{\top}_{q}W^{\top}_{Q}W_{K}{\bm{u}}_{l}$ .

Gradient backpropagation in multilayers. In multilayer setting, the gradient gets backpropagated from top layer. Specifically, let $g_{h_{k}}[i]$ be the backpropagated gradient sent to node $k$ at sample $i$ . For 1-layer Transformer with softmax loss directly applied to the hidden nodes of MLP, we have $g_{h_{k}}[i]\sim\mathbb{I}(y_{0}[i]=k)$ , where $y_{0}[i]$ is the label to be predicted for sample $i$ . For brevity, we often omit sample index $i$ if there is no ambiguity.

Assumption 1 (Stationary backpropagated gradient $g_{h_{k}}$ ).

Expectation terms involving $g_{h_{k}}$ (e.g., $\mathbb{E}\left[g_{h_{k}}{\bm{x}}\right]$ ) remains constant during training.

Note that this is true for layer-wise training: optimizing the weights for a specific Transformer layer, while fixing the weights of others and thus the statistics of backpropagated are stationary. For joint training, this condition also holds approximately since the weights change gradually during the training process. Under Assumption 1, Appendix A.1 gives an equivalent formulation in terms of per-hidden node loss.

Training Dynamics. Define the conditional expectation $\mathbb{E}_{q=m}\left[\cdot\right]:=\mathbb{E}\left[\cdot|q=m\right]$ . Now let us consider the dynamics of ${\bm{w}}_{k}$ and ${\bm{z}}_{m}$ , if we train the model with a batch of inputs that always end up with query $q[i]=m$ , then:

\dot{\bm{w}}_{k}=\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}{\bm{f}}\right],\quad\quad\dot{\bm{z}}_{m}=\mathbb{E}_{q=m}\left[\left(\partial{\bm{b}}/\partial{\bm{z}}_{m}\right)^{\top}U_{C}^{\top}{\bm{g}}_{{\bm{f}}}\right]

(2)

Here $h^{\prime}_{k}:=\phi^{\prime}({\bm{w}}_{k}^{\top}{\bm{f}})$ is the derivative of current activation and ${\bm{g}}_{{\bm{f}}}:=\sum_{k}g_{h_{k}}h^{\prime}_{k}{\bm{w}}_{k}$ .

3 JoMA: Existence of JOint dynamics of Attention and MLP

While the learning dynamics of ${\bm{w}}_{k}$ and ${\bm{z}}_{m}$ can be complicated, surprisingly, training dynamics suggests that the attention logits ${\bm{z}}_{m}(t)$ have close-form relationship with respect to the MLP weights ${\bm{w}}_{k}(t)$ , which lays the foundation of our JoMA framework:

Theorem 1 (JoMA).

Let ${\bm{v}}_{k}:=U_{C}^{\top}{\bm{w}}_{k}$ , then the dynamics of Eqn. 2 satisfies the invariants:

•

Linear attention. The dynamics satisfies ${\bm{z}}^{2}_{m}(t)=\sum_{k}{\bm{v}}^{2}_{k}(t)+{\bm{c}}$ .
•

Exp attention. The dynamics satisfies ${\bm{z}}_{m}(t)=\frac{1}{2}\sum_{k}{\bm{v}}^{2}_{k}(t)+{\bm{c}}$ .
•

Softmax attention. If $\bar{\bm{b}}_{m}:=\mathbb{E}_{q=m}\left[{\bm{b}}\right]$ is a constant over time and $\mathbb{E}_{q=m}\left[\sum_{k}g_{h_{k}}h_{k}^{\prime}{\bm{b}}{\bm{b}}^{\top}\right]=\bar{\bm{b}}_{m}\mathbb{E}_{q=m}\left[\sum_{k}g_{h_{k}}h_{k}^{\prime}{\bm{b}}\right]$ , then the dynamics satisfies ${\bm{z}}_{m}(t)=\frac{1}{2}\sum_{k}{\bm{v}}^{2}_{k}(t)-\|{\bm{v}}_{k}(t)\|_{2}^{2}\bar{\bm{b}}_{m}+{\bm{c}}$ .

Under zero initialization ( ${\bm{w}}_{k}(0)=0$ , ${\bm{z}}_{m}(0)=0$ ), then the time-independent constant ${\bm{c}}=0$ .

Therefore, we don’t need to explicitly update self-attention, since it is already implicitly incorporated in the lower layer of MLP weight! For softmax attention, we verify that even with the assumption, the invariance proposed by Theorem 1 still predicts ${\bm{z}}_{m}(t)$ fairly well.

3.1 Linear activations: winner-take-all

Now we can solve the dynamics of ${\bm{w}}_{k}(t)$ (Eqn. 2), by plugging in the close-form solution of self-attention. For simplicity, we consider exp attention with $K=1$ (i.e., single hidden MLP node). Let $\Delta_{m}:=\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}{\bm{x}}\right]$ , then ${\bm{v}}_{k}$ ’s dynamics is ( ${\bm{v}}_{k}$ written as ${\bm{v}}$ ):

\dot{\bm{v}}=\Delta_{m}\circ\exp({\bm{z}}_{m})=\Delta_{m}\circ\exp({\bm{v}}^{2}/2+{\bm{c}})

(3)

In the case of linear activations $\phi(x)=x$ , $h^{\prime}_{k}\equiv 1$ . According to Assumption 1, $\Delta_{m}$ does not depend on ${\bm{v}}$ and we arrive at the following theorem:

Theorem 2 (Linear Dynamics with Self-attention).

With linear MLP activation and zero initialization, for exp attention any two tokens $l\neq l^{\prime}$ satisfy the following invariants:

\frac{\mathrm{erf}\left(v_{l}(t)/2\right)}{\Delta_{lm}}=\frac{\mathrm{erf}(v_{l^{\prime}}(t)/2)}{\Delta_{l^{\prime}m}}

(4)

where $\Delta_{lm}=\mathbb{E}_{q=m}\left[g_{h_{k}}x_{l}\right]$ and $\mathrm{erf}(x)=\frac{2}{\sqrt{\pi}}\int_{0}^{x}e^{-t^{2}}\mathrm{d}t$ is Gauss error function.

Remarks. The dynamics suggests that the weights become one-hot over training. Specifically, let $l^{*}=\operatorname*{arg\,max}_{l}|\Delta_{lm}|$ , then $v_{l^{*}}(t)\rightarrow\operatorname{sign}(\Delta_{l^{*}m})\times\infty$ and other $v_{l}(t)$ converges to finite numbers, because of the constraint imposed by Eqn. 4 (see Fig. 3). For softmax attention, there is an additional sample-dependent normalization constant $A[i]$ , if $A[i]$ remains constant across samples and all elements of $\bar{\bm{b}}_{m}$ are the same, then Theorem 2 also applies.

Beyond distinct/common tokens. $\Delta_{lm}:=\mathbb{E}_{l,q=m}\left[g_{h_{k}}\right]\mathbb{P}(l|m)$ ²²2Since $x_{l}[i]$ is the empirical frequency of token $l$ in sample $i$ , we have $\Delta_{lm}=\mathbb{E}_{q=m}\left[g_{h_{k}}x_{l}\right]=\sum_{i}g_{h_{k}}[i]\mathbb{P}(l|q=m,i)\mathbb{P}(i|q=m)=\sum_{i}g_{h_{k}}[i]\mathbb{P}(i|q=m,l)\mathbb{P}(l|q=m)=\mathbb{E}_{l,q=m}\left[g_{h_{k}}\right]\mathbb{P}(l|m)$ . is a product of token discriminancy (i.e., $\mathbb{E}_{l,q=m}\left[g_{h_{k}}\right]>0$ means token $l$ positively correlated to backpropagated gradient $g_{h_{k}}$ , or label in the 1-layer case) and token frequency (i.e., $\mathbb{P}(l|m)$ , how frequent $l$ appears given $m$ ). This covers a broader spectrum of tokens than Tian et al. (2023), which only discusses distinct (i.e., large $|\Delta_{lm}|$ ) and common tokens (i.e., when $\Delta_{lm}\approx 0$ ).

4 Training Dynamics under Nonlinear Activations

In nonlinear case, the dynamics turns out to be very different. In this case, $\Delta_{m}$ is no longer a constant, but will change. As a result, the dynamics also changes substantially.

Theorem 3 (Dynamics of nonlinear activation with uniform attention).

If ${\bm{x}}$ is sampled from a mixture of $C$ isotropic distributions centered at $[\bar{\bm{x}}_{1},\ldots,\bar{\bm{x}}_{C}]$ , where each $\bar{\bm{x}}_{c}\in\mathbb{R}^{d}$ and gradient $g_{h_{k}}$ are constant within each mixture, then:

\displaystyle\dot{\bm{v}}

\displaystyle=

\displaystyle\Delta_{m}=\frac{1}{\|{\bm{v}}\|_{2}}\sum_{c}a_{c}\theta_{1}(r_{c})\bar{\bm{x}}_{c}+\frac{1}{\|{\bm{v}}\|_{2}^{3}}\sum_{c}a_{c}\theta_{2}(r_{c}){\bm{v}}

(5)

here $a_{c}:=\mathbb{E}_{q=m,c}\left[g_{h_{k}}\right]\mathbb{P}[c]$ , $r_{c}:={\bm{v}}^{\top}\bar{\bm{x}}_{c}+\xi$ is the affinity to $\bar{\bm{x}}_{c}$ and the “bias” term $\xi(t):=\int_{0}^{t}\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}\right]\mathrm{d}t$ , $\theta_{1}$ and $\theta_{2}$ depend on derivative of nonlinearity $\psi:=\phi^{\prime}$ and data distribution but not ${\bm{v}}$ . If $\psi$ is monotonous with $\psi(-\infty)=0$ and $\psi(+\infty)=1$ , so does $\theta_{1}$ .

Appendix A.3.2 presents critical point analysis. Here we focus on a simplified one when ${\bm{v}}$ is constrained to be a unit vector, which leads to the following modified dynamics ( $P^{\perp}_{\bm{v}}{\bm{v}}=0$ ):

\dot{\bm{v}}=P^{\perp}_{\bm{v}}\Delta_{m}=\sum_{c}a_{c}\theta_{1}(r_{c})P^{\perp}_{\bm{v}}\bar{\bm{x}}_{c}=\sum_{c}a_{c}\theta_{1}(r_{c})\|\bar{\bm{x}}_{c}\|[\bm{\mu}_{c}-({\bm{v}}^{\top}\bm{\mu}_{c}){\bm{v}}]

(6)

where $\bm{\mu}_{c}:=\bar{\bm{x}}_{c}/\|\bar{\bm{x}}_{c}\|$ . We consider when ${\bm{v}}$ is aligned with one cluster $\bar{\bm{x}}_{c}$ but far away from others, then $r_{c}\gg r_{c^{\prime}}$ for $c^{\prime}\neq c$ and $\theta_{1}(r_{c})\gg\theta_{1}(r_{c^{\prime}})$ since $\theta_{1}$ is monotonously increasing. Hence $\bm{\mu}_{c}$ dominates and let $\bm{\mu}:=\bm{\mu}_{c}$ for brevity. Similar to Eqn. 3, we use close-form simplification of JoMA to incorporate self-attention, which leads to (we use exp attention):

\dot{\bm{v}}\propto(\bm{\mu}-{\bm{v}})\circ\exp({\bm{v}}^{2}/2)

(7)

Here we omit the scalar terms and study when ${\bm{v}}$ is close to $\bm{\mu}$ , in which ${\bm{v}}^{\top}\bm{\mu}=1+O(\|\bm{\mu}-{\bm{v}}\|_{2}^{2})\approx 1$ . It is clear that the critical point ${\bm{v}}_{*}=\bm{\mu}$ does not change after adding the term $\exp({\bm{v}}^{2}/2)$ . However, the convergence speed changes drastically. As shown in the following lemma, the convergence speed towards salient component of $\bm{\mu}$ (i.e., component with large magnitude) is much faster than non-salient ones:

Theorem 4 (Convergence speed of salient vs. non-salient components).

Let $\delta_{j}(t):=1-v_{j}(t)/\mu_{j}$ be the convergence metric for component $j$ ( $\delta_{j}(t)=0$ means that the component $j$ converges). For nonlinear dynamics with attention (Eqn. 7), then

\frac{\ln\delta_{j}(0)/\delta_{j}(t)}{\ln\delta_{k}(0)/\delta_{k}(t)}=\frac{e^{\mu^{2}_{j}/2}}{e^{\mu^{2}_{k}/2}}(1+\Lambda_{jk}(t))

(8)

Here $\Lambda_{jk}(t)=\lambda_{jk}(t)\cdot e^{\mu_{k}^{2}/2}\ln^{-1}(\delta_{k}(0)/\delta_{k}(t))$ where $|\lambda_{jk}(t)|\leq C_{jk}$ and $C_{jk}$ only depends on $\delta_{j}(0)$ and $\delta_{k}(0)$ . So when $|\delta_{k}(t)|\ll|\delta_{k}(0)|\exp[-C_{jk}\exp(\mu_{k}^{2})]$ , we have $|\Lambda(t)|\ll 1$ .

Remarks. For linear attention, the ratio is different but the derivation is similar and simpler. Note that the convergence speed heavily depends on the magnitude of $\mu_{j}$ . If $\mu_{j}>\mu_{k}$ , then $\delta_{j}(t)\ll\delta_{k}(t)$ and $v_{j}(t)$ converges much faster than $v_{k}(t)$ . Therefore, the salient (i.e., large) components is learned first, and the non-salient (i.e., small) component is learned later, due to the modulation of the extra term $\exp({\bm{v}}^{2}/2)$ thanks to self-attention, as demonstrated in Fig. 4.

A follow-up question arises: What is the intuition behind salient and non-salient components in $\bm{\mu}$ ? Note that $\bm{\mu}$ is an $\ell_{2}$ -normalized version of the conditional token frequency ${\bm{x}}$ , given the query $q=m$ . In this case, similar to Theorem 2 (and Tian et al. (2023)), we again see that if a contextual token $l$ co-occurs a lot with the query $m$ , then the corresponding component $\mu_{l}$ becomes larger and the growth speed of $v_{l}$ towards $\mu_{l}$ is much faster.

Relationship with rank of MLP lower layer. Since MLP and attention layer has joint dynamics (Theorem 1), this also suggests that in the MLP layer, the rank of lower layer matrix $W$ (which projects into the hidden nodes) will first drop since the weight components that correspond to high target value $\mu_{j}$ grow first, and then bounce back to higher rank when the components that correspond to low target value $\mu_{j}$ catch up later.

5 How self-attention learns hierarchical data distribution?

A critical difference between the training dynamics of linear and nonlinear MLP is that in the nonlinear case, although slowly, the non-salient components will still grow, and the entropy of the attention bounces back later. While for 1-layer Transformer, this may only slow the training with no clear benefits, the importance of such a behavior is manifested if we think about the dynamics of multiple Transformer layers trained on a data distribution generated in a hierarchical manner.

Consider a simple generative hierarchical binary latent tree model (HBLT) (Tian et al., 2020) (Fig. 7(a)) in which we have latent (unobservable) binary variables $y$ at layer $s$ that generate latents at layer $s-1$ , until the observable tokens are generated at the lowest level ( $s=0$ ). The topmost layer is the class label $y_{0}$ , which can take $D$ discrete values. In HBLT, the generation process of $y_{\beta}$ at layer $s-1$ given $y_{\alpha}$ at layer $s$ can be characterized by their conditional probability $\mathbb{P}[y_{\beta}=1|y_{\alpha}=1]=\mathbb{P}[y_{\beta}=0|y_{\alpha}=0]=\frac{1}{2}(1+\rho)$ . The uncertainty hyperparameter $\rho\in[-1,1]$ determines how much the top level latents can determine the values of the low level ones. Please check Appendix A.5 for its formal definition.

With HBLT, we can compute the co-occurrence frequency of two tokens $l$ and $m$ , as a function of the depth of their common latent ancestor (CLA):

Theorem 5 (Token Co-occurrence in $\texttt{HBLT}{}(\rho)$ ).

If token $l$ and $m$ have common latent ancestor (CLA) of depth $H$ (Fig. 5(c)), then $\mathbb{P}[y_{l}=1|y_{m}=1]=\frac{1}{2}\left(\frac{1+\rho^{2H}-2\rho^{L-1}\rho_{0}}{1-\rho^{L-1}\rho_{0}}\right)$ , where $L$ is the total depth of the hierarchy and $\rho_{0}:={\bm{p}}_{\cdot|0}^{\top}{\bm{p}}_{0}$ , in which ${\bm{p}}_{0}=[\mathbb{P}[y_{0}=k]]\in\mathbb{R}^{D}$ and ${\bm{p}}_{\cdot|0}:=[\mathbb{P}[y_{l}=0|y_{0}=k]]\in\mathbb{R}^{D}$ , where $\{y_{l}\}$ are the immediate children of the root node $y_{0}$ .

Remarks. If $y_{0}$ takes multiple values (many classes) and each class only trigger one specific latent binary variables, then most of the top layer latents are very sparsely triggered and thus $\rho_{0}$ is very close to $1$ . If $\rho$ is also close to $1$ , then for deep hierarchy and shallow common ancestor, $\mathbb{P}[y_{l}=1|y_{m}=1]\rightarrow 1$ . To see this, assume $\rho=\rho_{0}=1-\epsilon$ , then we have:

\displaystyle\mathbb{P}[y_{l}=1|y_{m}=1]=\frac{1}{2}\left[\frac{1+1-2H\epsilon-2(1-L\epsilon)}{1-(1-L\epsilon)}\right]+O(\epsilon^{2})=1-\frac{H}{L}+O(\epsilon^{2})

(9)

This means that two tokens $l$ and $m$ co-occur a lot, if they have a shallow CLA ( $H$ small) that is close to both tokens. If their CLA is high in the hierarchy (e.g., $l^{\prime}$ and $m$ ), then the token $l^{\prime}$ and $m$ have much weaker co-occurrence and $\mathbb{P}(l^{\prime}|m)$ (and thus $x_{l^{\prime}}$ and $\mu_{l^{\prime}}$ ) is small.

With this generative model, we can analyze qualitatively the learning dynamics of JoMA: first it focuses on associating the tokens in the same lowest hierarchy as the query $m$ (and these tokens co-occur a lot with $m$ ), then gradually reaches out to other tokens $l^{\prime}$ that co-occur less with $m$ , if they have not been picked up by other tokens (Fig. 5(b)); if $l^{\prime}$ co-occurs a lot with some other $m^{\prime}$ , then $m$ - $l$ and $m^{\prime}$ - $l^{\prime}$ form their own lower hierarchy, respectively. This leads to learning of high-level features $y_{\beta}$ and $y_{\beta^{\prime}}$ , which has high correlation are associated in the higher level. Therefore, the latent hierarchy is implicitly learned.

6 Experiments

Dynamics of Attention Sparsity. Fig. 6 shows how attention sparsity changes over time when training from scratch. We use $10^{-4}$ learning rate and test our hypothesis on Wikitext2/Wikitext103 (Merity et al., 2016) (top/bottom row). Fig. 8 further shows that different learning rate leads to different attention sparsity patterns. With large learning rate, attention becomes extremely sparse as in (Tian et al., 2023). Interestingly, the attention patterns, which coincide with our theoretical analysis, yield the best validation score.

We also tested our hypothesis in OPT (Zhang et al., 2022) (OPT-2.7B) and Pythia (Biderman et al., 2023) (Pythia-70M/1.4B/6.9B) pre-trained models, both of which has public intermediate checkpoints. While the attention patterns show less salient drop-and-bounce patterns, the dynamics of stable ranks of the MLP lower layer (projection into hidden neurons) show much salient such structures for top layers, and dropping curves for bottom layers since they are suppressed by top-level learning (Sec. 5). Note that stable ranks only depend on the model parameters and thus may be more reliable than attention sparsity.

Validation of Alignment between latents and hidden nodes in MLP. Sec. 5 is based on an assumption that the hidden nodes in MLP layer will learn the latent variables. We verify this assumption in synthetic data sampled by HBLT, which generate latent variables in a top-down manner, until the final tokens are generated. The latent hierarchy has 2 hyperparameters: number of latents per layer ( $N_{s}$ ) and number of children per latent ( $N_{\mathrm{ch}}$ ). $C$ is the number of classes. Adam optimizer is used with learning rate $10^{-5}$ . Vocabulary size $M=100$ , sequence length $T=30$ and embedding dimension $d=1024$ .

We use 3-layer generative model as well as 3-layer Transformer models. We indeed perceive high correlations between the latents and the hidden neurons between corresponding layers. Note that latents are known during input generation procedure but are not known to the transformer being trained. We take the maximal activation of each neuron across the sequence length, and compute normalized correlation between maximal activation of each neuron and latents, after centeralizing across the sample dimension. Tbl. 1 shows that indeed in the learned models, for each latent, there exists at least one hidden node in MLP that has high normalized correlation with it, in particular in the lowest layer. When the generative models becomes more complicated (i.e., both $N_{\mathrm{ch}}$ and $N_{l}$ become larger), the correlation goes down a bit.

	$C=20$ , $N_{\mathrm{ch}}=2$		$C=20$ , $N_{\mathrm{ch}}=3$		$C=30$ , $N_{\mathrm{ch}}=2$
$(N_{0},N_{1})$	(10, 20)	(20, 30)	(10, 20)	(20, 30)	(10, 20)	(20, 30)
NCorr ( $s=0$ )	$0.99\pm 0.01$	$0.97\pm 0.02$	$1.00\pm 0.00$	$0.96\pm 0.02$	$0.99\pm 0.01$	$0.94\pm 0.04$
NCorr ( $s=1$ )	$0.81\pm 0.05$	$0.80\pm 0.05$	$0.69\pm 0.05$	$0.68\pm 0.04$	$0.73\pm 0.08$	$0.74\pm 0.03$
	$C=30$ $N_{\mathrm{ch}}=3$		$C=50$ , $N_{\mathrm{ch}}=2$		$C=50$ , $N_{\mathrm{ch}}=3$
$(N_{0},N_{1})$	(10, 20)	(20, 30)	(10, 20)	(20, 30)	(10, 20)	(20, 30)
NCorr ( $s=0$ )	$0.99\pm 0.01$	$0.95\pm 0.03$	$0.99\pm 0.01$	$0.95\pm 0.03$	$0.99\pm 0.01$	$0.95\pm 0.03$
NCorr ( $s=1$ )	$0.72\pm 0.04$	$0.66\pm 0.02$	$0.58\pm 0.02$	$0.55\pm 0.01$	$0.64\pm 0.02$	$0.61\pm 0.04$

Table 1: Normalized correlation between the latents and their best matched hidden node in MLP of the same layer. All experiments are run with 5 random seeds.

7 Discussion

Deal with almost orthogonal embeddings. In this paper, we focus on fixed orthonormal embeddings vectors. However, in real-world Transformer training, the assumption may not be valid, since often the embedding dimension $d$ is smaller than the number of vocabulary $M$ so the embedding vectors cannot be orthogonal to each other. In this setting, one reasonable assumption is that the embedding vectors are almost orthogonal. Thanks to Johnson–Lindenstrauss lemma, one interesting property of high-dimensional space is that for $M$ embedding vectors to achieve almost orthogonality $|{\bm{u}}_{l}^{\top}{\bm{u}}_{l^{\prime}}|\leq\epsilon$ , only $d\geq 8\epsilon^{-2}\log M$ is needed. As a result, our JoMA framework (Theorem 1) will have additional $\epsilon$ -related terms and we leave the detailed analysis as one of our future work.

Training embedding vectors. Another factor that is not considered in JoMA is that the embedding vectors are also trained simultaneously. This could further boost the efficiency of Transformer architecture, since concepts with similar semantics will learn similar embeddings. This essentially reduces the vocabulary size at each layer for learning to be more effective, and leads to better generalization. For example, in each hidden layer $4d$ hidden neurons are computed, which does not mean there are $4d$ independent intermediate “tokens”, because many of their embeddings are highly correlated.

Self-attention computed from embedding. JoMA arrives at the joint dynamics of MLP and attention by assuming that the pairwise attention score $Z$ is an independent parameters optimized under SGD dynamics. In practice, $Z=UW_{Q}W^{\top}_{K}U^{\top}$ is also parameterized by the embedding matrix, which allow generalization to tokens with similar embeddings, and may accelerate the training dynamics of $Z$ . We leave it in the future works.

8 Conclusion

We propose JoMA, a framework that characterizes the joint training dynamics of nonlinear MLP and attention layer, by integrating out the self-attention logits. The resulting dynamics connects the dynamics of nonlinear MLP lower layer weights (projection into hidden neurons) and self-attention, and shows that the attention first becomes sparse (or weights becomes low rank) and then becomes dense (or weights becomes high rank). Furthermore, we qualitatively give a learning mechanism of multilayer Transformer that reveals how self-attentions at different layers interact with each other to learn the latent feature hierarchy.

Acknowledgments

Simon S. Du is supported by supported by NSF IIS 2110170, NSF DMS 2134106, NSF CCF 2212261, NSF IIS 2143493, NSF CCF 2019844, NSF IIS 2229881.

References

Akyürek et al. (2022) Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022.
Allen-Zhu et al. (2019) Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019.
Anil et al. (2022) Cem Anil, Yuhuai Wu, Anders Andreassen, Aitor Lewkowycz, Vedant Misra, Vinay Ramasesh, Ambrose Slone, Guy Gur-Ari, Ethan Dyer, and Behnam Neyshabur. Exploring length generalization in large language models. arXiv preprint arXiv:2207.04901, 2022.
Arora et al. (2018) Sanjeev Arora, Nadav Cohen, Noah Golowich, and Wei Hu. A convergence analysis of gradient descent for deep linear neural networks. arXiv preprint arXiv:1810.02281, 2018.
Arora et al. (2019) Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pp. 322–332. PMLR, 2019.
Bai et al. (2023) Yu Bai, Fan Chen, Huan Wang, Caiming Xiong, and Song Mei. Transformers as statisticians: Provable in-context learning with in-context algorithm selection. arXiv preprint arXiv:2306.04637, 2023.
Barak et al. (2022) Boaz Barak, Benjamin Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: Sgd learns parities near the computational limit. Advances in Neural Information Processing Systems, 35:21750–21764, 2022.
Bartlett et al. (2018) Peter Bartlett, Dave Helmbold, and Philip Long. Gradient descent with identity initialization efficiently learns positive definite linear transformations by deep residual networks. In International conference on machine learning, pp. 521–530. PMLR, 2018.
Bhattamishra et al. (2020a) Satwik Bhattamishra, Kabir Ahuja, and Navin Goyal. On the ability and limitations of transformers to recognize formal languages. arXiv preprint arXiv:2009.11264, 2020a.
Bhattamishra et al. (2020b) Satwik Bhattamishra, Arkil Patel, and Navin Goyal. On the computational power of transformers and its implications in sequence modeling. arXiv preprint arXiv:2006.09286, 2020b.
Biderman et al. (2023) Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pp. 2397–2430. PMLR, 2023.
Bietti et al. (2023) Alberto Bietti, Vivien Cabannes, Diane Bouchacourt, Herve Jegou, and Leon Bottou. Birth of a transformer: A memory viewpoint. arXiv preprint arXiv:2306.00802, 2023.
Boix-Adsera et al. (2023) Enric Boix-Adsera, Etai Littwin, Emmanuel Abbe, Samy Bengio, and Joshua Susskind. Transformers learn through gradual rank increase. arXiv preprint arXiv:2306.07042, 2023.
Brutzkus & Globerson (2017) Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. In International conference on machine learning, pp. 605–614. PMLR, 2017.
Chizat & Bach (2018) Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018.
Chizat et al. (2019) Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in neural information processing systems, 32, 2019.
Dehghani et al. (2018) Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers. arXiv preprint arXiv:1807.03819, 2018.
Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Dong et al. (2022) Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Zhiyong Wu, Baobao Chang, Xu Sun, Jingjing Xu, and Zhifang Sui. A survey for in-context learning. arXiv preprint arXiv:2301.00234, 2022.
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
Du et al. (2018a) Simon Du, Jason Lee, Yuandong Tian, Aarti Singh, and Barnabas Poczos. Gradient descent learns one-hidden-layer cnn: Don’t be afraid of spurious local minima. In International Conference on Machine Learning, pp. 1339–1348. PMLR, 2018a.
Du et al. (2019) Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. In International conference on machine learning, pp. 1675–1685. PMLR, 2019.
Du et al. (2017) Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional filter easy to learn? arXiv preprint arXiv:1709.06129, 2017.
Du et al. (2018b) Simon S. Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks, 2018b. URL https://arxiv.org/abs/1810.02054.
Edelman et al. (2022) Benjamin L Edelman, Surbhi Goel, Sham Kakade, and Cyril Zhang. Inductive biases and variable creation in self-attention mechanisms. In International Conference on Machine Learning, pp. 5793–5831. PMLR, 2022.
Elhage et al. (2021) N Elhage, N Nanda, C Olsson, T Henighan, N Joseph, B Mann, A Askell, Y Bai, A Chen, T Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021.
Fang et al. (2021) Cong Fang, Jason Lee, Pengkun Yang, and Tong Zhang. Modeling from features: a mean-field framework for over-parameterized deep neural networks. In Conference on learning theory, pp. 1887–1936. PMLR, 2021.
Garg et al. (2022) Shivam Garg, Dimitris Tsipras, Percy S Liang, and Gregory Valiant. What can transformers learn in-context? a case study of simple function classes. Advances in Neural Information Processing Systems, 35:30583–30598, 2022.
Goel et al. (2018) Surbhi Goel, Adam Klivans, and Raghu Meka. Learning one convolutional layer with overlapping patches. In International Conference on Machine Learning, pp. 1783–1791. PMLR, 2018.
Hron et al. (2020) Jiri Hron, Yasaman Bahri, Jascha Sohl-Dickstein, and Roman Novak. Infinite attention: Nngp and ntk for deep attention networks. In International Conference on Machine Learning, pp. 4376–4386. PMLR, 2020.
Jacot et al. (2018) Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
Jelassi et al. (2022) Samy Jelassi, Michael Sander, and Yuanzhi Li. Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems, 35:37822–37836, 2022.
Li et al. (2023a) Hongkang Li, Meng Wang, Sijia Liu, and Pin-Yu Chen. A theoretical understanding of shallow vision transformers: Learning, generalization, and sample complexity. In The Eleventh International Conference on Learning Representations, 2023a. URL https://openreview.net/forum?id=jClGv3Qjhb.
Li et al. (2023b) Shuai Li, Zhao Song, Yu Xia, Tong Yu, and Tianyi Zhou. The closeness of in-context learning and weight shifting for softmax regression. arXiv preprint arXiv:2304.13276, 2023b.
Li & Liang (2018) Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. Advances in neural information processing systems, 31, 2018.
Li et al. (2023c) Yuchen Li, Yuanzhi Li, and Andrej Risteski. How do transformers learn topic structure: Towards a mechanistic understanding. arXiv preprint arXiv:2303.04245, 2023c.
Likhosherstov et al. (2021) Valerii Likhosherstov, Krzysztof Choromanski, and Adrian Weller. On the expressive power of self-attention matrices. arXiv preprint arXiv:2106.03764, 2021.
Liu et al. (2019) Tianyi Liu, Minshuo Chen, Mo Zhou, Simon S Du, Enlu Zhou, and Tuo Zhao. Towards understanding the importance of shortcut connections in residual networks. Advances in neural information processing systems, 32, 2019.
Lu et al. (2020) Yiping Lu, Chao Ma, Yulong Lu, Jianfeng Lu, and Lexing Ying. A mean field analysis of deep resnet and beyond: Towards provably optimization via overparameterization from depth. In International Conference on Machine Learning, pp. 6426–6436. PMLR, 2020.
Mei et al. (2018) Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018.
Merity et al. (2016) Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
Nguyen & Pham (2020) Phan-Minh Nguyen and Huy Tuan Pham. A rigorous framework for the mean field limit of multilayer neural networks. arXiv preprint arXiv:2001.11443, 2020.
Olsson et al. (2022) Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022.
OpenAI (2023) OpenAI. Gpt-4 technical report, 2023.
Oymak & Soltanolkotabi (2020) Samet Oymak and Mahdi Soltanolkotabi. Toward moderate overparameterization: Global convergence guarantees for training shallow neural networks. IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020.
Oymak et al. (2023) Samet Oymak, Ankit Singh Rawat, Mahdi Soltanolkotabi, and Christos Thrampoulidis. On the role of attention in prompt-tuning. ICML, 2023.
Pérez et al. (2021) Jorge Pérez, Pablo Barceló, and Javier Marinkovic. Attention is turing complete. The Journal of Machine Learning Research, 22(1):3463–3497, 2021.
Snell et al. (2021) Charlie Snell, Ruiqi Zhong, Dan Klein, and Jacob Steinhardt. Approximating how single head attention learns. arXiv preprint arXiv:2103.07601, 2021.
Soltanolkotabi (2017) Mahdi Soltanolkotabi. Learning relus via gradient descent. Advances in neural information processing systems, 30, 2017.
Tarzanagh et al. (2023a) Davoud Ataee Tarzanagh, Yingcong Li, Christos Thrampoulidis, and Samet Oymak. Transformers as support vector machines. arXiv preprint arXiv:2308.16898, 2023a.
Tarzanagh et al. (2023b) Davoud Ataee Tarzanagh, Yingcong Li, Xuechen Zhang, and Samet Oymak. Max-margin token selection in attention mechanism. arXiv preprint arXiv:2306.13596, 3(7):47, 2023b.
Tian (2017) Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. In International conference on machine learning, pp. 3404–3413. PMLR, 2017.
Tian (2022) Yuandong Tian. Understanding the role of nonlinearity in training dynamics of contrastive learning. arXiv preprint arXiv:2206.01342, 2022.
Tian (2023) Yuandong Tian. Understanding the role of nonlinearity in training dynamics of contrastive learning. ICLR, 2023.
Tian et al. (2020) Yuandong Tian, Lantao Yu, Xinlei Chen, and Surya Ganguli. Understanding self-supervised learning with dual deep networks. arXiv preprint arXiv:2010.00578, 2020.
Tian et al. (2023) Yuandong Tian, Yiping Wang, Beidi Chen, and Simon Du. Scan and snap: Understanding training dynamics and token composition in 1-layer transformer, 2023.
Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. 2017. URL https://arxiv.org/pdf/1706.03762.pdf.
Von Oswald et al. (2022) Johannes Von Oswald, Eyvind Niklasson, Ettore Randazzo, João Sacramento, Alexander Mordvintsev, Andrey Zhmoginov, and Max Vladymyrov. Transformers learn in-context by gradient descent. arXiv preprint arXiv:2212.07677, 2022.
Xu & Du (2023) Weihang Xu and Simon S Du. Over-parameterization exponentially slows down gradient descent for learning a single neuron. arXiv preprint arXiv:2302.10034, 2023.
Yang et al. (2022) Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer. arXiv preprint arXiv:2203.03466, 2022.
Yao et al. (2021) Shunyu Yao, Binghui Peng, Christos Papadimitriou, and Karthik Narasimhan. Self-attention networks can process bounded hierarchical languages. arXiv preprint arXiv:2105.11115, 2021.
Yun et al. (2019) Chulhee Yun, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J Reddi, and Sanjiv Kumar. Are transformers universal approximators of sequence-to-sequence functions? arXiv preprint arXiv:1912.10077, 2019.
Zhang et al. (2020) Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. Why are adaptive methods good for attention models? Advances in Neural Information Processing Systems, 33:15383–15393, 2020.
Zhang et al. (2023) Ruiqi Zhang, Spencer Frei, and Peter L Bartlett. Trained transformers learn linear models in-context. arXiv preprint arXiv:2306.09927, 2023.
Zhang et al. (2022) Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
Zhao et al. (2023) Haoyu Zhao, Abhishek Panigrahi, Rong Ge, and Sanjeev Arora. Do transformers parse while predicting the masked word? arXiv preprint arXiv:2303.08117, 2023.
Zhou et al. (2019) Mo Zhou, Tianyi Liu, Yan Li, Dachao Lin, Enlu Zhou, and Tuo Zhao. Toward understanding the importance of noise in training neural networks. In International Conference on Machine Learning, pp. 7594–7602. PMLR, 2019.
Zou et al. (2020) Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks. Machine learning, 109:467–492, 2020.

Appendix A Proofs

A.1 Per-hidden loss formulation

Our Assumption 1 has an equivalent per-hidden node loss:

\max_{\{{\bm{w}}_{k}\},\{{\bm{z}}_{m}\}}\mathbb{E}_{\mathcal{D}}\left[\sum_{k}g_{h_{k}}h_{k}\right]:=\max_{\{{\bm{w}}_{k}\},\{{\bm{z}}_{m}\}}\mathbb{E}_{i\sim\mathcal{D}}\left[\sum_{k}g_{h_{k}}[i]h_{k}[i]\right]

(10)

where $g_{h_{k}}[i]$ is the backpropagated gradient sent to node $h_{k}$ at sample $i$ .

A.2 JoMA framework (Section 3)

See 1

Proof.

Let $L:=\partial{\bm{b}}/\partial{\bm{z}}_{m}$ . Plugging the dynamics of ${\bm{w}}_{k}$ into the dynamics of self-attention logits ${\bm{z}}_{m}$ , we have:

\dot{\bm{z}}_{m}=\mathbb{E}_{q=m}\left[L^{\top}U_{C}^{\top}\sum_{k}g_{h_{k}}h^{\prime}_{k}{\bm{w}}_{k}\right]=\sum_{k}\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}L^{\top}{\bm{v}}_{k}\right]

(11)

Before we start, we first define $\xi_{k}(t):=\int_{0}^{t}\mathbb{E}_{q=m}\left[g_{h_{k}}(t^{\prime})h_{k}^{\prime}(t^{\prime})\right]\mathrm{d}t^{\prime}$ . Therefore, $\dot{\xi}_{k}=\mathbb{E}_{q=m}\left[g_{h_{k}}h_{k}^{\prime}\right]$ . Intuitively, $\xi_{k}$ is the bias of node $k$ , regardless of whether there exists an actual bias parameter to optimize.

Notice that $U_{C}^{\top}{\bm{f}}={\bm{b}}+U_{C}^{\top}{\bm{u}}_{q}$ , with orthonormal condition between contextual and query tokens: $U_{C}^{\top}{\bm{u}}_{m}=0$ , and thus $U_{C}^{\top}{\bm{f}}={\bm{b}}$ , which leads to

\dot{\bm{v}}_{k}=U_{C}^{\top}\dot{\bm{w}}_{k}=U_{C}^{\top}\mathbb{E}_{q=m}\left[g_{h_{k}}h_{k}^{\prime}{\bm{f}}\right]=\mathbb{E}_{q=m}\left[g_{h_{k}}h_{k}^{\prime}{\bm{b}}\right]

(12)

Unnormalized attention ( $A:=\mathrm{const}$ ). In this case, we have ${\bm{b}}=\sigma({\bm{z}}_{m})\circ{\bm{x}}/A$ and $L=\mathrm{diag}(\sigma^{\prime}({\bm{z}}_{m})\circ{\bm{x}})/A=\mathrm{diag}\left(\frac{\sigma^{\prime}({\bm{z}}_{m})}{\sigma({\bm{z}}_{m})}\right)\mathrm{diag}({\bm{b}})$ and thus

	$\displaystyle\dot{\bm{z}}_{m}$	$\displaystyle=$	$\displaystyle\sum_{k}\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}L^{\top}{\bm{v}}_{k}\right]=\mathrm{diag}\left(\frac{\sigma^{\prime}({\bm{z}}_{m})}{\sigma({\bm{z}}_{m})}\right)\sum_{k}\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}{\bm{b}}\right]\circ{\bm{v}}_{k}$		(13)
		$\displaystyle=$	$\displaystyle\mathrm{diag}\left(\frac{\sigma^{\prime}({\bm{z}}_{m})}{\sigma({\bm{z}}_{m})}\right)\sum_{k}\dot{\bm{v}}_{k}\circ{\bm{v}}_{k}$		(14)

which leads to

\mathrm{diag}\left(\frac{\sigma({\bm{z}}_{m})}{\sigma^{\prime}({\bm{z}}_{m})}\right)\dot{\bm{z}}_{m}=\sum_{k}\dot{\bm{v}}_{k}\circ{\bm{v}}_{k}

(15)

Therefore, for linear attention, $\sigma({\bm{z}}_{m})/\sigma^{\prime}({\bm{z}}_{m})={\bm{z}}_{m}$ , by integrating both sides, we have ${\bm{z}}_{m}^{2}(t)=\sum_{k}{\bm{v}}^{2}_{k}(t)+{\bm{c}}$ . For exp attention, $\sigma({\bm{z}}_{m})/\sigma^{\prime}({\bm{z}}_{m})=1$ , then by integrating both sides, we have ${\bm{z}}_{m}(t)=\frac{1}{2}\sum_{k}{\bm{v}}^{2}_{k}(t)+{\bm{c}}$ .

Softmax attention. In this case, we have $L=\mathrm{diag}({\bm{b}})-{\bm{b}}{\bm{b}}^{\top}$ . Therefore,

\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}\mathrm{diag}({\bm{b}})\right]U_{C}^{\top}{\bm{w}}_{k}=\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}{\bm{b}}\right]\circ{\bm{v}}_{k}=\dot{\bm{v}}_{k}\circ{\bm{v}}_{k}

(16)

where $\circ$ is the Hadamard (element-wise) product. Now Therefore, we have:

\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}{\bm{b}}^{\top}\right]U_{C}^{\top}{\bm{w}}_{k}=\dot{\bm{v}}_{k}^{\top}{\bm{v}}_{k}

(17)

Given the assumption that ${\bm{b}}$ is uncorrelated with $\sum_{k}g_{h_{k}}h^{\prime}_{k}{\bm{b}}$ (e.g., due to top-down gradient information), and let $\bar{\bm{b}}_{m}=\mathbb{E}_{q=m}\left[{\bm{b}}\right]$ , we have:

\dot{\bm{z}}_{m}=\sum_{k}\dot{\bm{v}}_{k}\circ{\bm{v}}_{k}-\bar{\bm{b}}_{m}\dot{\bm{v}}_{k}^{\top}{\bm{v}}_{k}

(18)

If we further assume that $\bar{\bm{b}}_{m}$ is constant over time, then we can integrate both side to get a close-form solution between ${\bm{z}}_{m}(t)$ and $\{{\bm{v}}_{k}(t)\}$ :

{\bm{z}}_{m}(t)=\frac{1}{2}\sum_{k}\left({\bm{v}}^{2}_{k}-\|{\bm{v}}_{k}\|_{2}^{2}\bar{\bm{b}}_{m}\right)+{\bm{c}}

(19)

∎

See 2

Proof.

Due to the assumption, we have:

\dot{v}_{l}=\mathbb{E}_{q=m}\left[g_{h_{k}}x_{l}\right]\exp(z_{ml})/A=\Delta_{lm}\exp(z_{ml})/A

(20)

where $\Delta_{lm}:=\mathbb{E}_{q=m}\left[g_{h_{k}}x_{l}\right]$ . If $x_{l}[i]=\mathbb{P}(l|m,y[i])$ , then $\Delta_{lm}=\mathbb{E}_{l,q=m}\left[g_{h_{k}}\right]\mathbb{P}(l|m)$ . Note that for linear model, $\Delta_{lm}$ is a constant over time.

Plugging in the close-form solution for exp attention, the dynamics becomes

\dot{v}_{l}=\Delta_{lm}\exp(v_{l}^{2}/2+c_{l})/A

(21)

Assuming $c_{l}=0$ , then for any two tokens $l\neq l^{\prime}$ , we get

\frac{\dot{v}_{l}}{\dot{v}_{l^{\prime}}}=\frac{\Delta_{lm}\exp(z_{ml})}{\Delta_{l^{\prime}m}\exp(z_{ml^{\prime}})}=\frac{\Delta_{lm}\exp(v^{2}_{l}/2)}{\Delta_{l^{\prime}m}\exp(v^{2}_{l^{\prime}}/2)}

(22)

which can be integrated using $\mathrm{erf}(\cdot)$ function (i.e., Gaussian CRF: $\mathrm{erf}(x)=\frac{2}{\sqrt{\pi}}\int_{0}^{x}e^{-t^{2}}\mathrm{d}t$ ):

\frac{\mathrm{erf}\left(v_{l}(t)/2\right)}{\Delta_{lm}}=\frac{\mathrm{erf}(v_{l^{\prime}}(t)/2)}{\Delta_{l^{\prime}m}}+c_{ll^{\prime}}

(23)

if ${\bm{v}}(0)=0$ , then $c_{ll^{\prime}}=0$ . ∎

A.3 Dynamics of Nonlinear activations (Sec. 4)

A.3.1 Without self-attention (or equivalently, with uniform attention)

Lemma 1 (Expectation of Hyperplane function under Isotropic distribution).

For any isotropic distribution $p({\bm{x}}-\bar{\bm{x}})$ with mean $\bar{\bm{x}}$ in a subspace spanned by orthonormal bases $R$ , if ${\bm{v}}\neq{\bm{0}}$ , we have:

\mathbb{E}_{p}\left[{\bm{x}}\psi({\bm{v}}^{\top}{\bm{x}}+\xi)\right]=\frac{\theta_{1}(r_{{\bm{v}}})}{\|{\bm{v}}\|_{2}}\bar{\bm{x}}+\frac{\theta_{2}(r_{{\bm{v}}})}{\|{\bm{v}}\|^{3}_{2}}RR^{\top}{\bm{v}},\quad\quad\mathbb{E}_{p}\left[\psi({\bm{v}}^{\top}{\bm{x}}+\xi)\right]=\frac{\theta_{1}(r_{{\bm{v}}})}{\|{\bm{v}}\|_{2}}

(24)

where $r_{{\bm{v}}}:={\bm{v}}^{\top}\bar{\bm{x}}+\xi$ is the (signed) distance between the distribution mean $\bar{\bm{x}}$ and the affine hyperplane $({\bm{v}},\xi)$ . $\theta_{1}(r)$ and $\theta_{2}(r)$ only depends on $\psi$ and the underlying distribution but not ${\bm{v}}$ . Additionally,

•

If $\psi(r)$ is monotonously increasing, then $\theta_{1}(r)$ is also monotonous increasing;
•

If $\psi(r)\geq 0$ , then $\theta_{1}(r)\geq 0$ ;
•

If $\psi(-\infty)=0$ , $\psi(+\infty)=1$ , then $\theta_{1}(-\infty)=0$ and $\theta_{1}(+\infty)=1$ ;
•

If $\psi(-\infty)=0$ , then $\theta_{2}(-\infty)=0$ .

Proof.

Note that ${\bm{x}}^{\prime}$ is isotropic in span( $R$ ) and thus $p({\bm{x}}^{\prime})$ just depends on $\|{\bm{x}}^{\prime}\|$ , we let $p_{0}:\mathbb{R}^{+}\rightarrow\mathbb{R}^{+}$ satisfies $p_{0}(\|{\bm{x}}^{\prime}\|)=p({\bm{x}}^{\prime})$ . Our goal is to calculate

	$\displaystyle\mathbb{E}_{p}\left[{\bm{x}}\psi({\bm{w}}^{\top}{\bm{x}}+\xi)\right]$	$\displaystyle=$	$\displaystyle\int_{\text{span}(R)}{\bm{x}}\psi({\bm{w}}^{\top}{\bm{x}}+\xi)p({\bm{x}}-\bm{\mu})\mathrm{d}{\bm{x}}$		(25)
		$\displaystyle=$	$\displaystyle\int_{\text{span}(R)}({\bm{x}}^{\prime}+\bm{\mu})\psi({\bm{w}}^{\top}{\bm{x}}^{\prime}+r_{{\bm{w}}})p({\bm{x}}^{\prime})\mathrm{d}{\bm{x}}^{\prime}$		(26)

where ${\bm{x}}^{\prime}:={\bm{x}}-\bm{\mu}$ is isotropic. Since $RR^{\top}{\bm{w}}$ is the projection of ${\bm{w}}$ onto space span( $R$ ), we denote ${\bm{v}}:=RR^{\top}{\bm{w}}$ and $y^{\prime}:={\bm{w}}^{\top}{\bm{x}}^{\prime}={\bm{v}}^{\top}{\bm{x}}^{\prime}$ since ${\bm{x}}^{\prime}$ lies in span( $R$ ). Then let $S$ be any hyper-plane through ${\bm{v}}$ , which divide span( $R$ ) into two symmetric part $V_{+}$ and $V_{-}$ (Boundary is zero measurement set and can be ignored), we have,

$\displaystyle P_{1}$	$\displaystyle:=$	$\displaystyle\int_{\text{span}(R)}{\bm{x}}^{\prime}\psi({\bm{w}}^{\top}{\bm{x}}^{\prime}+r_{{\bm{w}}})p({\bm{x}}^{\prime})\mathrm{d}{\bm{x}}^{\prime}$	(27)
	$\displaystyle=$	$\displaystyle(\int_{V_{+}}+\int_{V_{-}}){\bm{x}}^{\prime}\psi({\bm{v}}^{\top}{\bm{x}}^{\prime}+r_{{\bm{w}}})p({\bm{x}}^{\prime})\mathrm{d}{\bm{x}}^{\prime}$	(28)
	$\displaystyle=$	$\displaystyle 2\times\int_{V_{+}}\frac{{\bm{v}}^{\top}{\bm{x}}^{\prime}}{\\|{\bm{v}}\\|}\cdot\frac{{\bm{v}}}{\\|{\bm{v}}\\|}\cdot\psi({\bm{v}}^{\top}{\bm{x}}^{\prime}+r_{{\bm{w}}})p({\bm{x}}^{\prime})\mathrm{d}{\bm{x}}^{\prime}$	(29)
	$\displaystyle=$	$\displaystyle\{\int_{\text{span}(R)}y^{\prime}\psi(y^{\prime}+r_{{\bm{w}}})p({\bm{x}}^{\prime})\mathrm{d}{\bm{x}}^{\prime}\}\cdot\frac{{\bm{v}}}{\\|{\bm{v}}\\|^{2}}$	(30)

Eqn. 29 holds since for every ${\bm{x}}^{\prime}\in V_{+}$ , we can always find unique ${\bm{x}}^{\prime\prime}\in V_{-}$ defined as

{\bm{x}}^{\prime\prime}=-({\bm{x}}^{\prime}-\frac{{\bm{v}}^{\top}{\bm{x}}^{\prime}}{\|{\bm{v}}\|^{2}}{\bm{v}})+\frac{{\bm{v}}^{\top}{\bm{x}}^{\prime}}{\|{\bm{v}}\|^{2}}{\bm{v}}=\frac{2y^{\prime}}{\|{\bm{v}}\|^{2}}{\bm{v}}-{\bm{x}}^{\prime}

(31)

where ${\bm{x}}^{\prime\prime}$ and ${\bm{x}}^{\prime}$ satisfy $\|{\bm{x}}^{\prime\prime}\|=\|{\bm{x}}^{\prime}\|$ , ${\bm{v}}^{\top}{\bm{x}}^{\prime\prime}={\bm{v}}^{\top}{\bm{x}}^{\prime}$ , and have equal reverse component $\pm({\bm{x}}^{\prime}-\frac{{\bm{v}}^{\top}{\bm{x}}^{\prime}}{\|{\bm{v}}\|^{2}}{\bm{v}})$ perpendicular to ${\bm{v}}$ . Thus for the ${\bm{x}}^{\prime}$ in Eqn. 28, only the component parallel to ${\bm{v}}$ remains. Furthermore, let $\{{\bm{u}}_{1},\ldots,{\bm{u}}_{n-1},{\bm{v}}/\|{\bm{v}}\|\}$ to be an orthonormal bases of span( $R$ ) and denote $x^{\prime}_{i}:={\bm{u}}_{i}^{\top}{\bm{x}}^{\prime},\forall i\in[n-1]$ , then we have

	$\displaystyle P_{1}$	$\displaystyle=$	$\displaystyle\{\int_{y^{\prime}}y^{\prime}\psi(y^{\prime}+r_{{\bm{w}}})\mathrm{d}(\frac{y^{\prime}}{\\|{\bm{v}}\\|})[\int_{x^{\prime}_{1}}\cdots\int_{x^{\prime}_{n-1}}p({\bm{x}}^{\prime})\mathrm{d}x^{\prime}_{1}\ldots\mathrm{d}x^{\prime}_{n-1}]\}\cdot\frac{{\bm{v}}}{\\|{\bm{v}}\\|^{2}}$		(32)
		$\displaystyle=:$	$\displaystyle\{\int_{-\infty}^{+\infty}y^{\prime}\psi(y^{\prime}+r_{{\bm{w}}})p_{n}(y^{\prime})\mathrm{d}y^{\prime}\}\cdot\frac{{\bm{v}}}{\\|{\bm{v}}\\|^{3}}$		(33)

Here $p_{n}(y^{\prime})$ is the probability density function of $y^{\prime}$ obtained from ${\bm{x}}^{\prime}$ . For the trivial case where $n=1$ , clearly $p_{n}(y^{\prime})=p_{0}(|y^{\prime}|)=p(y^{\prime})$ . If $n\geq 2$ , it can be further calculated as:

$\displaystyle p_{n}(y^{\prime})$	$\displaystyle=$	$\displaystyle\int_{x^{\prime}_{1}}\cdots\int_{x^{\prime}_{n-1}}p_{0}(\sqrt{(x^{\prime}_{1})^{2}+\ldots+(x^{\prime}_{n-1})^{2}+(y^{\prime})^{2}})\ \cdot\mathrm{d}x^{\prime}_{1}\ldots\mathrm{d}x^{\prime}_{n-1}$	(34)
	$\displaystyle=$	$\displaystyle\int_{0}^{+\infty}p_{0}(\sqrt{y^{\prime 2}+l^{2}})\cdot S_{n-1}(l)\mathrm{d}l$	(35)
	$\displaystyle=$	$\displaystyle\frac{(n-1)\pi^{(n-1)/{2}}}{\Gamma(\frac{n+1}{2})}\int_{0}^{+\infty}p_{0}(\sqrt{y^{\prime 2}+l^{2}})\cdot l^{n-2}\mathrm{d}l$	(36)
	$\displaystyle=$	$\displaystyle\begin{cases}\begin{aligned} &\frac{2^{n/2}\pi^{n/2-1}}{(n-3)!!}\int_{0}^{+\infty}p_{0}(\sqrt{y^{\prime 2}+l^{2}})\cdot l^{n-2}\mathrm{d}l,&\quad&n\text{ is even}\\ &\frac{2\pi^{(n-1)/2}}{(\frac{n-3}{2})!}\int_{0}^{+\infty}p_{0}(\sqrt{y^{\prime 2}+l^{2}})\cdot l^{n-2}\mathrm{d}l,&\quad&n\text{ is odd}\end{aligned}\end{cases}$	(37)

where $S_{n}(R)=\frac{n\pi^{n/2}}{\Gamma(n/2+1)}R^{n-1}$ represents the surface area of an $n$ -dimensional hyper-sphere of radius $l$ . $\Gamma$ denotes the gamma function and we use the property that $\Gamma(n+1)=n!$ and $\Gamma(n+\frac{1}{2})=(2n-1)!!\sqrt{\pi}2^{-n}$ for any $n\in\mathbb{N}^{+}$ .

Similarly, for another term we have

	$\displaystyle P_{2}$	$\displaystyle=$	$\displaystyle\int_{\text{span}(R)}\bm{\mu}\cdot\psi({\bm{w}}^{\top}{\bm{x}}^{\prime}+r_{{\bm{w}}})p({\bm{x}}^{\prime})\mathrm{d}{\bm{x}}^{\prime}$		(38)
		$\displaystyle=$	$\displaystyle\{\int_{-\infty}^{+\infty}\psi(y^{\prime}+r_{{\bm{w}}})p_{n}(y^{\prime})\mathrm{d}y^{\prime}\}\cdot\frac{\bm{\mu}}{\\|{\bm{v}}\\|}$		(39)

Finally, let

	$\displaystyle\theta_{1}(r_{{\bm{w}}})$	$\displaystyle:=$	$\displaystyle\int_{-\infty}^{+\infty}\psi(y^{\prime}+r_{{\bm{w}}})p_{n}(y^{\prime})\mathrm{d}y^{\prime}$		(41)
	$\displaystyle\theta_{2}(r_{{\bm{w}}})$	$\displaystyle:=$	$\displaystyle\int_{-\infty}^{+\infty}y^{\prime}\cdot\psi(y^{\prime}+r_{{\bm{w}}})p_{n}(y^{\prime})\mathrm{d}y^{\prime}$		(42)

Then we arrive at the conclusion. ∎

See 3

Proof.

Since backpropagated gradient $g_{h_{k}}$ is constant within each of its mixed components, we have:

$\displaystyle\Delta_{m}$	$\displaystyle:=$	$\displaystyle\mathbb{E}_{q=m}\left[g_{h_{k}}h^{\prime}_{k}{\bm{b}}\right]=\sum_{j}\mathbb{E}_{q=m,c=j}\left[g_{h_{k}}h^{\prime}_{k}{\bm{b}}\right]\mathbb{P}[c=j]$	(43)
	$\displaystyle=$	$\displaystyle\sum_{j}\mathbb{E}_{q=m,c=j}\left[g_{h_{k}}\right]\mathbb{P}[c=j]\mathbb{E}_{q=m,c=j}\left[h^{\prime}_{k}{\bm{b}}\right]$	(44)
	$\displaystyle=$	$\displaystyle\sum_{j}a_{j}\mathbb{E}_{{\bm{x}}\sim p({\bm{x}}-\bar{\bm{x}}_{j})}\left[{\bm{b}}\phi^{\prime}({\bm{w}}^{\top}{\bm{f}})\right]$	(45)

Let $\psi=\phi^{\prime}$ . Note that ${\bm{w}}^{\top}{\bm{f}}={\bm{w}}^{\top}(U_{c}{\bm{b}}+{\bm{u}}_{q})={\bm{v}}^{\top}{\bm{b}}+\xi$ and with uniform attention ${\bm{b}}={\bm{x}}$ , we have:

\Delta_{m}=\sum_{j}a_{j}\mathbb{E}_{{\bm{x}}\sim p({\bm{x}}-\bar{\bm{x}}_{j})}\left[{\bm{x}}\psi({\bm{v}}^{\top}{\bm{x}}+\xi)\right]

(46)

Using Lemma 1 leads to the conclusion. ∎

Remarks. Note that if $\phi$ is linear, then $\psi\equiv 1$ , $\theta_{1}\equiv 1$ and $\theta_{2}\equiv 0$ . In this case, $\theta_{1}$ is a constant, which marks a key difference between linear and nonlinear dynamics.

A.3.2 (Tentative) Critical Point Analysis of Dynamics in Theorem 3

Lemma 2 (Property of $\theta_{1},\theta_{2}$ with homogeneous activation).

If $\phi(x)=x\phi^{\prime}(x)$ is a homogeneous activation function and $\psi=\phi^{\prime}$ , then we have:

\frac{\mathrm{d}}{\mathrm{d}r}\left(\theta_{2}(r)+r\theta_{1}(r)\right)=\theta_{1}(r)

(47)

Integrating both sides and we get:

\theta_{2}(r)+r\theta_{1}(r)=F(r):=\int_{0}^{r}\theta_{1}(r^{\prime})\mathrm{d}r^{\prime}+C

(48)

Let $r=0$ and it is clear that $C=\theta_{2}(0)$ . Thus

\theta_{2}(r)+r\theta_{1}(r)=F(r)=\int_{0}^{r}\theta_{1}(r^{\prime})\mathrm{d}r^{\prime}+\theta_{2}(0)

(49)

If $\psi\geq 0$ , then $F(r)$ is a monotonous increasing function with $F(+\infty)=+\infty$ . Furthermore, if $\lim_{r\rightarrow-\infty}r\theta_{1}(r)=0$ and $\psi(-\infty)=0$ , then $\theta_{2}(-\infty)=0$ and $F(-\infty)=0$ and thus $F(r)\geq 0$ .

Proof.

Simply verify Eqn. 47 is true. ∎

Overall, the dynamics can be quite complicated. We consider a special $C=2$ case with one positive ( $a_{+}$ , $r_{+}$ and $\bar{\bm{x}}_{+}$ ) and one negative ( $a_{-}$ , $r_{-}$ and $\bar{\bm{x}}_{-}$ ) distribution.

Lemma 3 (Existence of critical point of dynamics with ReLU activation).

For any homogeneous activation $\phi(x)=x\phi^{\prime}(x)$ , any stationary point of Eqn. 5 must satisfy $\sum_{j}a_{j}F(r_{j})=0$ , where $F(r):=\theta_{2}(0)+\int_{0}^{r}\theta_{1}(r^{\prime})\mathrm{d}r^{\prime}$ is a monotonous increasing function.

Proof.

We rewrite the dynamics equations for the nonlinear activation without attention case:

\dot{\bm{v}}=\frac{1}{\|{\bm{v}}\|_{2}}\sum_{j}a_{j}\theta_{1}(r_{j})\bar{\bm{x}}_{j}+\frac{1}{\|{\bm{v}}\|^{3}_{2}}\sum_{j}a_{j}\theta_{2}(r_{j}){\bm{v}},\qquad\dot{\xi}=\frac{1}{\|{\bm{v}}\|_{2}}\sum_{j}a_{j}\theta_{1}(r_{j})

(50)

Notice that $\bar{\bm{x}}_{j}^{\top}{\bm{v}}=r_{j}-\xi$ , this gives that:

$\displaystyle\\|{\bm{v}}\\|_{2}{\bm{v}}^{\top}\dot{\bm{v}}$	$\displaystyle=$	$\displaystyle\sum_{j}a_{j}\theta_{1}(r_{j})(r_{j}-\xi)+\sum_{j}a_{j}\theta_{2}(r_{j})$	(51)
	$\displaystyle=$	$\displaystyle\sum_{j}a_{j}(r_{j}\theta_{1}(r_{j})+\theta_{2}(r_{j}))-\xi\sum_{j}a_{j}\theta_{1}(r_{j})$	(52)
	$\displaystyle=$	$\displaystyle\sum_{j}a_{j}F(r_{j})-\\|{\bm{v}}\\|_{2}\xi\dot{\xi}$	(53)

in which the last equality is because the dynamics of $\xi$ , and due to Lemma 2. Now we leverage the condition of stationary points ( $\dot{\bm{v}}=0$ and $\dot{\xi}=0$ ), we arrive at the necessary conditions at the stationary points:

\sum_{j}a_{j}F(r_{j})=0

(54)

Note that in general, the scalar condition above is only necessary but not sufficient. Eqn. 50 has $M_{c}+1$ equations but we only have two scalar equations (Eqn. 50 and $\|{\bm{v}}\|_{2}\dot{\xi}=\sum_{j}a_{j}\theta_{1}(r_{j})=0$ ). However, we can get a better characterization of the stationary points if there are only two components $a_{+}$ and $a_{-}$ :

A special case: one positive and one negative samples In this case, we have (here $r_{+}:={\bm{v}}^{\top}\bar{\bm{x}}_{+}+\xi$ and $r_{-}:={\bm{v}}^{\top}\bar{\bm{x}}_{-}+\xi$ ):

a_{+}F(r_{+})-a_{-}F(r_{-})=0

(55)

So the sufficient and necessary condition for $({\bm{v}},\xi)$ to be the critical point is that

\frac{F(r_{+})}{F(r_{-})}=\frac{\theta_{1}(r_{+})}{\theta_{1}(r_{-})}=\frac{a_{-}}{a_{+}}

(56)

Without loss of generality, we consider the case where $\phi$ is ReLU and $\psi(r)=\mathbf{I}[r>0]$ . Note that $\theta_{1}$ is a monotonously increasing function, we have $\theta_{1}^{-1}:(0,1)\rightarrow\mathbb{R}$ such that $\theta_{1}^{-1}(\theta_{1}(r))=r$ for any $r\in\mathbb{R}$ . And we denote $G:(0,1)\rightarrow\mathbb{R}$ which satisfies:

G(y)=F(\theta_{1}^{-1}(y))

(57)

and $y_{+}:=\theta_{1}^{-1}(r_{+})$ , $y_{-}:=\theta_{1}^{-1}(r_{-})$ . Then if we can find some line $l_{k}:y=kx$ for some $k\in\mathbb{R}$ such that $l_{k}$ has at least two points of intersection $(y_{i},ky_{i}),i=1,2$ with curve $G$ and $a_{-}/a_{+}=y_{1}/y_{2}$ or $a_{-}/a_{+}=y_{2}/y_{1}$ , then we can always find some ${\bm{v}}$ and $\xi$ such that Eqn. 56 holds.

On the other hand, it’s easy to find that (Fig. 9):

$\displaystyle\frac{\mathrm{d}G(y)}{\mathrm{d}y}\left.\right\|_{y=\theta_{1}(x)}$	$\displaystyle=$	$\displaystyle\frac{\theta_{1}(x)}{p_{n}(x)}>0$
$\displaystyle\lim_{y\rightarrow 1}G(y)$	$\displaystyle=$	$\displaystyle\lim_{r\rightarrow+\infty}F(r)=+\infty$
$\displaystyle\lim_{y\rightarrow 0}G(y)$	$\displaystyle=$	$\displaystyle\lim_{r\rightarrow-\infty}F(r)=\lim_{r\rightarrow-\infty}r\theta_{1}(r)$

Note that since $G(y_{+})/G(y_{-})=y_{+}/y_{-}$ , we have $G(y_{+})/y_{+}=G(y_{-})/y_{-}$ and thus $(y_{+},G(y_{+}))$ and $(y_{-},G(y_{-}))$ are lying at the same straight line.

For finding the sufficient condition, we focus on the range $x\geq 0$ and $\theta_{1}(x)\geq\frac{1}{2}$ . Then in order that line $l_{k}:y=kx$ for some $k\in\mathbb{R}$ has at least two points of intersection with curve $G$ , we just need to let

\frac{G(\tilde{\theta}_{1}(0))}{\tilde{\theta}_{1}(0)}\geq\frac{\mathrm{d}G(y)}{\mathrm{d}y}\left.\right|_{y=\tilde{\theta}_{1}(0)}\iff\tilde{\theta}_{2}(0)\cdot p_{n}(0)=p_{n}(0)\int_{0}^{+\infty}y^{\prime}p_{n}(y^{\prime})\mathrm{d}y^{\prime}\geq\frac{1}{4}

(58)

For convenience, let $S_{l_{k}}:=\{(x,y)|y=kx\}$ and $S_{G}:=\{(x,y)|y=G(x)\}$ to be the image of the needed functions. Denote $\pi_{1}:\mathbb{R}^{2}\rightarrow\mathbb{R}:\pi_{1}((x,y))=x$ for any $x,y\in\mathbb{R}$ , $\pi_{1}(S)=\{\pi_{1}(s)|\forall s\in S\}$ . Therefore, if Eqn. 58 holds, then the following set $\mathcal{S}$ will not be empty.

\mathcal{S}:=\bigcup_{k\in\mathbb{R}}\{\frac{x_{2}}{x_{1}}\ |\ \forall x_{1}\neq x_{2}\in\pi_{1}(S_{l_{k}}\cap S_{G})\}

(59)

And Eqn. 5 has critical points if $a_{+}/a_{-}\in\mathcal{S}$ . And it’s easy to find that $\forall s\in\mathcal{S}$ , $s\in(\frac{1}{2},1)\cup(1,2)$ . Similar results also hold for other homogeneous activations.

Remarks. It is often the case that $y_{-}<1/2$ and $y_{+}>1/2$ , since $G(y)$ when $y>1/2$ is convex and there will be at most two intersection between a convex function and a straight line. This means that $r^{*}_{+}>0$ and $r^{*}_{-}=\xi_{*}<0$ .

∎

A.4 Several remarks

The intuition behind $\xi$ : Note that while node $k$ in MLP layer does not have an explicit bias term, our analysis above demonstrates that there exists an “implicit bias” term $\xi_{k}(t)$ embedded in the weight vector ${\bm{w}}_{k}$ :

{\bm{w}}(t)={\bm{w}}(0)+U_{C}[{\bm{v}}(t)-{\bm{v}}(0)]+{\bm{u}}_{m}\xi(t)

(60)

This bias term allows encoding of the query embedding ${\bm{u}}_{m}$ into the weight, and the negative bias $\xi^{*}<0$ ensures that given the query $q=m$ , there needs to be a positive inner product between ${\bm{v}}_{*}$ (i.e., the “pattern template”) and the input contextual tokens, in order to activate the node $k$ .

Pattern superposition. Note that due to such mechanism, one single weight ${\bm{w}}$ may contain multiple query vectors (e.g., ${\bm{u}}_{m_{1}}$ and ${\bm{u}}_{m_{2}}$ ) and their associated pattern templates (e.g., ${\bm{v}}_{m_{1}}$ and ${\bm{v}}_{m_{2}}$ ), as long as they are orthogonal to each other. Specifically, if ${\bm{w}}={\bm{v}}_{m_{1}}-\xi_{m_{1}}{\bm{u}}_{m_{1}}+{\bm{v}}_{m_{2}}-\xi_{m_{2}}{\bm{u}}_{m_{2}}$ , then it can match both pattern 1 and pattern 2. We called this “pattern superposition”, as demonstrated in Fig. 10.

Lemma 4.

If $\phi(x)$ is homogeneous, i.e., $\phi(x)=\phi^{\prime}(x)x$ , then there exist constant $c_{-},c_{+}\in\mathbb{R}$ depend on $\phi$ such that $\phi(x)=c_{-}\mathbf{1}[x<0]+c_{+}\mathbf{1}[x>0]$ , and thus

\frac{\mathrm{d}\theta_{1}}{\mathrm{d}r}=(c_{-}+c_{+})p_{n}(r),\quad\frac{\mathrm{d}\theta_{2}}{\mathrm{d}r}=-(c_{-}+c_{+})r\cdot p_{n}(r)

(61)

Proof.

For any $x>0$ , we have

$\displaystyle\phi^{\prime}(x)$	$\displaystyle=$	$\displaystyle\lim_{\delta x\rightarrow 0+}\frac{\phi(x+\delta x)-\phi(x)}{\delta x}$	(62)
	$\displaystyle=$	$\displaystyle\lim_{\delta x\rightarrow 0+}\frac{\phi^{\prime}(x+\delta x)-\phi^{\prime}(x)}{\delta x}\cdot x+\lim_{\delta x\rightarrow 0}\phi^{\prime}(x+\delta x)$	(63)
	$\displaystyle=$	$\displaystyle x\cdot\lim_{\delta x\rightarrow 0+}\frac{\phi^{\prime}(x+\delta x)-\phi^{\prime}(x)}{\delta x}+\phi^{\prime}(x)$	(64)

So for any $x>0$ , $\phi^{\prime}(x)$ must be constant, and similar results hold for $x<0$ . Then by direct calculation, we can get the results. ∎

A.4.1 With self-attention

Lemma 5.

Let $g(y):=\frac{1-e^{-y^{2}}}{y}$ . Then $\max_{y\geq 0}g(y)\leq\frac{1}{\sqrt{2}}$ .

Proof.

Any of its stationary point $y_{*}$ must satisfies $g_{y}^{\prime}(y_{*})=0$ , which gives:

e^{-y_{*}^{2}}=\frac{1}{2y_{*}^{2}+1}

(66)

Therefore, at any stationary points, we have:

g(y_{*})=\frac{2y_{*}}{2y_{*}^{2}+1}=\frac{2}{2y_{*}+y_{*}^{-1}}\leq\frac{1}{\sqrt{2}}

(67)

since $g(0)=g(+\infty)=0$ , the conclusion follows. ∎

Lemma 6 (Bound of Gaussian integral).

Let $G(y):=e^{-y^{2}/2}\int_{0}^{y}e^{x^{2}/2}\mathrm{d}x$ , then $0\leq G(y)\leq 1$ for $y\geq 0$ .

Proof.

$G(y)\geq 0$ is obvious. Note that

\displaystyle G(y)

\displaystyle:=

\displaystyle e^{-y^{2}/2}\int_{0}^{y}e^{x^{2}/2}\mathrm{d}x\leq e^{-y^{2}/2}\int_{0}^{y}e^{xy/2}\mathrm{d}x=\frac{2}{y}\left(1-e^{-y^{2}/2}\right)=\sqrt{2}g(y/\sqrt{2})

Applying Lemma 5 gives the conclusion. ∎

See 4

Proof.

We first consider when $\bm{\mu}>0$ . We can write down the dynamics in a component wise manner, since all components share the same scalar constant:

\frac{\dot{v}_{j}}{\dot{v}_{k}}=\frac{(\mu_{j}-v_{j})e^{v_{j}^{2}/2}}{(\mu_{k}-v_{k})e^{v_{k}^{2}/2}}

(68)

which gives the following separable form:

\frac{\dot{v}_{j}e^{-v^{2}_{j}/2}}{\mu_{j}-v_{j}}=\frac{\dot{v}_{k}e^{-v^{2}_{k}/2}}{\mu_{k}-v_{k}}

(69)

Let

F(r,r_{0},\mu):=\int_{r_{0}\mu}^{r\mu}\frac{e^{-v^{2}/2}}{\mu-v}\mathrm{d}v=\int_{r_{0}}^{r}\frac{e^{-\mu^{2}x^{2}/2}}{1-x}\mathrm{d}x\quad\quad(x=v/\mu)

(70)

Integrating both sides of Eqn. 69 from $t=0$ to $t$ , the dynamics must satisfy the following equation at time $t$ :

F(r_{j}(t),r_{j}(0),\mu_{j})=F(r_{k}(t),r_{k}(0),\mu_{k})

(71)

where $r_{j}(t):=v_{j}(t)/\mu_{j}$ . According to the dynamics, $r_{j}(t)\rightarrow 1$ and the question is how fast the convergence is. Depending on the initialization, $r_{j}(t)>1$ or $r_{j}(t)<1$ .

Eqn. 71 implicitly gives the relationship between $r_{j}(t)$ and $r_{k}(t)$ (and thus $\delta_{j}(t)$ and $\delta_{k}(t)$ ). Now the question is how to bound $F(r,r_{0},\mu)$ , which does not have close-form solutions.

Note that we have:

$\displaystyle\frac{\partial F}{\partial\mu}$	$\displaystyle=$	$\displaystyle-\mu\int_{r_{0}}^{r}\frac{x^{2}e^{-\mu^{2}x^{2}/2}}{1-x}\mathrm{d}x$	(72)
	$\displaystyle=$	$\displaystyle\mu\int_{r_{0}}^{r}\frac{1-x^{2}}{1-x}e^{-\mu^{2}x^{2}/2}\mathrm{d}x-\mu\int_{r_{0}}^{r}\frac{e^{-\mu^{2}x^{2}/2}}{1-x}\mathrm{d}x$	(73)
	$\displaystyle=$	$\displaystyle\mu\int_{r_{0}}^{r}(1+x)e^{-\mu^{2}x^{2}/2}\mathrm{d}x-\mu F(r,r_{0},\mu)$	(74)
	$\displaystyle=$	$\displaystyle\sqrt{\frac{\pi}{2}}\left[\mathrm{erf}\left(\frac{r\mu}{\sqrt{2}}\right)-\mathrm{erf}\left(\frac{r_{0}\mu}{\sqrt{2}}\right)\right]+\frac{1}{\mu}(e^{-r_{0}^{2}\mu^{2}/2}-e^{-r^{2}\mu^{2}/2})-\mu F(r,r_{0},\mu)$	(75)

Let

\zeta(r,r_{0},\mu):=\sqrt{\frac{\pi}{2}}\left[\mathrm{erf}\left(\frac{r\mu}{\sqrt{2}}\right)-\mathrm{erf}\left(\frac{r_{0}\mu}{\sqrt{2}}\right)\right]+\frac{1}{\mu}(e^{-r_{0}^{2}\mu^{2}/2}-e^{-r^{2}\mu^{2}/2})

(76)

Applying Lemma 5 and notice that $\mu>0$ , we have

|\zeta(r,r_{0},\mu)|\leq\sqrt{2\pi}+\sqrt{2}(|r_{0}|+|r|)/\sqrt{2}\leq\sqrt{2\pi}+\max(2|r_{0}|,|r_{0}|+1)=:M(r_{0})

(77)

which means that $|\zeta(r,r_{0},\mu)|$ is uniformly bounded, regardless of $\mu$ and $r(t)$ (note that $r$ is bounded and will converge to $1$ from the dynamics). Integrating both side and we have:

$\displaystyle\frac{\partial}{\partial\mu}\left(e^{\mu^{2}/2}F(r,r_{0},\mu)\right)$	$\displaystyle=$	$\displaystyle\zeta(r,r_{0},\mu)e^{\mu^{2}/2}$	(78)
$\displaystyle e^{\mu^{2}/2}F(r,r_{0},\mu)-F(r,r_{0},0)$	$\displaystyle=$	$\displaystyle\int_{0}^{\mu}\zeta(r,r_{0},x)e^{x^{2}/2}\mathrm{d}x$	(79)
$\displaystyle F(r,r_{0},\mu)$	$\displaystyle=$	$\displaystyle e^{-\mu^{2}/2}F(r,r_{0},0)+e^{-\mu^{2}/2}\int_{0}^{\mu}\zeta(r,r_{0},x)e^{x^{2}/2}\mathrm{d}x$	(80)

Note that $F(r,r_{0},0)$ has a close form:

F(r,r_{0},0)=\int_{r_{0}}^{r}\frac{1}{1-x}\mathrm{d}x=\ln\frac{1-r_{0}}{1-r}

(81)

has a close-form solution that works for both $r_{0}<r<1$ and $r_{0}>r>1$ (the situations that 1 is between $r_{0}$ and $r$ won’t happen). Using mean-value theorem, we have:

F(r,r_{0},\mu)=e^{-\mu^{2}/2}\ln\frac{1-r_{0}}{1-r}+\zeta(r,r_{0},\bar{\mu})e^{-\mu^{2}/2}\int_{0}^{\mu}e^{x^{2}/2}\mathrm{d}x

(82)

Applying Lemma 6, we have the following bound for $F(r,\mu)$ :

-M(r_{0})\leq F(r,\mu)-e^{-\mu^{2}/2}\ln\frac{1-r_{0}}{1-r}\leq M(r_{0})

(83)

When $r$ is close to $1$ (near convergence), the term $e^{-\mu^{2}/2}\ln\frac{1-r_{0}}{1-r}$ (with fixed $\mu$ and fixed $r_{0}$ ) is huge compared to the constant $M(r_{0})$ , which is $\sqrt{2\pi}+1.5\approx 4.0066$ for e.g., $|r_{0}|=1/2$ , and thus $F(r,\mu)\rightarrow e^{-\mu^{2}/2}\ln\frac{1-r_{0}}{1-r}$ .

To be more concrete, note that $\delta(t)=1-v(t)/\mu=1-r(t)$ , we let

\rho(\delta(t),\mu)=F(1-\delta(t),1-\delta(0),\mu)-e^{-\mu^{2}/2}\ln\frac{\delta(0)}{\delta(t)}\in(-M(r_{0}),M(r_{0}))

(84)

And using Eqn. 71, we have:

F(1-\delta_{j}(t),1-\delta_{j}(0),\mu_{j})=F(1-\delta_{k}(t),1-\delta_{k}(0),\mu_{k})

(85)

Then

	$\displaystyle\lambda_{jk}(t)$	$\displaystyle:=$	$\displaystyle\rho(\delta_{k}(t),\mu_{k})-\rho(\delta_{j}(t),\mu_{j})$		(86)
		$\displaystyle=$	$\displaystyle e^{-\mu_{j}^{2}/2}\ln\frac{\delta_{j}(0)}{\delta_{j}(t)}-e^{-\mu_{k}^{2}/2}\ln\frac{\delta_{k}(0)}{\delta_{k}(t)}$		(87)

and $|\lambda_{jk}(t)|\leq M(r_{j}(0))+M(r_{k}(0))$ . Then we arrive at the conclusion. ∎

A.5 Hierarchical Latent Tree Models (Section 5)

We formally introduce the definition of HBLT here. Let $y_{\alpha}$ be a binary variable at layer $s$ (upper layer and $y_{\beta}$ be a binary variable at layer $s-1$ (lower layer). We use a 2x2 matrix $P_{\beta|\alpha}$ to represent their conditional probability:

P_{\beta|\alpha}:=[\mathbb{P}[y_{\beta}|y_{\alpha}]]=\left[\begin{array}[]{cc}\mathbb{P}[y_{\beta}=0|y_{\alpha}=0]&\mathbb{P}[y_{\beta}=0|y_{\alpha}=1]\\ \mathbb{P}[y_{\beta}=1|y_{\alpha}=0]&\mathbb{P}[y_{\beta}=1|y_{\alpha}=1]\end{array}\right]

(88)

Definition 1.

Define $2\times 2$ matrix $M(\rho):=\frac{1}{2}\left[\begin{array}[]{cc}1+\rho&1-\rho\\ 1-\rho&1+\rho\end{array}\right]$ and $2$ -dimensional vector ${\bm{p}}(\rho)=\frac{1}{2}[1+\rho,1-\rho]^{\top}$ for $\rho\in[-1,1]$ .

Lemma 7 (Property of $M(\rho)$ ).

$M(\rho)$ has the following properties:

•

$M(\rho)$ is a symmetric matrix.
•

$M(\rho){\bm{1}}_{2}={\bm{1}}_{2}$ .
•

$M(\rho_{1})M(\rho_{2})=M(\rho_{1}\rho_{2})$ . So matrix multiplication in $\{M(\rho)\}_{\rho\in[-1,1]}$ is communicative and isomorphic to scalar multiplication.
•

$M(\rho_{1}){\bm{p}}(\rho_{2})={\bm{p}}(\rho_{1}\rho_{2})$ .

Proof.

The first two are trivial properties. For the third one, notice that $M(\rho)=\frac{1}{2}({\bm{1}}{\bm{1}}^{T}+\rho{\bm{e}}{\bm{e}}^{\top})$ , in which ${\bm{e}}:=[1,-1]^{\top}$ . Therefore, ${\bm{e}}^{\top}{\bm{e}}=2$ and ${\bm{1}}^{\top}{\bm{e}}=0$ and thus:

M(\rho_{1})M(\rho_{2})=\frac{1}{4}({\bm{1}}{\bm{1}}^{T}+\rho_{1}{\bm{e}}{\bm{e}}^{\top})({\bm{1}}{\bm{1}}^{T}+\rho_{2}{\bm{e}}{\bm{e}}^{\top})=\frac{1}{2}({\bm{1}}{\bm{1}}^{\top}+\rho_{1}\rho_{2}{\bm{e}}{\bm{e}}^{\top})=M(\rho_{1}\rho_{2})

(89)

For the last one, note that ${\bm{p}}(\rho)=\frac{1}{2}({\bm{1}}+\rho{\bm{e}})$ and the conclusion follows. ∎

Definition 2 (Definition of HBLT).

In $\texttt{HBLT}(\rho)$ , $P_{\beta|\alpha}=M(\rho_{\beta|\alpha})$ , where $\rho_{\beta|\alpha}\in[-1,1]$ is the uncertainty parameter. In particular, if $\rho_{\beta|\alpha}=\rho$ , then we just write the entire HBLT model as $\texttt{HBLT}(\rho)$ .

Lemma 8.

For latent $y_{\alpha}$ and its descendent $y_{\gamma}$ , we have:

P_{\gamma|\alpha}=P_{\gamma|\beta_{1}}P_{\beta_{1}|\beta_{2}}\ldots P_{\beta_{k}|\alpha}=M\left(\rho_{\gamma|\alpha}\right)

(90)

where $\rho_{\gamma|\alpha}:=\rho_{\gamma|\beta_{1}}\rho_{\beta_{1}|\beta_{2}}\ldots\rho_{\beta_{k}|\alpha}$ and $\alpha\succ\beta_{1}\succ\beta_{2}\succ\ldots\succ\beta_{k}\succ\gamma$ is the descendent chain from $y_{\alpha}$ to $y_{\gamma}$ .

Proof.

Due to the tree structure of HBLT, we have:

\mathbb{P}[y_{\gamma}|y_{\alpha}]=\sum_{y_{\beta_{1}},y_{\beta_{2}},\ldots,y_{\beta_{k}}}\mathbb{P}[y_{\gamma}|y_{\beta_{1}}]\mathbb{P}[y_{\beta_{1}}|y_{\beta_{2}}]\ldots\mathbb{P}[y_{\beta_{k}}|y_{\alpha}]

(91)

which is precisely how the entries of $P_{\gamma|\beta_{1}}P_{\beta_{1}|\beta_{2}}\ldots P_{\beta_{k}|\alpha}$ get computed. By leveraging the property of $M(\rho)$ , we arrive at the conclusion. ∎

See 5

Proof.

Let the common latent ancestor (CLA) of $y_{\beta_{1}}$ and $y_{\beta_{2}}$ be $y_{c}$ , then we have:

\mathbb{P}[y_{\beta_{1}},y_{\beta_{2}}]=\sum_{y_{c}}\mathbb{P}[y_{\beta_{1}}|y_{c}]\mathbb{P}[y_{\beta_{2}}|y_{c}]\mathbb{P}[y_{c}]

(92)

Let $P_{\beta_{1}\beta_{2}}=[\mathbb{P}[y_{\beta_{1}},y_{\beta_{2}}]]$ , then we have:

P_{\beta_{1}\beta_{2}}=M(\rho_{\beta_{1}|c})D(c)M^{\top}(\rho_{\beta_{2}|c})

(93)

where $D(c):=\mathrm{diag}(\mathbb{P}[y_{c}])=\frac{1}{2}\left[\begin{array}[]{cc}1+\rho_{c}&0\\ 0&1-\rho_{c}\end{array}\right]$ is a diagonal matrix, and $\rho_{c}:=2\mathbb{P}[y_{c}=0]-1$ . Note that

{\bm{1}}^{\top}D(c){\bm{1}}={\bm{e}}^{\top}D(c){\bm{e}}=1,\quad\quad{\bm{1}}^{\top}D(c){\bm{e}}={\bm{e}}^{\top}D(c){\bm{1}}=\rho_{c}

(94)

And $M(\rho)=\frac{1}{2}({\bm{1}}{\bm{1}}^{T}+\rho{\bm{e}}{\bm{e}}^{\top})$ , therefore we have:

$\displaystyle P_{\beta_{1}\beta_{2}}$	$\displaystyle=$	$\displaystyle M(\rho_{\beta_{1}\|c})D(c)M^{\top}(\rho_{\beta_{2}\|c})$	(95)
	$\displaystyle=$	$\displaystyle\frac{1}{4}({\bm{1}}{\bm{1}}^{T}+\rho_{\beta_{1}\|c}{\bm{e}}{\bm{e}}^{\top})D(c)({\bm{1}}{\bm{1}}^{T}+\rho_{\beta_{2}\|c}{\bm{e}}{\bm{e}}^{\top})$	(96)
	$\displaystyle=$	$\displaystyle\frac{1}{4}\left({\bm{1}}{\bm{1}}^{T}+\rho_{\beta_{1}\|c}\rho_{\beta_{2}\|c}{\bm{e}}{\bm{e}}^{\top}+\rho_{\beta_{1}\|c}\rho_{c}{\bm{e}}{\bm{1}}^{\top}+\rho_{\beta_{2}\|c}\rho_{c}{\bm{1}}{\bm{e}}^{\top}\right)$	(97)

Now we compute $\rho_{c}$ . Note that

\mathbb{P}[y_{c}]=\sum_{y_{0}}\mathbb{P}[y_{c}|y_{0}]\mathbb{P}[y_{0}]

(98)

Let ${\bm{p}}_{c}:=[\mathbb{P}[y_{c}]]$ be a 2-dimensional vector. Then we have ${\bm{p}}_{c}=P_{y_{c}|y_{0}}{\bm{p}}_{0}={\bm{p}}(\rho_{c|0}\rho_{0})$ , where ${\bm{p}}_{0}$ is the probability distribution of class label $y_{0}$ , which can be categorical of size $C$ :

$\displaystyle{\bm{p}}_{c}$	$\displaystyle=$	$\displaystyle P_{y_{c}\|y_{0}}{\bm{p}}_{0}=\sum_{y_{1}}P_{y_{c}\|y_{1}}P_{y_{1}\|y_{0}}{\bm{p}}_{0}$	(99)
	$\displaystyle=$	$\displaystyle M(\rho_{c\|1})\frac{1}{2}\left[\begin{array}[]{cccc}1+p_{1\|0}&1+p_{2\|0}&\ldots&1+p_{C\|0}\\ 1-p_{1\|0}&1-p_{2\|0}&\ldots&1-p_{C\|0}\end{array}\right]{\bm{p}}_{0}$	(102)
	$\displaystyle=$	$\displaystyle M(\rho_{c\|1})\frac{1}{2}\left[\begin{array}[]{c}1+{\bm{p}}_{\cdot\|0}^{\top}{\bm{p}}_{0}\\ 1-{\bm{p}}_{\cdot\|0}^{\top}{\bm{p}}_{0}\end{array}\right]$	(105)
	$\displaystyle=$	$\displaystyle M(\rho_{c\|1}{\bm{p}}_{\cdot\|0}^{\top}{\bm{p}}_{0})$	(106)

in which $y_{1}$ is the last binary variable right below the root node class label $y_{0}$ .

Therefore, $\rho_{c}=\rho_{c|1}\rho_{0}$ , where $\rho_{0}:={\bm{p}}_{\cdot|0}^{\top}{\bm{p}}_{0}$ is the uncertainty parameter of the root node $y_{0}$ .

If all $\rho_{\beta|\alpha}=\rho$ for immediate parent $y_{\alpha}$ and child $y_{\beta}$ , $y_{\beta_{1}}$ is for token $l$ and $y_{\beta_{2}}$ is for token $m$ , then $\rho_{\beta_{1}|c}=\rho_{\beta_{2}|c}=\rho^{H}$ , and $\rho_{c|1}=\rho^{L-1-H}$ and thus we have:

	$\displaystyle\mathbb{P}[y_{l}=1\|y_{m}=1]$	$\displaystyle=$	$\displaystyle\frac{\mathbb{P}[y_{l}=1,y_{m}=1]}{\mathbb{P}[y_{m}=1]}=\frac{1}{2}\left(\frac{1+\rho^{2H}-2\rho^{H}\rho_{c}}{1-\rho^{H}\rho_{c}}\right)$		(107)
		$\displaystyle=$	$\displaystyle\frac{1}{2}\left(\frac{1+\rho^{2H}-2\rho^{L-1}\rho_{0}}{1-\rho^{L-1}\rho_{0}}\right)$		(108)

and the conclusion follows. ∎

Appendix B More Experiment Results

B.1 Orthogonality of embedding vectors

We verify the orthogonality assumption mentioned in our problem setting (Sec. 2). The orthogonality is measured by absolute cosine similarity $\mathrm{cossim}({\bm{x}}_{1},{\bm{x}}_{2})\in[0,1]$ of two vectors ${\bm{x}}_{1}$ and ${\bm{x}}_{2}$ :

\mathrm{cossim}({\bm{x}}_{1},{\bm{x}}_{2}):=\frac{|{\bm{x}}^{\top}_{1}{\bm{x}}_{2}|}{\|{\bm{x}}_{1}\|\|{\bm{x}}_{2}\|}

(109)

Here the two vectors ${\bm{x}}_{1}$ and ${\bm{x}}_{2}$ are column vectors of the out-projection (or upper) matrix of MLPs at different layers, each corresponding to one hidden neuron. For a MLP layer with model dimension $d$ and hidden dimension $4d$ , there will be $4d$ such column vectors. We measure the average cosine similarity across all $2d(4d-1)$ pairs and report in the figure.

While $4d$ $d$ -dimensional vectors have to be linearly dependent, they are indeed almost orthogonal (i.e., $\mathrm{cossim}({\bm{x}}_{1},{\bm{x}}_{2})\ll 1$ ) throughout the training process, as shown below. In Fig. 11, we show cosine similiarity over the entire training process of Pythia models of different sizes. Fig. 12 further checks the training curve at early training stages, since Pythia checkpoints are more densely sampled around early training stages, i.e., “steps 0 (initialization), 1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1000, and then every 1,000 subsequent steps” (Biderman et al., 2023). Finally, for models whose intermediate checkpoints are not available, we show the cosine similarity in the publicly released pre-trained models (Fig. 13).

B.2 Attention Entropy for Encoder-decoder models

We also measure how attention entropy, as well as stable rank of the in-projection (or lower) matrix in MLP, changes over time for encoder-decoder models like BERT, as shown in Fig. 14. The behavior is very similar to the decoder-only case (Fig. 7), further verifying our theoretical findings.

JoMA: Demystifying Multilayer Transformers via JOint Dynamics of MLP and Attention

Abstract

1 Introduction

1.1 Related Work

2 Problem Setting

Assumption 1 (Stationary backpropagated gradient ghkg_{h_{k}}).

3 JoMA: Existence of JOint dynamics of Attention and MLP

Theorem 1 (JoMA).

3.1 Linear activations: winner-take-all

Theorem 2 (Linear Dynamics with Self-attention).

4 Training Dynamics under Nonlinear Activations

Theorem 3 (Dynamics of nonlinear activation with uniform attention).

Theorem 4 (Convergence speed of salient vs. non-salient components).

5 How self-attention learns hierarchical data distribution?

Theorem 5 (Token Co-occurrence in HBLT​(ρ)\texttt{HBLT}{}(\rho)).

6 Experiments

7 Discussion

8 Conclusion

Acknowledgments

References

Appendix A Proofs

A.1 Per-hidden loss formulation

A.2 JoMA framework (Section 3)

Proof.

Proof.

A.3 Dynamics of Nonlinear activations (Sec. 4)

A.3.1 Without self-attention (or equivalently, with uniform attention)

Lemma 1 (Expectation of Hyperplane function under Isotropic distribution).

Proof.

Proof.

A.3.2 (Tentative) Critical Point Analysis of Dynamics in Theorem 3

Lemma 2 (Property of θ1,θ2\theta_{1},\theta_{2} with homogeneous activation).

Proof.

Lemma 3 (Existence of critical point of dynamics with ReLU activation).

Proof.

A.4 Several remarks

Lemma 4.

Proof.

A.4.1 With self-attention

Lemma 5.

Proof.

Lemma 6 (Bound of Gaussian integral).

Proof.

Proof.

A.5 Hierarchical Latent Tree Models (Section 5)

Definition 1.

Lemma 7 (Property of M​(ρ)M(\rho)).

Proof.

Definition 2 (Definition of HBLT).

Lemma 8.

Proof.

Proof.

Appendix B More Experiment Results

B.1 Orthogonality of embedding vectors

B.2 Attention Entropy for Encoder-decoder models

Assumption 1 (Stationary backpropagated gradient $g_{h_{k}}$ ).

Theorem 5 (Token Co-occurrence in $\texttt{HBLT}{}(\rho)$ ).

Lemma 2 (Property of $\theta_{1},\theta_{2}$ with homogeneous activation).

Lemma 7 (Property of $M(\rho)$ ).