This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Hierarchical Associative Memory, Parallelized MLP-Mixer, and Symmetry Breaking

Ryo Karakida
AIST
&Toshihiro Ota
CyberAgent
&Masato Taki
Rikkyo University
Abstract

Transformers have established themselves as the leading neural network model in natural language processing and are increasingly foundational in various domains. In vision, the MLP-Mixer model has demonstrated competitive performance, suggesting that attention mechanisms might not be indispensable. Inspired by this, recent research has explored replacing attention modules with other mechanisms, including those described by MetaFormers. However, the theoretical framework for these models remains underdeveloped. This paper proposes a novel perspective by integrating Krotov’s hierarchical associative memory with MetaFormers, enabling a comprehensive representation of the entire Transformer block, encompassing token-/channel-mixing modules, layer normalization, and skip connections, as a single Hopfield network. This approach yields a parallelized MLP-Mixer derived from a three-layer Hopfield network, which naturally incorporates symmetric token-/channel-mixing modules and layer normalization. Empirical studies reveal that symmetric interaction matrices in the model hinder performance in image recognition tasks. Introducing symmetry-breaking effects transitions the performance of the symmetric parallelized MLP-Mixer to that of the vanilla MLP-Mixer. This indicates that during standard training, weight matrices of the vanilla MLP-Mixer spontaneously acquire a symmetry-breaking configuration, enhancing their effectiveness. These findings offer insights into the intrinsic properties of Transformers and MLP-Mixers and their theoretical underpinnings, providing a robust framework for future model design and optimization. \NoHyper000Author names are in alphabetical order.\endNoHyper

1 Introduction

Transformers [1], widely recognized as the preeminent neural network model in natural language processing, are now solidifying their position as a foundational technology across various domains. In the vision domain, the Vision Transformer (ViT) [2, 3] is increasingly replacing the traditional convolutional neural networks, achieving significant success. Shortly after the advent of ViT, the MLP-Mixer model, which utilizes only a two-layer MLP instead of an attention module, was proposed [4, 5], suggesting that the attention may not be indispensable in the vision domain. This discovery catalyzed research into replacing attention modules with other mechanisms, leading to the recent proposal of a family of models abstracted as MetaFormers [6, 7]. Observations from MetaFormers indicate that the token mixer in the architecture can be highly flexible, yet there is no concrete guideline for designing an appropriate token mixer for specific tasks within the MetaFormer context. Furthermore, while MetaFormers have generically exhibited good empirical performance, their theoretical treatment remains challenging and underdeveloped.

The Hopfield network is a classical associative memory model of single-layer recurrent neural network [8, 9]. In this network, stored memories and their retrieval are well described by the attractors of an energy function and the dynamics converging to them. For many Hopfield networks, it was assumed for years that neuron states and interaction matrices take discrete values, making it challenging to integrate them into the modern framework of deep neural networks that learn through backpropagation using differential methods. Recently, Ramsauer et al. proposed a modern Hopfield network with continuous states and interactions [10, 11]. They claimed that the update rule for neuron states in a modern Hopfield network with a specific energy function is essentially equivalent to the attention module, whereas their discussion was heuristic and somewhat ad hoc. Around the same time, Krotov and Hopfield introduced a two-layer Hopfield network equipped with Lagrangian functions that define the system [12]. They demonstrated that the modern Hopfield network of Ramsauer et al. can be realized as a special case within the framework of the two-layer Hopfield network by choosing particular Lagrangian functions. Subsequently, Krotov extended this two-layer Hopfield network to a hierarchical model, successfully constructing a general LL-layer hierarchical associative memory model [13]. In the framework of these generalized Hopfield networks by Krotov and Hopfield, which we refer to as the hierarchical Hopfield networks, each model of Hopfield network is instantiated by determining a specific combination of Lagrangians. Models derived in this manner thus inherently possess energy functions, with clear mathematical properties. It has been shown that by adopting appropriate Lagrangians, modules such as attention, two-layer MLP, convolution, and pooling—all of which are used as token mixers in MetaFormers—can be derived [12, 13, 14] from the hierarchical Hopfield network.

Inspired by the fact that hierarchical Hopfield networks encompass a certain class of MetaFormers, recent research has begun to explore the fundamental properties of Transformers and MLP-Mixers (more generally, MetaFormers) through the lens of Hopfield networks [15, 16]. In previous studies, the correspondence between hierarchical Hopfield networks and MetaFormer models indicated that only token-mixing modules such as attention and spatial MLP correspond to specific types of Hopfield networks. In this paper, we propose a new model by introducing a novel perspective into Krotov’s hierarchical associative memory, allowing the entire Transformer (MetaFormer) block, not just the token-mixing module, but also the channel-mixing module, layer normalization, and skip connection, to correspond exactly to a single Hopfield network. This model, derived from a three-layer Hopfield network, naturally yields MetaFormers with parallelized token-/channel-mixing modules. Based on this theoretical derivation, we focus on a specific combination of Lagrangians, presenting a parallelized MLP-Mixer, and examine its fundamental properties. Thinking of it as a stack of associative memory models, interaction matrices of the proposed model exhibits a certain symmetry (symmetric parallelized MLP-Mixer, SymMixer), while empirical observations show that this symmetry actually hinders its performance as an image recognition model. By introducing symmetry-breaking effects into the interaction weight matrices, we confirm that the performance of SymMixer essentially transitions to that of the vanilla MLP-Mixer. This suggests that during the normal training of the vanilla MLP-Mixer, the weight matrix spontaneously acquires a symmetry-breaking configuration through learning.

The rest of this paper is organized as follows. After the list of related works, we provide a general background for our analysis, giving a brief overview of the two-layer Hopfield network and showing the correspondence between a certain type of Hopfield network and the MLP-Mixer. We then move on to discussing the hierarchical associative memory model, which is the main objective of this paper. In Sec. 4, we introduce a three-layer Hopfield network and study its basic properties as an associative memory model. A prototype of the parallelized MetaFormer emerges from a specific configuration of neurons and a combination of Lagrangians. In the subsequent section, we derive a class of MLP-Mixers with parallelized mixing layers as a stack of associative memory models. Empirical studies reveal that the symmetric weights of mixing layers are indeed a constraint for image recognition tasks. In Sec. 6, we demonstrate that, on the Hopfield model side, symmetry breaking resolves degeneracies of local minima of an energy function, and it plays a crucial role in ensuring performance on the Mixer model side.111The code for our experiments is available at https://github.com/Toshihiro-Ota/paramixer.

2 Related Work

While it is widely believed that an attention mechanism is critical to the success of ViT, there have also been attention-free alternatives. MLP-Mixer [4, 5, 17, 18] has shown that by simply replacing the attention mechanism of ViT with MLP, it is possible to achieve performance approaching that of ViT. This discovery has stimulated a series of studies showing that a wide range of token-mixing mechanisms, including pooling [6], global filtering [19], recurrent layers [20], and graph neural networks [21], can be used as substitutes for the attention mechanism. Wang et al. [22] demonstrate that the token mixer can even be removed.

The classical Hopfield network [8, 9] is known to suffer from limited memory capacity. To address this issue, models with substantially higher memory capacities have been proposed [23, 24, 25], but they require many-body interactions among neurons, which is biologically implausible in the brain. Recently, Krotov and Hopfield developed more comprehensive associative memory models that consist of visible and hidden neurons with only two-body interactions between them [12, 13]. Based on their findings, there has been a growing number of studies aiming to understand modern neural networks through the perspective of Hopfield networks [26, 27, 28, 29, 30, 31, 32]; see also [33].

3 Background

To clarify the notation, we provide here the derivation of the mixing layers in MLP-Mixer [4] from a continuous Hopfield network.

3.1 Overview of the two-layer Hopfield network

Let us first briefly review the two-layer Hopfield network proposed in [12]. In this system, the dynamical variables are composed of NvN_{v} visible neurons and NhN_{h} hidden neurons both continuous,

v(t)Nv,h(t)Nh,v(t)\in\mathbb{R}^{N_{v}},\quad h(t)\in\mathbb{R}^{N_{h}}, (1)

where the argument tt can be thought of as “time”. The interaction matrices between them,

ξNh×Nv,ξNv×Nh,\xi\in\mathbb{R}^{N_{h}\times N_{v}},\quad\xi^{\prime}\in\mathbb{R}^{N_{v}\times N_{h}}, (2)

are basically supposed to be symmetric: ξ=ξ\xi^{\prime}=\xi^{\top}, see Fig. 1. With the relaxing time constants of the two groups of neurons τv\tau_{v} and τh\tau_{h}, the dynamics of the system is described by the following differential equations,

τvdvi(t)dt\displaystyle\tau_{v}\frac{dv_{i}(t)}{dt} =μ=1Nhξiμfμ(h(t))vi(t),\displaystyle=\sum_{\mu=1}^{N_{h}}\xi_{i\mu}f_{\mu}(h(t))-v_{i}(t), (3)
τhdhμ(t)dt\displaystyle\tau_{h}\frac{dh_{\mu}(t)}{dt} =i=1Nvξμigi(v(t))hμ(t),\displaystyle=\sum_{i=1}^{N_{v}}\xi_{\mu i}g_{i}(v(t))-h_{\mu}(t), (4)

where the activation functions ff and gg are determined through Lagrangians Lh:NhL_{h}:\mathbb{R}^{N_{h}}\to\mathbb{R} and Lv:NvL_{v}:\mathbb{R}^{N_{v}}\to\mathbb{R}, such that

fμ(h)=Lh(h)hμ,gi(v)=Lv(v)vi.f_{\mu}(h)=\frac{\partial L_{h}(h)}{\partial h_{\mu}},\quad g_{i}(v)=\frac{\partial L_{v}(v)}{\partial v_{i}}. (5)

The canonical energy function for this system is given by

E(v,h)=i=1Nvvigi(v)Lv(v)+μ=1Nhhμfμ(h)Lh(h)μ,ifμξμigi.E(v,h)=\sum_{i=1}^{N_{v}}v_{i}g_{i}(v)-L_{v}(v)+\sum_{\mu=1}^{N_{h}}h_{\mu}f_{\mu}(h)-L_{h}(h)-\sum_{\mu,i}f_{\mu}\xi_{\mu i}g_{i}. (6)

One can easily find that this energy function monotonically decreases along the trajectory of the dynamical equations to define an associative memory model,

dE(v(t),h(t))dt0,\frac{dE(v(t),h(t))}{dt}\leq 0, (7)

provided that the Hessians of the Lagrangians are positive semi-definite. In addition to this, if the overall energy function is bounded from below, the trajectory is guaranteed to converge to a fixed-point attractor state, which corresponds to one of the local minima of the energy function. Such fixed points and the process of convergence are thought of as associative memories and memory retrieval of an associative memory model. The formulation of neural networks in terms of Lagrangians and the associated energy functions enables us to easily experiment with different choices of the activation functions and different architectural arrangements of neurons.

Refer to caption
(a)
Refer to caption
(b)
Figure 1: The two-layer Hopfield network. (a) Visible and hidden neurons interact each other and the layers form a bipartite graph. (b) Each of the token- (spatial) and channel-mixing block in the mixing layer of MLP-Mixer can be realized as an instance with the corresponding Lagrangians.

3.2 Mixing layer as an associative memory model

Suppose we have a fixed interaction matrix ξμi\xi_{\mu i}, then the system is defined by the choice of Lagrangians LhL_{h} and LvL_{v}. Tang and Kopp demonstrated that the specific choice of Lagrangians called “model C” in [12] essentially reproduces the mixing layers in the MLP-Mixer [14], which is given by the following Lagrangians:

Lh(h)=μF(hμ),Lv(v)=i(viv¯)2,L_{h}(h)=\sum_{\mu}F(h_{\mu}),\quad L_{v}(v)=\sqrt{\sum_{i}(v_{i}-\bar{v})^{2}}, (8)

where FF will be specified below, and v¯=ivi/Nv\bar{v}=\sum_{i}v_{i}/N_{v}. For these Lagrangians, the activation functions are222gig_{i} can include learnable parameters such as 𝜶\bm{\alpha} and 𝜷\bm{\beta} as in [34] by slightly modifying the Lagrangian LvL_{v}, see Appendix A.

fμ(h)\displaystyle f_{\mu}(h) =Lhhμ=F(hμ),\displaystyle=\frac{\partial L_{h}}{\partial h_{\mu}}=F^{\prime}(h_{\mu}), (9)
gi(v)\displaystyle g_{i}(v) =Lvvi=viv¯j(vjv¯)2=LayerNorm(v)i.\displaystyle=\frac{\partial L_{v}}{\partial v_{i}}=\frac{v_{i}-\bar{v}}{\sqrt{\sum_{j}(v_{j}-\bar{v})^{2}}}=\operatorname{LayerNorm}(v)_{i}. (10)

We now consider the adiabatic limit, τvτh\tau_{v}\gg\tau_{h}, which means that the dynamics of the hidden neurons is much faster than that of the visible neurons, i.e., we can take τh0\tau_{h}\to 0:

Eq. (4)hμ(t)=i=1NvξμiLayerNorm(v(t))i.\text{Eq.~{}(\ref{eq:heq})}\quad\rightsquigarrow\quad h_{\mu}(t)=\sum_{i=1}^{N_{v}}\xi_{\mu i}\operatorname{LayerNorm}(v(t))_{i}. (11)

By substituting this expression into the other dynamical equation, we find

τvdvi(t)dt=μξiμF(jξμjLayerNorm(v(t))j)αvi(t).\tau_{v}\frac{dv_{i}(t)}{dt}=\sum_{\mu}\xi_{i\mu}F^{\prime}\bigg{(}\sum_{j}\xi_{\mu j}\operatorname{LayerNorm}(v(t))_{j}\bigg{)}-\alpha v_{i}(t). (12)

Notice that we can put an arbitrary coefficient α\alpha, which can be even zero, in front of the decay term, since for this choice of Lagrangian LvL_{v} its Hessian has a zero mode:

jMij(vjv¯)=0,Mij:=2Lvvivj.\sum_{j}M_{ij}(v_{j}-\bar{v})=0,\quad M_{ij}:=\frac{\partial^{2}L_{v}}{\partial v_{i}\partial v_{j}}. (13)

If we take α=0\alpha=0 and discretize the differential equation by taking Δt=τv\Delta t=\tau_{v}, then we obtain the update rule for the visible neurons as

vi(t+1)=vi(t)+μξiμσ(jξμjLayerNorm(v(t))j),v_{i}(t+1)=v_{i}(t)+\sum_{\mu}\xi_{i\mu}\,\sigma\bigg{(}\sum_{j}\xi_{\mu j}\operatorname{LayerNorm}(v(t))_{j}\bigg{)}, (14)

where we defined σ:=F\sigma:=F^{\prime}. If it is chosen as σ=GELU\sigma=\operatorname{GELU}, this update rule is identified with the token- and channel-mixing blocks in the mixing layers discussed in [4], as in Fig. 1.

4 Hierarchical Associative Memory Model

Along the line of [12], Krotov further extended the two-layer Hopfield network to a hierarchical associative memory model with multiple hidden layers [13], which we refer to as the hierarchical Hopfield network. Based on observations in Sec. 3.2 and Krotov’s extension, in the subsequent sections we will study a type of MetaFormers through the lens of associative memory models and their variants.

First, in this section we consider the hierarchical extension of the discussion of the previous section. We derive a prototype of the parallelized MetaFormers via a different viewpoint from the original formulation of the hierarchical Hopfield network, and demonstrate its basic properties as an associative memory model.

4.1 Energy MetaFormer

As a simple extension in the context of hierarchical Hopfield network, we consider a structure with three layers: one visible layer with NvN_{v} neurons and two hidden layers with NsN_{s} and NcN_{c} neurons each. The crucial point is that the set of the visible neurons lies in between the two hidden layers as depicted in Fig. 2, unlike the original configuration discussed in [13]. The dynamics of this system is then described by the following differential equations,

τsdxαs(t)dt\displaystyle\tau_{s}\frac{dx^{s}_{\alpha}(t)}{dt} =i=1Nvξαi(s,v)giv(xv(t))xαs(t),\displaystyle=\sum_{i=1}^{N_{v}}\xi^{(s,v)}_{\alpha i}g^{v}_{i}(x^{v}(t))-x^{s}_{\alpha}(t), (15)
τvdxiv(t)dt\displaystyle\tau_{v}\frac{dx^{v}_{i}(t)}{dt} =α=1Nsξiα(v,s)gαs(xs(t))+β=1Ncξiβ(v,c)gβc(xc(t))xiv(t),\displaystyle=\sum_{\alpha=1}^{N_{s}}\xi^{(v,s)}_{i\alpha}g^{s}_{\alpha}(x^{s}(t))+\sum_{\beta=1}^{N_{c}}\xi^{(v,c)}_{i\beta}g^{c}_{\beta}(x^{c}(t))-x^{v}_{i}(t), (16)
τcdxβc(t)dt\displaystyle\tau_{c}\frac{dx^{c}_{\beta}(t)}{dt} =i=1Nvξβi(c,v)giv(xv(t))xβc(t),\displaystyle=\sum_{i=1}^{N_{v}}\xi^{(c,v)}_{\beta i}g^{v}_{i}(x^{v}(t))-x^{c}_{\beta}(t), (17)

where ξ(A,B)\xi^{(A,B)} denotes the interaction from BB neurons to AA neurons: ξ(A,B)NA×NB\xi^{(A,B)}\in\mathbb{R}^{N_{A}\times N_{B}}. The activation functions are again determined through the corresponding Lagrangians,

gαs(xs)=Ls(xs)xαs,giv(xv)=Lv(xv)xiv,gβc(xc)=Lc(xc)xβc.g^{s}_{\alpha}(x^{s})=\frac{\partial L^{s}(x^{s})}{\partial x^{s}_{\alpha}},\quad g^{v}_{i}(x^{v})=\frac{\partial L^{v}(x^{v})}{\partial x^{v}_{i}},\quad g^{c}_{\beta}(x^{c})=\frac{\partial L^{c}(x^{c})}{\partial x^{c}_{\beta}}. (18)

In this system, the canonical energy function is given by [13]

E(xs,xv,xc)=A=s,v,c((xA)gALA)A=s,v(gA+1)ξ(A+1,A)gA,E(x^{s},x^{v},x^{c})=\sum_{A=s,v,c}\left(\left(x^{A}\right)^{\top}g^{A}-L^{A}\right)-\sum_{A=s,v}\left(g^{A+1}\right)^{\top}\xi^{(A+1,A)}g^{A}, (19)

where s+1:=vs+1:=v and v+1:=cv+1:=c, and the arguments for gAg^{A} and LAL^{A} are omitted for simplicity. We have moved on to the matrix notation for brevity, and all the products here after imply matrix multiplication. Under the symmetric condition on the interaction matrices,

ξ(v,s)=(ξ(s,v)),ξ(v,c)=(ξ(c,v)),\xi^{(v,s)}=\left(\xi^{(s,v)}\right)^{\top},\quad\xi^{(v,c)}=\left(\xi^{(c,v)}\right)^{\top}, (20)

and ensuring the Hessians of the Lagrangians are positive semi-definite, the above dynamical equations confirm that

dE(xs(t),xv(t),xc(t))dt0.\frac{dE(x^{s}(t),x^{v}(t),x^{c}(t))}{dt}\leq 0. (21)

This fact demonstrates that we have a well-defined Lyapunov function for this system, which monotonically decreases along the trajectory of the dynamical equations.

We now take the specific Lagrangian for the visible neurons as

Lv(xv)=i=1Nv(xivv¯)2,v¯=1Nvixiv.L^{v}(x^{v})=\sqrt{\sum_{i=1}^{N_{v}}(x^{v}_{i}-\bar{v})^{2}},\quad\bar{v}=\frac{1}{N_{v}}\sum_{i}x^{v}_{i}. (22)

Then, the dynamical equations become

τsdxs(t)dt\displaystyle\tau_{s}\frac{dx^{s}(t)}{dt} =ξ(s,v)LayerNorm(xv(t))xs(t),\displaystyle=\xi^{(s,v)}\operatorname{LayerNorm}(x^{v}(t))-x^{s}(t),
τvdxv(t)dt\displaystyle\tau_{v}\frac{dx^{v}(t)}{dt} =ξ(v,s)gs(xs(t))+ξ(v,c)gc(xc(t))xv(t),\displaystyle=\xi^{(v,s)}g^{s}(x^{s}(t))+\xi^{(v,c)}g^{c}(x^{c}(t))-x^{v}(t), (23)
τcdxc(t)dt\displaystyle\tau_{c}\frac{dx^{c}(t)}{dt} =ξ(c,v)LayerNorm(xv(t))xc(t).\displaystyle=\xi^{(c,v)}\operatorname{LayerNorm}(x^{v}(t))-x^{c}(t).

It is not necessary to specify the activation functions gsg^{s} and gcg^{c} at this stage. Following the discussion in Sec. 3.2, the discrete update rule for xvx^{v} neurons will lead to a new type of MLP-Mixer model, which we will discuss in the next section. The flexible activation functions gsg^{s} and gcg^{c} suggest that the system may form a class of MetaFormers in turn. Thus, these dynamical equations represent a hierarchical associative memory model, serving as a continuous MetaFormer endowed with an energy function. This fact motivates us to refer to this system as the Energy MetaFormer.

Refer to caption
(a) Hierarchical Hopfield network
Refer to caption
(b) Receptive fields
Refer to caption
(c) Energy descent
Figure 2: Energy MetaFormer.

4.2 Numerical demonstration

Before getting into the discussion of deep neural network, it is informative to demonstrate the functionality of the Energy MetaFormer as an associative memory. To gain intuition, in this subsection we instantiate a model by taking a specific combination of Lagrangians and train the model on a denoising task using Eqs. (23). For the numerical demonstration, we utilize the HAMUX framework [35]333https://github.com/bhoov/barebones-hamux to implement the Energy MetaFormer.

Here, we set the dimensions of each layer to Nv=784N_{v}=784 and Ns=Nc=900N_{s}=N_{c}=900, and take the hidden Lagrangians as 12max(x,0)2\frac{1}{2}\max(x,0)^{2}, which results in the ReLU\operatorname{ReLU} activation function: gs=gc=ReLUg^{s}=g^{c}=\operatorname{ReLU}. The energy function for this system is determined by the Lagrangians,

E(xs,xv,xc)\displaystyle E(x^{s},x^{v},x^{c}) =xvLayerNorm(xv)i(xivv¯)2\displaystyle={x^{v}}^{\top}\operatorname{LayerNorm}(x^{v})-\sqrt{\sum_{i}(x^{v}_{i}-\bar{v})^{2}}
+A=s,c((xA)ReLU(xA)12max(xA,0)2)\displaystyle+\sum_{A=s,c}\left(\left(x^{A}\right)^{\top}\operatorname{ReLU}(x^{A})-\frac{1}{2}\max(x^{A},0)^{2}\right)
LayerNorm(xv)ξ(v,s)ReLU(xs)ReLU(xc)ξ(c,v)LayerNorm(xv).\displaystyle-\operatorname{LayerNorm}(x^{v})^{\top}\xi^{(v,s)}\operatorname{ReLU}(x^{s})-\operatorname{ReLU}(x^{c})^{\top}\xi^{(c,v)}\operatorname{LayerNorm}(x^{v}). (24)

The model is trained on 60,000 training samples from the MNIST dataset [36]. The mini-batch size is set to 512. We add Gaussian noise to the training batches before inputting them into the model. The input (initial state of visible neurons) evolves by the dynamical equations (23) and the model eventually outputs the state at one of the minima of the energy function Eq. (24). We compute the mean-squared error as the objective function between the outputs and the true images. The model is trained for 100 epochs with the Adam [37] optimizer (using Optax [38] defaults: β1=0.9\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999) at a constant learning rate of 10410^{-4}.

Figure 2 depicts the receptive fields (a part of the weight vectors ξα(s,v)\xi^{(s,v)}_{\alpha} and ξβ(c,v)\xi^{(c,v)}_{\beta}) of the Energy MetaFormer learned by the training, which roughly correspond to the minima of the energy function adapted to the MNIST training samples. We can also verify the energy descent of the model. Randomly initialized neuron states of the trained model evolve according to the dynamical equations (23) to converge to one of the minima of the energy function, as illustrated in Fig. 2. These results demonstrate that the Energy MetaFormer has successfully learned to embed the training dataset into the network (weight matrices) and functions effectively as an associative memory model.

5 Parallelized MLP-Mixer

In this section, we advance the discussions from the previous section to derive a parallelized MLP-Mixer from the hierarchical Hopfield network. We also perform experiments to explore the visual recognition capabilities of the proposed models, keeping in mind that they are conceived as a stack of associative memory models.

5.1 Model

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Parallelized MLP-Mixer. (a) Three-layer hierarchical Hopfield network with 2d structure for visible neurons. Hidden layers interact with the visibles along the relatively perpendicular directions. (b) The whole mixing layer composed of parallelized token- and channel-mixing modules is identified with an update rule of a hierarchical Hopfield network.

To derive the model, we essentially follow the three-layer hierarchical Hopfield network setup discussed in the previous section. A key point here is to introduce a two-dimensional structure into the layer of visible neurons, as shown in Fig. 3. The set of visible neurons can now be considered as a matrix rather than a vector, xvNvs×Nvc,Nv=Nvs×Nvcx^{v}\in\mathbb{R}^{N_{v_{s}}\times N_{v_{c}}},~{}N_{v}=N_{v_{s}}\times N_{v_{c}}, and interacts with the hidden neurons xsx^{s} and xcx^{c} along relatively perpendicular directions.

We continue to study the system Eq. (23), while it should be noted that xvx^{v} has two indices, e.g.,

(ξ(s,v)LayerNorm(xv))α\displaystyle\left(\xi^{(s,v)}\operatorname{LayerNorm}(x^{v})\right)_{\alpha} =i,IξαiI(s,v)LayerNorm(xv)iI\displaystyle=\sum_{i,I}\xi^{(s,v)}_{\alpha iI}\operatorname{LayerNorm}(x^{v})_{iI}
=i,IξαiI(s,v)xiIvv¯j,J(xjJvv¯)2,\displaystyle=\sum_{i,I}\xi^{(s,v)}_{\alpha iI}\frac{x^{v}_{iI}-\bar{v}}{\sqrt{\sum_{j,J}(x^{v}_{jJ}-\bar{v})^{2}}}, (25)

etc, where i=1,,Nvsi=1,\dots,N_{v_{s}} and I=1,,NvcI=1,\dots,N_{v_{c}}.

We now consider the adiabatic limit, τvτs,τc\tau_{v}\gg\tau_{s},\,\tau_{c}, then the dynamical equations for xsx^{s} and xcx^{c} neurons are reduced to

xs(t)=ξ(s,v)LayerNorm(xv(t)),xc(t)=ξ(c,v)LayerNorm(xv(t)).x^{s}(t)=\xi^{(s,v)}\operatorname{LayerNorm}(x^{v}(t)),\quad x^{c}(t)=\xi^{(c,v)}\operatorname{LayerNorm}(x^{v}(t)). (26)

By substituting these expressions to the other one and by following the same discussion as in Sec. 3.2, the differential equation for xvx^{v} eventually becomes the update rule for the visible neurons:

xv(t+1)=xv(t)+ξ(v,s)gs(ξ(s,v)LayerNorm(xv(t)))+ξ(v,c)gc(ξ(c,v)LayerNorm(xv(t))).x^{v}(t+1)=x^{v}(t)+\xi^{(v,s)}g^{s}\left(\xi^{(s,v)}\operatorname{LayerNorm}(x^{v}(t))\right)+\xi^{(v,c)}g^{c}\left(\xi^{(c,v)}\operatorname{LayerNorm}(x^{v}(t))\right). (27)

Furthermore, as for the comparison with the vanilla MLP-Mixer (or more generally, MetaFormers), writing symmetric matrices

W2=W1;W1Ns×Nvs,W2Nvs×Ns,\displaystyle W_{2}=W_{1}^{\top};\quad W_{1}\in\mathbb{R}^{N_{s}\times N_{v_{s}}},\quad W_{2}\in\mathbb{R}^{N_{v_{s}}\times N_{s}}, (28)
W4=W3;W3Nvc×Nc,W4Nc×Nvc,\displaystyle W_{4}=W_{3}^{\top};\quad W_{3}\in\mathbb{R}^{N_{v_{c}}\times N_{c}},\quad W_{4}\in\mathbb{R}^{N_{c}\times N_{v_{c}}}, (29)

we truncate the interaction matrices between the visible and the hiddens as

ξ(s,v)\displaystyle\xi^{(s,v)} =(ξαiI(s,v))=(W1)αi,\displaystyle=(\xi^{(s,v)}_{\alpha iI})=(W_{1})_{\alpha i}, (30)
ξ(c,v)\displaystyle\xi^{(c,v)} =(ξβiI(c,v))=(W4)βI.\displaystyle=(\xi^{(c,v)}_{\beta iI})=(W_{4})_{\beta I}. (31)

With these expressions, we obtain the update rule for the visible neurons as shown in Fig. 3,

xv(t+1)=xv(t)+W2gs(W1LayerNorm(xv(t)))+(gc(LayerNorm(xv(t))W3))W4.x^{v}(t+1)=x^{v}(t)+W_{2}g^{s}\left(W_{1}\operatorname{LayerNorm}(x^{v}(t))\right)+\left(g^{c}\left(\operatorname{LayerNorm}(x^{v}(t))W_{3}\right)\right)W_{4}. (32)

Each term on the right-hand side reads the skip-connection, the token-mixing, and the channel-mixing modules, respectively. The activation functions gsg^{s} and gcg^{c} are not specified at this stage. In practice, we set them to gs=gc=GELUg^{s}=g^{c}=\operatorname{GELU} in experiments to compare the proposed models with the vanilla MLP-Mixer.

Our single update rule (32) contains all the components of the mixing layer: token- and channel-mixing modules, layer normalization, and skip-connection, unlike Eq. (14) in Sec. 3.2. The two main differences from an ordinary mixing layer arise from our construction via the correspondence with the hierarchical Hopfield network: the layer normalization is simultaneously applied along both token and channel axes as in Eq. (25), and the layer processes features for token- and channel-mixing modules in a parallel manner.444Recently, a parallel Transformer block is occasionally adopted for some notable models (e.g., [39, 40]), and it shows comparable performance empirically. A model consisting of stacked associative memory models Eq. (32) as mixing layers becomes an MLP-Mixer model composed of parallelized token- and channel-mixing modules.

5.2 Experiments

We incorporate the update rule Eq. (32) into the mixing layers, which results in a parallelized MLP-Mixer as a stack of associative memory models. To investigate the trainability of the proposed model and the implications from the correspondence with the Hopfield network side, we conduct several empirical studies to compare our model with the vanilla MLP-Mixer (VanillaMixer). The statistics of the results are obtained from ten trials with random initialization across all the experiments in this subsection. The experimental details are provided in Appendix B.

Network architecture. We consider two cases of the model: parallelized MLP-Mixer with symmetric weights as in Eqs. (28) and (29), which we refer to as SymMixer, and parallelized MLP-Mixer without weight constraint for comparison, which we simply refer to as ParaMixer. We incorporate the proposed module Eq. (32) with the mixing layers, and keep the rest of the networks exactly identical to VanillaMixer. From this construction, Para/Sym-Mixer actually do not have additional hyperparameters compared to VanillaMixer. The main difference between the proposed models and the ordinary MLP-Mixer is the normalization part within the mixing layers. As already mentioned, the layer normalization is applied over both the token and channel axes. This stems from the symmetry of the two directions inherent in the visible neurons as in Fig. 3 and is thus natural from the correspondence with the Hopfield network side. For control experiments, we implement this symmetric layer normalization for all the models, and examine some ablation studies below. We utilize PyTorch Image Models timm [41] to implement the models in all the experiments in this section.

Trainability. To study the trainability aspect of Para/Sym-Mixer, we perform scratch training of the models on a classification task with CIFAR-10/100 [42]. The CIFAR-10/100 datasets each consist of 60,000 natural images of size 32×3232\times 32. The ground-truth object category labels are attached to each image, and the number of categories is 10/100, with 6,000/600 images per class. There are 50,000 training images and 10,000 test images. For the training setting, we basically follow the previous works [3, 43]. The images are resized to 224×224224\times 224, and the AdamW [44] optimizer is used. We set the base learning rate to batch size512×5×104\frac{\text{batch size}}{512}\times 5\times 10^{-4}, and the mini-batch size is 384. As regularization methods, we employ label smoothing [45] and stochastic depth [46]. For data augmentation, we apply cutout [47], cutmix [48], mixup [49], random erasing [50], and randaugment [51]. For more details, see Appendix B.2. We train all the Mixer models under the exact same training configuration for fair comparison. All the trainings are conducted with four V100 GPUs.

Table 1: Top-1 accuracy (%) of the Mixer models.
Model CIFAR-10 CIFAR-100
VanillaMixer 89.6189.61 ±0.22\pm 0.22 69.1369.13 ±0.52\pm 0.52
ParaMixer 89.8089.80 ±0.29\pm 0.29 69.0569.05 ±0.53\pm 0.53
SymMixer 77.8677.86 ±0.53\pm 0.53 53.1953.19 ±0.21\pm 0.21

Table 1 shows the results. The performance of ParaMixer is competitive with VanillaMixer, whereas SymMixer exhibits significant performance drops. This is quite counterintuitive since the difference between ParaMixer and SymMixer is only the symmetry constraint on the weight matrices. From previous studies [52, 53], it may be expected that many modern neural network architectures exhibit robust performance under certain symmetric constraints on weights. The enormous performance drop from ParaMixer to SymMixer suggests that symmetric weights and their symmetry breaking have an important effect on the capability of mixing layers. We will investigate this aspect in the next section.

Iterative mixing layers. Since the proposed mixing layer Eq. (32) is viewed as a lump of an associative memory model, it is natural to consider iterative updates of neurons. We use the trained Mixer models for CIFAR-10 and iteratively update the neuron states of the last mixing layer, which is regarded as the linear classifier of the network. The results are shown in Fig. 4. The inputs are 1,000 train and test samples of CIFAR-10 chosen at random. From these results, we observe that the classification performance of ParaMixer significantly degrades as the number of iterations of the mixing layer increases, while SymMixer maintains its performance over long iterations. This suggests that SymMixer stores associative memories of features during training and that the symmetric weights can properly retrieve them in the inference phase. We have also plotted the results for VanillaMixer for comparison, but it is difficult to make any sense since the mixing layers of VanillaMixer have nothing to do with the Hopfield network side.

Refer to caption
(a) Train samples
Refer to caption
(b) Test samples
Figure 4: Accuracy vs number of iterations of the last mixing layer.
Table 2: Top-1 accuracy (%) of the Mixer models trained with CIFAR-10 from scratch. All the mixing layers of the models are iteratively applied.
# iteration VanillaMixer ParaMixer SymMixer
1 (baseline) 89.6189.61 ±0.22\pm 0.22 89.8089.80 ±0.29\pm 0.29 77.8677.86 ±0.53\pm 0.53
2 90.3890.38 ±0.36\pm 0.36 91.2391.23 ±0.24\pm 0.24 78.6778.67 ±0.24\pm 0.24
3 90.3190.31 ±0.46\pm 0.46 91.3491.34 ±0.22\pm 0.22 79.0279.02 ±0.18\pm 0.18
4 90.3090.30 ±0.44\pm 0.44 91.4091.40 ±0.22\pm 0.22 79.1679.16 ±0.21\pm 0.21

As another aspect of the iterative mixing layers, we also perform scratch training of the Mixer models with a multiple number of iterations for all the mixing layers. Here, we adopt the number of iterations for all the mixing layers from one to four, and keep the rest of the training configurations intact. Table 2 shows the results. The results for a single iteration indicate the baseline from Table 1. We observe that the performance of Para/Sym-Mixer monotonically improves with a large number of iterations, while the performance improvement of VanillaMixer stops immediately. In particular, there is a clear margin between ParaMixer and VanillaMixer for a large number of iterations.

Ablation study. We here consider some ablation studies to examine the functions of the proposed models. Table 3 shows the overall results. We did not include bias terms in the two-layer MLPs of token- and channel-mixing blocks in Para/Sym-Mixer due to the correspondence with the Hopfield network side, while VanillaMixer does have. Table 3 tells us that adding the bias terms does not affect performance much, as ParaMixer with bias terms and VanillaMixer differ only in the parallel or serial mixing blocks, as depicted in Figs.3 and 1. As in Table 1, we have observed enormous performance drops in SymMixer. One might expect that the small number of parameters causes such a performance drop, since SymMixer has only roughly half the number of parameters as ParaMixer. (The number of parameters for each model is provided in Appendix B.2.) We train ParaMixer with half widths (Para (hw)) and SymMixer with double widths (Sym (dw)) to match the parameter number condition, and keep the remaining configurations unchanged.

Table 3: Ablation studies for the Mixer models trained with CIFAR-10 from scratch. Each entry corresponds to the top-1 accuracy (%) of the model.
(a) add bias
Model Top-1 acc.
Vanilla 89.6189.61 ±0.22\pm 0.22
Para + bias 89.7789.77 ±0.19\pm 0.19
Para 89.8089.80 ±0.29\pm 0.29
Sym 77.8677.86 ±0.53\pm 0.53
(b) double/half the widths
Model Top-1 acc.
Para 89.8089.80 ±0.29\pm 0.29
Para (hw) 87.1487.14 ±0.33\pm 0.33
Sym (dw) 79.1179.11 ±0.36\pm 0.36
Sym 77.8677.86 ±0.53\pm 0.53
(c) symmetric vanilla
Model Top-1 acc.
Vanilla 89.6189.61 ±0.22\pm 0.22
VanillaSym 81.8081.80 ±0.20\pm 0.20
Para 89.8089.80 ±0.29\pm 0.29
Sym 77.8677.86 ±0.53\pm 0.53
(d) channel-only layernorm
Model Top-1 acc.
Vanilla 91.0391.03 ±0.35\pm 0.35
Para 90.7390.73 ±0.21\pm 0.21
Sym 80.9180.91 ±0.21\pm 0.21

Table 3 shows that there is still a large gap between Para (hw) and Sym (dw). This result indicates that the symmetry condition imposed on the weight matrices of mixing layers contributes more crucially to the classification performance than the number of parameters. In addition, this trend is observed not only between Para/Sym-Mixer but also in VanillaMixer (Table 3). This may imply that the symmetry breaking of weight matrices in the mixing modules of MetaFormers could generically be crucial to ensuring performance. Finally, to examine the effect of the unusual layer normalization within the Mixer models, we replace the normalization layer with the ordinary channel-only layer normalization. From Table 3, we see that all the Mixer models achieve slightly better performance. This is probably because our layer normalization normalizes the features over both token and channel axes, causing the sequence of token vectors to be overly normalized, thus suppressing individuality.

6 Symmetry Breaking

From the discussion in the previous section, we observed that the symmetry condition imposed on the weight matrices of mixing layers is actually a constraint for the image recognition task. In this section, we explore the effect of symmetry breaking of weight matrices in the context of Hopfield networks, and investigate its implications for Mixer models.

6.1 A formulation

Hopfield networks, as associative memory models, are generally supposed to have symmetric weight matrices between neuron layers, as discussed in the previous sections. In order to examine the effect of symmetry breaking of weight matrices, we propose a minimal extension of the setup of the three-layer hierarchical Hopfield network. In doing so, it is natural to extend the original energy function Eq. (19) to the following interaction-symmetric form,

E(xs,xv,xc)=A=s,v,c((xA)gALA)\displaystyle E(x^{s},x^{v},x^{c})=\sum_{A=s,v,c}\left(\left(x^{A}\right)^{\top}g^{A}-L^{A}\right)
12((gv)ξ(v,s)gs+(gs)ξ(s,v)gv)12((gc)ξ(c,v)gv+(gv)ξ(v,c)gc),\displaystyle\quad-\frac{1}{2}\left(\left(g^{v}\right)^{\top}\xi^{(v,s)}g^{s}+\left(g^{s}\right)^{\top}\xi^{(s,v)}g^{v}\right)-\frac{1}{2}\left(\left(g^{c}\right)^{\top}\xi^{(c,v)}g^{v}+\left(g^{v}\right)^{\top}\xi^{(v,c)}g^{c}\right), (33)

where the arguments for gAg^{A} and LAL^{A} are again omitted. If ξ(v,s)=(ξ(s,v))\xi^{(v,s)}=\left(\xi^{(s,v)}\right)^{\top} and ξ(v,c)=(ξ(c,v))\xi^{(v,c)}=\left(\xi^{(c,v)}\right)^{\top}, this energy function is nothing but the original one.

We then extend the hidden-to-visible interactions to include a symmetry breaking term,

ξ(v,s)=(ξ(s,v))+ξ~(v,s),ξ(v,c)=(ξ(c,v))+ξ~(v,c).\xi^{(v,s)}=\left(\xi^{(s,v)}\right)^{\top}+\tilde{\xi}^{(v,s)},\quad\xi^{(v,c)}=\left(\xi^{(c,v)}\right)^{\top}+\tilde{\xi}^{(v,c)}. (34)

Taking the corresponding Lagrangians for Para/Sym-Mixer, and considering the adiabatic limit, τvτs,τc\tau_{v}\gg\tau_{s},\tau_{c}, the total energy with symmetry breaking terms reads:

E(xv)=xvLayerNorm(xv)i,I(xiIvv¯)2α=1NsΦGELU(xαs)+12i,I,αLayerNorm(xv)iIξ~iα(v,s)GELU(xαs)β=1NcΦGELU(xβc)+12i,I,βLayerNorm(xv)iIξ~Iβ(v,c)GELU(xβc).E(x^{v})={x^{v}}^{\top}\operatorname{LayerNorm}(x^{v})-\sqrt{\sum_{i,I}(x^{v}_{iI}-\bar{v})^{2}}\\ -\sum_{\alpha=1}^{N_{s}}\operatorname{\Phi_{\text{GELU}}}\left(x^{s}_{\alpha}\right)+\frac{1}{2}\sum_{i,I,\alpha}\operatorname{LayerNorm}(x^{v})_{iI}\tilde{\xi}^{(v,s)}_{i\alpha}\operatorname{GELU}(x^{s}_{\alpha})\\ -\sum_{\beta=1}^{N_{c}}\operatorname{\Phi_{\text{GELU}}}\left(x^{c}_{\beta}\right)+\frac{1}{2}\sum_{i,I,\beta}\operatorname{LayerNorm}(x^{v})_{iI}\tilde{\xi}^{(v,c)}_{I\beta}\operatorname{GELU}(x^{c}_{\beta}). (35)

It is understood that xsx^{s} and xcx^{c} above are shorthands for Eq. (26). We refer to this extended energy function as the pseudo energy function. We defined the Lagrangians for the hidden neurons as Ls=Lc=aΦGELU(xaA)L^{s}=L^{c}=\sum_{a}\operatorname{\Phi_{\text{GELU}}}(x^{A}_{a}), where ΦGELU\operatorname{\Phi_{\text{GELU}}} is the primitive function of GELU\operatorname{GELU}:

ΦGELU(z)=14(z2+(z21)erf(z2)+z2πez22),ΦGELU=GELU.\operatorname{\Phi_{\text{GELU}}}(z)=\frac{1}{4}\left(z^{2}+(z^{2}-1)\operatorname{erf}\left(\frac{z}{\sqrt{2}}\right)+z\sqrt{\frac{2}{\pi}}e^{-\frac{z^{2}}{2}}\right),\quad\operatorname{\Phi_{\text{GELU}}}^{\prime}=\operatorname{GELU}. (36)

6.2 Degeneracy resolution

With these Lagrangians and the pseudo energy function, the time evolution of the visible neurons xvx^{v} is governed by the differential equation,

dxv(t)dt\displaystyle\frac{dx^{v}(t)}{dt} =ξ(v,s)GELU(ξ(s,v)LayerNorm(xv(t)))\displaystyle=\xi^{(v,s)}\operatorname{GELU}\left(\xi^{(s,v)}\operatorname{LayerNorm}(x^{v}(t))\right)
+ξ(v,c)GELU(ξ(c,v)LayerNorm(xv(t)))xv(t),\displaystyle+\xi^{(v,c)}\operatorname{GELU}\left(\xi^{(c,v)}\operatorname{LayerNorm}(x^{v}(t))\right)-x^{v}(t), (37)

where we set τv=1\tau_{v}=1. In this equation, the weight matrices from hidden neurons to visible neurons generically involve the symmetry breaking term as in Eq. (34). We conduct an empirical study of the effects of symmetry breaking on associative memories by solving this dynamical equation using the torchdiffeq framework [54, 55].

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 5: The pseudo energy with symmetric weights. (a), (b) Energy landscape for Nv=2N_{v}=2.
Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 6: (a), (b) Pseudo energy landscape for Nv=2N_{v}=2.

Let us first take a simple example of Nvs=2N_{v_{s}}=2 and Nvc=1N_{v_{c}}=1, for which the pseudo energy function Eq. (35) can be plotted in 2D/3D. A random initialization of the weight matrices with the symmetric condition, ξ~(v,)=0\tilde{\xi}^{(v,\cdot)}=0, yields an energy landscape as shown in Figs. 5 and 5. The energy function becomes symmetric with respect to x1v+x2v=0x^{v}_{1}+x^{v}_{2}=0, as in this simple case the Lagrangian for xvx^{v} reduces to Lv=|x1vx2v|/2L^{v}=\left\lvert x^{v}_{1}-x^{v}_{2}\right\rvert/\sqrt{2}. The minima of this energy function form two zero modes along the x1v=x2vx^{v}_{1}=x^{v}_{2} line. Any solution trajectories in this system approach the degenerate minima of the energy function (Fig. 5). By turning on the symmetry breaking terms ξ~(v,)\tilde{\xi}^{(v,\cdot)}, the degeneracy of the attractors is resolved and solution trajectories can retrieve different patterns in turn (Fig. 6). It is noted that for the dynamical equation (37) with the symmetry breaking terms, the pseudo energy function does not necessarily decrease along the trajectories. These observations imply that the associative memories the system stores degenerate for the symmetric weights, leading to an extremely limited effective memorization capability.

We can gain more insights into this system by increasing the number of neurons. For Nvs=4N_{v_{s}}=4, Nvc=8N_{v_{c}}=8, Ns=20N_{s}=20, and Nc=160N_{c}=160, we obtain Fig. 7, although the full landscape of the pseudo energy function cannot be visualized. We observe that NvsN_{v_{s}} roughly corresponds to the number of attractors, NvcN_{v_{c}} lifts the total energy, and a large total number of neurons makes the convergence slower. These observations are natural, as NvsN_{v_{s}} corresponds to the number of tokens and NvcN_{v_{c}} to the number of channels in terms of Mixer models of deep neural networks.

Refer to caption
(a) ξ~(v,)=0\tilde{\xi}^{(v,\cdot)}=0
Refer to caption
(b) ξ~(v,)0\tilde{\xi}^{(v,\cdot)}\neq 0
Figure 7: Time evolution of the pseudo energy along solution trajectories for the cases of (a) symmetric weight and (b) involving the symmetry breaking term. A slight increase of energy for (a) is due to ΦGELU\operatorname{\Phi_{\text{GELU}}}.

6.3 AsymMixer

Having knowledge from the Hopfield network side, we consider the symmetry breaking of SymMixer by stacking the discrete version of the previous subsection as mixing layers. We refer to the parallelized MLP-Mixer with asymmetric weights as the AsymMixer, whose mixing layers consist of a symmetry-breaking extension of Eq. (32),

W2=W1+W~2,W4=W3+W~4.\displaystyle W_{2}=W_{1}^{\top}+\tilde{W}_{2},\quad W_{4}=W_{3}^{\top}+\tilde{W}_{4}. (38)

To perturbatively add the symmetry breaking and study its effect on the trainability of the model, we train the AsymMixer with the following custom loss:

=CE+λl=1La=2,4W~alF2,\mathcal{L}=\mathcal{L}_{\text{CE}}+\lambda\sum_{l=1}^{L}\sum_{a=2,4}\left\lVert\tilde{W}_{a}^{l}\right\rVert_{F}^{2}, (39)

where CE\mathcal{L}_{\text{CE}} is the cross-entropy loss function, LL is the number of layers, and F\left\lVert\cdot\right\rVert_{F} denotes the Frobenius norm.

Refer to caption
(a)
Refer to caption
(b)
Figure 8: Top-1 accuracy of AsymMixers compared with ParaMixer and SymMixer, trained with CIFAR-10 from scratch. (a) Performance transition of AsymMixer. (b) Performance transition with respect to the norm of weights. wt_norm and wc_norm are norms of W~2L\tilde{W}_{2}^{L} and W~4L\tilde{W}_{4}^{L}, respectively.

Figure 8 shows the results. We use CIFAR-10 to train the AsymMixers from scratch with λ={0,10r}\lambda=\left\{0,10^{-r}\right\} for r=0,,6r=0,\dots,6. For λ=1\lambda=1 (r=0r=0), the symmetry breaking terms W~\tilde{W} are almost not allowed to have non-trivial entries due to the penalty term in the loss function, for which the trained AsymMixer boils down to SymMixer. For λ=0\lambda=0, in contrast, W~\tilde{W} have no restrictions to learn and thus the model weights are trained in a similar manner as ParaMixer. Figure 8 provides reasonable results for the trainability of AsymMixers, interpolating the performance of SymMixer and ParaMixer. The discrepancy between the performance of AsymMixer with λ=0\lambda=0 and ParaMixer is due to the architecture designs; we assume that the symmetry breaking terms in AsymMixer are included only in the hidden-to-visible interactions for simplicity. These observations demonstrate consistent results with the previous subsection, indicating that symmetric weights in Mixer models have only a small amount of effective associative memories, and that symmetry breaking plays a crucial role in assuring performance.

7 Discussion

We proposed a novel correspondence between Hopfield networks and MLP-Mixer models, and identified the Para/Sym-Mixer as a stack of associative memory models. The proposed models consist of the parallelized mixing layer, which naturally includes the symmetric layer normalization module from the Hopfield network side. We demonstrated the basic properties of the models as associative memories and examined their visual recognition capabilities in the context of deep neural networks. From the numerical experiments, we observed that the symmetry conditions on interaction matrices in Hopfield networks are indeed a constraint on the performance of Mixer models. The empirical studies imply that Mixer models with symmetric weights have a highly restricted effective memorization capacity and that symmetry breaking essentially plays a crucial role in ensuring performance.

One limitation of this paper is its narrow focus on the Lagrangians in the formulation of the proposed models. The Lagrangians for hidden neurons (in other words, activation functions gsg^{s} and gcg^{c}) are actually not restricted in the derivation of the Para/Sym-Mixer. Reconsideration of the Lagrangians for hidden neurons might provide more insights into understanding the role of mixing modules in MetaFormers. Another limitation is the lack of examination of applications to practical problems. While the performance of SymMixer significantly drops, as observed in Sec. 5.2, there could be a task domain suitable for SymMixer such as denoising or image retrieval, where Hopfield networks would inherently be effective. In addition, it is worth pursuing the quantitative or exact analysis of the memorization capacity of Energy MetaFormer (Sec. 4) and its symmetry breaking phase (AsymMixer, Sec. 6). To do so, in a similar fashion to [32, 30], it would be helpful to utilize tools developed in statistical physics. We leave these aspects of the study for future exploration.

References

  • [1] Vaswani, A., N. Shazeer, N. Parmar, et al. Attention is all you need. In Advances in Neural Information Processing Systems, vol. 30. 2017.
  • [2] Dosovitskiy, A., L. Beyer, A. Kolesnikov, et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representation. 2021.
  • [3] Touvron, H., M. Cord, M. Douze, et al. Training data-efficient image transformers & distillation through attention. In International Conference on Machine Learning. 2021.
  • [4] Tolstikhin, I. O., N. Houlsby, A. Kolesnikov, et al. Mlp-mixer: An all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34, 2021.
  • [5] Melas-Kyriazi, L. Do you even need attention? a stack of feed-forward layers does surprisingly well on imagenet. arXiv preprint arXiv:2105.02723, 2021.
  • [6] Yu, W., M. Luo, P. Zhou, et al. Metaformer is actually what you need for vision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10819–10829. 2022.
  • [7] Yu, W., C. Si, P. Zhou, et al. Metaformer baselines for vision. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  • [8] Hopfield, J. J. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the National Academy of Sciences, 79(8):2554–2558, 1982.
  • [9] —. Neurons with graded response have collective computational properties like those of two-state neurons. Proceedings of the National Academy of Sciences, 81(10):3088–3092, 1984.
  • [10] Ramsauer, H., B. Schäfl, J. Lehner, et al. Hopfield networks is all you need. In International Conference on Learning Representations. 2021.
  • [11] Widrich, M., B. Schäfl, M. Pavlović, et al. Modern hopfield networks and attention for immune repertoire classification. In Advances in Neural Information Processing Systems, vol. 33, pages 18832–18845. 2020.
  • [12] Krotov, D., J. J. Hopfield. Large associative memory problem in neurobiology and machine learning. In International Conference on Learning Representations. 2021.
  • [13] Krotov, D. Hierarchical associative memory. arXiv preprint arXiv:2107.06446, 2021.
  • [14] Tang, F., M. Kopp. A remark on a paper of krotov and hopfield [arxiv: 2008.06996]. arXiv preprint arXiv:2105.15034, 2021.
  • [15] Hoover, B., Y. Liang, B. Pham, et al. Energy transformer. Advances in Neural Information Processing Systems, 36, 2023.
  • [16] Ota, T., M. Taki. imixer: hierarchical hopfield network implies an invertible, implicit and iterative mlp-mixer. arXiv preprint arXiv:2304.13061, 2023.
  • [17] Touvron, H., P. Bojanowski, M. Caron, et al. Resmlp: Feedforward networks for image classification with data-efficient training. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [18] Liu, R., Y. Li, L. Tao, et al. Are we ready for a new paradigm shift? a survey on visual deep mlp. Patterns, 3(7):100520, 2022.
  • [19] Rao, Y., W. Zhao, Z. Zhu, et al. Global filter networks for image classification. Advances in neural information processing systems, 34:980–993, 2021.
  • [20] Tatsunami, Y., M. Taki. Sequencer: Deep lstm for image classification. In Advances in Neural Information Processing Systems. 2022.
  • [21] Han, K., Y. Wang, J. Guo, et al. Vision gnn: An image is worth graph of nodes. In Advances in Neural Information Processing Systems. 2022.
  • [22] Wang, J., S. Zhang, Y. Liu, et al. Riformer: Keep your vision backbone effective but removing token mixer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023.
  • [23] Krotov, D., J. J. Hopfield. Dense associative memory for pattern recognition. In Advances in Neural Information Processing Systems, vol. 29. 2016.
  • [24] Demircigil, M., J. Heusel, M. Löwe, et al. On a model of associative memory with huge storage capacity. Journal of Statistical Physics, 168(2):288–299, 2017.
  • [25] Krotov, D., J. J. Hopfield. Dense associative memory is robust to adversarial inputs. Neural computation, 30(12):3151–3167, 2018.
  • [26] Millidge, B., T. Salvatori, Y. Song, et al. Universal hopfield networks: A general framework for single-shot associative memory models. In International Conference on Machine Learning, pages 15561–15583. PMLR, 2022.
  • [27] Ota, T., R. Karakida. Attention in a family of boltzmann machines emerging from modern hopfield networks. Neural Computation, 35(8):1463–1480, 2023.
  • [28] Ota, T., I. Sato, R. Kawakami, et al. Learning with partial forgetting in modern hopfield networks. In International Conference on Artificial Intelligence and Statistics, pages 6661–6673. PMLR, 2023.
  • [29] Hoover, B., H. Strobelt, D. Krotov, et al. Memory in plain sight: A survey of the uncanny resemblances between diffusion models and associative memories. In Associative Memory & Hopfield Networks in 2023. 2023.
  • [30] Yampolskaya, M., P. Mehta. Controlling the bifurcations of attractors in modern hopfield networks. In Associative Memory & Hopfield Networks in 2023. 2023.
  • [31] Ambrogioni, L. In search of dispersed memories: Generative diffusion models are associative memory networks. In Associative Memory & Hopfield Networks in 2023. 2023.
  • [32] Lucibello, C., M. Mézard. Exponential capacity of dense associative memories. Physical Review Letters, 132(7):077301, 2024.
  • [33] Krotov, D. A new frontier for hopfield networks. Nature Reviews Physics, 5(7):366–367, 2023.
  • [34] Ba, J. L., J. R. Kiros, G. E. Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  • [35] Hoover, B., D. H. Chau, H. Strobelt, et al. A universal abstraction for hierarchical hopfield networks. In The Symbiosis of Deep Learning and Differential Equations II. 2022.
  • [36] Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine, 29(6):141–142, 2012.
  • [37] Kingma, D., J. Ba. Adam: A method for stochastic optimization. In International Conference on Learning Representations. 2015.
  • [38] DeepMind, I. Babuschkin, K. Baumli, et al. The DeepMind JAX Ecosystem. http://github.com/google-deepmind, 2020.
  • [39] Chowdhery, A., S. Narang, J. Devlin, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  • [40] Zhong, Y. D., T. Zhang, A. Chakraborty, et al. A neural ODE interpretation of transformer layers. In The Symbiosis of Deep Learning and Differential Equations II. 2022.
  • [41] Wightman, R. Pytorch image models. https://github.com/rwightman/pytorch-image-models, 2019.
  • [42] Krizhevsky, A. Learning multiple layers of features from tiny images, 2009.
  • [43] Hou, Q., Z. Jiang, L. Yuan, et al. Vision permutator: A permutable mlp-like architecture for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
  • [44] Loshchilov, I., F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations. 2019.
  • [45] Szegedy, C., V. Vanhoucke, S. Ioffe, et al. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
  • [46] Huang, G., Y. Sun, Z. Liu, et al. Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, pages 646–661. Springer, 2016.
  • [47] DeVries, T., G. W. Taylor. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017.
  • [48] Yun, S., D. Han, S. J. Oh, et al. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
  • [49] Zhang, H., M. Cisse, Y. N. Dauphin, et al. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations. 2018.
  • [50] Zhong, Z., L. Zheng, G. Kang, et al. Random erasing data augmentation. In Proceedings of the AAAI conference on artificial intelligence, vol. 34, pages 13001–13008. 2020.
  • [51] Cubuk, E. D., B. Zoph, J. Shlens, et al. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703. 2020.
  • [52] Hu, S. X., S. Zagoruyko, N. Komodakis. Exploring weight symmetry in deep neural networks. Computer Vision and Image Understanding, 187:102786, 2019.
  • [53] Yang, Y., D. P. Wipf, et al. Transformers from an optimization perspective. Advances in Neural Information Processing Systems, 35:36958–36971, 2022.
  • [54] Chen, R. T. Q., Y. Rubanova, J. Bettencourt, et al. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 2018.
  • [55] Chen, R. T. Q. torchdiffeq. https://github.com/rtqichen/torchdiffeq, 2018.

Appendix A Some Useful Formulae

The choice of Lagrangian LA:NAL^{A}:\mathbb{R}^{N_{A}}\to\mathbb{R} defines the activation function gA:NANAg^{A}:\mathbb{R}^{N_{A}}\to\mathbb{R}^{N_{A}} such that gA=LAg^{A}=\nabla L^{A}. These functions introduce non-linearities into the dynamics of neuron layers. We here show three examples used in the main text. For other common examples, see e.g., [35].

LayerNorm. For xNx\in\mathbb{R}^{N}, we obtain the layer normalization module from the following Lagrangian,

LLN(x)=Dγ1Di=1N(xix¯)2+ϵ+iδixi,L_{\text{LN}}(x)=D\gamma\sqrt{\frac{1}{D}\sum_{i=1}^{N}\left(x_{i}-\bar{x}\right)^{2}+\epsilon}+\sum_{i}\delta_{i}x_{i}, (40)

where x¯=ixi/N\bar{x}=\sum_{i}x_{i}/N, γ\gamma and δi\delta_{i} are learnable parameters, ϵ\epsilon is a regularization constant, and DD is an arbitrary constant. The derivative of this Lagrangian computes

gi(x)=LLN(x)xi\displaystyle g_{i}(x)=\frac{\partial L_{\text{LN}}(x)}{\partial x_{i}} =γxix¯1Dj(xjx¯)2+ϵ+δi\displaystyle=\gamma\frac{x_{i}-\bar{x}}{\sqrt{\frac{1}{D}\sum_{j}\left(x_{j}-\bar{x}\right)^{2}+\epsilon}}+\delta_{i}
=LayerNorm(x)i.\displaystyle=\operatorname{LayerNorm}(x)_{i}. (41)

One can see that by taking γ=D=1\gamma=D=1 and δi=ϵ=0\delta_{i}=\epsilon=0, this Lagrangian and the activation function give those discussed in the main text.

ReLU. For xNx\in\mathbb{R}^{N}, the Lagrangian

LReLU(x)=12i=1Nmax(xi,0)2L_{\text{ReLU}}(x)=\frac{1}{2}\sum_{i=1}^{N}\max(x_{i},0)^{2} (42)

gives the ReLU activation function, which acts on xx element-wise:

gi(x)=LReLU(x)xi=max(xi,0).g_{i}(x)=\frac{\partial L_{\text{ReLU}}(x)}{\partial x_{i}}=\max(x_{i},0). (43)

GELU. For xNx\in\mathbb{R}^{N}, we obtain the GELU activation function from the Lagrangian,

LGELU(x)\displaystyle L_{\text{GELU}}(x) =i=1NΦGELU(xi),\displaystyle=\sum_{i=1}^{N}\operatorname{\Phi_{\text{GELU}}}(x_{i}), (44)
ΦGELU(z)\displaystyle\operatorname{\Phi_{\text{GELU}}}(z) =14(z2+(z21)erf(z2)+z2πez22).\displaystyle=\frac{1}{4}\left(z^{2}+(z^{2}-1)\operatorname{erf}\left(\frac{z}{\sqrt{2}}\right)+z\sqrt{\frac{2}{\pi}}e^{-\frac{z^{2}}{2}}\right). (45)

One finds that ΦGELU\operatorname{\Phi_{\text{GELU}}} is a primitive function of the GELU activation function, acting on xx element-wise,

gi(x)=LGELU(x)xi=xi2(1+erf(xi2)).g_{i}(x)=\frac{\partial L_{\text{GELU}}(x)}{\partial x_{i}}=\frac{x_{i}}{2}\left(1+\operatorname{erf}\left(\frac{x_{i}}{\sqrt{2}}\right)\right). (46)

Appendix B Experimental Details

B.1 Code of parallelized mixing layer

The pseudo-code of the parallelized mixing layer is shown in Algorithm 1. As mentioned in the main text, we use PyTorch Image Models timm [41]555https://github.com/huggingface/pytorch-image-models. for the implementation of the models. The Para/Sym-Mixer models have mostly the same structure as the ordinary MLP-Mixer and no additional hyperparameters. The two main differences from the ordinary Mixer are the parallelized token- and channel-mixing modules and the symmetric layer normalization applied along both token and channel axes.

Algorithm 1 Pseudo-code of the parallelized mixing layer, PyTorch-like code.
class pMixerBlock(nn.Module):
def __init__(self, dim, seq_len, h_r=1, n_iter=1):
super().__init__()
d_t, d_c = [int(x * dim) for x in to_2tuple((0.5,4.0))]
self.mlp_t = Mlp(seq_len, int(d_t*h_r), act_layer=nn.GELU)
self.mlp_c = Mlp(dim, int(d_c*h_r), act_layer=nn.GELU)
self.drop_path = DropPath(0.1)
self.n_iter = n_iter
# symmetric layer normalization
self.norm = nn.LayerNorm((seq_len, dim))
# parallelized MLPs from the Hopfield/Mixer correspondence
def forward(self, x):
for _ in range(self.n_iter):
x = x
+ self.drop_path(self.mlp_t(
self.norm(x).transpose(1, 2)).transpose(1, 2))
+ self.drop_path(self.mlp_c(self.norm(x)))
return x

B.2 Training details

We here report the detailed training setups commonly used for the Mixer models in Secs. 5.2 and 6.3. We basically follow the previous study [3], and also employ [43]666https://github.com/houqb/VisionPermutator. for some considerations. The number of trainable parameters of each model is shown in Table 5.

Table 4: Hyperparameters commonly used for the Mixer models for fair comparison.
Training configuration Value
# layers 8
optimizer AdamW
training epochs 300
batch size 384
base learning rate 5×1045\times 10^{-4}
weight decay 0.05
optimizer ϵ\epsilon 10810^{-8}
optimizer momentum β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999
learning rate schedule cosine decay
lower learning rate bound 10610^{-6}
warmup epochs 20
warmup schedule linear
warmup learning rate 10610^{-6}
cooldown epochs 10
crop ratio 0.875
RandAugment (9, 0.5)
mixup α\alpha 0.8
cutmix α\alpha 1.0
random erasing 0.25
label smoothing 0.1
stochastic depth 0.1
Table 5: Number of parameters of the Mixer models.
VanillaMixer ParaMixer SymMixer AsymMixer
# params 21.2 M 19.6 M 10.8 M 19.6 M