ViR: the Vision Reservoir

Xian Wei ^∗, Bin Wang ¹, Mingsong Chen ², Ji Yuan ³, Hai Lan ⁴, Jiehuang Shi ⁵,
Xuan Tang ⁶, Bo Jin ⁷, Guozhang Chen ⁸, Dongping Yang ⁹
[email protected] , [email protected] ¹, [email protected] ², [email protected] ³,
[email protected] ⁴, [email protected] ⁵, [email protected] ⁶,
[email protected] ⁷, [email protected] ⁸, [email protected] ⁹ Corresponding Author

Abstract

The most recent year has witnessed the success of applying the Vision Transformer (ViT) for image classification. However, there are still evidences indicating that ViT often suffers following two aspects, i) the high computation and the memory burden from applying the multiple Transformer layers for pre-training on a large-scale dataset, ii) the over-fitting when training on small datasets from scratch. To address these problems, a novel method, namely, Vision Reservoir computing (ViR), is proposed here for image classification, as a parallel to ViT. By splitting each image into a sequence of tokens with a fixed length, the ViR constructs a pure reservoir with a nearly fully connected topology to replace the Transformer module in ViT. Two kinds of deep ViR models are subsequently proposed to enhance the network performance. Comparative experiments between the ViR and the ViT are carried out on several image classification benchmarks. Without any pre-training process, the ViR outperforms the ViT in terms of both model and computational complexity. Specifically, the number of parameters of the ViR is about 15% even 5% of the ViT, and the memory footprint is about $20\%\sim 40\%$ of the ViT. The superiority of the ViR performance is explained by Small-World characteristics, Lyapunov exponents, and memory capacity.

1 Introduction

Recently, Vision Transformer (ViT) [11] has been demonstrated remarkable performance across various vision tasks, such as image classification [50], instance segmentation [57], object detection and tracking [7], with the potential to replace the convolutional neural networks (CNNs) [24].

While the convolutions in a CNN excel on extracting local features of an image, the ViT is based on the self-attention mechanism that can directly compute global relationships among pixels in various small sections of the image (e.g., $15\times 16$ patches) [29, 6, 10, 24]. Furthermore, compared to CNNs, the ViT assumes minimal prior knowledge about the task at hand, while its performance surpasses CNNs on large-scale datasets with pre-training and fine-tuning, and only requires substantially fewer computational resources to train [11]. Its superiority attracts researches to further propose many variants, e.g., DeiT, T2T ViT, PVT [50, 60, 56], to handle various downstream tasks.

Refer to caption — Figure 1: Comparisons of the time consumption by performing the ViR and the ViT on the CIFAR100 dataset. The initial accuracy and the final accuracy of the ViR are higher than the ViT without pre-training. The deep ViR is the parallel structure. With the same depth, the time cost of the ViR is much less than the ViT.

However, pre-training on a large-scale dataset and fine-tuning to downstream tasks are essential for the ViT, which often lead to the exhaustive and redundant computation, and the memory burden by extra parameters.Additionally, the ViT with multiple Transformer encoder layers often suffers the over-fitting especially when the training data is limited [8].

In essence, by treating the image patches as the temporal sequence, the core innovation of the ViT is to use a kernel connection operation, e.g., a dot product, to obtain the internal dependencies among image patches, e.g., the spatial and temporal (sequential) coherence between different portions of an image [9, 7, 24].

This motivates us to consider a brain-inspired network, i.e., Reservoir Computing (RC) [32], which incorporates intrinsic spatio-temporal dynamics, with much lower consumption on computation and memory, fewer training parameters, and much fewer training samples [18, 34].

The main body of RC is a dynamical system, called reservoir [37], which consists of randomly interacting non-linear neurons and fixed weights. The non-linearity and disordered connections give rise to the complex internal dynamics of RC. Such complex dynamics are speculated to be able to facilitate information processing, with the most benefit at the so-called ’edge of chaos’ [28]. The intrinsic dynamics project inputs to the high-dimensional representation space, facilitating the following regression or classification tasks. While the recurrent connections in a reservoir are randomly generated and fixed, readout weights are optimized according to the task at hand. Therefore, RC is a powerful and physically efficient tool in time series related tasks, in terms of the low training cost, fast training speed, few parameters, suitability for hardware, and linear training schemes.

Inspired by the excellent performance of RC on time series tasks, we propose a reservoir mechanism-based model, namely, Vision Reservoir computing (ViR), to replace the ViT for image tasks. The ViR follows a similar basic pipeline with the ViT and the whole network is shown in Fig. 2. Without pre-training, the results of the ViR are close to or even better than the ViT on small-scale datasets, but with much fewer parameters and faster training speed, when having the same network depth. Usually, the ViR can achieve considerable performance with layers fewer than the number of encoders of the ViT (Fig. 1).

The main contributions of this work can be summarized into three aspects: i) As a parallel to ViT, we propose a novel model with more efficiency, namely ViR, for image classification. ii) We design a nearly fully connected topology of the internal reservoir, with small-world characteristics, chaotic dynamics, and great memory capacity. Series ViR and parallel ViR are both developed, with the latter achieving better performance. iii) ViR achieves superior performance on small-scale benchmarks than ViT, e.g., with about 15% even 5% the number of parameters of the ViT, and $20\%\sim 40\%$ the memory footprint of the ViT.

2 Related Works

RC, mainly composed of Echo State Networks (ESNs) [18, 21] and Liquid State Machines (LSMs) [34], has been successively applied on speech recognition [47], time series prediction [2], MIMO-OFDM [36], smart grids [14], etc. Physical RC [48] such as photonic RC [52] is amenable to hardware implementation using a variety of physical systems, substrates, and devices. However, few work have so far studied the behavior of such networks on computer vision.

One of the related experiments has been carried out to classify digits of the MNIST dataset using a normal ESN [43], which shows that RC can handle image data. More applications corresponding to images with RC are detailed in [25, 22, 26]. Combining RC with other popular algorithms such as CNNs or Reinforcement Learning (RL) is a valid approach to enhance the performance of networks for image tasks. Tong and Tanaka [49] took full advantage of an untrained CNN to transform raw image data into a set of small feature maps as a preprocessing step of RC and achieved a high classification accuracy with a much smaller number of trainable parameters compared with Schaetti’s work. Then, a novel practical approach, called RL with a convolutional reservoir computing (RCRC) model [5] was proposed. A fixed random-weight CNN, used for extracting image feature maps, combined with an RC model, employed for extracting the time series features, is adopted for the evolution strategy of RL, which succeeded in decreasing computational cost to a large degree. However, such studies remain restricted because they simply treat RC as an auxiliary tool but not the core for image tasks.

Based on the above issues, we propose an RC-based model specifically for image tasks, by exploring the internal topology of the reservoir. The proposed topology is originated from the Simple Cycle Reservoir (SCR), shown in [40], which performs close to the traditional ESN. Later on, the updated one-Cycle Reservoirs with Regular Jumps (CRJ) [41] has been proposed based on SCR, which obtains superior performances in most time-series tasks.

Recently, Verzelli et al. [55] made the best of the controllability matrix to explain the encoding mechanism and memory capacity of the reservoir with the topologies mentioned above, from the perspective of the mathematical characteristics of the matrix, i.e., the rank and the nullspace of the matrix. However, higher memory capacity usually does not correspond to higher precision prediction [12], and the two are generally traded off in the ”edge of chaos” state [31].

Another recent related model is the deep RC. Taking ESN as an example, the typical architecture of deep ESN can be classified into the series ESN and the parallel ESN, deriving other deep structures [30, 13, 61].

There exist few studies combining RC with transformer [53]. Shen et al. [44] randomly initialized some of the layers in transformers without updates but obtained impressive performance in natural language processing. It only draws on the idea of initializing the parameters of the reservoir, not the ”true” reservoir in essence.

3 Methodology

In the design of ViR, we first present the topology we utilized in the reservoir and show some formulations and characteristics to expound the working mechanism. Then, we describe the proposed ViR network and further present the case of the deep ViR. Finally, we give some aspects to analyze the intrinsic properties of the ViR.

3.1 Nearly Fully Connected Topology

Based on CRJ, we argue that for an image classification task it is necessary to consider: 1) a simple fixed non-random reservoir topology–nearly full connection structure with self-loop connections; 2) the weight value $\textit{$r_{i},r_{j},r_{k}$}\in(0,1]$ for reservoir connections; 3) the signs of the weights in 2) are deterministically generated by a rule with a certain degree of randomness; 4) input neurons are randomly connected to the reservoir.

Fig. 3 depicts the nearly fully connected topology. We first generate an $N\times N$ reservoir connection matrix W matching the topology, in which all the weights have the same original value $r_{k}$ and then we set:

\textbf{{W}}_{1,\textit{N}}=\textbf{{W}}_{\textit{q}+1,\textit{q}}=\textit{}{r_{i}},\qquad\textit{q}=1,...,\textit{N}-1

(1)

$r_{j}$ is the jump weight. Consider the jump size $1<\ell<\textit{N}$ . The first jump begins from neuron 1 to $1+\ell$ , then from $1+\ell$ to $2\ell+1$ , until $n\ell+1>N$ , where n is the number of jump. These bi-directional jump weights are reset to:

		$\displaystyle\textbf{{W}}_{1,\ell+1}=\textbf{{W}}_{\ell+1,2\ell+1}=...$		(2)
		$\displaystyle=\textbf{{W}}_{\ell+1,1}=\textbf{{W}}_{2\ell+1,\ell+1}=...=\textit{$r_{j}$}$		(2)

Finally, we randomly set an element, with the symmetrical element, to 0, indicating that the two neurons are disconnected with each other. Hence, W mainly consists of $r_{k}$ , and the amount of $r_{j},r_{c}$ is relatively small, with two 0 elements.

For the randomness, there exists a random matrix with elements e randomly generated in (0,1] with the same size as W. If e $<$ 0.5 (the threshold we set), then the sign of the weight in W, corresponding to the same position as e, will be $-$ , otherwise $+$ . The input weight matrix V, compared to W, is a sparse matrix of the given shape with uniformly distributed values.

For the stability of the reservoir, W is typically scaled as $\textbf{{W}}\leftarrow\alpha\textbf{{W}}/|\lambda_{max}|$ , where $|\lambda_{max}|$ is the spectral radius of W and $0<\alpha<1$ is a scaling parameter [19].

For the training process, assuming that $\textit{{u}(t)}=(u_{1}(t),...,u_{K}(t))$ means the activation of $K$ input units at time step $t$ from flattened patches, and $\textit{{x}(t)}=(x_{1}(t),...,x_{M}(t))$ represents the $M$ internal units in the reservoir with the output $\textit{{y}(t)}=(y_{1}(t),...,y_{Q}(t))$ containing $Q$ output units. The $M\times K$ matrix V stores input weights. The $M\times(K+M+Q)$ matrix $\textbf{{W}}_{out}$ shows the connections from a reservoir to output units. Additionally, all the values mentioned above are real-valued. Apart from $\textbf{W}_{out}$ , all the matrices are not trainable and their entries are initialized with a suitable distribution and then fixed. Fig. 2 shows the described structure.

The internal units are updated according to:

\textbf{{x}}(\textit{t}+1)=\textit{f }(\textbf{{Vu}}(\textit{t}+1)+\textbf{{Wx}}(\textit{t})+\textbf{{b}}(\textit{t}+1))

(3)

where f is the activation function (typically sigmoidal); b(t+1) is an optional uniform i.i.d. noise. The readout is computed by:

	$\displaystyle\textbf{{y}}(\textit{t}+1)=\textbf{{W}}_{out}$	$\displaystyle[\textbf{{u}}(\textit{t}+1);\textbf{{x}}(\textit{t}+1);\textbf{{y}}(\textit{t});$		(4)
		$\displaystyle\textbf{{u}}(\textit{t}+1)^{2};\textbf{{x}}(\textit{t}+1)^{2};\textbf{{y}}(\textit{t})^{2}]$		(4)

where ’;’ is a concatenation operation.

As shown in Eq. 4, the essence of network is to approximate output weights $\textit{{W}}_{out}$ through training samples, to obtain the predictive ability of certain tasks. Hence, while training, it is essential to collect and store the state of the reservoir x(t) and the corresponding outputs y(t) in a matrix X and Y after warm-up [19]. The classic calculation method is the Least Square Method (LSM) [33]:

\textit{{W}}_{out}=(\textit{{X}}^{T}\textit{{X}})^{-1}\textit{{X}}^{T}\textit{{Y}}

(5)

where $(\ast)^{T}$ and $(\ast)^{-1}$ mean the transpose and the inverse of matrix operator, respectively. Usually, in case X is ill-posed and irreversible, the ridge regression method [45] is utilized (Eq. 6) instead of LSM. For other optimizing algorithms refer to [35, 39].

\textit{{W}}_{out}=(\textit{k}\textit{{I}}+\textit{{X}}^{T}\textit{{X}})^{-1}\textit{{X}}^{T}\textit{{Y}}

(6)

where k is a relatively small positive number called regular coefficient, and I is an identity matrix. In our work, we use a trainable linear layer (LL) to approximate $\textit{{W}}_{out}$ and Eq. 4 can be rewritten as:

	$\displaystyle\textbf{{y}}(\textit{t}+1)=\textbf{{LL}}$	$\displaystyle[\textbf{{u}}(\textit{t}+1);\textbf{{x}}(\textit{t}+1);\textbf{{y}}(\textit{t});$		(7)
		$\displaystyle\textbf{{u}}(\textit{t}+1)^{2};\textbf{{x}}(\textit{t}+1)^{2};\textbf{{y}}(\textit{t})^{2}]$		(7)

3.2 The Vision Reservoir (ViR)

Fig. 2 depicts the proposed model for image classification, and the key component is the ViR core which is composed of a reservoir with the internal topology mentioned above and a residual block.

The typical reservoir receives a time-series sequence as input. Similar to the treatment in ViT, some pretreatments are used to convert raw images to patches as the time series input. We reshape the original image $\textbf{x}\in\mathbb{R}^{\textit{H}\times\textit{W}\times\textit{C}}$ into a series of patches $\textbf{x}_{\textit{p}}\in\mathbb{R}^{\textit{N}\times(\textit{P}^{2}\cdot\textit{C})}$ , where $(\textit{H, W})$ and $(\textit{P, P})$ respectively represent the resolution of x and $\textbf{x}_{\textit{p}}$ , and $\textit{N}=\textit{H}\textit{W}/\textit{P}^{2}$ is the number of patches, also serving as the input time steps. With a trainable linear projection E (Eq. 8), we flatten the patches and map to D dimensions to make the inputs suitable for the reservoir. We refer to the output of this projection as the patch coding. Therefore, the input at time step t+n can be written as:

\textbf{{u}}(\textit{t+n})=\textbf{x}^{n}_{\textit{p}}\textbf{{E}},\quad\textbf{{E}}\in\mathbb{R}^{(\textit{P}^{2}\cdot\textit{C})\times\textit{D}},\textit{n}=0,1,...,\textit{N}-1

(8)

Then the ViR core receives the data. It consists of the nearly fully connected reservoir (Fig. 3) and a residual block including a layer-norm (LN) operation and a feed-forward (FF) layer. The non-linear function in FF is a Gaussian Error Linear Unit (GELU) [16]. The processing of the reservoir has been shown in Eq. 3 and Eq. 4, and LN is applied to $\textbf{y}_{\textit{r}}$ (the output of the reservoir) with the feed-forward layer with a residual connection between them, shown as:

\textbf{{y}}_{\textit{c}}=\textbf{{FF}}(\textbf{{LN}}(\textbf{{y}}_{\textit{r}}))+\textbf{y}_{\textit{r}}

(9)

A classification operation is attached to $\textbf{{y}}_{\textit{c}}$ (the output of the ViR core ), which is implemented by a MLP, and we can obtain the image representation and the final classification results y by Eq. 10:

\textbf{{y}}=\textbf{{LN}}(\textbf{{y}}_{\textit{c}})

(10)

3.3 The Deep ViR

We further stack reservoirs to get the deep ViR to enhance the network performance.

The first one is the series reservoir consisting of L reservoirs depicted in Fig. 4. Similarly, the input matrices $\textbf{{{V}}}^{(l)}$ and reservoir connection matrices $\textbf{{W}}^{(l)}$ are constant and generated as mentioned above, l $\in$ $\{$ 1,2,…,L $\}$ , with similar training process described in subsection 3.1.

$\textbf{{U}}^{(l)}$ is the output matrix. Units in reservoir l updating and the output of reservoir l are given by:

\textit{$\textbf{x}^{(l)}$}(\textit{t}+1)=\textit{f }(\textit{$\textbf{V}^{(l)}$}\textit{$\textbf{y}^{(l-1)}$}(\textit{t}+1)+\textit{$\textbf{W}^{(l)}\textbf{x}^{(l)}$}(\textit{t}))

(11)

\textit{$\textbf{y}^{(l)}$}(\textit{t}+1)=\textit{$\textbf{U}^{(l)}$}[\textit{$\textbf{y}^{(l)}$}(\textit{t});\textit{$\textbf{x}^{(l)}$}(\textit{t}+1);\textit{$\textbf{y}^{(l)}$}(\textit{t})^{2};\textit{$\textbf{x}^{(l)}$}(\textit{t}+1)^{2}]

(12)

where l $\in$ $\{$ 1,2,…,L $\}$ and $\textbf{{y}}^{(0)}(\textit{t})=\textbf{{u}}(\textit{t})$ , $\textbf{{y}}^{(L)}(\textit{t})=\textbf{{y}}(\textit{t})$ .

The other one is a parallel reservoir, depicted in Fig. 4.

L reservoirs simultaneously receive the same input sequence u(t) and all L reservoirs can be trained simultaneously referring to the above process. The units in L reservoirs and the output units are updated as follows:

\textit{$\textbf{x}^{(l)}$}(\textit{t}+1)=\textit{f }(\textit{$\textbf{V}^{(l)}$}\textbf{{u}}(\textit{t}+1)+\textit{$\textbf{W}^{(l)}\textbf{x}^{(l)}$}(\textit{t}))

(13)

\textit{$\textbf{y}^{(l)}$}(\textit{t}+1)=\textit{$\textbf{U}^{(l)}$}[\textit{$\textbf{y}^{(l)}$}(\textit{t});\textit{$\textbf{x}^{(l)}$}(\textit{t}+1);\textit{$\textbf{y}^{(l)}$}(\textit{t})^{2};\textit{$\textbf{x}^{(l)}$}(\textit{t}+1)^{2}]

(14)

The final output of a parallel reservoir is the arithmetic mean of L reservoir outputs, given by:

\textit{{y}}(\textit{t})=\sum_{l=1}^{L}\textit{$\textbf{y}^{(l)}$}(\textit{t})/\textit{L}

(15)

3.4 Theoretical Analyses

Usually, we evaluate the quality of a reservoir through three indicators: the Small-World (SW) characteristics, the Largest Lyapunov Exponent (LLE), and the Memory Capacity (MC) [37]. The reservoir with SW characteristics has a stronger information processing ability and faster information dissemination speed than the common reservoir [21, 1]. LLE of a reservoir represents the dynamics and stability, and usually the value of LLE should be approximately equal to 0 for great performance. MC is the property of a reservoir to keep the previous input information, related to the data processing capability.

Small-World Characteristics: RC is a brain-inspired neural network, which mimics the connections of cerebral neurons, cortical neural connectivity has been shown to exhibit a small-world (SW) network topology [58], which has a shorter average path length and larger clustering coefficient than regular networks. In a reservoir, a small clustering degree means that dynamic information flow through the reservoir nodes is not ‘too cluttered’. Also, a small average path length can allow for the representation of a variety of dynamic time scales.

Due to the bi-directional connections, we can view the interconnection topology as an undirected graph G=(J, E), J and E representing neurons and the connections in the reservoir, respectively. This is similar to the kernel connections in the ViT. Define $\textbf{{l}}_{\textbf{{G}}}$ to be the average path length between vertex pairs in J, which is calculated by [38]:

\textbf{{l}}_{\textbf{{G}}}=\frac{2}{M(M+1)}\sum_{\begin{subarray}{c}i\geq j\end{subarray}}d_{ij}

(16)

where $d_{ij}$ is the geodesic distance from vertex i to vertex j, and M is the number of nodes (equal to the number of neurons in the reservoir). If $i=j$ , the distance is 0 [38]. Moreover, if there are no self-loop connections, Eq. 16 should be rewritten as:

\textbf{{l}}_{\textbf{{G}}}=\frac{2}{M(M-1)}\sum_{\begin{subarray}{c}i>j\end{subarray}}d_{ij}

(17)

Assuming that the neighborhood $|\textit{M}_{i}|$ of the node $\textit{v}_{i}$ is the nodes next to $\textit{v}_{i}$ , $\textit{M}_{i}$ is the number of nodes in $|\textit{M}_{i}|$ . The local clustering coefficient $\textbf{{C}}_{\textbf{{i}}}$ of $\textit{v}_{i}$ is given by a ratio, i.e., the connecting edges between nodes in the neighborhood divided by the number of possible connecting edges between them, shown in Eq. 19 [58]:

\textbf{{M}}_{\textbf{{i}}}={\textit{v}_{j}:\textit{e}_{ij}\in\textbf{{E}}\vee\textit{e}}_{ji}\in\textbf{{E}}

(18)

\textbf{{C}}_{\textbf{{i}}}=\frac{2|{\textit{e}_{jk}:\textit{v}_{j},\textit{v}_{k}\in\textbf{{M}}_{i},\textit{e}}_{jk}\in\textbf{{E}}|}{M(M-1)}

(19)

The overall clustering level $\bar{\textit{C}}$ is $\bar{\textbf{{C}}}=\frac{1}{M}\sum_{\begin{subarray}{c}i=1\end{subarray}}^{\begin{subarray}{c}M\end{subarray}}C_{i}$ .

The degree of SW, named small-worldness, is given as:

\delta=\frac{C}{C_{0}}/\frac{l_{G}}{l_{0}}

(20)

where $\textit{C}_{0}$ and $\textit{l}_{0}$ represent the average path length and clustering level of a regular network with similar size. If $\delta>1$ , the network is a SW network [17].

Lyapunov Exponent: The Largest Lyapunov Exponent (LLE) [59] represents the stability of reservoir dynamics, expounding the sensitivity of initial conditions of a system to small perturbations, defined as [51]:

\textbf{$\lambda$}=\lim_{k\to\infty}\frac{1}{k}ln(\frac{\gamma_{k}}{\gamma_{0}})

(21)

where $\gamma_{0}$ is the initial distance between the perturbed and the unperturbed trajectory, $\gamma_{k}$ is the distance at time k. For ordered dynamic systems, $\lambda<0$ and for chaotic systems $\lambda>0$ . At $\lambda\approx 0$ , a phase transition occurs (called the critical point, or edge of chaos) with the best computational capability [3, 4].

Memory Capacity: Memory Capacity (MC) [20] is another metric to measure the learning performance of our model. With receiving a random input u(t) at a time t, the reservoir is trained to generate the desired output $\textbf{y}_{d}(\textit{t})=u(t-\tau)$ . The output y is learned from u(t) which was $\tau$ steps earlier. Then the MC is given as:

\textbf{MC}_{\tau}=\frac{cov^{2}(\textbf{u}_{\tau},\textbf{y})}{v^{2}(\textbf{u}_{\tau})v^{2}(\textbf{y})},\qquad\textbf{MC}=\sum_{\tau=1}^{T}\textbf{MC}_{\tau}

(22)

where $cov^{2}(\textbf{u}_{\tau},\textbf{y})$ indicates the covariance between the true value $\textbf{u}_{\tau}$ and the predicted value y. $v^{2}(\ast)$ means the variance of a series $\ast$ . T is the maximal time delay we set.

We have experimentally demonstrated the SW characteristics, largest Lyapunov exponents, and memory capacities of our model, in Section 4.1.

4 Experiment Results

The comparative study between the proposed ViR and the common ViT model is carried out on three classical datasets, i.e. MNIST, CIFAR10, and CIFAR100[27]. We also compare the number of parameters in the ViR with the ViT and analyze the convergence speed as well as the memory footprint of our models. Further, the robustness is tested on CIFAR10-C [15].

In this section, original ViT is named ViT-Base [11] with several changes, as shown in Table 1.

Table 1: The system parameters of the ViR and the ViT. N is the number of neurons in a reservoir, and

\alpha

is a scaling parameter of the spectral radius of W; SD is the sparse degree of input matrix V. And

r_{i},r_{j},r_{k}

and jump size are detailed in subsection 3.1. In the ViT row, patch size is the same for all tested datasets.

	parameters	values
ViR	(N, $\alpha$ , SD)	(1000, 0.9, 0.05)
ViR	( $r_{i},r_{j},r_{k}$ , jump size)	(0.05, 0.5, 0.08, 137)
ViT	Hidden size	512
	MLP size	2048
	Heads	12
	Patch size	$4\times 4$

4.1 Small-Worldness, Lyapunov Exponent and Memory Capacity

Table 2 shows the normalized average path length $\frac{l_{G}}{l_{0}}$ , clustering coefficient $\frac{C}{C_{0}}$ , and small-worldness $\delta$ , $\frac{C}{C_{0}}/\frac{l_{G}}{l_{0}}$ . The baseline is the classic regular network and random network detailed in [23].

Table 2: Small-worldness of the reservoirs with 1000 neurons.

l_{0}

and

C_{0}

are the average path length and clustering level of the regular network, correspondingly. Both are usually equal to 1.

	$l_{G}/l_{0}$	$C/C_{0}$	$\delta$
Regular Network	1.00	1.00	1.00
Ours	0.98	0.99	1.02
Random Network	0.09	0.02	0.20

Our model has a lower clustering level and a smaller average path than the regular one. It still exhibits the SW property, referring to the fact that our topology has great potential for information calculation and processing. Similar to the brain [42], a small average path indicates low wiring cost for physical implementations. By applying a standard algorithm given in [46], inputs are sampled from Gaussian distribution [54] for the memory capacity and the largest Lyapunov exponents.

Fig. 5(a) shows we get the peak Memory Capacity (MC) when the spectral radius $\rho$ (W) is around 1. Also, Fig. 5(b) shows that the largest Lyapunov Exponent (LLE, defined in subsection 3.4) is close to 0 when $\rho$ (W) is around 1 $\sim$ 1.5. Meanwhile, the influences of the input scaling (IS) to MC and LLE are also shown in Fig. 5. And the results indicate that we should constrain the $\rho$ (W) and the IS to about 1 for image tasks.

4.2 Main Results with Training from Scratch

Table 3: Comparisons between ViR and the ViT models without pre-training on prevalent image classification datasets. The digital suffix means the number of the ViR layers or encoders of the ViT. ”M” is expressed as a unit symbol of one million.

	MNIST		CIFAR10		CIFAR100
	accuracy[%]	parameters[M]	accuracy[%]	parameters[M]	accuracy[%]	parameters[M]
ViT-1	98.31	3.48	64.81	3.49	35.21	3.53
ViR-1	98.92	0.60	78.03	0.61	51.41	0.62
ViT-3	98.49	10.42	74.78	10.44	43.63	10.47
Series-ViR-3	98.27	1.79	73.39	1.80	46.91	1.81
Parallel-ViR-3	98.95	1.59	78.51	1.60	52.25	1.61
ViT-6	98.54	20.84	77.68	20.85	44.14	20.88
Series-ViR-6	98.45	3.59	74.3	3.60	46.87	3.61
Parallel-ViR-6	99.05	3.19	79.13	3.20	51.73	3.21
ViT-12	98.72	41.66	79.72	41.67	45.61	41.71
Series-ViR-12	98.68	7.17	75.63	7.18	47.02	7.19
Parallel-ViR-12	99.17	6.24	80.02	6.25	51.98	6.26

Without any pre-training, we compare our models ViR-1, ViR-3, ViR-6, and ViR-12 with ViT-1, ViT-3, ViT-6, and ViT-12 by performing image classification tasks on MNIST, CIFAR10, and CIFAR100. Table 3 shows the classification accuracy and the number of parameters. It can be seen that ViR-1 has more competitive results than ViT-12 and therefore ViR models have a promising advantage in saving parameters. According to the aforementioned analysis, the reservoir with a nearly fully connected topology has great information processing capability and rich dynamics. Hence, our model is suitable for handling image tasks.

Accuracy: All the ViR models except Series-ViR perform better than ViT models with the same depths. Depths are the layers of the ViR or the encoders of the ViT. Interestingly, the shallow model ViR-1 usually rivals ViT-1, ViT-3, and ViT-12. This indicates that shallow ViR models have a competitive performance compared with the deep ViT models, but with lower time costs and fewer parameters. The great MC of the reservoir donates the high accuracy to some degree.

Convergence Speed: As depicted in Fig. 1, the initial test accuracy of the ViR is already close to the best results of the ViT. Meanwhile, the ViT-12 suffers over-fitting. The convergence speed of most ViR models is much faster than ViT. This reflects the SW characteristics and rich dynamics from LLE.

The Number of Parameters: Due to the fact that training a reservoir is limited to the output layer, $\textbf{{W}}_{out}$ , much fewer parameters are trained compared to other models. This is one of the attractive characteristics of RC. In this aspect, the ViR has the promising advantage compared with ViT.

Table 3 shows the trainable parameters of the ViR and the ViT on different datasets with different layers. The number of parameters of the ViR is about 15% of the ViT with the same depth. But the comparison between shallow ViR models and deep ViT models shows that the number of parameters of the ViR could be about 5% of the ViT while guaranteeing an acceptable accuracy.

Table 3 also indicates that, with the same depth, the series reservoir usually performs worse than the parallel one. One possible explanation is that: the series one always randomizes the input to the next reservoir, having no substantial help for tasks. However, the parallel one randomizes the inputs only once and then gets the arithmetic mean results. As a result, the enhancement of accuracy is predictable. With the increase of reservoir layers, the gains from deep structures can’t offset the influence of time and computational costs compared to shallow ones. This reveals that it is not necessary to stack too many reservoirs.

4.3 Memory Footprint

Considering the potential crash risk from memory overloading, it is necessary to evaluate and reduce its usage in a single task while ensuring acceptable results. The memory footprint of the ViR, compared with ViT models, on MNIST and CIFAR100 datasets, is shown in Fig. 6.

For small patch size, we observed that the memory footprint of the ViT models is about 2.5 to 5 times to ours, with the same depth. Although the large patch size of the ViT could reduce the memory footprint, our models (with small patch size) still have a slight superiority in such cases. It is worth noticing that, for both the series ViR and the parallel ViR the memory footprints within the optimal number of layers are nearly equal.

Table 4: Robustness for evaluating the influence of corruptions of input images. The

\textbf{CE}_{c}

value for a certain type of corruption is shown from column

2

9

, correspondingly. The CE value is the arithmetic mean corruption error of the corruptions in Noise, Blur, Weather, and Digital columns. In these cases, models are trained only on clean CIFAR10 images and tested on CIFAR10-C.

	Noise		Blur		Weather		Digital		CE
	Gaussian	Shot	Defocus	Motion	Snow	Frost	Contrast	JPEG	CE
ViT-1	1.8471	1.9345	1.8551	1.8546	1.9271	1.9468	1.9635	1.9736	1.9128
ViT-3	1.7218	1.7263	1.6837	1.7728	1.7978	1.7761	1.7726	1.7379	1.7486
ViT-6	1.6662	1.7562	1.6608	1.6907	1.6845	1.6973	1.7612	1.7261	1.7054
ViT-12	1.5557	1.7023	1.5234	1.5343	1.5310	1.6287	1.6215	1.6779	1.5969
ViR-1	1.6441	1.8079	1.6573	1.7385	1.7294	1.8717	1.7629	1.7506	1.7447
Series-ViR-3	1.8835	1.9225	1.7164	1.8763	2.0488	2.2069	2.1572	1.8542	1.9582
Series-ViR-6	1.9820	2.2625	2.0411	2.1119	2.4030	2.4629	2.3488	2.3164	2.2411
Series-ViR-12	1.8219	2.4180	2.3383	2.3281	2.8341	2.8164	2.8712	2.4978	2.4907
Parallel-ViR-3	1.6365	1.7849	1.6159	1.7034	1.7152	1.8102	1.7603	1.8061	1.7291
Parallel-ViR-6	1.5035	1.8193	1.6361	1.6801	1.7296	1.8739	1.7829	1.6950	1.7151
Parallel-ViR-12	1.4911	1.8188	1.6006	1.6858	1.7239	1.8758	1.7242	1.6739	1.6993

4.4 Robustness

We evaluate the robustness of the ViR from two aspects: i.e. corruptions of input images and perturbations of system hyperparameters.

Corruptions of input images. The input images are selected from the CIFAR10-C dataset, which consists of imposed corruptions from noise, blur, weather and digital categories on the original CIFAR10 dataset [15].

To achieve the comprehensive evaluation of the robustness to a given type of corruption, we score the classification performance across five corruption severity levels, denoted by s ( $1\leq s\leq 5$ ), and aggregate these scores according to:

\textbf{CE}_{c}=\sum_{s=1}^{5}\textbf{E}_{s,c}-\textbf{E}_{clean},\textbf{CE}=\frac{\textbf{CE}_{c}}{n}

(23)

where $\textbf{E}_{clean}$ is the top-1 error rate on clean CIFAR10 dataset; $\textbf{E}_{s,c}$ is the top-1 error rate on CIFAR10-C, c is the type of corruptions. CE is arithmetic mean of all types of $\textbf{CE}_{c}$ , and n represents the number of corruption types. It should be pointed out that, the smaller value of $\textbf{CE}_{c}$ or CE represents the stronger robustness.

Table 4 compares the robustness performance by means of $\textbf{CE}_{c}$ and CE. The results show that the number of layers within parallel-ViR has little influence on robustness. With the increasing encoding numbers of the ViT, it has increasingly robustness, which is contrary to in the cases of the series ViR model.

Perturbations of system hyperparameters. From Fig. 1, we can see that the number of layers has a smaller impact on our models, compared with the ViT, perhaps because the spectral radius of W in the ViR is constrained to be around 1 when training [37]. Moreover, the influence of IS is also reduced because of the constrained spectral radius, shown in Fig. 7. It has the same conclusion for other hyperparameters of the ViR, such as the influence of the number of neurons N, $r_{i},r_{j},r_{k}$ and the jump size. SW characteristics support our experimental results, which make RC have strong robustness. However, the hyperparameters in ViT, such as the number of heads, patch size, and MLP size etc., will significantly impact the classification accuracy. This indicates that our model has a certain degree of robustness to the perturbations of system hyperparameters.

5 Conclusion

In this paper, we proposed a novel model, namely ViR, for image classification. Similar to the ViT, the ViR treats the image as a sequence of pixel patches. The sequences are then processed by using a reservoir with a nearly fully connected topology. This simple and low-cost network model performs well, compared with the ViT models, on MNIST, CIFAR10, and CIFAR100. The ViR has small-world characteristics, chaos dynamics, and great memory capacities. The ViR exceeds the ViT models, with the learning from scratch, on image classification, with a lower memory footprint, a smaller number of parameters, and faster training speed. Encouraging by these experimental results, further study will be focused on the following three aspects, i) theoretical explanations for the working mechanism of the reservoir, ii) pre-training on large-scale datasets, iii) applying the ViR to more computer vision tasks.

References

[1] Réka Albert and Albert-László Barabási. Statistical mechanics of complex networks. Rev. Mod. Phys., 74(1):47, 2002.
[2] Witali Aswolinskiy, René Felix Reinhart, and Jochen Steil. Time series classification in reservoir-and model-space. Neural Process. Lett., 48(2):789–809, 2018.
[3] Nils Bertschinger and Thomas Natschläger. Real-time computation at the edge of chaos in recurrent neural networks. Neural Comput., 16(7):1413–1436, 2004.
[4] Joschka Boedecker, Oliver Obst, Joseph T Lizier, N Michael Mayer, and Minoru Asada. Information processing in echo state networks at the edge of chaos. Theor. Biosci., 131(3):205–213, 2012.
[5] Hanten Chang and Katsuya Futagami. Reinforcement learning with convolutional reservoir computing. Appl. Intell., 50(8):2400–2410, 2020.
[6] Sneha Chaudhari, Varun Mithal, Gungor Polatkan, and Rohan Ramanath. An attentive survey of attention models. ACM TIST, 12(5):1–32, 2021.
[7] Xin Chen, Bin Yan, Jiawen Zhu, Dong Wang, Xiaoyun Yang, and Huchuan Lu. Transformer tracking. In CVPR, pages 8126–8135, 2021.
[8] Zhengsu Chen, Lingxi Xie, Jianwei Niu, Xuefeng Liu, Longhui Wei, and Qi Tian. Visformer: The vision-friendly transformer. In ICCV, 2021.
[9] Krzysztof Marcin Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Quincy Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers. In ICLR, 2020.
[10] Alana de Santana Correia and Esther Luna Colombini. Attention, please! a survey of neural attention models in deep learning. CoRR, abs/2103.16775, 2021.
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
[12] Igor Farkaš, Radomír Bosák, and Peter Gergel’. Computational analysis of memory capacity in echo state networks. Neural Netw., 83:109–120, 2016.
[13] Claudio Gallicchio, Alessio Micheli, and Luca Pedrelli. Deep reservoir computing: A critical experimental analysis. Neurocomputing, 268:87–99, 2017.
[14] Kian Hamedani, Lingjia Liu, Rachad Atat, Jinsong Wu, and Yang Yi. Reservoir computing meets smart grids: Attack detection using delayed feedback networks. IEEE Trans. Industr. Inform., 14(2):734–743, 2017.
[15] Dan Hendrycks and Thomas G. Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. In ICLR. OpenReview.net, 2019.
[16] Dan Hendrycks and Kevin Gimpel. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. CoRR, abs/1606.08415, 2016.
[17] Mark D Humphries, Kevin Gurney, and Tony J Prescott. The brainstem reticular formation is a small-world, not scale-free, network. Proc. Royal Soc. B: Biol. Sci., 273(1585):503–511, 2006.
[18] Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks-with an erratum note. Bonn, Germany: German National Research Center for Information Technology GMD Technical Report, 148(34):13, 2001.
[19] Herbert Jaeger. Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach, volume 5. GMD-Forschungszentrum Informationstechnik Bonn, 2002.
[20] Herbert Jaeger et al. Short term memory in echo state networks, volume 5. GMD-Forschungszentrum Informationstechnik, 2001.
[21] Herbert Jaeger and Harald Haas. Harnessing nonlinearity: Predicting chaotic systems and saving energy in wireless communication. Science, 304(5667):78–80, 2004.
[22] Azarakhsh Jalalvand, Kris Demuynck, Wesley De Neve, and Jean-Pierre Martens. On the application of reservoir computing networks for noisy image recognition. Neurocomputing, 277:237–248, 2018.
[23] Yuji Kawai, Jihoon Park, and Minoru Asada. A small-world topology enhances the echo state property and signal propagation in reservoir computing. Neural Netw., 112:15–23, 2019.
[24] Salman H. Khan, Muzammal Naseer, Munawar Hayat, Syed Waqas Zamir, Fahad Shahbaz Khan, and Mubarak Shah. Transformers in vision: A survey. CoRR, abs/2101.01169, 2021.
[25] Denis Kleyko, Sumeer Khan, Evgeny Osipov, and Suet-Peng Yong. Modality classification of medical images with distributed representations based on cellular automata reservoir computing. In ISBI, pages 1053–1056. IEEE, 2017.
[26] Petia D. Koprinkova-Hristova. Reservoir computing approach for gray images segmentation. CoRR, abs/2107.11077, 2021.
[27] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2012.
[28] Robert Legenstein and Wolfgang Maass. Edge of chaos and prediction of computational performance for neural circuit models. Neural Netw., 20(3):323–334, 2007.
[29] Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. CoRR, abs/2104.05707, 2021.
[30] Xuanlin Liu, Mingzhe Chen, Changchuan Yin, and Walid Saad. Analysis of memory capacity for deep echo state networks. In ICMLA, pages 443–448. IEEE, 2018.
[31] Lorenzo Livi, Filippo Maria Bianchi, and Cesare Alippi. Determination of the edge of criticality in echo state networks through fisher information maximization. IEEE Trans. Neural Netw. Learn. Syst., 29(3):706–717, 2017.
[32] Mantas Lukoševičius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. Computer Science Review, 3(3):127–149, 2009.
[33] Mantas Lukoševičius, Herbert Jaeger, and Benjamin Schrauwen. Reservoir computing trends. KI-Künstliche Intelligenz, 26(4):365–371, 2012.
[34] Wolfgang Maass, Thomas Natschläger, and Henry Markram. Real-time computing without stable states: A new framework for neural comput based on perturbations. Neural Comput., 14(11):2531–2560, 2002.
[35] Han Min and Wang Ya-Nan. Multivariate time series online predictor with kalman filter trained reservoir. Acta Pharm. Sin., 36(1):169–173, 2010.
[36] Somayeh Susanna Mosleh, Lingjia Liu, Cenk Sahin, Yahong Rosa Zheng, and Yang Yi. Brain-inspired wireless communications: Where reservoir computing meets mimo-ofdm. IEEE Trans. Neural Netw. Learn. Syst., 29(10):4694–4708, 2017.
[37] Kohei Nakajima and Ingo Fischer. Reservoir Computing. Springer, 2021.
[38] Mark EJ Newman. The structure and function of complex networks. SIAM REV., 45(2):167–256, 2003.
[39] Mustafa C Ozturk and Jose C Principe. Computing with transiently stable states. In IJCNN, volume 3, pages 1467–1472. IEEE, 2005.
[40] Ali Rodan and Peter Tino. Minimum complexity echo state network. IEEE Trans. Neural Netw., 22(1):131–144, 2010.
[41] Ali Rodan and Peter Tiňo. Simple deterministically constructed cycle reservoirs with regular jumps. Neural Comput., 24(7):1822–1852, 2012.
[42] Mikail Rubinov, Rolf JF Ypma, Charles Watson, and Edward T Bullmore. Wiring cost and topological participation of the mouse brain connectome. PNAS, 112(32):10032–10037, 2015.
[43] Nils Schaetti, Michel Salomon, and Raphaël Couturier. Echo state networks-based reservoir computing for mnist handwritten digits recognition. In CSE and EUC and DCABES, pages 484–491. IEEE, 2016.
[44] Sheng Shen, Alexei Baevski, Ari S. Morcos, Kurt Keutzer, Michael Auli, and Douwe Kiela. Reservoir transformer. CoRR, abs/2012.15045, 2020.
[45] Zhiwei Shi and Min Han. Support vector echo-state machine for chaotic time-series prediction. IEEE Trans. Neural Netw., 18(2):359–372, 2007.
[46] Ippei Shimada and Tomomasa Nagashima. A numerical approach to ergodic problem of dissipative dynamical systems. Progress of Theoretical Physics, 61(6):1605–1616, 1979.
[47] Mark D Skowronski and John G Harris. Minimum mean squared error time series classification using an echo state network prediction model. In ISCAS, pages 4 pp.–3156. IEEE, 2006.
[48] Gouhei Tanaka, Toshiyuki Yamane, Jean Benoit Héroux, Ryosho Nakane, Naoki Kanazawa, Seiji Takeda, Hidetoshi Numata, Daiju Nakano, and Akira Hirose. Recent advances in physical reservoir computing: A review. Neural Netw., 115:100–123, 2019.
[49] Zhiqiang Tong and Gouhei Tanaka. Reservoir computing with untrained convolutional neural networks for image recognition. In ICPR, pages 1289–1294. IEEE, 2018.
[50] Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Hervé Jégou. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357. PMLR, 2021.
[51] Kazuya Tsuruta, Zonghuang Yang, Yoshifumi Nishio, and Akio Ushida. Small-world cellular neural networks for image processing applications. In ECCTD, volume 1, pages 225–228. Citeseer, 2003.
[52] Guy Van der Sande, Daniel Brunner, and Miguel C Soriano. Advances in photonic reservoir computing. Nanophotonics, 6(3):561–576, 2017.
[53] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NIPS, pages 5998–6008, 2017.
[54] David Verstraeten, Joni Dambre, Xavier Dutoit, and Benjamin Schrauwen. Memory versus non-linearity in reservoirs. In IJCNN, pages 1–8. IEEE, 2010.
[55] Pietro Verzelli, Cesare Alippi, Lorenzo Livi, and Peter Tiňo. Input-to-state representation in linear reservoirs dynamics. IEEE Trans. Neural Netw. Learn. Syst., 2021.
[56] Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, and Ling Shao. Pyramid vision transformer: A versatile backbone for dense prediction without convolutions. CoRR, abs/2102.12122, 2021.
[57] Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng, Hao Shen, and Huaxia Xia. End-to-end video instance segmentation with transformers. In CVPR, pages 8741–8750, 2021.
[58] Duncan J Watts and Steven H Strogatz. Collective dynamics of ‘small-world’networks. Nature, 393(6684):440–442, 1998.
[59] Alan Wolf, Jack B Swift, Harry L Swinney, and John A Vastano. Determining lyapunov exponents from a time series. Physica D: nonlinear phenomena, 16(3):285–317, 1985.
[60] Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Francis E. H. Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on imagenet. CoRR, abs/2101.11986, 2021.
[61] Zhou Zhou, Lingjia Liu, Vikram Chandrasekhar, Jianzhong Zhang, and Yang Yi. Deep reservoir computing meets 5g mimo-ofdm systems in symbol detection. In AAAI, volume 34, pages 1266–1273, 2020.