On the Compressive Power of Boolean Threshold Autoencoders

Avraham A. Melkman Department of Computer Science, Ben-Gurion University of the Negev Sini Guo Department of Mathematics, The University of Hong Kong Wai-Ki Ching partially supported by Hong Kong RGC GRF Grant no. 17301519, IMR and RAE Research fund from Faculty of Science, HKU. Department of Mathematics, The University of Hong Kong Pengyu Liu Bioinformatics Center, Institute for Chemical Research, Kyoto University Tatsuya Akutsu partially supported by Grant-in-Aid #18H04113 from JSPS, Japan,Correspoding author. e-mail: [email protected] Bioinformatics Center, Institute for Chemical Research, Kyoto University

Abstract

An autoencoder is a layered neural network whose structure can be viewed as consisting of an encoder, which compresses an input vector of dimension $D$ to a vector of low dimension $d$ , and a decoder which transforms the low-dimensional vector back to the original input vector (or one that is very similar). In this paper we explore the compressive power of autoencoders that are Boolean threshold networks by studying the numbers of nodes and layers that are required to ensure that each vector in a given set of distinct input binary vectors is transformed back to its original. We show that for any set of $n$ distinct vectors there exists a seven-layer autoencoder with the smallest possible middle layer, (i.e., its size is logarithmic in $n$ ), but that there is a set of $n$ vectors for which there is no three-layer autoencoder with a middle layer of the same size. In addition we present a kind of trade-off: if a considerably larger middle layer is permissible then a five-layer autoencoder does exist. We also study encoding by itself. The results we obtain suggest that it is the decoding that constitutes the bottleneck of autoencoding. For example, there always is a three-layer Boolean threshold encoder that compresses $n$ vectors into a dimension that is reduced to twice the logarithm of $n$ .

Keywords: Neural networks, Boolean functions, threshold functions, autoencoders.

1 Introduction

Artificial neural networks have been extensively studied in recent years. Among various models, autoencoders attract much attention because of their power to generate new objects such as image data. An autoencoder is a layered neural network whose structure can be viewed as consisting of two parts, an encoder and a decoder, where the former transforms an input vector to a low-dimensional vector and the latter transforms the low-dimensional vector to an output vector which should be the same as or similar to the input vector, [1, 2, 3, 4]. An autoencoder is trained in an unsupervised manner to minimize the difference between input and output data by adjusting weights (and some other parameters). In the process it learns, therefore, a mapping from high-dimensional input data to a low-dimensional representation space. Although autoencoders have a long history [1, 3], recent studies focus on variational autoencoders [5, 6, 7] because of their generative power. Autoencoders have been applied to various areas including image processing [6, 7], natural language processing [7], and drug discovery [8].

As described above, autoencoders perform dimensionality reduction, a kind of data compression. However, how data are compressed via autoencoders is not yet very clear. Of course, extensive studies have been done on the representation power of deep neural networks [9, 10, 11, 12]. Yet, to the best of the authors’ knowledge, the quantitative relationship between the compressive power and the numbers of layers and nodes in autoencoders is still unclear.

In this paper, we study the compressive power of autoencoders using a Boolean model of layered neural networks. In this model, each node takes on values that are either 1 (active) or 0 (inactive) and the activation rule for each node is given by a Boolean function. That is, we consider a Boolean network (BN) [13] as a model of a neural network. BNs have been used as a discrete model of genetic networks, for which extensive studies have been done on inference, control, and analysis [14, 15, 16, 17, 18, 19]. It should be noted that BNs have also been used as a discrete model of neural networks in which functions are restricted to be linear Boolean threshold functions whose outputs are determined by comparison of the weighted sum of input values with a threshold [20]. Such a BN is referred to as Boolean threshold networks (BTNs) in this paper. Although extensive theoretical studies have been devoted to the representational power of BTNs [20], almost none considered BN models of autoencoders with the notable exception of Baldi’s study of the computational complexity of clustering via autoencoders [4].

Table 1: Summary of Results.

	$d$	architecture	type	constraint
Proposition 4	$\log n$	$D/d/D$	Encoder/Decoder	BN
Theorem 5	$\lceil 8\sqrt{2M}\ln n\rceil$	$D/d$	Encoder	BTN, $M=\#1\mbox{s in each }{\bf x}^{i}$
Theorem 8	$2\lceil\log n\rceil$	$D/d$	Encoder	BN with parity function
Theorem 9	$2\lceil\log n\rceil$	$D/D^{2}/d$	Encoder	BTN
Theorem 15	$\lceil\log n\rceil$	4 layers ( $O(\sqrt{n}+D)$ nodes)	Encoder	BTN
Theorem 19	$2\lceil\sqrt{n}\rceil$	$D/({\frac{d}{2}}+D)/d/{\frac{dD}{2}}/D$	Encoder/Decoder	BTN
Theorem 21	$\lceil\log n\rceil$	$D/n/d/n/D$	Encoder/Decoder	BTN
Theorem 22	$2\lceil\log\sqrt{n}\rceil$	7 layers ( $O(D\sqrt{n})$ nodes)	Encoder/Decoder	BTN

The BTNs we consider in this paper are always layered ones, and we study their compressive power in two settings. The first one focuses on encoding. We are given a set of $n$ $D$ -dimensional different binary vectors $X_{n}=\{{\bf x}^{0},\ldots,{\bf x}^{n-1}\}$ and the task is to find a BTN which maps each ${\bf x}^{i}$ to some $d$ -dimensional binary vector ${\bf f}({\bf x}^{i})$ in such a way that ${\bf f}({\bf x}^{i})\neq{\bf f}({\bf x}^{j})$ for all $i\neq j$ . Such a BTN is called a perfect encoder for $X_{n}$ . The second one involves both encoding and decoding. In this case we are again given a set of $D$ -dimensional binary vectors $X_{n}=\{{\bf x}^{0},\ldots,{\bf x}^{n-1}\}$ , and the task is to find a BTN consisting of an encoder function ${\bf f}$ and a decoder function ${\bf g}$ satisfying ${\bf g}({\bf f}({\bf x}^{i}))={\bf x}^{i}$ for all $i=0,\ldots,n-1$ . Such a BTN is called a perfect autoencoder for $X_{n}$ . It is clear from the definition that if $({\bf f},{\bf g})$ is a perfect autoencoder, then ${\bf f}$ is a perfect encoder. In this paper, we are interested in whether or not there exists a BTN for any $X_{n}$ under the condition that the architecture (i.e., the number of layers and the number of nodes in each layer) is fixed. This setting is reasonable because learning of an autoencoder is usually performed, for a given set of input vectors, by fixing the architecture of the BTN and adjusting the weights of the edges. Therefore, the existence of a perfect BTN for any $X_{n}$ implies that a BTN can be trained for any $X_{n}$ so that the required conditions are satisfied if an appropriate set of initial parameters is given.

The results on existence of perfect encoders and autoencoders are summarized in Table 1, in which an entry in the architecture column lists the number of nodes in each layer (from input to output). Note that $\log n$ means $\log_{2}n$ throughout the paper. In addition to these positive results, we show a negative result (Theorem 16) stating that for some $X_{n}$ with $d=\lceil\log n\rceil$ , there does not exist a perfect autoencoder consisting of three layers.

2 Problem Definitions

Our first BN model is a three-layered one, see Fig. 1; its definitions can be extended to networks with four or more layers in a straightforward way. Let ${\bf x}$ , ${\bf z}$ , ${\bf y}$ be binary vectors given to the first, second, and third layers, respectively. In this case, the first, second, and third layers correspond to the input, middle, and output layers, respectively. Let ${\bf f}$ and ${\bf g}$ be lists of Boolean functions assigned to nodes in the second and third layers, respectively, which means ${\bf z}={\bf f}({\bf x})$ and ${\bf y}={\bf g}({\bf z})$ . Since our primary interest is in autoencoders, we assume that ${\bf x}$ and ${\bf y}$ are $D$ -dimensional binary vectors and ${\bf z}$ is a $d$ -dimensional binary vector with $d\leq D$ . A list of functions is also referred to as a mapping. The $i$ th element of a vector ${\bf x}$ will be denoted by $x_{i}$ , which is also used to denote the node corresponding to this element. Similarly, for each mapping ${\bf f}$ , $f_{i}$ denotes the $i$ th function.

In the following, $X_{n}=\{{\bf x}^{0},\ldots,{\bf x}^{n-1}\}$ will always denote a set of $n$ $D$ -dimensional binary input vectors that are all different.

Definition 1.

A mapping ${\bf f}$ : $\{0,1\}^{D}\rightarrow\{0,1\}^{d}$ is called a perfect encoder for $X_{n}$ if ${\bf f}({\bf x}^{i})\neq$ ${\bf f}({\bf x}^{j})$ holds for all $i\neq j$ .

Definition 2.

A pair of mappings $({\bf f},{\bf g})$ with ${\bf f}$ : $\{0,1\}^{D}\rightarrow\{0,1\}^{d}$ and ${\bf g}$ : $\{0,1\}^{d}\rightarrow\{0,1\}^{D}$ is called a perfect autoencoder if ${\bf g}({\bf f}({\bf x}^{i}))={\bf x}^{i}$ holds for all ${\bf x}^{i}\in X_{n}$ .

Refer to caption — Figure 1: Architecture of an autoencoder.

Fig. 1 illustrates the architecture of an autoencoder. As mentioned before, if $({\bf f},{\bf g})$ is a perfect autoencoder, then ${\bf f}$ is a perfect encoder.

Example 3.

Let $X_{4}=\{{\bf x}^{0},{\bf x}^{1},{\bf x}^{2},{\bf x}^{3}\}$ where ${\bf x}^{0}=(0,0,0)$ , ${\bf x}^{1}=(1,0,0)$ , ${\bf x}^{2}=(1,0,1)$ , ${\bf x}^{3}=(1,1,1)$ . Let $D=3$ and $d=2$ . Define ${\bf z}={\bf f}({\bf x})$ and ${\bf y}={\bf g}({\bf z})$ by

\displaystyle f_{0}:x_{0}\oplus x_{1},~{}f_{1}:x_{2},~{}g_{0}:z_{0}\lor z_{1},~{}g_{1}:\overline{z_{0}}\land z_{1},~{}g_{2}:z_{1}.

This pair of mappings has the following truth table, which shows it to be a perfect Boolean autoencoder.

$x_{0}$	$x_{1}$	$x_{2}$	$z_{0}$	$z_{1}$	$y_{0}$	$y_{1}$	$y_{2}$
0	0	0	0	0	0	0	0
1	0	0	1	0	1	0	0
1	0	1	1	1	1	0	1
1	1	1	0	1	1	1	1

Proposition 4.

For any $X_{n}$ , there exists a perfect Boolean autoencoder with $d=\lceil\log n\rceil$ .

Proof.

The encoder maps a vector to its index, in binary representation, and the decoder maps the index back to the vector. To implement the idea formally denote by $i2b_{d}(j)$ the $d$ -dimensional binary vector representing $j$ , for $j<2^{d}$ and $0<d$ . For example, $i2b_{3}(5)=(1,0,1)$ .

Given $d=\lceil\log n\rceil$ define $({\bf f},{\bf g})$ by ${\bf f}({\bf x}^{i})=i2b_{d}(i)$ and $g_{j}({\bf f}({\bf x}^{i}))=x^{i}_{j}$ . Clearly, $({\bf f},{\bf g})$ is a perfect autoencoder and can be represented by Boolean functions. ∎

Note that there does not exist a perfect Boolean autoencoder with $d<\lceil\log n\rceil$ because $X_{n}$ contains $n$ different vectors. Note, furthermore, that the Proposition permits the use of arbitrary Boolean functions. In the following, we focus on what is achievable when a neural network model with Boolean threshold functions [20] is used. A function $f$ : $\{0,1\}^{h}\rightarrow\{0,1\}$ is called a Boolean threshold function if it is represented as

\displaystyle f({\bf x})=\left\{\begin{array}[]{ll}1,&{\bf a}\cdot{\bf x}\geq\theta,\\ 0,&\mbox{otherwise,}\end{array}\right.

for some $({\bf a},\theta)$ , where ${\bf a}$ is an $h$ -dimensional integer vector and $\theta$ is an integer. We will also denote the same function as $[{\bf a}\cdot{\bf x}\geq\theta]$ . If all activation functions in a neural network are Boolean threshold functions, the network is called a Boolean threshold network (BTN). In the following sections, we will consider BTNs with $L$ layers where $L\geq 2$ (see Fig. 2). Such a BTN is represented as ${\bf y}={\bf f}^{(L-1)}({\bf f}^{(L-2)}(\cdots{\bf f}^{(1)}({\bf x})\cdots))$ , where ${\bf f}^{(i)}$ is a list of activation functions for the $i+1$ -th layer. When a BTN is used as an autoencoder, some layer is specified as the middle layer. If the $k$ th layer is specified as the middle layer, the middle vector ${\bf z}$ , encoder ${\bf f}$ , and decoder ${\bf g}$ are defined by

	$\displaystyle{\bf z}$	$\displaystyle=$	$\displaystyle{\bf f}^{(k-1)}({\bf f}^{(k-2)}(\cdots{\bf f}^{(1)}({\bf x})\cdots))={\bf f}({\bf x}),$
	$\displaystyle{\bf y}$	$\displaystyle=$	$\displaystyle{\bf f}^{(L-1)}({\bf f}^{(L-2)}(\cdots{\bf f}^{(k)}({\bf z})\cdots))={\bf g}({\bf z}).$

3 How Easy is it to Encode?

This section is devoted to the encoding setting. First we establish, non-constructively, the existence of two encoders for $n$ $D$ -dimensional vectors: a two-layer BTN that uses an output layer with $O(\sqrt{D}\log n)$ nodes, and a three-layer BTN with a hidden layer of size $D^{2}$ and an output layer of size $2\lceil\log_{2}{n}\rceil$ . The third result reduces the number of output nodes further to $\lceil\log n\rceil$ , using a BTN of depth 4 with $O(\sqrt{n}+D)$ hidden nodes; it also lays out the architecture of this BTN. Taken together these results suggest that encoding is relatively easy by itself.

Theorem 5.

Given $X_{n}$ there exists a perfect encoder that is a two-layer BTN with at most $\lceil 8\sqrt{2M}\ln n\rceil$ output nodes, where $M$ is the maximum number of 1’s that any two vectors in $X_{n}$ have in common.

Proof.

We use the probabilistic method to establish the existence of a set $W$ of $d$ $D$ -dimensional weight vectors ${\bf w}^{j}\in\{-1,1\}^{D}$ with the property that the two-layer BTN whose threshold functions are $f_{j}({\bf x})=[{\bf x}\cdot{\bf w}^{j}\geq 1],\ j=0,\ldots,d-1$ , yields output vectors that are all different. Denote by ${\bf y}^{i}$ the output vector for ${\bf x}^{i}$ , with $y^{i}_{j}=f_{j}({\bf x}^{i})$ .

Consider a random $W$ such that each $w^{j}_{i}$ has the value $-1$ or 1 with probability $\frac{1}{2}$ . Let us compute a lower bound on the probability that a given pair of output vectors differs in a specific coordinate. Without loss of generality (w.l.o.g.) we take the vectors to be ${\bf y}^{0}$ and ${\bf y}^{1}$ , and look at the $\ell$ -th coordinate. Because this probability does not depend on the coordinate, since $W$ was chosen at random, let’s denote it by $p_{\{0,1\}}$ . Denote by $m_{i}$ the number of 1’s in ${\bf x}^{i}$ , and by $m_{\{0,1\}}$ the number of coordinates with value 1 for both ${\bf x}^{0}$ and ${\bf x}^{1}$ . W.l.o.g. $x^{0}_{0}=x^{1}_{0}=\cdots=x^{0}_{m_{\{0,1\}}-1}=x^{1}_{m_{\{0,1\}}-1}=1$ , $x^{0}_{m_{\{0,1\}}}=\cdots=x^{0}_{m_{0}-1}=x^{1}_{m_{0}}=\cdots=x^{1}_{m_{0}+m_{1}-m_{\{0,1\}}-1}=1$ , all other coordinates being 0.

Consequently ${y}^{0}_{\ell}=[\sum_{j=0}^{m_{0}-1}w^{\ell}_{j}\geq 0]$ , and ${y}^{1}_{\ell}=[\sum_{j=0}^{m_{\{0.1\}}-1}w^{\ell}_{j}+\sum_{j=m_{0}}^{m_{0}+m_{1}-m_{\{0.1\}}-1}w^{\ell}_{j}\geq 0]$ .

Suppose $m_{\{0,1\}}$ is even. Then a lower bound on $p_{\{0,1\}}$ is $p_{\{0,1\}}\geq Prob(\sum_{j=0}^{m_{\{0,1\}}-1}w^{\ell}_{j}=0)\cdot Q$ , where $Q$ stands for

	$\displaystyle Prob(\sum_{j=m_{\{0,1\}}}^{m_{0}-1}w^{\ell}_{j}\geq 0)\cdot Prob(\sum_{j=m_{0}}^{m_{0}+m_{1}-m_{\{0,1\}}-1}w^{\ell}_{j}<0)+$
	$\displaystyle Prob(\sum_{j=m_{\{0,1\}}}^{m_{0}-1}w^{\ell}_{j}<0)\cdot Prob(\sum_{j=m_{0}}^{m_{0}+m_{1}-m_{\{0,1\}}-1}w^{\ell}_{j}\geq 0).$		(2)

By the soon-to-be-stated Lemma 7, $Prob(\sum_{j=0}^{m_{\{0,1\}}-1}w^{\ell}_{j}=0)\geq\frac{1}{\sqrt{2m_{\{0,1\}}}}$ . Turning to the estimation of $Q$ , Lemma 7 implies that if one of $m_{0}-m_{\{0,1\}},m_{1}-m_{\{0,1\}}$ is odd then $Q=\frac{1}{2}$ . In the remaining case that both $r_{0}=m_{0}-m_{\{0,1\}}$ and $r_{1}=m_{1}-m_{\{0,1\}}$ are even

Q=\frac{1}{2}-\binom{r_{1}}{\frac{r_{0}}{2}}\binom{r_{1}}{\frac{r_{1}}{2}}\frac{1}{2^{r_{0}+r_{1}+1}}.

This expression assumes its smallest value, for $r_{0}+r_{1}\geq 1$ , when one of $r_{0},r_{1}$ is 0 and the other is 2, in which case its value is $\frac{1}{4}$ . We conclude, therefore, that if the number of 1’s that ${\bf x}^{0},{\bf x}^{1}$ have in common is even then $Q\geq\frac{1}{4}$ and $p_{\{0,1\}}\geq\frac{1}{4\sqrt{2m_{\{0,1\}}}}\geq\frac{1}{4\sqrt{2D}}$ .

A similar computation for the case that $m_{\{0,1\}}$ is odd shows that if number of 1’s that ${\bf x}^{0},{\bf x}^{1}$ have in common is odd then $p_{\{0,1\}}\geq\frac{1}{4\sqrt{m_{\{0,1\}}}}$ .

In summary, the probability that a specific pair of output vectors differ at a specific coordinate is at least

p=\min_{0\leq i<j\leq n-1}p_{\{i,j\}}\geq\frac{1}{4\sqrt{2M}},

where $M=\max_{0\leq i<j\leq n-1}m_{\{i,j\}}$ ,

Consequently the probability that all coordinates of two $d$ -dimensional output vectors are identical is at most $(1-\frac{1}{4\sqrt{2M}})^{d}$ . To ensure that the probability that one of the $\binom{n}{2}$ pairs of vectors ${\bf y}^{i_{0}}$ and ${\bf y}^{i_{1}}$ is identical is less than 1 it is sufficient to choose $d$ such that

\binom{n}{2}(1-\frac{1}{4\sqrt{2M}})^{d}<1,

i.e. we require $d\geq 8\sqrt{2M}\ln n$ . ∎

Remark 6.

In many applications the feature vectors are sparse so that $M$ is small and the bound given in Theorem 5 can be lowered. Consider for example the case that the number of 1s is $D^{\alpha},\ \alpha<1$ . Then it is sufficient to set $d\geq 8\sqrt{2}D^{\frac{\alpha}{2}}\ln n$ .

Here is the statement of the Lemma used in the foregoing proof.

Lemma 7.

Let $w_{j},j=0,\ldots,m-1,w_{j}\in\{-1,1\}$ be random variables, and let $w=\sum_{j=0}^{m-1}w_{j}$ .

If $m$ is even the following hold.

1.

$Prob(w=0)=\binom{m}{m/2}\frac{1}{2^{m}}$ . This implies, using Stirling’s approximation, $Prob(w=0)\geq\frac{1}{\sqrt{2m}}$ .
2.

$Prob(w\geq 0)=\frac{1}{2}(1+Prob(w=0)).$
3.

$Prob(w\geq 2)=Prob(w<0)=\frac{1}{2}(1-Prob(w=0)).$

If $m$ is odd the following hold.

1.

$Prob(w=1)=Prob(w=-1)=\binom{m}{(m-1)/2}\frac{1}{2^{m}}$ . This implies $Prob(w=1)\geq\frac{1}{2\sqrt{m}}$ .
2.

$Prob(w\geq 0)=Prob(w<0)=\frac{1}{2}$ .

Proof.

It suffices to note that if $m$ is even (odd) then $w$ is even (odd), and that $Prob(w<0)=Prob(w>0)$ . ∎

Next we show that the number of output nodes needed to encode the $n$ vectors can be reduced to $2\lceil\log_{2}{n}\rceil$ by adding a layer of $D^{2}$ hidden nodes to the BTN. To prove this we first present a result of independent interest, that uses the power of parity functions.

Theorem 8.

Given $X_{n}$ there is a two-layer network whose activation functions are parity functions that maps $X_{n}$ to $n$ different vectors of dimension $2\lceil\log_{2}{n}\rceil$ .

Proof.

We again use the probabilistic method to establish the existence of a network with $d$ output nodes, provided $d\geq 2\log_{2}n$ . Equip each output node $y_{k}$ , $k=0,\ldots,d-1$ with an activation function $f_{k}$ constructed as follows: assemble a set of bits, $S_{k}$ , by adding each one of the $D$ input variables to the set with probability $\frac{1}{2}$ , and set $f_{k}({\bf x})$ to be the parity of the set of values that ${\bf x}$ assigns to the variables in $S_{k}$ .

Consider an arbitrary fixed output node $y_{k}$ . Given a pair of input vectors $({\bf x}^{i},{\bf x}^{j})$ , define

B^{i}_{i,j}=\{\ell\mid{\bf x}^{i}_{\ell}=1,~{}{\bf x}^{j}_{\ell}=0\},\ B^{j}_{i,j}=\{\ell\mid{\bf x}^{i}_{\ell}=0,~{}{\bf x}^{j}_{\ell}=1\}.

Then $f_{k}({\bf x}^{i})=f_{k}({\bf x}^{j})$ if and only if $|B^{i}_{i,j}\cap S_{k}|$ and $|B^{j}_{i,j}\cap S_{k}|$ are either both odd or both even.

If $|B^{i}_{i,j}|>0$ , then $Prob(|B^{i}_{i,j}\cap S_{k}|\ is\ even)=Prob(|B^{i}_{i,j}\cap S_{k}|\ is\ odd)=\frac{1}{2}$ , because $S_{k}$ is randomly selected. Therefore, if $|B^{i}_{i,j}|>0$ and $|B^{j}_{i,j}|>0$ , then the probability that $f_{k}({\bf x}^{i})=f_{k}({\bf x}^{j})$ is ${\frac{1}{2}}$ . If $|B^{i}_{i,j}|=0$ then $|B^{j}_{i,j}|>0$ , since ${\bf x}^{i}\neq{\bf x}^{j}$ , and $Prob(f_{k}({\bf x}^{i})=f_{k}({\bf x}^{j}))=Prob(|B^{j}_{i,j}\cap S_{k}|\ is\ even)=\frac{1}{2}$ .

We conclude that $Prob(f_{k}({\bf x}^{i})=f_{k}({\bf x}^{j}))={\frac{1}{2}}$ , and the probability that $Prob({\bf y}^{i}={\bf y}^{j})=({\frac{1}{2}})^{d}$ . Therefore the probability that there is some pair ${\bf x}^{i},{\bf x}^{j}$ for which ${\bf y}^{i}={\bf y}^{j}$ is smaller than $n^{2}({\frac{1}{2}})^{d}$ , which is less than 1 if $d\geq 2\log_{2}n$ . ∎

Theorem 9.

Given $n$ different binary vectors of dimension $D$ , there is a three-layer BTN whose activation functions are threshold functions that maps these vectors to $n$ different vectors of dimension $2\lceil\log_{2}{n}\rceil$ using $D^{2}$ hidden nodes.

Proof.

It is well-known that a parity function on $D$ nodes can be implemented by threshold functions while using only one additional layer of at most $D$ nodes. A simple way of doing this is to use a hidden layer with $D$ nodes, $h_{0},...,h_{D-1}$ , where node $h_{i}$ has value 1 if the input contains at least $i+1$ 1’s, i.e. $h_{i}({\bf x})=[{\bf e}\cdot{\bf x}\geq i+1]$ , where ${\bf e}=(1,...,1)$ . The output node that computes the parity of the input has the activation function $[\sum_{i=0}^{D-1}(-1)^{i-1}h_{i}\geq 1]$ . The Theorem follows therefore from Theorem 8 when each of the parity functions it uses is implemented this way. ∎

We turn now to results that also provide constructions. In one form or another all use the idea of having the encoder map a vector $\textbf{x}^{i}$ to its index, as done in Proposition 4. That Proposition permits the use of any Boolean mapping, so that the meaning of the index of the vector is of no importance. Here, however, only mappings arising from BTNs are employed, and the meaning of the index, as provided by the following simple observation, is going to be of central importance.

Lemma 10.

cf. [12]. Given $X_{n}$ there is a vector a such that $\textbf{a}\cdot\textbf{x}^{i}\neq\textbf{a}\cdot\textbf{x}^{j}$ if $i\neq j$ . Furthermore, this vector can be assumed to have integer coordinates.

The reason is that the vectors in $X_{n}$ are all different. We fix ${\bf a}$ , and assume henceforth that $X_{n}$ is sorted,

b^{0}<\cdots<b^{n-1},\ b_{i}=\textbf{a}\cdot\textbf{x}^{i}.

(3)

When a vector $\textbf{x}^{i}$ is mapped to its index, $i$ will be represented either in binary, or by the $s$ -dimensional step-vector $\textbf{h}^{i}[s]$ , for some $s$ , defined by

\textbf{h}^{i}[s]_{j}=1\mbox{ if }0\leq j\leq i,\ 0\mbox{ if }i<j<s.

(4)

Our first result in this direction exemplifies encoding with step-vectors.

Theorem 11.

Given $X_{n}$ there is a two-layer BTN that maps $\textbf{x}^{i}$ to $\textbf{h}^{i}[n],\ i=0,\ldots,n-1$ .

Proof.

Let the vector of threshold functions for the $n$ nodes in the output layer be $\textit{G}_{j}(\textbf{x})=[\textbf{a}\cdot\textbf{x}\geq b_{j}]$ . Clearly $\textit{G}_{j}(\textbf{x}^{i})=[\textbf{a}\cdot\textbf{x}^{i}\geq b_{j}]$ has value 1 if and only if $j\leq i$ , i.e. $\textbf{{G}}(\textbf{x}^{i})=\textbf{h}^{i}$ . ∎

We show next that the number of output nodes of the encoder can be reduced to $2\lceil\sqrt{n}\rceil$ if a hidden layer with only $\lceil\sqrt{n}\rceil+D$ nodes is added. Henceforward we denote $r=\lceil\sqrt{n}\rceil$ .

Theorem 12.

Given $X_{n}$ satisfying equation (3), there is a three-layer network that maps $\textbf{x}^{i}$ to $(\boldsymbol{h}^{k}[r],\boldsymbol{h}^{\ell}[r])$ where $k$ and $\ell$ are such that $i=kr+\ell$ , with $0\leq\ell<r$ . The network has $D$ input nodes $x_{j},\ j=0,\ldots,D-1$ , $r+D$ hidden nodes ${\alpha}^{1}_{i},\ i=0,\ldots,r-1$ and ${\alpha}^{2}_{j},\ j=0,\ldots,D-1$ , and $2r$ output nodes $\beta^{1}_{i},\beta^{2}_{i},\ i=0,\ldots,r-1$ .

Proof.

For ease of exposition we assume that $\log n={2m}$ , and $r=\sqrt{n}=2^{m}$ .

The nodes $\alpha^{2}_{j}$ simply copy the input, ${\alpha}^{2}_{j}=x_{j},\ j=0,\ldots,D-1$ . To define the activation functions of $\alpha^{1}_{i}$ divide the interval $[b_{0},b_{n-1}]$ into $r$ consecutive subintervals $[s_{0},s_{1}-1],\ [s_{1},s_{2}-1],\ldots,[s_{r-1},b_{n-1}]$ each containing $r$ values $b_{i}$ , i.e. $s_{0}=b_{0},s_{1}=b_{r},\ldots,s_{r-1}=b_{(r-1)r}$ , and $s_{r}=b_{n}+1$ . The activation function of $\alpha^{1}_{i}$ is $[{\bf a}\cdot{\bf x}\geq s_{i}]$ , $0\leq i\leq r-1$ . Clearly, if $h=kr+\ell$ then $\alpha^{1}_{i}=1$ if and only if $i\leq k$ , i.e. $\boldsymbol{\alpha}^{1}=\boldsymbol{h}^{k}[r]$ .

The output node $\beta^{1}_{i}$ copies $\alpha^{1}_{i},\ 0\leq i\leq r-1$ . The remaining output nodes, $\boldsymbol{\beta}^{2}$ , represent the ordinal number of ${\bf a}\cdot{\bf x}$ within the $k$ -th subinterval. To implement this part of the mapping we follow [21] in employing the ingenious technique of telescopic sums, introduced by [22], and define

	$\displaystyle t_{i}=b_{j}\alpha_{0}$	$\displaystyle+(b_{r+i}-b_{i})\alpha_{1}+\ldots$
	$\displaystyle\ldots$	$\displaystyle+(b_{(r-1)r+i}-b_{(r-2)r+i})\alpha_{r-1},i=0,\ldots,r-1.$

The important thing to note here is that if ${\bf a}\cdot{\bf x}$ falls in the $k$ -th subinterval, so that $s_{k}\leq{\bf a}\cdot{\bf x}<s_{k+1}$ , then $t_{i}$ has the value $b_{kr+i}$ , because $\alpha_{0}=\ldots=\alpha_{k}=1$ , while $\alpha_{i}=0,\ i>k$ .

The activation function of $\beta^{2}_{i}$ is $[{\bf a}\cdot{\bf x}-t_{i}\geq 0]$ , $i=0,\ldots,r-1$ . Observe that if ${\bf a}\cdot{\bf x}=b_{kr+\ell}$ , then $\beta^{2}_{i}=[b_{kr+\ell}-b_{kr+i}\geq 0]$ , i.e. $\beta^{2}_{i}=1$ if $i\leq\ell$ , and $\beta^{2}_{i}=0$ if $i>\ell$ . ∎

It is relatively easy to convert the BTNs constructed in Theorems 11 and 12 into BTNs that map $X_{n}$ to a set of vectors with the minimum possible dimension, $\lceil\log n\rceil$ , by adding a layer of that size. The technique for doing so is encapsulated in the following Lemma. Its formulation is slightly more general than strictly needed here in order to serve another purpose later on.

Lemma 13.

Given a set of $s\leq 2^{d}$ different $d$ -dimensional binary vectors $Z=\{{\bf z}^{i},\ i=0,\ldots,s-1\}$ there is a two-layer BTN that maps $\textbf{h}^{i}[s]$ to ${\bf z}^{i}$ .

Proof.

To construct the vector of threshold functions for the output nodes, $\textbf{{H}}(\textbf{h}[s])$ , set $w^{j}_{0}=z^{0}_{j},j=0,\ldots,d-1$ ,

w^{j}_{i}=z^{i}_{j}-z^{i-1}_{j},i=1,\ldots,s-1,\ j=0,\ldots,d-1,

and $\textit{H}_{j}(\textbf{h}[s])=[\textbf{w}^{j}\cdot\textbf{h}[s]\geq 1].$ It is easily verified that $\textit{H}_{j}(\textbf{h}[s]^{i})=[z^{i}_{j}\geq 1]=z^{i}_{j}$ , i.e. $\textbf{{H}}(\textbf{h}[s]^{i})=\textbf{z}^{i}$ as desired. ∎

Consider the layer constructed in the proof of the Lemma when ${\bf z}^{i}$ is the $\lceil\log n\rceil$ -dimensional binary representation of $i$ and $s=n$ . By adding this layer to the BTN constructed in Theorem 11 we get the following result.

Theorem 14.

Given $X_{n}$ there is a three-layer BTN with $n$ nodes in the second layer that maps ${\bf x}^{i}$ to the $\lceil\log n\rceil$ -dimensional binary representation of $i$ , $i=0,\ldots,n-1$ .

Similarly, we can add to the BTN constructed in Theorem 12 a layer that consists of a first part and a second part, both constructed according to the Lemma. The first part connects the nodes $\boldsymbol{\beta}^{1}$ to a $\lceil\log r\rceil$ -dimensional binary counter $\boldsymbol{\gamma}^{1}$ , and the second part connects the nodes $\boldsymbol{\beta}^{2}$ to another $\lceil\log r\rceil$ -dimensional binary counter $\boldsymbol{\gamma}^{2}$ . Thus, on input $\textbf{x}^{i}$ the nodes $\boldsymbol{\beta}^{1}$ contain $\textbf{h}^{k}[r]$ which is mapped to the $\lceil\log n\rceil$ -dimensional binary representation of $k$ in $\boldsymbol{\gamma}^{1}$ , and the nodes $\boldsymbol{\beta}^{2}$ contain $\textbf{h}^{\ell}[r]$ which is mapped to the $\lceil\log n\rceil$ -dimensional binary representation of $\ell$ in $\boldsymbol{\gamma}^{2}$ . Recalling that $i=kr+\ell$ , with $0\leq\ell<r$ , it follows that, in this case, $\boldsymbol{\gamma}^{1}$ and $\boldsymbol{\gamma}^{2}$ taken together contain the $\lceil\log{n}\rceil$ -dimensional binary representation of $i$ . This construction establishes the following Theorem.

Theorem 15.

Given $X_{n}$ there is a four-layer BTN with $3\lceil\sqrt{n}\rceil+D$ hidden nodes that maps ${\bf x}^{i}$ to the $\lceil\log n\rceil$ -dimensional binary representation of $i$ , $i=0,\ldots,n-1$ .

4 Difficulty of Decoding

Theorem 16.

For each $d$ there exists a set of $n=2^{d}$ different $N$ -dimensional bit-vectors $Y=\{\textbf{y}^{i}\}$ with the following properties:

1.

There exists a 2-layer BTN that maps $Y$ to $\{0,1\}^{d}$ .
2.

There does not exist a 2-layer BTN that maps $\{0,1\}^{d}$ to $Y$ .

Proof.

Set $N=\binom{n}{2}$ and construct $Y$ as follows:

1.

index the coordinates of the vectors by $(i,j),\ 0\leq i<j\leq n-1$ ;
2.

set ${y}^{k}_{(i,j)}=1$ if and only if $i$ or $j$ is $k$ .

A BTN that encodes these vectors with $d$ bits is constructed as follows. The most significant bit is generated by the threshold function

\sum_{i=\frac{n}{4}}^{\frac{n}{2}-1}y_{(2i,2i+1)}\geq 1.

The next most significant bit is generated by

\sum_{i=\frac{3n}{8}}^{\frac{n}{2}-1}y_{(2i,2i+1)}+\sum_{i=\frac{n}{8}}^{\frac{n}{4}-1}y_{(2i,2i+1)}\geq 1.

The successive digits are generated in similar fashion, each time using a sum of $\frac{n}{4}$ input bits, until at last the least significant is generated by the slightly different threshold function

y_{(1,3)}+y_{(5,7)}+\cdots y_{(n-3,n-1)}\geq 1.

It is not difficult to verify that this 2-layer BTN maps $\textbf{y}^{i}$ to the $d$ -bit representation of $i$ .

To illustrate, consider the case $d=3,\ n=8,\ N=28$ . These vectors are encoded by the threshold functions $d_{0}$ (most significant digit) $d_{1}$ and $d_{2}$ (least significant digit)

	$\displaystyle d_{0}({\bf y})$	$\displaystyle=[y_{(4,5)}+y_{(6,7)}\geq 1],$
	$\displaystyle d_{1}({\bf y})$	$\displaystyle=[y_{(2,3)}+y_{(6,7)}\geq 1],$
	$\displaystyle d_{2}({\bf x})$	$\displaystyle=[y_{(1,3)}+u_{(5,7)}\geq 1].$

For example, since in ${\bf y}^{7}$ the coordinates $(0,7),\ldots,(6,7)$ are 1 it has $d_{0}({\bf y}^{7})=d_{1}({\bf y}^{7})=d_{2}({\bf y}^{7})=1$ , i.e. its representation is $(1,1,1)$ . Similarly, since $y^{1}_{(6,7)}=y^{1}_{(4,5)}=y^{1}_{(2,3)}=0$ whereas $y^{1}_{(1,3)}=1$ , $d_{0}({\bf y}^{1})=d_{1}({\bf y}^{1})=0,d_{2}({\bf y}^{1})=1$ , so that ${\bf y}^{1}$ has the representation $(0,0,1)$ .

To prove the second assertion, suppose to the contrary that there does exist a BTN mapping $\{0,1\}^{d}$ to $Y$ , and that it maps $(0,\ldots,0)$ to $\textbf{y}^{k}$ and $(1,\ldots,1)$ to $\textbf{y}^{\ell}$ . Consider the threshold function for the bit $x_{(k,\ell)}$ of the output layer. Since by construction $y^{k}_{(k,\ell)}=y^{\ell}_{(k,\ell)}=1$ while $y^{i}_{(k,\ell)}=0$ for all $i\neq k,\ \ell$ , it follows that the threshold function for this bit separates the two $d$ -dimensional vectors $(0,\ldots,0)$ and $(1,\ldots,1)$ from the remainder of $\{0,1\}^{d}$ , a contradiction to the well-known fact that such a threshold function does not exist. ∎

Corollary 17.

In general there may not exist an auto-encoding three-layer BTN if the middle layer has only $\lceil\log n\rceil$ nodes.

5 Perfect Encoding and Decoding

In this section we construct four autoencoders by adding decoders to the encoders constructed in Theorems 11, 12, 14, and 15. The autoencoder based on Theorem 11 is described in the next Theorem.

Theorem 18.

Given $X_{n}$ satisfying equation (3), there is a three-layer perfect BTN autoencoder with $n$ nodes in the middle layer.

Proof.

To the two-layer encoder of Theorem 11 add the decoder that maps $\textbf{h}^{i}[n]$ to $\textbf{x}^{i},\ i=0,\ldots,n-1$ . That decoder can be obtained from the construction of Lemma 13 by setting ${\bf z}^{i}=\textbf{x}^{i}$ and $s=n$ . ∎

To obtain an autoencoder based on Theorem 12 will take more doing.

Theorem 19.

Let $r=\lceil\sqrt{n}\rceil$ . Given $X$ satisfying equation (3), there is a five-layer BTN perfect autoencoder with $r+D$ nodes in the second layer, $2r$ nodes in the middle hidden layer, and $rD$ nodes in the fourth layer.

Proof.

On top of the encoder constructed in the proof of Theorem 12 we add a decoder that is a 3-layer network with the $2r$ input nodes $\beta^{1}_{i},\beta^{2}_{i},\ i=0,\ldots,r-1$ , $rD$ hidden nodes $\eta_{i,j},\ i=0,\ldots,r-1,\ j=0,\ldots,D-1$ , and $D$ output nodes $y_{j},\ j=0,\ldots,D-1$ , see Fig. 3. To simplify the description we add a dummy node $\beta^{2}_{r}=0$ .

We will see that this decoder outputs $\textbf{y}=\textbf{x}^{kr+\ell}$ when given input $(\boldsymbol{h}^{k}[r],\boldsymbol{h}^{\ell}[r])$ , as desired. Equip the node $\eta_{i,j}$ with the activation function

	$\displaystyle[x^{i}_{j}\beta^{1}_{0}$	$\displaystyle+(x^{r+i}_{j}-x^{i}_{j})\beta^{1}_{1}+\ldots$
	$\displaystyle\ldots$	$\displaystyle+(x^{r(r-1)+i}_{j}-x^{r(r-2)+i}_{j})\beta^{1}_{r-1}+\beta^{2}_{i}-\beta^{2}_{i+1}\geq 2].$		(5)

Note that if $\boldsymbol{\beta}^{1}=\boldsymbol{h}^{k}[r]$ then

	$\displaystyle x^{i}_{j}\beta^{1}_{0}$	$\displaystyle+(x^{r+i}_{j}-x^{i}_{j})\beta^{1}_{1}+\ldots$
	$\displaystyle\ldots$	$\displaystyle+(x^{r(r-1)+i}_{j}-x^{r(r-2)+i}_{j})\beta^{1}_{r-1}=x^{kr+i}_{j}.$

Note further that if $\boldsymbol{\beta}^{2}=\boldsymbol{h}^{\ell}[r]$ then $\beta^{2}_{i}-\beta^{2}_{i+1}=0$ unless $i=\ell$ . Hence, when $(\boldsymbol{\beta}^{1},\boldsymbol{\beta}^{2})=(\boldsymbol{h}^{k}[r],\boldsymbol{h}^{\ell}[r])$ , the value of $\eta_{i,j}$ is $x^{kr+\ell}_{j}$ if $i=\ell$ and 0 otherwise, according to the activation function (5).

To complete the definition of the network we equip node $y_{j}$ with the activation function

[\sum^{r-1}_{i=0}\eta_{i,j}\geq 1].

It follows that when the network is given the input $\boldsymbol{\beta}^{1}$ , $\boldsymbol{\beta}^{2}$ as specified in the Theorem, $y_{j}$ has value 1 if and only if $x^{kr+\ell}_{j}=1$ , i.e. $y_{j}=x^{kr+\ell}_{j}$ . ∎

We will now show how to replace the middle layer of each of the autoencoders of Theorems 18 and 19 with a three-layer autoencoder whose middle layer has the minimum possible dimension, $\lceil\log n\rceil$ . Observe that the middle layer of the autoencoder constructed in the proof of Theorem 18 has $n$ nodes and that the only values that this layer assumes are $\textbf{h}^{i}[n],i=0,\ldots,n-1$ . The middle layer of the autoencoder constructed in the proof of Theorem 19 has two sets of $r=\lceil\sqrt{n}\rceil$ nodes and the only values that each of these sets can assume are $\textbf{h}^{i}[r],i=0,\ldots,r-1$ . In each case, therefore, the appropriate three-layer autoencoder to replace the middle layer can be constructed from one or two autoencoders on $s$ nodes which only assume the values $\textbf{h}^{i}[s],i=0,\ldots,s-1$ .

Lemma 20.

There is a three-layer BTN that autoencodes the set of vectors $\textbf{h}^{i}[s],i=0,\ldots,s-1$ and has a middle layer of size $\lceil\log{s}\rceil$ .

Proof.

For ease of exposition we assume that $s={2^{m}}$ . We construct a BTN with $s$ input nodes $\boldsymbol{\beta}$ , $m$ hidden nodes $\boldsymbol{\gamma}$ , and $s$ output nodes $\boldsymbol{\delta}$ .

On input $\textbf{h}^{i}[s]$ the middle layer $\boldsymbol{\gamma}$ contains the binary representation of $i$ . The decoding layer is therefore straightforward: the activation function for $\delta_{j},j=0,\ldots,s-1$ is $[\sum_{h=0}^{m-1}\gamma_{h}2^{h}\geq j]$ .

The activation functions for $\gamma_{j},j=0,\ldots,m-1$ can be computed by the method of Lemma 13:

	$\displaystyle\gamma_{m-1}:$	$\displaystyle\ [\beta_{\frac{s}{2}}\geq 1];$
	$\displaystyle\gamma_{m-2}:$	$\displaystyle\ [\beta_{\frac{s}{4}}-\beta_{\frac{2s}{4}}+\beta_{\frac{3s}{4}}\geq 1];$
	$\displaystyle\gamma_{m-3}:$	$\displaystyle\ [\beta_{\frac{s}{8}}-\beta_{\frac{2s}{8}}+\beta_{\frac{3s}{8}}-\beta_{\frac{4s}{8}-1}+\ldots+\beta_{\frac{7s}{8}}\geq 1];$
		.
		.
	$\displaystyle\gamma_{0}:$	$\displaystyle[\beta_{1}-\beta_{2}+\beta_{3}-\beta_{4}+\ldots+\beta_{s-1}\geq 1].$

Note that $\beta_{0}=\delta_{0}=1$ for all inputs $\textbf{h}^{i}[s]$ , since by definition $\textbf{h}^{i}[s]_{0}=1$ , and the activation function of $\delta_{0}$ is $[\sum_{h=0}^{m-1}\gamma_{h}2^{h}\geq 0]$ . ∎

Replacing the middle layers of the autoencoders constructed in Theorems 18 and 19 with the appropriate three-layer BTN constructed according to the Lemma yields the following two results.

Theorem 21.

Given $X_{n}$ satisfying equation (3), there is a five-layer perfect BTN autoencoder with $n$ nodes in the second and fourth layer and $\lceil\log n\rceil$ nodes in the middle layer.

Theorem 22.

Given $X_{n}$ satisfying equation (3), there is a seven-layer perfect BTN autoencoder with $2\lceil\log\sqrt{n}\rceil$ nodes in the middle hidden layer, and $(D+5)\lceil\sqrt{n}\rceil+D$ nodes in the other hidden layers.

6 Conclusion

In this paper, we have studied the compressive power of autoencoders mainly within the Boolean threshold network model by exploring the existence of encoders and of autoencoders that map distinct input vectors to distinct vectors in lower-dimensions. It should be noted that our results are not necessarily optimal except for Proposition 4. The establishment of lower bounds and the reduction of the number of layers or nodes are left as open problems.

Although we focused on the existence of injection mappings, conservation of the distance is another important factor in dimensionality reduction. Therefore, it should be interesting and useful to study autoencoders that approximately conserve the distances (e.g., Hamming distance) between input vectors.

Another important direction is to use more practical models of neural networks such as those with sigmoid functions and Rectified Linear Unit (ReLU) functions. In such cases, real vectors need to be handled and thus conservation of the distances between vectors should also be analyzed.

References

[1] D. H. Ackley, G. H. Hinton, and T. J. Sejnowski, “A learning algorithm for Boltzmann machines,” Cognitive Science, vol. 9, pp. 147–169, 1985.
[2] P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,” Neural Networks, vol. 2, pp. 53–38, 1989.
[3] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol, 313, pp. 504–507, Jul. 2006.
[4] P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,” JMLR: Workshop and Conference Proceedings, vol. 27, pp. 37–50, 2012.
[5] D. P. Kingma and M, Welling, “Auto-encoding variational Bayes,” arXiv:1312.6114, 2013.
[6] C. Doersch, “Tutorial on variational autoencoders,” arXiv:1606.05908, 2016.
[7] M. Tschannen, O. Bachem, and M. Lucic. “Recent advances in autoencoder-based representation learning,” arXiv:1812.05069, 2018.
[8] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, et al., “Automatic chemical design using a data-driven continuous representation of molecules,” ACS Central Science, vol. 4, no. 2, pp. 268–276, 2018.
[9] O. Delalleau and Y. Bengio, “Shallow vs. deep sum-product networks,” In Advances in Neural Information Processing Systems, pp. 666-674, 2011.
[10] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. “On the number of linear regions of deep neural networks,” In Advances in Neural Information Processing Systems, pp. 2924-2932, 2014.
[11] S. An, M. Bennamoun, and F. Boussaid, “On the compressive power of deep rectifier networks for high resolution representation of class boundaries,” in arXiv:1708.07244v1, 2017.
[12] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in Proc. ICLR 2017 (arXiv:1611.03530), 2017.
[13] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” J. Theoret. Biol., vol. 22, no. 3, pp. 437–467, Mar. 1969.
[14] T. Akutsu, Algorithms for Analysis, Inference, and Control of Boolean Networks. Singapore: World Scientific, 2018.
[15] D. Cheng, H. Qi, and Z. Li, Analysis and Control of Boolean Networks: A Semi-tensor Product Approach.
[16] F. Li, “Pinning control design for the stabilization of Boolean networks,” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 7, pp. 1585–1590, 2016.
[17] Y. Liu, L. Sun, J. Lu and J. Liang, “Feedback controller design for the synchronization of Boolean control networks,” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 9, pp. 1991–1996, 2016.
[18] J. Lu, H. Li, Y. Liu, and F. Li, “Survey on semi-tensor product method with its applications in logical networks and other finite-valued systems,” IET Control Theory & Applications, vol. 11, no. 13, pp. 2040–2047, Aug. 2017.
[19] Y. Zhao, B. K. Ghosh and D. Cheng, “Control of large-scale Boolean networks via network aggregation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 7, pp. 1527–1536, 2016.
[20] M. Anthony, Discrete Mathematics of Neural Networks, Selected Topics. Philadelphia, PA, USA: SIAM, 2001.
[21] K. Y. Siu, V. P. Roychowdhury, and T. Kailath, “Depth-size tradeoffs for neural computation”, IEEE Transactions on Computers, vol. 40, no. 12, pp. 1402-1412, 1991.
[22] R. Minnick, “Linear-input logic,” IEEE Trans. Electron. Comput., vol. EC-10, pp. 6-16, 1961.