This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Compressive Power of Boolean Threshold Autoencoders

Avraham A. Melkman Department of Computer Science, Ben-Gurion University of the Negev Sini Guo Department of Mathematics, The University of Hong Kong Wai-Ki Ching partially supported by Hong Kong RGC GRF Grant no. 17301519, IMR and RAE Research fund from Faculty of Science, HKU. Department of Mathematics, The University of Hong Kong Pengyu Liu Bioinformatics Center, Institute for Chemical Research, Kyoto University Tatsuya Akutsu partially supported by Grant-in-Aid #18H04113 from JSPS, Japan,Correspoding author. e-mail: [email protected] Bioinformatics Center, Institute for Chemical Research, Kyoto University
Abstract

An autoencoder is a layered neural network whose structure can be viewed as consisting of an encoder, which compresses an input vector of dimension DD to a vector of low dimension dd, and a decoder which transforms the low-dimensional vector back to the original input vector (or one that is very similar). In this paper we explore the compressive power of autoencoders that are Boolean threshold networks by studying the numbers of nodes and layers that are required to ensure that each vector in a given set of distinct input binary vectors is transformed back to its original. We show that for any set of nn distinct vectors there exists a seven-layer autoencoder with the smallest possible middle layer, (i.e., its size is logarithmic in nn), but that there is a set of nn vectors for which there is no three-layer autoencoder with a middle layer of the same size. In addition we present a kind of trade-off: if a considerably larger middle layer is permissible then a five-layer autoencoder does exist. We also study encoding by itself. The results we obtain suggest that it is the decoding that constitutes the bottleneck of autoencoding. For example, there always is a three-layer Boolean threshold encoder that compresses nn vectors into a dimension that is reduced to twice the logarithm of nn.

Keywords: Neural networks, Boolean functions, threshold functions, autoencoders.

1 Introduction

Artificial neural networks have been extensively studied in recent years. Among various models, autoencoders attract much attention because of their power to generate new objects such as image data. An autoencoder is a layered neural network whose structure can be viewed as consisting of two parts, an encoder and a decoder, where the former transforms an input vector to a low-dimensional vector and the latter transforms the low-dimensional vector to an output vector which should be the same as or similar to the input vector, [1, 2, 3, 4]. An autoencoder is trained in an unsupervised manner to minimize the difference between input and output data by adjusting weights (and some other parameters). In the process it learns, therefore, a mapping from high-dimensional input data to a low-dimensional representation space. Although autoencoders have a long history [1, 3], recent studies focus on variational autoencoders [5, 6, 7] because of their generative power. Autoencoders have been applied to various areas including image processing [6, 7], natural language processing [7], and drug discovery [8].

As described above, autoencoders perform dimensionality reduction, a kind of data compression. However, how data are compressed via autoencoders is not yet very clear. Of course, extensive studies have been done on the representation power of deep neural networks [9, 10, 11, 12]. Yet, to the best of the authors’ knowledge, the quantitative relationship between the compressive power and the numbers of layers and nodes in autoencoders is still unclear.

In this paper, we study the compressive power of autoencoders using a Boolean model of layered neural networks. In this model, each node takes on values that are either 1 (active) or 0 (inactive) and the activation rule for each node is given by a Boolean function. That is, we consider a Boolean network (BN) [13] as a model of a neural network. BNs have been used as a discrete model of genetic networks, for which extensive studies have been done on inference, control, and analysis [14, 15, 16, 17, 18, 19]. It should be noted that BNs have also been used as a discrete model of neural networks in which functions are restricted to be linear Boolean threshold functions whose outputs are determined by comparison of the weighted sum of input values with a threshold [20]. Such a BN is referred to as Boolean threshold networks (BTNs) in this paper. Although extensive theoretical studies have been devoted to the representational power of BTNs [20], almost none considered BN models of autoencoders with the notable exception of Baldi’s study of the computational complexity of clustering via autoencoders [4].

Table 1: Summary of Results.
dd architecture type constraint
Proposition 4 logn\log n D/d/DD/d/D Encoder/Decoder BN
Theorem 5 82Mlnn\lceil 8\sqrt{2M}\ln n\rceil D/dD/d Encoder BTN, M=#1s in each 𝐱iM=\#1\mbox{s in each }{\bf x}^{i}
Theorem 8 2logn2\lceil\log n\rceil D/dD/d Encoder BN with parity function
Theorem 9 2logn2\lceil\log n\rceil D/D2/dD/D^{2}/d Encoder BTN
Theorem 15 logn\lceil\log n\rceil 4 layers (O(n+D)O(\sqrt{n}+D) nodes) Encoder BTN
Theorem 19 2n2\lceil\sqrt{n}\rceil D/(d2+D)/d/dD2/DD/({\frac{d}{2}}+D)/d/{\frac{dD}{2}}/D Encoder/Decoder BTN
Theorem 21 logn\lceil\log n\rceil D/n/d/n/DD/n/d/n/D Encoder/Decoder BTN
Theorem 22 2logn2\lceil\log\sqrt{n}\rceil 7 layers (O(Dn)O(D\sqrt{n}) nodes) Encoder/Decoder BTN

The BTNs we consider in this paper are always layered ones, and we study their compressive power in two settings. The first one focuses on encoding. We are given a set of nn DD-dimensional different binary vectors Xn={𝐱0,,𝐱n1}X_{n}=\{{\bf x}^{0},\ldots,{\bf x}^{n-1}\} and the task is to find a BTN which maps each 𝐱i{\bf x}^{i} to some dd-dimensional binary vector 𝐟(𝐱i){\bf f}({\bf x}^{i}) in such a way that 𝐟(𝐱i)𝐟(𝐱j){\bf f}({\bf x}^{i})\neq{\bf f}({\bf x}^{j}) for all iji\neq j. Such a BTN is called a perfect encoder for XnX_{n}. The second one involves both encoding and decoding. In this case we are again given a set of DD-dimensional binary vectors Xn={𝐱0,,𝐱n1}X_{n}=\{{\bf x}^{0},\ldots,{\bf x}^{n-1}\}, and the task is to find a BTN consisting of an encoder function 𝐟{\bf f} and a decoder function 𝐠{\bf g} satisfying 𝐠(𝐟(𝐱i))=𝐱i{\bf g}({\bf f}({\bf x}^{i}))={\bf x}^{i} for all i=0,,n1i=0,\ldots,n-1. Such a BTN is called a perfect autoencoder for XnX_{n}. It is clear from the definition that if (𝐟,𝐠)({\bf f},{\bf g}) is a perfect autoencoder, then 𝐟{\bf f} is a perfect encoder. In this paper, we are interested in whether or not there exists a BTN for any XnX_{n} under the condition that the architecture (i.e., the number of layers and the number of nodes in each layer) is fixed. This setting is reasonable because learning of an autoencoder is usually performed, for a given set of input vectors, by fixing the architecture of the BTN and adjusting the weights of the edges. Therefore, the existence of a perfect BTN for any XnX_{n} implies that a BTN can be trained for any XnX_{n} so that the required conditions are satisfied if an appropriate set of initial parameters is given.

The results on existence of perfect encoders and autoencoders are summarized in Table 1, in which an entry in the architecture column lists the number of nodes in each layer (from input to output). Note that logn\log n means log2n\log_{2}n throughout the paper. In addition to these positive results, we show a negative result (Theorem 16) stating that for some XnX_{n} with d=lognd=\lceil\log n\rceil, there does not exist a perfect autoencoder consisting of three layers.

2 Problem Definitions

Our first BN model is a three-layered one, see Fig. 1; its definitions can be extended to networks with four or more layers in a straightforward way. Let 𝐱{\bf x}, 𝐳{\bf z}, 𝐲{\bf y} be binary vectors given to the first, second, and third layers, respectively. In this case, the first, second, and third layers correspond to the input, middle, and output layers, respectively. Let 𝐟{\bf f} and 𝐠{\bf g} be lists of Boolean functions assigned to nodes in the second and third layers, respectively, which means 𝐳=𝐟(𝐱){\bf z}={\bf f}({\bf x}) and 𝐲=𝐠(𝐳){\bf y}={\bf g}({\bf z}). Since our primary interest is in autoencoders, we assume that 𝐱{\bf x} and 𝐲{\bf y} are DD-dimensional binary vectors and 𝐳{\bf z} is a dd-dimensional binary vector with dDd\leq D. A list of functions is also referred to as a mapping. The iith element of a vector 𝐱{\bf x} will be denoted by xix_{i}, which is also used to denote the node corresponding to this element. Similarly, for each mapping 𝐟{\bf f}, fif_{i} denotes the iith function.

In the following, Xn={𝐱0,,𝐱n1}X_{n}=\{{\bf x}^{0},\ldots,{\bf x}^{n-1}\} will always denote a set of nn DD-dimensional binary input vectors that are all different.

Definition 1.

A mapping 𝐟{\bf f}: {0,1}D{0,1}d\{0,1\}^{D}\rightarrow\{0,1\}^{d} is called a perfect encoder for XnX_{n} if 𝐟(𝐱i){\bf f}({\bf x}^{i})\neq 𝐟(𝐱j){\bf f}({\bf x}^{j}) holds for all iji\neq j.

Definition 2.

A pair of mappings (𝐟,𝐠)({\bf f},{\bf g}) with 𝐟{\bf f}: {0,1}D{0,1}d\{0,1\}^{D}\rightarrow\{0,1\}^{d} and 𝐠{\bf g}: {0,1}d{0,1}D\{0,1\}^{d}\rightarrow\{0,1\}^{D} is called a perfect autoencoder if 𝐠(𝐟(𝐱i))=𝐱i{\bf g}({\bf f}({\bf x}^{i}))={\bf x}^{i} holds for all 𝐱iXn{\bf x}^{i}\in X_{n}.

Refer to caption
Figure 1: Architecture of an autoencoder.

Fig. 1 illustrates the architecture of an autoencoder. As mentioned before, if (𝐟,𝐠)({\bf f},{\bf g}) is a perfect autoencoder, then 𝐟{\bf f} is a perfect encoder.

Example 3.

Let X4={𝐱0,𝐱1,𝐱2,𝐱3}X_{4}=\{{\bf x}^{0},{\bf x}^{1},{\bf x}^{2},{\bf x}^{3}\} where 𝐱0=(0,0,0){\bf x}^{0}=(0,0,0), 𝐱1=(1,0,0){\bf x}^{1}=(1,0,0), 𝐱2=(1,0,1){\bf x}^{2}=(1,0,1), 𝐱3=(1,1,1){\bf x}^{3}=(1,1,1). Let D=3D=3 and d=2d=2. Define 𝐳=𝐟(𝐱){\bf z}={\bf f}({\bf x}) and 𝐲=𝐠(𝐳){\bf y}={\bf g}({\bf z}) by

f0:x0x1,f1:x2,g0:z0z1,g1:z0¯z1,g2:z1.\displaystyle f_{0}:x_{0}\oplus x_{1},~{}f_{1}:x_{2},~{}g_{0}:z_{0}\lor z_{1},~{}g_{1}:\overline{z_{0}}\land z_{1},~{}g_{2}:z_{1}.

This pair of mappings has the following truth table, which shows it to be a perfect Boolean autoencoder.

x0x_{0} x1x_{1} x2x_{2} z0z_{0} z1z_{1} y0y_{0} y1y_{1} y2y_{2}
0 0 0 0 0 0 0 0
1 0 0 1 0 1 0 0
1 0 1 1 1 1 0 1
1 1 1 0 1 1 1 1
Proposition 4.

For any XnX_{n}, there exists a perfect Boolean autoencoder with d=lognd=\lceil\log n\rceil.

Proof.

The encoder maps a vector to its index, in binary representation, and the decoder maps the index back to the vector. To implement the idea formally denote by i2bd(j)i2b_{d}(j) the dd-dimensional binary vector representing jj, for j<2dj<2^{d} and 0<d0<d. For example, i2b3(5)=(1,0,1)i2b_{3}(5)=(1,0,1).

Given d=lognd=\lceil\log n\rceil define (𝐟,𝐠)({\bf f},{\bf g}) by 𝐟(𝐱i)=i2bd(i){\bf f}({\bf x}^{i})=i2b_{d}(i) and gj(𝐟(𝐱i))=xjig_{j}({\bf f}({\bf x}^{i}))=x^{i}_{j}. Clearly, (𝐟,𝐠)({\bf f},{\bf g}) is a perfect autoencoder and can be represented by Boolean functions. ∎

Note that there does not exist a perfect Boolean autoencoder with d<lognd<\lceil\log n\rceil because XnX_{n} contains nn different vectors. Note, furthermore, that the Proposition permits the use of arbitrary Boolean functions. In the following, we focus on what is achievable when a neural network model with Boolean threshold functions [20] is used. A function ff: {0,1}h{0,1}\{0,1\}^{h}\rightarrow\{0,1\} is called a Boolean threshold function if it is represented as

f(𝐱)={1,𝐚𝐱θ,0,otherwise,\displaystyle f({\bf x})=\left\{\begin{array}[]{ll}1,&{\bf a}\cdot{\bf x}\geq\theta,\\ 0,&\mbox{otherwise,}\end{array}\right.

for some (𝐚,θ)({\bf a},\theta), where 𝐚{\bf a} is an hh-dimensional integer vector and θ\theta is an integer. We will also denote the same function as [𝐚𝐱θ][{\bf a}\cdot{\bf x}\geq\theta]. If all activation functions in a neural network are Boolean threshold functions, the network is called a Boolean threshold network (BTN). In the following sections, we will consider BTNs with LL layers where L2L\geq 2 (see Fig. 2). Such a BTN is represented as 𝐲=𝐟(L1)(𝐟(L2)(𝐟(1)(𝐱))){\bf y}={\bf f}^{(L-1)}({\bf f}^{(L-2)}(\cdots{\bf f}^{(1)}({\bf x})\cdots)), where 𝐟(i){\bf f}^{(i)} is a list of activation functions for the i+1i+1-th layer. When a BTN is used as an autoencoder, some layer is specified as the middle layer. If the kkth layer is specified as the middle layer, the middle vector 𝐳{\bf z}, encoder 𝐟{\bf f}, and decoder 𝐠{\bf g} are defined by

𝐳\displaystyle{\bf z} =\displaystyle= 𝐟(k1)(𝐟(k2)(𝐟(1)(𝐱)))=𝐟(𝐱),\displaystyle{\bf f}^{(k-1)}({\bf f}^{(k-2)}(\cdots{\bf f}^{(1)}({\bf x})\cdots))={\bf f}({\bf x}),
𝐲\displaystyle{\bf y} =\displaystyle= 𝐟(L1)(𝐟(L2)(𝐟(k)(𝐳)))=𝐠(𝐳).\displaystyle{\bf f}^{(L-1)}({\bf f}^{(L-2)}(\cdots{\bf f}^{(k)}({\bf z})\cdots))={\bf g}({\bf z}).
Refer to caption
Figure 2: Architecture of five layer BTN. In this case, the third layer corresponds to the middle layer.

3 How Easy is it to Encode?

This section is devoted to the encoding setting. First we establish, non-constructively, the existence of two encoders for nn DD-dimensional vectors: a two-layer BTN that uses an output layer with O(Dlogn)O(\sqrt{D}\log n) nodes, and a three-layer BTN with a hidden layer of size D2D^{2} and an output layer of size 2log2n2\lceil\log_{2}{n}\rceil. The third result reduces the number of output nodes further to logn\lceil\log n\rceil, using a BTN of depth 4 with O(n+D)O(\sqrt{n}+D) hidden nodes; it also lays out the architecture of this BTN. Taken together these results suggest that encoding is relatively easy by itself.

Theorem 5.

Given XnX_{n} there exists a perfect encoder that is a two-layer BTN with at most 82Mlnn\lceil 8\sqrt{2M}\ln n\rceil output nodes, where MM is the maximum number of 1’s that any two vectors in XnX_{n} have in common.

Proof.

We use the probabilistic method to establish the existence of a set WW of dd DD-dimensional weight vectors 𝐰j{1,1}D{\bf w}^{j}\in\{-1,1\}^{D} with the property that the two-layer BTN whose threshold functions are fj(𝐱)=[𝐱𝐰j1],j=0,,d1f_{j}({\bf x})=[{\bf x}\cdot{\bf w}^{j}\geq 1],\ j=0,\ldots,d-1, yields output vectors that are all different. Denote by 𝐲i{\bf y}^{i} the output vector for 𝐱i{\bf x}^{i}, with yji=fj(𝐱i)y^{i}_{j}=f_{j}({\bf x}^{i}).

Consider a random WW such that each wijw^{j}_{i} has the value 1-1 or 1 with probability 12\frac{1}{2}. Let us compute a lower bound on the probability that a given pair of output vectors differs in a specific coordinate. Without loss of generality (w.l.o.g.) we take the vectors to be 𝐲0{\bf y}^{0} and 𝐲1{\bf y}^{1}, and look at the \ell-th coordinate. Because this probability does not depend on the coordinate, since WW was chosen at random, let’s denote it by p{0,1}p_{\{0,1\}}. Denote by mim_{i} the number of 1’s in 𝐱i{\bf x}^{i}, and by m{0,1}m_{\{0,1\}} the number of coordinates with value 1 for both 𝐱0{\bf x}^{0} and 𝐱1{\bf x}^{1}. W.l.o.g. x00=x01==xm{0,1}10=xm{0,1}11=1x^{0}_{0}=x^{1}_{0}=\cdots=x^{0}_{m_{\{0,1\}}-1}=x^{1}_{m_{\{0,1\}}-1}=1, xm{0,1}0==xm010=xm01==xm0+m1m{0,1}11=1x^{0}_{m_{\{0,1\}}}=\cdots=x^{0}_{m_{0}-1}=x^{1}_{m_{0}}=\cdots=x^{1}_{m_{0}+m_{1}-m_{\{0,1\}}-1}=1, all other coordinates being 0.

Consequently y0=[j=0m01wj0]{y}^{0}_{\ell}=[\sum_{j=0}^{m_{0}-1}w^{\ell}_{j}\geq 0], and y1=[j=0m{0.1}1wj+j=m0m0+m1m{0.1}1wj0]{y}^{1}_{\ell}=[\sum_{j=0}^{m_{\{0.1\}}-1}w^{\ell}_{j}+\sum_{j=m_{0}}^{m_{0}+m_{1}-m_{\{0.1\}}-1}w^{\ell}_{j}\geq 0].

Suppose m{0,1}m_{\{0,1\}} is even. Then a lower bound on p{0,1}p_{\{0,1\}} is p{0,1}Prob(j=0m{0,1}1wj=0)Qp_{\{0,1\}}\geq Prob(\sum_{j=0}^{m_{\{0,1\}}-1}w^{\ell}_{j}=0)\cdot Q, where QQ stands for

Prob(j=m{0,1}m01wj0)Prob(j=m0m0+m1m{0,1}1wj<0)+\displaystyle Prob(\sum_{j=m_{\{0,1\}}}^{m_{0}-1}w^{\ell}_{j}\geq 0)\cdot Prob(\sum_{j=m_{0}}^{m_{0}+m_{1}-m_{\{0,1\}}-1}w^{\ell}_{j}<0)+
Prob(j=m{0,1}m01wj<0)Prob(j=m0m0+m1m{0,1}1wj0).\displaystyle Prob(\sum_{j=m_{\{0,1\}}}^{m_{0}-1}w^{\ell}_{j}<0)\cdot Prob(\sum_{j=m_{0}}^{m_{0}+m_{1}-m_{\{0,1\}}-1}w^{\ell}_{j}\geq 0). (2)

By the soon-to-be-stated Lemma 7, Prob(j=0m{0,1}1wj=0)12m{0,1}Prob(\sum_{j=0}^{m_{\{0,1\}}-1}w^{\ell}_{j}=0)\geq\frac{1}{\sqrt{2m_{\{0,1\}}}}. Turning to the estimation of QQ, Lemma 7 implies that if one of m0m{0,1},m1m{0,1}m_{0}-m_{\{0,1\}},m_{1}-m_{\{0,1\}} is odd then Q=12Q=\frac{1}{2}. In the remaining case that both r0=m0m{0,1}r_{0}=m_{0}-m_{\{0,1\}} and r1=m1m{0,1}r_{1}=m_{1}-m_{\{0,1\}} are even

Q=12(r1r02)(r1r12)12r0+r1+1.Q=\frac{1}{2}-\binom{r_{1}}{\frac{r_{0}}{2}}\binom{r_{1}}{\frac{r_{1}}{2}}\frac{1}{2^{r_{0}+r_{1}+1}}.

This expression assumes its smallest value, for r0+r11r_{0}+r_{1}\geq 1, when one of r0,r1r_{0},r_{1} is 0 and the other is 2, in which case its value is 14\frac{1}{4}. We conclude, therefore, that if the number of 1’s that 𝐱0,𝐱1{\bf x}^{0},{\bf x}^{1} have in common is even then Q14Q\geq\frac{1}{4} and p{0,1}142m{0,1}142Dp_{\{0,1\}}\geq\frac{1}{4\sqrt{2m_{\{0,1\}}}}\geq\frac{1}{4\sqrt{2D}}.

A similar computation for the case that m{0,1}m_{\{0,1\}} is odd shows that if number of 1’s that 𝐱0,𝐱1{\bf x}^{0},{\bf x}^{1} have in common is odd then p{0,1}14m{0,1}p_{\{0,1\}}\geq\frac{1}{4\sqrt{m_{\{0,1\}}}}.

In summary, the probability that a specific pair of output vectors differ at a specific coordinate is at least

p=min0i<jn1p{i,j}142M,p=\min_{0\leq i<j\leq n-1}p_{\{i,j\}}\geq\frac{1}{4\sqrt{2M}},

where M=max0i<jn1m{i,j}M=\max_{0\leq i<j\leq n-1}m_{\{i,j\}},

Consequently the probability that all coordinates of two dd-dimensional output vectors are identical is at most (1142M)d(1-\frac{1}{4\sqrt{2M}})^{d}. To ensure that the probability that one of the (n2)\binom{n}{2} pairs of vectors 𝐲i0{\bf y}^{i_{0}} and 𝐲i1{\bf y}^{i_{1}} is identical is less than 1 it is sufficient to choose dd such that

(n2)(1142M)d<1,\binom{n}{2}(1-\frac{1}{4\sqrt{2M}})^{d}<1,

i.e. we require d82Mlnnd\geq 8\sqrt{2M}\ln n. ∎

Remark 6.

In many applications the feature vectors are sparse so that MM is small and the bound given in Theorem 5 can be lowered. Consider for example the case that the number of 1s is Dα,α<1D^{\alpha},\ \alpha<1. Then it is sufficient to set d82Dα2lnnd\geq 8\sqrt{2}D^{\frac{\alpha}{2}}\ln n.

Here is the statement of the Lemma used in the foregoing proof.

Lemma 7.

Let wj,j=0,,m1,wj{1,1}w_{j},j=0,\ldots,m-1,w_{j}\in\{-1,1\} be random variables, and let w=j=0m1wjw=\sum_{j=0}^{m-1}w_{j}.

If mm is even the following hold.

  1. 1.

    Prob(w=0)=(mm/2)12mProb(w=0)=\binom{m}{m/2}\frac{1}{2^{m}}. This implies, using Stirling’s approximation, Prob(w=0)12mProb(w=0)\geq\frac{1}{\sqrt{2m}}.

  2. 2.

    Prob(w0)=12(1+Prob(w=0)).Prob(w\geq 0)=\frac{1}{2}(1+Prob(w=0)).

  3. 3.

    Prob(w2)=Prob(w<0)=12(1Prob(w=0)).Prob(w\geq 2)=Prob(w<0)=\frac{1}{2}(1-Prob(w=0)).

If mm is odd the following hold.

  1. 1.

    Prob(w=1)=Prob(w=1)=(m(m1)/2)12mProb(w=1)=Prob(w=-1)=\binom{m}{(m-1)/2}\frac{1}{2^{m}}. This implies Prob(w=1)12mProb(w=1)\geq\frac{1}{2\sqrt{m}}.

  2. 2.

    Prob(w0)=Prob(w<0)=12Prob(w\geq 0)=Prob(w<0)=\frac{1}{2}.

Proof.

It suffices to note that if mm is even (odd) then ww is even (odd), and that Prob(w<0)=Prob(w>0)Prob(w<0)=Prob(w>0). ∎

Next we show that the number of output nodes needed to encode the nn vectors can be reduced to 2log2n2\lceil\log_{2}{n}\rceil by adding a layer of D2D^{2} hidden nodes to the BTN. To prove this we first present a result of independent interest, that uses the power of parity functions.

Theorem 8.

Given XnX_{n} there is a two-layer network whose activation functions are parity functions that maps XnX_{n} to nn different vectors of dimension 2log2n2\lceil\log_{2}{n}\rceil.

Proof.

We again use the probabilistic method to establish the existence of a network with dd output nodes, provided d2log2nd\geq 2\log_{2}n. Equip each output node yky_{k}, k=0,,d1k=0,\ldots,d-1 with an activation function fkf_{k} constructed as follows: assemble a set of bits, SkS_{k}, by adding each one of the DD input variables to the set with probability 12\frac{1}{2}, and set fk(𝐱)f_{k}({\bf x}) to be the parity of the set of values that 𝐱{\bf x} assigns to the variables in SkS_{k}.

Consider an arbitrary fixed output node yky_{k}. Given a pair of input vectors (𝐱i,𝐱j)({\bf x}^{i},{\bf x}^{j}), define

Bi,ji={𝐱i=1,𝐱j=0},Bi,jj={𝐱i=0,𝐱j=1}.B^{i}_{i,j}=\{\ell\mid{\bf x}^{i}_{\ell}=1,~{}{\bf x}^{j}_{\ell}=0\},\ B^{j}_{i,j}=\{\ell\mid{\bf x}^{i}_{\ell}=0,~{}{\bf x}^{j}_{\ell}=1\}.

Then fk(𝐱i)=fk(𝐱j)f_{k}({\bf x}^{i})=f_{k}({\bf x}^{j}) if and only if |Bi,jiSk||B^{i}_{i,j}\cap S_{k}| and |Bi,jjSk||B^{j}_{i,j}\cap S_{k}| are either both odd or both even.

If |Bi,ji|>0|B^{i}_{i,j}|>0, then Prob(|Bi,jiSk|iseven)=Prob(|Bi,jiSk|isodd)=12Prob(|B^{i}_{i,j}\cap S_{k}|\ is\ even)=Prob(|B^{i}_{i,j}\cap S_{k}|\ is\ odd)=\frac{1}{2}, because SkS_{k} is randomly selected. Therefore, if |Bi,ji|>0|B^{i}_{i,j}|>0 and |Bi,jj|>0|B^{j}_{i,j}|>0, then the probability that fk(𝐱i)=fk(𝐱j)f_{k}({\bf x}^{i})=f_{k}({\bf x}^{j}) is 12{\frac{1}{2}}. If |Bi,ji|=0|B^{i}_{i,j}|=0 then |Bi,jj|>0|B^{j}_{i,j}|>0, since 𝐱i𝐱j{\bf x}^{i}\neq{\bf x}^{j}, and Prob(fk(𝐱i)=fk(𝐱j))=Prob(|Bi,jjSk|iseven)=12Prob(f_{k}({\bf x}^{i})=f_{k}({\bf x}^{j}))=Prob(|B^{j}_{i,j}\cap S_{k}|\ is\ even)=\frac{1}{2}.

We conclude that Prob(fk(𝐱i)=fk(𝐱j))=12Prob(f_{k}({\bf x}^{i})=f_{k}({\bf x}^{j}))={\frac{1}{2}}, and the probability that Prob(𝐲i=𝐲j)=(12)dProb({\bf y}^{i}={\bf y}^{j})=({\frac{1}{2}})^{d}. Therefore the probability that there is some pair 𝐱i,𝐱j{\bf x}^{i},{\bf x}^{j} for which 𝐲i=𝐲j{\bf y}^{i}={\bf y}^{j} is smaller than n2(12)dn^{2}({\frac{1}{2}})^{d}, which is less than 1 if d2log2nd\geq 2\log_{2}n. ∎

Theorem 9.

Given nn different binary vectors of dimension DD, there is a three-layer BTN whose activation functions are threshold functions that maps these vectors to nn different vectors of dimension 2log2n2\lceil\log_{2}{n}\rceil using D2D^{2} hidden nodes.

Proof.

It is well-known that a parity function on DD nodes can be implemented by threshold functions while using only one additional layer of at most DD nodes. A simple way of doing this is to use a hidden layer with DD nodes, h0,,hD1h_{0},...,h_{D-1}, where node hih_{i} has value 1 if the input contains at least i+1i+1 1’s, i.e. hi(𝐱)=[𝐞𝐱i+1]h_{i}({\bf x})=[{\bf e}\cdot{\bf x}\geq i+1], where 𝐞=(1,,1){\bf e}=(1,...,1). The output node that computes the parity of the input has the activation function [i=0D1(1)i1hi1][\sum_{i=0}^{D-1}(-1)^{i-1}h_{i}\geq 1]. The Theorem follows therefore from Theorem 8 when each of the parity functions it uses is implemented this way. ∎

We turn now to results that also provide constructions. In one form or another all use the idea of having the encoder map a vector xi\textbf{x}^{i} to its index, as done in Proposition 4. That Proposition permits the use of any Boolean mapping, so that the meaning of the index of the vector is of no importance. Here, however, only mappings arising from BTNs are employed, and the meaning of the index, as provided by the following simple observation, is going to be of central importance.

Lemma 10.

cf. [12]. Given XnX_{n} there is a vector a such that axiaxj\textbf{a}\cdot\textbf{x}^{i}\neq\textbf{a}\cdot\textbf{x}^{j} if iji\neq j. Furthermore, this vector can be assumed to have integer coordinates.

The reason is that the vectors in XnX_{n} are all different. We fix 𝐚{\bf a}, and assume henceforth that XnX_{n} is sorted,

b0<<bn1,bi=axi.b^{0}<\cdots<b^{n-1},\ b_{i}=\textbf{a}\cdot\textbf{x}^{i}. (3)

When a vector xi\textbf{x}^{i} is mapped to its index, ii will be represented either in binary, or by the ss-dimensional step-vector hi[s]\textbf{h}^{i}[s], for some ss, defined by

hi[s]j=1 if 0ji, 0 if i<j<s.\textbf{h}^{i}[s]_{j}=1\mbox{ if }0\leq j\leq i,\ 0\mbox{ if }i<j<s. (4)

Our first result in this direction exemplifies encoding with step-vectors.

Theorem 11.

Given XnX_{n} there is a two-layer BTN that maps xi\textbf{x}^{i} to hi[n],i=0,,n1\textbf{h}^{i}[n],\ i=0,\ldots,n-1.

Proof.

Let the vector of threshold functions for the nn nodes in the output layer be Gj(x)=[axbj]\textit{G}_{j}(\textbf{x})=[\textbf{a}\cdot\textbf{x}\geq b_{j}]. Clearly Gj(xi)=[axibj]\textit{G}_{j}(\textbf{x}^{i})=[\textbf{a}\cdot\textbf{x}^{i}\geq b_{j}] has value 1 if and only if jij\leq i, i.e. G(xi)=hi\textbf{{G}}(\textbf{x}^{i})=\textbf{h}^{i}. ∎

We show next that the number of output nodes of the encoder can be reduced to 2n2\lceil\sqrt{n}\rceil if a hidden layer with only n+D\lceil\sqrt{n}\rceil+D nodes is added. Henceforward we denote r=nr=\lceil\sqrt{n}\rceil.

Theorem 12.

Given XnX_{n} satisfying equation (3), there is a three-layer network that maps xi\textbf{x}^{i} to (𝐡k[r],𝐡[r])(\boldsymbol{h}^{k}[r],\boldsymbol{h}^{\ell}[r]) where kk and \ell are such that i=kr+i=kr+\ell, with 0<r0\leq\ell<r. The network has DD input nodes xj,j=0,,D1x_{j},\ j=0,\ldots,D-1, r+Dr+D hidden nodes αi1,i=0,,r1{\alpha}^{1}_{i},\ i=0,\ldots,r-1 and αj2,j=0,,D1{\alpha}^{2}_{j},\ j=0,\ldots,D-1, and 2r2r output nodes βi1,βi2,i=0,,r1\beta^{1}_{i},\beta^{2}_{i},\ i=0,\ldots,r-1.

Proof.

For ease of exposition we assume that logn=2m\log n={2m}, and r=n=2mr=\sqrt{n}=2^{m}.

The nodes αj2\alpha^{2}_{j} simply copy the input, αj2=xj,j=0,,D1{\alpha}^{2}_{j}=x_{j},\ j=0,\ldots,D-1. To define the activation functions of αi1\alpha^{1}_{i} divide the interval [b0,bn1][b_{0},b_{n-1}] into rr consecutive subintervals [s0,s11],[s1,s21],,[sr1,bn1][s_{0},s_{1}-1],\ [s_{1},s_{2}-1],\ldots,[s_{r-1},b_{n-1}] each containing rr values bib_{i}, i.e. s0=b0,s1=br,,sr1=b(r1)rs_{0}=b_{0},s_{1}=b_{r},\ldots,s_{r-1}=b_{(r-1)r}, and sr=bn+1s_{r}=b_{n}+1. The activation function of αi1\alpha^{1}_{i} is [𝐚𝐱si][{\bf a}\cdot{\bf x}\geq s_{i}], 0ir10\leq i\leq r-1. Clearly, if h=kr+h=kr+\ell then αi1=1\alpha^{1}_{i}=1 if and only if iki\leq k, i.e. 𝜶1=𝒉k[r]\boldsymbol{\alpha}^{1}=\boldsymbol{h}^{k}[r].

The output node βi1\beta^{1}_{i} copies αi1, 0ir1\alpha^{1}_{i},\ 0\leq i\leq r-1. The remaining output nodes, 𝜷2\boldsymbol{\beta}^{2}, represent the ordinal number of 𝐚𝐱{\bf a}\cdot{\bf x} within the kk-th subinterval. To implement this part of the mapping we follow [21] in employing the ingenious technique of telescopic sums, introduced by [22], and define

ti=bjα0\displaystyle t_{i}=b_{j}\alpha_{0} +(br+ibi)α1+\displaystyle+(b_{r+i}-b_{i})\alpha_{1}+\ldots
\displaystyle\ldots +(b(r1)r+ib(r2)r+i)αr1,i=0,,r1.\displaystyle+(b_{(r-1)r+i}-b_{(r-2)r+i})\alpha_{r-1},i=0,\ldots,r-1.

The important thing to note here is that if 𝐚𝐱{\bf a}\cdot{\bf x} falls in the kk-th subinterval, so that sk𝐚𝐱<sk+1s_{k}\leq{\bf a}\cdot{\bf x}<s_{k+1}, then tit_{i} has the value bkr+ib_{kr+i}, because α0==αk=1\alpha_{0}=\ldots=\alpha_{k}=1, while αi=0,i>k\alpha_{i}=0,\ i>k.

The activation function of βi2\beta^{2}_{i} is [𝐚𝐱ti0][{\bf a}\cdot{\bf x}-t_{i}\geq 0], i=0,,r1i=0,\ldots,r-1. Observe that if 𝐚𝐱=bkr+{\bf a}\cdot{\bf x}=b_{kr+\ell}, then βi2=[bkr+bkr+i0]\beta^{2}_{i}=[b_{kr+\ell}-b_{kr+i}\geq 0], i.e. βi2=1\beta^{2}_{i}=1 if ii\leq\ell, and βi2=0\beta^{2}_{i}=0 if i>i>\ell. ∎

It is relatively easy to convert the BTNs constructed in Theorems 11 and 12 into BTNs that map XnX_{n} to a set of vectors with the minimum possible dimension, logn\lceil\log n\rceil, by adding a layer of that size. The technique for doing so is encapsulated in the following Lemma. Its formulation is slightly more general than strictly needed here in order to serve another purpose later on.

Lemma 13.

Given a set of s2ds\leq 2^{d} different dd-dimensional binary vectors Z={𝐳i,i=0,,s1}Z=\{{\bf z}^{i},\ i=0,\ldots,s-1\} there is a two-layer BTN that maps hi[s]\textbf{h}^{i}[s] to 𝐳i{\bf z}^{i}.

Proof.

To construct the vector of threshold functions for the output nodes, H(h[s])\textbf{{H}}(\textbf{h}[s]), set w0j=zj0,j=0,,d1w^{j}_{0}=z^{0}_{j},j=0,\ldots,d-1,

wij=zjizji1,i=1,,s1,j=0,,d1,w^{j}_{i}=z^{i}_{j}-z^{i-1}_{j},i=1,\ldots,s-1,\ j=0,\ldots,d-1,

and Hj(h[s])=[wjh[s]1].\textit{H}_{j}(\textbf{h}[s])=[\textbf{w}^{j}\cdot\textbf{h}[s]\geq 1]. It is easily verified that Hj(h[s]i)=[zji1]=zji\textit{H}_{j}(\textbf{h}[s]^{i})=[z^{i}_{j}\geq 1]=z^{i}_{j}, i.e. H(h[s]i)=zi\textbf{{H}}(\textbf{h}[s]^{i})=\textbf{z}^{i} as desired. ∎

Consider the layer constructed in the proof of the Lemma when 𝐳i{\bf z}^{i} is the logn\lceil\log n\rceil-dimensional binary representation of ii and s=ns=n. By adding this layer to the BTN constructed in Theorem 11 we get the following result.

Theorem 14.

Given XnX_{n} there is a three-layer BTN with nn nodes in the second layer that maps 𝐱i{\bf x}^{i} to the logn\lceil\log n\rceil-dimensional binary representation of ii, i=0,,n1i=0,\ldots,n-1.

Similarly, we can add to the BTN constructed in Theorem 12 a layer that consists of a first part and a second part, both constructed according to the Lemma. The first part connects the nodes 𝜷1\boldsymbol{\beta}^{1} to a logr\lceil\log r\rceil-dimensional binary counter 𝜸1\boldsymbol{\gamma}^{1}, and the second part connects the nodes 𝜷2\boldsymbol{\beta}^{2} to another logr\lceil\log r\rceil-dimensional binary counter 𝜸2\boldsymbol{\gamma}^{2}. Thus, on input xi\textbf{x}^{i} the nodes 𝜷1\boldsymbol{\beta}^{1} contain hk[r]\textbf{h}^{k}[r] which is mapped to the logn\lceil\log n\rceil-dimensional binary representation of kk in 𝜸1\boldsymbol{\gamma}^{1}, and the nodes 𝜷2\boldsymbol{\beta}^{2} contain h[r]\textbf{h}^{\ell}[r] which is mapped to the logn\lceil\log n\rceil-dimensional binary representation of \ell in 𝜸2\boldsymbol{\gamma}^{2}. Recalling that i=kr+i=kr+\ell, with 0<r0\leq\ell<r, it follows that, in this case, 𝜸1\boldsymbol{\gamma}^{1} and 𝜸2\boldsymbol{\gamma}^{2} taken together contain the logn\lceil\log{n}\rceil-dimensional binary representation of ii. This construction establishes the following Theorem.

Theorem 15.

Given XnX_{n} there is a four-layer BTN with 3n+D3\lceil\sqrt{n}\rceil+D hidden nodes that maps 𝐱i{\bf x}^{i} to the logn\lceil\log n\rceil-dimensional binary representation of ii, i=0,,n1i=0,\ldots,n-1.

4 Difficulty of Decoding

Theorem 16.

For each dd there exists a set of n=2dn=2^{d} different NN-dimensional bit-vectors Y={yi}Y=\{\textbf{y}^{i}\} with the following properties:

  1. 1.

    There exists a 2-layer BTN that maps YY to {0,1}d\{0,1\}^{d}.

  2. 2.

    There does not exist a 2-layer BTN that maps {0,1}d\{0,1\}^{d} to YY.

Proof.

Set N=(n2)N=\binom{n}{2} and construct YY as follows:

  1. 1.

    index the coordinates of the vectors by (i,j), 0i<jn1(i,j),\ 0\leq i<j\leq n-1;

  2. 2.

    set y(i,j)k=1{y}^{k}_{(i,j)}=1 if and only if ii or jj is kk.

A BTN that encodes these vectors with dd bits is constructed as follows. The most significant bit is generated by the threshold function

i=n4n21y(2i,2i+1)1.\sum_{i=\frac{n}{4}}^{\frac{n}{2}-1}y_{(2i,2i+1)}\geq 1.

The next most significant bit is generated by

i=3n8n21y(2i,2i+1)+i=n8n41y(2i,2i+1)1.\sum_{i=\frac{3n}{8}}^{\frac{n}{2}-1}y_{(2i,2i+1)}+\sum_{i=\frac{n}{8}}^{\frac{n}{4}-1}y_{(2i,2i+1)}\geq 1.

The successive digits are generated in similar fashion, each time using a sum of n4\frac{n}{4} input bits, until at last the least significant is generated by the slightly different threshold function

y(1,3)+y(5,7)+y(n3,n1)1.y_{(1,3)}+y_{(5,7)}+\cdots y_{(n-3,n-1)}\geq 1.

It is not difficult to verify that this 2-layer BTN maps yi\textbf{y}^{i} to the dd-bit representation of ii.

To illustrate, consider the case d=3,n=8,N=28d=3,\ n=8,\ N=28. These vectors are encoded by the threshold functions d0d_{0} (most significant digit) d1d_{1} and d2d_{2} (least significant digit)

d0(𝐲)\displaystyle d_{0}({\bf y}) =[y(4,5)+y(6,7)1],\displaystyle=[y_{(4,5)}+y_{(6,7)}\geq 1],
d1(𝐲)\displaystyle d_{1}({\bf y}) =[y(2,3)+y(6,7)1],\displaystyle=[y_{(2,3)}+y_{(6,7)}\geq 1],
d2(𝐱)\displaystyle d_{2}({\bf x}) =[y(1,3)+u(5,7)1].\displaystyle=[y_{(1,3)}+u_{(5,7)}\geq 1].

For example, since in 𝐲7{\bf y}^{7} the coordinates (0,7),,(6,7)(0,7),\ldots,(6,7) are 1 it has d0(𝐲7)=d1(𝐲7)=d2(𝐲7)=1d_{0}({\bf y}^{7})=d_{1}({\bf y}^{7})=d_{2}({\bf y}^{7})=1, i.e. its representation is (1,1,1)(1,1,1). Similarly, since y(6,7)1=y(4,5)1=y(2,3)1=0y^{1}_{(6,7)}=y^{1}_{(4,5)}=y^{1}_{(2,3)}=0 whereas y(1,3)1=1y^{1}_{(1,3)}=1, d0(𝐲1)=d1(𝐲1)=0,d2(𝐲1)=1d_{0}({\bf y}^{1})=d_{1}({\bf y}^{1})=0,d_{2}({\bf y}^{1})=1, so that 𝐲1{\bf y}^{1} has the representation (0,0,1)(0,0,1).

To prove the second assertion, suppose to the contrary that there does exist a BTN mapping {0,1}d\{0,1\}^{d} to YY, and that it maps (0,,0)(0,\ldots,0) to yk\textbf{y}^{k} and (1,,1)(1,\ldots,1) to y\textbf{y}^{\ell}. Consider the threshold function for the bit x(k,)x_{(k,\ell)} of the output layer. Since by construction y(k,)k=y(k,)=1y^{k}_{(k,\ell)}=y^{\ell}_{(k,\ell)}=1 while y(k,)i=0y^{i}_{(k,\ell)}=0 for all ik,i\neq k,\ \ell, it follows that the threshold function for this bit separates the two dd-dimensional vectors (0,,0)(0,\ldots,0) and (1,,1)(1,\ldots,1) from the remainder of {0,1}d\{0,1\}^{d}, a contradiction to the well-known fact that such a threshold function does not exist. ∎


Corollary 17.

In general there may not exist an auto-encoding three-layer BTN if the middle layer has only logn\lceil\log n\rceil nodes.

5 Perfect Encoding and Decoding

In this section we construct four autoencoders by adding decoders to the encoders constructed in Theorems 11, 12, 14, and 15. The autoencoder based on Theorem 11 is described in the next Theorem.

Theorem 18.

Given XnX_{n} satisfying equation (3), there is a three-layer perfect BTN autoencoder with nn nodes in the middle layer.

Proof.

To the two-layer encoder of Theorem 11 add the decoder that maps hi[n]\textbf{h}^{i}[n] to xi,i=0,,n1\textbf{x}^{i},\ i=0,\ldots,n-1. That decoder can be obtained from the construction of Lemma 13 by setting 𝐳i=xi{\bf z}^{i}=\textbf{x}^{i} and s=ns=n. ∎

To obtain an autoencoder based on Theorem 12 will take more doing.

Theorem 19.

Let r=nr=\lceil\sqrt{n}\rceil. Given XX satisfying equation (3), there is a five-layer BTN perfect autoencoder with r+Dr+D nodes in the second layer, 2r2r nodes in the middle hidden layer, and rDrD nodes in the fourth layer.

Proof.

On top of the encoder constructed in the proof of Theorem 12 we add a decoder that is a 3-layer network with the 2r2r input nodes βi1,βi2,i=0,,r1\beta^{1}_{i},\beta^{2}_{i},\ i=0,\ldots,r-1, rDrD hidden nodes ηi,j,i=0,,r1,j=0,,D1\eta_{i,j},\ i=0,\ldots,r-1,\ j=0,\ldots,D-1, and DD output nodes yj,j=0,,D1y_{j},\ j=0,\ldots,D-1, see Fig. 3. To simplify the description we add a dummy node βr2=0\beta^{2}_{r}=0.

Refer to caption
Figure 3: Perfect Encoder/Decoder BTN.

We will see that this decoder outputs y=xkr+\textbf{y}=\textbf{x}^{kr+\ell} when given input (𝒉k[r],𝒉[r])(\boldsymbol{h}^{k}[r],\boldsymbol{h}^{\ell}[r]), as desired. Equip the node ηi,j\eta_{i,j} with the activation function

[xjiβ01\displaystyle[x^{i}_{j}\beta^{1}_{0} +(xjr+ixji)β11+\displaystyle+(x^{r+i}_{j}-x^{i}_{j})\beta^{1}_{1}+\ldots
\displaystyle\ldots +(xjr(r1)+ixjr(r2)+i)βr11+βi2βi+122].\displaystyle+(x^{r(r-1)+i}_{j}-x^{r(r-2)+i}_{j})\beta^{1}_{r-1}+\beta^{2}_{i}-\beta^{2}_{i+1}\geq 2]. (5)

Note that if 𝜷1=𝒉k[r]\boldsymbol{\beta}^{1}=\boldsymbol{h}^{k}[r] then

xjiβ01\displaystyle x^{i}_{j}\beta^{1}_{0} +(xjr+ixji)β11+\displaystyle+(x^{r+i}_{j}-x^{i}_{j})\beta^{1}_{1}+\ldots
\displaystyle\ldots +(xjr(r1)+ixjr(r2)+i)βr11=xjkr+i.\displaystyle+(x^{r(r-1)+i}_{j}-x^{r(r-2)+i}_{j})\beta^{1}_{r-1}=x^{kr+i}_{j}.

Note further that if 𝜷2=𝒉[r]\boldsymbol{\beta}^{2}=\boldsymbol{h}^{\ell}[r] then βi2βi+12=0\beta^{2}_{i}-\beta^{2}_{i+1}=0 unless i=i=\ell. Hence, when (𝜷1,𝜷2)=(𝒉k[r],𝒉[r])(\boldsymbol{\beta}^{1},\boldsymbol{\beta}^{2})=(\boldsymbol{h}^{k}[r],\boldsymbol{h}^{\ell}[r]), the value of ηi,j\eta_{i,j} is xjkr+x^{kr+\ell}_{j} if i=i=\ell and 0 otherwise, according to the activation function (5).

To complete the definition of the network we equip node yjy_{j} with the activation function

[i=0r1ηi,j1].[\sum^{r-1}_{i=0}\eta_{i,j}\geq 1].

It follows that when the network is given the input 𝜷1\boldsymbol{\beta}^{1}, 𝜷2\boldsymbol{\beta}^{2} as specified in the Theorem, yjy_{j} has value 1 if and only if xjkr+=1x^{kr+\ell}_{j}=1, i.e. yj=xjkr+y_{j}=x^{kr+\ell}_{j}. ∎

We will now show how to replace the middle layer of each of the autoencoders of Theorems 18 and 19 with a three-layer autoencoder whose middle layer has the minimum possible dimension, logn\lceil\log n\rceil. Observe that the middle layer of the autoencoder constructed in the proof of Theorem 18 has nn nodes and that the only values that this layer assumes are hi[n],i=0,,n1\textbf{h}^{i}[n],i=0,\ldots,n-1. The middle layer of the autoencoder constructed in the proof of Theorem 19 has two sets of r=nr=\lceil\sqrt{n}\rceil nodes and the only values that each of these sets can assume are hi[r],i=0,,r1\textbf{h}^{i}[r],i=0,\ldots,r-1. In each case, therefore, the appropriate three-layer autoencoder to replace the middle layer can be constructed from one or two autoencoders on ss nodes which only assume the values hi[s],i=0,,s1\textbf{h}^{i}[s],i=0,\ldots,s-1.

Lemma 20.

There is a three-layer BTN that autoencodes the set of vectors hi[s],i=0,,s1\textbf{h}^{i}[s],i=0,\ldots,s-1 and has a middle layer of size logs\lceil\log{s}\rceil.

Proof.

For ease of exposition we assume that s=2ms={2^{m}}. We construct a BTN with ss input nodes 𝜷\boldsymbol{\beta}, mm hidden nodes 𝜸\boldsymbol{\gamma}, and ss output nodes 𝜹\boldsymbol{\delta}.

On input hi[s]\textbf{h}^{i}[s] the middle layer 𝜸\boldsymbol{\gamma} contains the binary representation of ii. The decoding layer is therefore straightforward: the activation function for δj,j=0,,s1\delta_{j},j=0,\ldots,s-1 is [h=0m1γh2hj][\sum_{h=0}^{m-1}\gamma_{h}2^{h}\geq j].

The activation functions for γj,j=0,,m1\gamma_{j},j=0,\ldots,m-1 can be computed by the method of Lemma 13:

γm1:\displaystyle\gamma_{m-1}: [βs21];\displaystyle\ [\beta_{\frac{s}{2}}\geq 1];
γm2:\displaystyle\gamma_{m-2}: [βs4β2s4+β3s41];\displaystyle\ [\beta_{\frac{s}{4}}-\beta_{\frac{2s}{4}}+\beta_{\frac{3s}{4}}\geq 1];
γm3:\displaystyle\gamma_{m-3}: [βs8β2s8+β3s8β4s81++β7s81];\displaystyle\ [\beta_{\frac{s}{8}}-\beta_{\frac{2s}{8}}+\beta_{\frac{3s}{8}}-\beta_{\frac{4s}{8}-1}+\ldots+\beta_{\frac{7s}{8}}\geq 1];
.
.
γ0:\displaystyle\gamma_{0}: [β1β2+β3β4++βs11].\displaystyle[\beta_{1}-\beta_{2}+\beta_{3}-\beta_{4}+\ldots+\beta_{s-1}\geq 1].

Note that β0=δ0=1\beta_{0}=\delta_{0}=1 for all inputs hi[s]\textbf{h}^{i}[s], since by definition hi[s]0=1\textbf{h}^{i}[s]_{0}=1, and the activation function of δ0\delta_{0} is [h=0m1γh2h0][\sum_{h=0}^{m-1}\gamma_{h}2^{h}\geq 0]. ∎

Replacing the middle layers of the autoencoders constructed in Theorems 18 and 19 with the appropriate three-layer BTN constructed according to the Lemma yields the following two results.

Theorem 21.

Given XnX_{n} satisfying equation (3), there is a five-layer perfect BTN autoencoder with nn nodes in the second and fourth layer and logn\lceil\log n\rceil nodes in the middle layer.

Theorem 22.

Given XnX_{n} satisfying equation (3), there is a seven-layer perfect BTN autoencoder with 2logn2\lceil\log\sqrt{n}\rceil nodes in the middle hidden layer, and (D+5)n+D(D+5)\lceil\sqrt{n}\rceil+D nodes in the other hidden layers.

6 Conclusion

In this paper, we have studied the compressive power of autoencoders mainly within the Boolean threshold network model by exploring the existence of encoders and of autoencoders that map distinct input vectors to distinct vectors in lower-dimensions. It should be noted that our results are not necessarily optimal except for Proposition 4. The establishment of lower bounds and the reduction of the number of layers or nodes are left as open problems.

Although we focused on the existence of injection mappings, conservation of the distance is another important factor in dimensionality reduction. Therefore, it should be interesting and useful to study autoencoders that approximately conserve the distances (e.g., Hamming distance) between input vectors.

Another important direction is to use more practical models of neural networks such as those with sigmoid functions and Rectified Linear Unit (ReLU) functions. In such cases, real vectors need to be handled and thus conservation of the distances between vectors should also be analyzed.

References

  • [1] D. H. Ackley, G. H. Hinton, and T. J. Sejnowski, “A learning algorithm for Boltzmann machines,” Cognitive Science, vol. 9, pp. 147–169, 1985.
  • [2] P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,” Neural Networks, vol. 2, pp. 53–38, 1989.
  • [3] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol, 313, pp. 504–507, Jul. 2006.
  • [4] P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,” JMLR: Workshop and Conference Proceedings, vol. 27, pp. 37–50, 2012.
  • [5] D. P. Kingma and M, Welling, “Auto-encoding variational Bayes,” arXiv:1312.6114, 2013.
  • [6] C. Doersch, “Tutorial on variational autoencoders,” arXiv:1606.05908, 2016.
  • [7] M. Tschannen, O. Bachem, and M. Lucic. “Recent advances in autoencoder-based representation learning,” arXiv:1812.05069, 2018.
  • [8] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, et al., “Automatic chemical design using a data-driven continuous representation of molecules,” ACS Central Science, vol. 4, no. 2, pp. 268–276, 2018.
  • [9] O. Delalleau and Y. Bengio, “Shallow vs. deep sum-product networks,” In Advances in Neural Information Processing Systems, pp. 666-674, 2011.
  • [10] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. “On the number of linear regions of deep neural networks,” In Advances in Neural Information Processing Systems, pp. 2924-2932, 2014.
  • [11] S. An, M. Bennamoun, and F. Boussaid, “On the compressive power of deep rectifier networks for high resolution representation of class boundaries,” in arXiv:1708.07244v1, 2017.
  • [12] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in Proc. ICLR 2017 (arXiv:1611.03530), 2017.
  • [13] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” J. Theoret. Biol., vol. 22, no. 3, pp. 437–467, Mar. 1969.
  • [14] T. Akutsu, Algorithms for Analysis, Inference, and Control of Boolean Networks. Singapore: World Scientific, 2018.
  • [15] D. Cheng, H. Qi, and Z. Li, Analysis and Control of Boolean Networks: A Semi-tensor Product Approach.
  • [16] F. Li, “Pinning control design for the stabilization of Boolean networks,” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 7, pp. 1585–1590, 2016.
  • [17] Y. Liu, L. Sun, J. Lu and J. Liang, “Feedback controller design for the synchronization of Boolean control networks,” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 9, pp. 1991–1996, 2016.
  • [18] J. Lu, H. Li, Y. Liu, and F. Li, “Survey on semi-tensor product method with its applications in logical networks and other finite-valued systems,” IET Control Theory & Applications, vol. 11, no. 13, pp. 2040–2047, Aug. 2017.
  • [19] Y. Zhao, B. K. Ghosh and D. Cheng, “Control of large-scale Boolean networks via network aggregation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 7, pp. 1527–1536, 2016.
  • [20] M. Anthony, Discrete Mathematics of Neural Networks, Selected Topics. Philadelphia, PA, USA: SIAM, 2001.
  • [21] K. Y. Siu, V. P. Roychowdhury, and T. Kailath, “Depth-size tradeoffs for neural computation”, IEEE Transactions on Computers, vol. 40, no. 12, pp. 1402-1412, 1991.
  • [22] R. Minnick, “Linear-input logic,” IEEE Trans. Electron. Comput., vol. EC-10, pp. 6-16, 1961.