On the Compressive Power of Boolean Threshold Autoencoders
Abstract
An autoencoder is a layered neural network whose structure can be viewed as consisting of an encoder, which compresses an input vector of dimension to a vector of low dimension , and a decoder which transforms the low-dimensional vector back to the original input vector (or one that is very similar). In this paper we explore the compressive power of autoencoders that are Boolean threshold networks by studying the numbers of nodes and layers that are required to ensure that each vector in a given set of distinct input binary vectors is transformed back to its original. We show that for any set of distinct vectors there exists a seven-layer autoencoder with the smallest possible middle layer, (i.e., its size is logarithmic in ), but that there is a set of vectors for which there is no three-layer autoencoder with a middle layer of the same size. In addition we present a kind of trade-off: if a considerably larger middle layer is permissible then a five-layer autoencoder does exist. We also study encoding by itself. The results we obtain suggest that it is the decoding that constitutes the bottleneck of autoencoding. For example, there always is a three-layer Boolean threshold encoder that compresses vectors into a dimension that is reduced to twice the logarithm of .
Keywords: Neural networks, Boolean functions, threshold functions, autoencoders.
1 Introduction
Artificial neural networks have been extensively studied in recent years. Among various models, autoencoders attract much attention because of their power to generate new objects such as image data. An autoencoder is a layered neural network whose structure can be viewed as consisting of two parts, an encoder and a decoder, where the former transforms an input vector to a low-dimensional vector and the latter transforms the low-dimensional vector to an output vector which should be the same as or similar to the input vector, [1, 2, 3, 4]. An autoencoder is trained in an unsupervised manner to minimize the difference between input and output data by adjusting weights (and some other parameters). In the process it learns, therefore, a mapping from high-dimensional input data to a low-dimensional representation space. Although autoencoders have a long history [1, 3], recent studies focus on variational autoencoders [5, 6, 7] because of their generative power. Autoencoders have been applied to various areas including image processing [6, 7], natural language processing [7], and drug discovery [8].
As described above, autoencoders perform dimensionality reduction, a kind of data compression. However, how data are compressed via autoencoders is not yet very clear. Of course, extensive studies have been done on the representation power of deep neural networks [9, 10, 11, 12]. Yet, to the best of the authors’ knowledge, the quantitative relationship between the compressive power and the numbers of layers and nodes in autoencoders is still unclear.
In this paper, we study the compressive power of autoencoders using a Boolean model of layered neural networks. In this model, each node takes on values that are either 1 (active) or 0 (inactive) and the activation rule for each node is given by a Boolean function. That is, we consider a Boolean network (BN) [13] as a model of a neural network. BNs have been used as a discrete model of genetic networks, for which extensive studies have been done on inference, control, and analysis [14, 15, 16, 17, 18, 19]. It should be noted that BNs have also been used as a discrete model of neural networks in which functions are restricted to be linear Boolean threshold functions whose outputs are determined by comparison of the weighted sum of input values with a threshold [20]. Such a BN is referred to as Boolean threshold networks (BTNs) in this paper. Although extensive theoretical studies have been devoted to the representational power of BTNs [20], almost none considered BN models of autoencoders with the notable exception of Baldi’s study of the computational complexity of clustering via autoencoders [4].
architecture | type | constraint | ||
Proposition 4 | Encoder/Decoder | BN | ||
Theorem 5 | Encoder | BTN, | ||
Theorem 8 | Encoder | BN with parity function | ||
Theorem 9 | Encoder | BTN | ||
Theorem 15 | 4 layers ( nodes) | Encoder | BTN | |
Theorem 19 | Encoder/Decoder | BTN | ||
Theorem 21 | Encoder/Decoder | BTN | ||
Theorem 22 | 7 layers ( nodes) | Encoder/Decoder | BTN |
The BTNs we consider in this paper are always layered ones, and we study their compressive power in two settings. The first one focuses on encoding. We are given a set of -dimensional different binary vectors and the task is to find a BTN which maps each to some -dimensional binary vector in such a way that for all . Such a BTN is called a perfect encoder for . The second one involves both encoding and decoding. In this case we are again given a set of -dimensional binary vectors , and the task is to find a BTN consisting of an encoder function and a decoder function satisfying for all . Such a BTN is called a perfect autoencoder for . It is clear from the definition that if is a perfect autoencoder, then is a perfect encoder. In this paper, we are interested in whether or not there exists a BTN for any under the condition that the architecture (i.e., the number of layers and the number of nodes in each layer) is fixed. This setting is reasonable because learning of an autoencoder is usually performed, for a given set of input vectors, by fixing the architecture of the BTN and adjusting the weights of the edges. Therefore, the existence of a perfect BTN for any implies that a BTN can be trained for any so that the required conditions are satisfied if an appropriate set of initial parameters is given.
The results on existence of perfect encoders and autoencoders are summarized in Table 1, in which an entry in the architecture column lists the number of nodes in each layer (from input to output). Note that means throughout the paper. In addition to these positive results, we show a negative result (Theorem 16) stating that for some with , there does not exist a perfect autoencoder consisting of three layers.
2 Problem Definitions
Our first BN model is a three-layered one, see Fig. 1; its definitions can be extended to networks with four or more layers in a straightforward way. Let , , be binary vectors given to the first, second, and third layers, respectively. In this case, the first, second, and third layers correspond to the input, middle, and output layers, respectively. Let and be lists of Boolean functions assigned to nodes in the second and third layers, respectively, which means and . Since our primary interest is in autoencoders, we assume that and are -dimensional binary vectors and is a -dimensional binary vector with . A list of functions is also referred to as a mapping. The th element of a vector will be denoted by , which is also used to denote the node corresponding to this element. Similarly, for each mapping , denotes the th function.
In the following, will always denote a set of -dimensional binary input vectors that are all different.
Definition 1.
A mapping : is called a perfect encoder for if holds for all .
Definition 2.
A pair of mappings with : and : is called a perfect autoencoder if holds for all .

Fig. 1 illustrates the architecture of an autoencoder. As mentioned before, if is a perfect autoencoder, then is a perfect encoder.
Example 3.
Let where , , , . Let and . Define and by
This pair of mappings has the following truth table, which shows it to be a perfect Boolean autoencoder.
0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
---|---|---|---|---|---|---|---|
1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 |
1 | 0 | 1 | 1 | 1 | 1 | 0 | 1 |
1 | 1 | 1 | 0 | 1 | 1 | 1 | 1 |
Proposition 4.
For any , there exists a perfect Boolean autoencoder with .
Proof.
The encoder maps a vector to its index, in binary representation, and the decoder maps the index back to the vector. To implement the idea formally denote by the -dimensional binary vector representing , for and . For example, .
Given define by and . Clearly, is a perfect autoencoder and can be represented by Boolean functions. ∎
Note that there does not exist a perfect Boolean autoencoder with because contains different vectors. Note, furthermore, that the Proposition permits the use of arbitrary Boolean functions. In the following, we focus on what is achievable when a neural network model with Boolean threshold functions [20] is used. A function : is called a Boolean threshold function if it is represented as
for some , where is an -dimensional integer vector and is an integer. We will also denote the same function as . If all activation functions in a neural network are Boolean threshold functions, the network is called a Boolean threshold network (BTN). In the following sections, we will consider BTNs with layers where (see Fig. 2). Such a BTN is represented as , where is a list of activation functions for the -th layer. When a BTN is used as an autoencoder, some layer is specified as the middle layer. If the th layer is specified as the middle layer, the middle vector , encoder , and decoder are defined by

3 How Easy is it to Encode?
This section is devoted to the encoding setting. First we establish, non-constructively, the existence of two encoders for -dimensional vectors: a two-layer BTN that uses an output layer with nodes, and a three-layer BTN with a hidden layer of size and an output layer of size . The third result reduces the number of output nodes further to , using a BTN of depth 4 with hidden nodes; it also lays out the architecture of this BTN. Taken together these results suggest that encoding is relatively easy by itself.
Theorem 5.
Given there exists a perfect encoder that is a two-layer BTN with at most output nodes, where is the maximum number of 1’s that any two vectors in have in common.
Proof.
We use the probabilistic method to establish the existence of a set of -dimensional weight vectors with the property that the two-layer BTN whose threshold functions are , yields output vectors that are all different. Denote by the output vector for , with .
Consider a random such that each has the value or 1 with probability . Let us compute a lower bound on the probability that a given pair of output vectors differs in a specific coordinate. Without loss of generality (w.l.o.g.) we take the vectors to be and , and look at the -th coordinate. Because this probability does not depend on the coordinate, since was chosen at random, let’s denote it by . Denote by the number of 1’s in , and by the number of coordinates with value 1 for both and . W.l.o.g. , , all other coordinates being 0.
Consequently , and .
Suppose is even. Then a lower bound on is , where stands for
(2) |
By the soon-to-be-stated Lemma 7, . Turning to the estimation of , Lemma 7 implies that if one of is odd then . In the remaining case that both and are even
This expression assumes its smallest value, for , when one of is 0 and the other is 2, in which case its value is . We conclude, therefore, that if the number of 1’s that have in common is even then and .
A similar computation for the case that is odd shows that if number of 1’s that have in common is odd then .
In summary, the probability that a specific pair of output vectors differ at a specific coordinate is at least
where ,
Consequently the probability that all coordinates of two -dimensional output vectors are identical is at most . To ensure that the probability that one of the pairs of vectors and is identical is less than 1 it is sufficient to choose such that
i.e. we require . ∎
Remark 6.
In many applications the feature vectors are sparse so that is small and the bound given in Theorem 5 can be lowered. Consider for example the case that the number of 1s is . Then it is sufficient to set .
Here is the statement of the Lemma used in the foregoing proof.
Lemma 7.
Let be random variables, and let .
If is even the following hold.
-
1.
. This implies, using Stirling’s approximation, .
-
2.
-
3.
If is odd the following hold.
-
1.
. This implies .
-
2.
.
Proof.
It suffices to note that if is even (odd) then is even (odd), and that . ∎
Next we show that the number of output nodes needed to encode the vectors can be reduced to by adding a layer of hidden nodes to the BTN. To prove this we first present a result of independent interest, that uses the power of parity functions.
Theorem 8.
Given there is a two-layer network whose activation functions are parity functions that maps to different vectors of dimension .
Proof.
We again use the probabilistic method to establish the existence of a network with output nodes, provided . Equip each output node , with an activation function constructed as follows: assemble a set of bits, , by adding each one of the input variables to the set with probability , and set to be the parity of the set of values that assigns to the variables in .
Consider an arbitrary fixed output node . Given a pair of input vectors , define
Then if and only if and are either both odd or both even.
If , then , because is randomly selected. Therefore, if and , then the probability that is . If then , since , and .
We conclude that , and the probability that . Therefore the probability that there is some pair for which is smaller than , which is less than 1 if . ∎
Theorem 9.
Given different binary vectors of dimension , there is a three-layer BTN whose activation functions are threshold functions that maps these vectors to different vectors of dimension using hidden nodes.
Proof.
It is well-known that a parity function on nodes can be implemented by threshold functions while using only one additional layer of at most nodes. A simple way of doing this is to use a hidden layer with nodes, , where node has value 1 if the input contains at least 1’s, i.e. , where . The output node that computes the parity of the input has the activation function . The Theorem follows therefore from Theorem 8 when each of the parity functions it uses is implemented this way. ∎
We turn now to results that also provide constructions. In one form or another all use the idea of having the encoder map a vector to its index, as done in Proposition 4. That Proposition permits the use of any Boolean mapping, so that the meaning of the index of the vector is of no importance. Here, however, only mappings arising from BTNs are employed, and the meaning of the index, as provided by the following simple observation, is going to be of central importance.
Lemma 10.
cf. [12]. Given there is a vector a such that if . Furthermore, this vector can be assumed to have integer coordinates.
The reason is that the vectors in are all different. We fix , and assume henceforth that is sorted,
(3) |
When a vector is mapped to its index, will be represented either in binary, or by the -dimensional step-vector , for some , defined by
(4) |
Our first result in this direction exemplifies encoding with step-vectors.
Theorem 11.
Given there is a two-layer BTN that maps to .
Proof.
Let the vector of threshold functions for the nodes in the output layer be . Clearly has value 1 if and only if , i.e. . ∎
We show next that the number of output nodes of the encoder can be reduced to if a hidden layer with only nodes is added. Henceforward we denote .
Theorem 12.
Given satisfying equation (3), there is a three-layer network that maps to where and are such that , with . The network has input nodes , hidden nodes and , and output nodes .
Proof.
For ease of exposition we assume that , and .
The nodes simply copy the input, . To define the activation functions of divide the interval into consecutive subintervals each containing values , i.e. , and . The activation function of is , . Clearly, if then if and only if , i.e. .
The output node copies . The remaining output nodes, , represent the ordinal number of within the -th subinterval. To implement this part of the mapping we follow [21] in employing the ingenious technique of telescopic sums, introduced by [22], and define
The important thing to note here is that if falls in the -th subinterval, so that , then has the value , because , while .
The activation function of is , . Observe that if , then , i.e. if , and if . ∎
It is relatively easy to convert the BTNs constructed in Theorems 11 and 12 into BTNs that map to a set of vectors with the minimum possible dimension, , by adding a layer of that size. The technique for doing so is encapsulated in the following Lemma. Its formulation is slightly more general than strictly needed here in order to serve another purpose later on.
Lemma 13.
Given a set of different -dimensional binary vectors there is a two-layer BTN that maps to .
Proof.
To construct the vector of threshold functions for the output nodes, , set ,
and It is easily verified that , i.e. as desired. ∎
Consider the layer constructed in the proof of the Lemma when is the -dimensional binary representation of and . By adding this layer to the BTN constructed in Theorem 11 we get the following result.
Theorem 14.
Given there is a three-layer BTN with nodes in the second layer that maps to the -dimensional binary representation of , .
Similarly, we can add to the BTN constructed in Theorem 12 a layer that consists of a first part and a second part, both constructed according to the Lemma. The first part connects the nodes to a -dimensional binary counter , and the second part connects the nodes to another -dimensional binary counter . Thus, on input the nodes contain which is mapped to the -dimensional binary representation of in , and the nodes contain which is mapped to the -dimensional binary representation of in . Recalling that , with , it follows that, in this case, and taken together contain the -dimensional binary representation of . This construction establishes the following Theorem.
Theorem 15.
Given there is a four-layer BTN with hidden nodes that maps to the -dimensional binary representation of , .
4 Difficulty of Decoding
Theorem 16.
For each there exists a set of different -dimensional bit-vectors with the following properties:
-
1.
There exists a 2-layer BTN that maps to .
-
2.
There does not exist a 2-layer BTN that maps to .
Proof.
Set and construct as follows:
-
1.
index the coordinates of the vectors by ;
-
2.
set if and only if or is .
A BTN that encodes these vectors with bits is constructed as follows. The most significant bit is generated by the threshold function
The next most significant bit is generated by
The successive digits are generated in similar fashion, each time using a sum of input bits, until at last the least significant is generated by the slightly different threshold function
It is not difficult to verify that this 2-layer BTN maps to the -bit representation of .
To illustrate, consider the case . These vectors are encoded by the threshold functions (most significant digit) and (least significant digit)
For example, since in the coordinates are 1 it has , i.e. its representation is . Similarly, since whereas , , so that has the representation .
To prove the second assertion, suppose to the contrary that there does exist a BTN mapping to , and that it maps to and to . Consider the threshold function for the bit of the output layer. Since by construction while for all , it follows that the threshold function for this bit separates the two -dimensional vectors and from the remainder of , a contradiction to the well-known fact that such a threshold function does not exist. ∎
Corollary 17.
In general there may not exist an auto-encoding three-layer BTN if the middle layer has only nodes.
5 Perfect Encoding and Decoding
In this section we construct four autoencoders by adding decoders to the encoders constructed in Theorems 11, 12, 14, and 15. The autoencoder based on Theorem 11 is described in the next Theorem.
Theorem 18.
Given satisfying equation (3), there is a three-layer perfect BTN autoencoder with nodes in the middle layer.
Proof.
To obtain an autoencoder based on Theorem 12 will take more doing.
Theorem 19.
Let . Given satisfying equation (3), there is a five-layer BTN perfect autoencoder with nodes in the second layer, nodes in the middle hidden layer, and nodes in the fourth layer.
Proof.
On top of the encoder constructed in the proof of Theorem 12 we add a decoder that is a 3-layer network with the input nodes , hidden nodes , and output nodes , see Fig. 3. To simplify the description we add a dummy node .

We will see that this decoder outputs when given input , as desired. Equip the node with the activation function
(5) |
Note that if then
Note further that if then unless . Hence, when , the value of is if and 0 otherwise, according to the activation function (5).
To complete the definition of the network we equip node with the activation function
It follows that when the network is given the input , as specified in the Theorem, has value 1 if and only if , i.e. . ∎
We will now show how to replace the middle layer of each of the autoencoders of Theorems 18 and 19 with a three-layer autoencoder whose middle layer has the minimum possible dimension, . Observe that the middle layer of the autoencoder constructed in the proof of Theorem 18 has nodes and that the only values that this layer assumes are . The middle layer of the autoencoder constructed in the proof of Theorem 19 has two sets of nodes and the only values that each of these sets can assume are . In each case, therefore, the appropriate three-layer autoencoder to replace the middle layer can be constructed from one or two autoencoders on nodes which only assume the values .
Lemma 20.
There is a three-layer BTN that autoencodes the set of vectors and has a middle layer of size .
Proof.
For ease of exposition we assume that . We construct a BTN with input nodes , hidden nodes , and output nodes .
On input the middle layer contains the binary representation of . The decoding layer is therefore straightforward: the activation function for is .
The activation functions for can be computed by the method of Lemma 13:
. | |||
. | |||
Note that for all inputs , since by definition , and the activation function of is . ∎
Replacing the middle layers of the autoencoders constructed in Theorems 18 and 19 with the appropriate three-layer BTN constructed according to the Lemma yields the following two results.
Theorem 21.
Given satisfying equation (3), there is a five-layer perfect BTN autoencoder with nodes in the second and fourth layer and nodes in the middle layer.
Theorem 22.
Given satisfying equation (3), there is a seven-layer perfect BTN autoencoder with nodes in the middle hidden layer, and nodes in the other hidden layers.
6 Conclusion
In this paper, we have studied the compressive power of autoencoders mainly within the Boolean threshold network model by exploring the existence of encoders and of autoencoders that map distinct input vectors to distinct vectors in lower-dimensions. It should be noted that our results are not necessarily optimal except for Proposition 4. The establishment of lower bounds and the reduction of the number of layers or nodes are left as open problems.
Although we focused on the existence of injection mappings, conservation of the distance is another important factor in dimensionality reduction. Therefore, it should be interesting and useful to study autoencoders that approximately conserve the distances (e.g., Hamming distance) between input vectors.
Another important direction is to use more practical models of neural networks such as those with sigmoid functions and Rectified Linear Unit (ReLU) functions. In such cases, real vectors need to be handled and thus conservation of the distances between vectors should also be analyzed.
References
- [1] D. H. Ackley, G. H. Hinton, and T. J. Sejnowski, “A learning algorithm for Boltzmann machines,” Cognitive Science, vol. 9, pp. 147–169, 1985.
- [2] P. Baldi and K. Hornik, “Neural networks and principal component analysis: Learning from examples without local minima,” Neural Networks, vol. 2, pp. 53–38, 1989.
- [3] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of data with neural networks,” Science, vol, 313, pp. 504–507, Jul. 2006.
- [4] P. Baldi, “Autoencoders, unsupervised learning, and deep architectures,” JMLR: Workshop and Conference Proceedings, vol. 27, pp. 37–50, 2012.
- [5] D. P. Kingma and M, Welling, “Auto-encoding variational Bayes,” arXiv:1312.6114, 2013.
- [6] C. Doersch, “Tutorial on variational autoencoders,” arXiv:1606.05908, 2016.
- [7] M. Tschannen, O. Bachem, and M. Lucic. “Recent advances in autoencoder-based representation learning,” arXiv:1812.05069, 2018.
- [8] R. Gómez-Bombarelli, J. N. Wei, D. Duvenaud, J. M. Hernández-Lobato, et al., “Automatic chemical design using a data-driven continuous representation of molecules,” ACS Central Science, vol. 4, no. 2, pp. 268–276, 2018.
- [9] O. Delalleau and Y. Bengio, “Shallow vs. deep sum-product networks,” In Advances in Neural Information Processing Systems, pp. 666-674, 2011.
- [10] G. F. Montufar, R. Pascanu, K. Cho, and Y. Bengio. “On the number of linear regions of deep neural networks,” In Advances in Neural Information Processing Systems, pp. 2924-2932, 2014.
- [11] S. An, M. Bennamoun, and F. Boussaid, “On the compressive power of deep rectifier networks for high resolution representation of class boundaries,” in arXiv:1708.07244v1, 2017.
- [12] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” in Proc. ICLR 2017 (arXiv:1611.03530), 2017.
- [13] S. A. Kauffman, “Metabolic stability and epigenesis in randomly constructed genetic nets,” J. Theoret. Biol., vol. 22, no. 3, pp. 437–467, Mar. 1969.
- [14] T. Akutsu, Algorithms for Analysis, Inference, and Control of Boolean Networks. Singapore: World Scientific, 2018.
- [15] D. Cheng, H. Qi, and Z. Li, Analysis and Control of Boolean Networks: A Semi-tensor Product Approach.
- [16] F. Li, “Pinning control design for the stabilization of Boolean networks,” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 7, pp. 1585–1590, 2016.
- [17] Y. Liu, L. Sun, J. Lu and J. Liang, “Feedback controller design for the synchronization of Boolean control networks,” IEEE Trans. Neural Netw. Learning Syst., vol. 27, no. 9, pp. 1991–1996, 2016.
- [18] J. Lu, H. Li, Y. Liu, and F. Li, “Survey on semi-tensor product method with its applications in logical networks and other finite-valued systems,” IET Control Theory & Applications, vol. 11, no. 13, pp. 2040–2047, Aug. 2017.
- [19] Y. Zhao, B. K. Ghosh and D. Cheng, “Control of large-scale Boolean networks via network aggregation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 7, pp. 1527–1536, 2016.
- [20] M. Anthony, Discrete Mathematics of Neural Networks, Selected Topics. Philadelphia, PA, USA: SIAM, 2001.
- [21] K. Y. Siu, V. P. Roychowdhury, and T. Kailath, “Depth-size tradeoffs for neural computation”, IEEE Transactions on Computers, vol. 40, no. 12, pp. 1402-1412, 1991.
- [22] R. Minnick, “Linear-input logic,” IEEE Trans. Electron. Comput., vol. EC-10, pp. 6-16, 1961.