Report on Learning Rules for Discrete-valued Networks

Stephen Chung University of Massachusetts Amherst, Department of Computer Science, [email protected]

(March 2022)

Unbiased Weight Maximization

Stephen Chung University of Massachusetts Amherst, Department of Computer Science, [email protected]

(March 2022)

1 Overview

A biologically plausible method for training an Artificial Neural Network (ANN) involves treating each unit as a stochastic Reinforcement Learning (RL) agent, thereby considering the network as a team of agents. Consequently, all units can learn via REINFORCE, a local learning rule modulated by a global reward signal, which aligns more closely with biologically observed forms of synaptic plasticity [1, Chapter 15]. Nevertheless, this learning method is often slow and scales poorly with network size due to inefficient structural credit assignment, since a single reward signal is broadcast to all units without considering individual contributions.

Weight Maximization [2], a proposed solution, replaces a unit’s reward signal with the norm of its outgoing weight, thereby allowing each hidden unit to maximize the norm of the outgoing weight instead of the global reward signal. In this research report, we analyze the theoretical properties of Weight Maximization and propose a variant, Unbiased Weight Maximization. This new approach provides an unbiased learning rule that increases learning speed and improves asymptotic performance. Notably, to our knowledge, this is the first learning rule for a network of Bernoulli-logistic units that is unbiased and scales well with the number of network’s units in terms of learning speed.

2 A single Bernoulli-logistic unit

We first focus on the learning rules for a single Bernoulli-logistic unit with no inputs¹¹1To generalize the learning rules to a unit with input, we can multiply the update of the bias by the input to compute the update of the weight associated with that input.: denoting the activation value of the unit by $H$ , sigmoid function as $\sigma$ , and the scalar bias (the parameter to be learned; not to be confused with the estimation bias of an estimator) by $b$ , the distribution of $H$ is given by $\operatorname*{\text{Pr}}(H=h)=\pi(h;b)=\sigma((2h-1)b)$ where $h\in\{0,1\}$ . After sampling $H$ , a scalar reward $R$ is sampled from a distribution conditioned on $H$ . We are interested in learning $b$ such that the expected reward $\operatorname*{\mathbb{E}}[R]$ is maximized. We also denote the conditional expectation $\operatorname*{\mathbb{E}}_{R}[R|H=h]$ by $r(h)$ , so $\operatorname*{\mathbb{E}}[R]=\operatorname*{\mathbb{E}}_{H}[\operatorname*{\mathbb{E}}_{R}[R|H]]=\operatorname*{\mathbb{E}}[r(H)]$ .

Our goal is to compute or estimate $\nabla_{b}\operatorname*{\mathbb{E}}_{H}[r(H)]$ , the gradient of the expected reward w.r.t. parameter $b$ , so as to perform gradient ascent on the expected reward.

A direct computation of the gradient gives:

	$\displaystyle\nabla_{b}\operatorname*{\mathbb{E}}[r(H)]$	$\displaystyle=\nabla_{b}(\sigma(-b)r(0)+\sigma(b)r(1))$		(1)
		$\displaystyle=(r(1)-r(0))\sigma(b)(1-\sigma(b)).$		(2)

That is, the gradient is positive iff $r(1)>r(0)$ since $\sigma(b)(1-\sigma(b))$ is always positive. In other words, if the expected reward for firing (i.e. outputting $1$ ) is larger than the reward for not firing (i.e. outputting $0$ ), then the unit should fire more by increasing $b$ . In fact, we see that the optimal $b^{*}=\operatorname*{arg\,max}_{b}\operatorname*{\mathbb{E}}[r(H)]$ is infinitely positive if $r(1)>r(0)$ and infinitely negative if $r(1)<r(0)$ . However, $r(1)-r(0)$ is generally not known.

2.1 A multi-layer network of Bernoulli-logistic units

Before discussing the learning rules, we consider some properties of $r(h)$ if the unit is one of the many units in a multi-layer network of Bernoulli-logistic units. In a multi-layer network, units are grouped into different layers. We call the last layer output layer and other layers hidden layers. Units on the output layer are called output units, and units on hidden layers are called hidden units. Activation values of the hidden units are passed to the next layer as inputs, and activation values of the output units are passed to the external environment and determine the reward distribution.

Therefore, if the unit is a hidden unit in a multi-layer network of Bernoulli-logistic units, $H$ is passed to units on next layer to determine their activation values $D_{1},D_{2},...,D_{m}$ by $\operatorname*{\text{Pr}}(D_{i}=d|H=h)=\sigma((2d-1)(v_{i}h+c_{i}))$ where $v_{i}$ is the weight connecting from the unit to the next layer’s unit $i$ , $c_{i}$ is the bias of the next layer’s unit $i$ , $d\in\{0,1\}$ , $i\in\{0,1,...,m\}$ , and $m$ is the number of next layer’s units. We call the vector $v=[v_{1},...,v_{m}]$ outgoing weight, and the next layer’s units outgoing units.

From this, we see that $r(h)$ can be written as some differentiable function $g(v_{1}h+c_{1},...,v_{m}h+c_{m})$ :

	$\displaystyle r(h)$	(3)
$\displaystyle=$	$\displaystyle\operatorname*{\mathbb{E}}[R\|H=h]$	(4)
$\displaystyle=$	$\displaystyle\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=h)\operatorname{\mathbb{E}}[R\|D_{1:m}=d_{1:m},H=h]$	(5)
$\displaystyle=$	$\displaystyle\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=h)\operatorname{\mathbb{E}}[R\|D_{1:m}=d_{1:m}]$	(6)
$\displaystyle=$	$\displaystyle\sum_{d_{1:m}}\sigma((2d_{1}-1)(v_{1}h+c_{1}))...\sigma((2d_{m}-1)(v_{m}h+c_{m}))\operatorname*{\mathbb{E}}[R\|D_{1:m}=d_{1:m}]$	(7)
$\displaystyle=:$	$\displaystyle g(v_{1}h+c_{1},...,v_{m}h+c_{m}),$	(8)

where (6) uses the assumption that $R$ is independent with $H$ when conditioned on $D_{1:m}$ since the network has a multi-layer structure.

Note that for $h\in[0,1]$ , $g$ is a differentiable function of $h$ , so $g$ gives a differentiable extension of $r(h)$ to $h\in[0,1]$ . We call this extension a natural extension of $r(h)$ , which extends $r(h)$ by allowing the unit to pass any values in $[0,1]$ instead of only $\{0,1\}$ to its outgoing units. Though differentiable, the natural extension can be highly non-linear, as shown in Figure 1(a). The natural extension of $r(h)$ , which is only applicable for hidden units, is useful as it opens up the possibility of other learning rules that are not possible for output units. For example, we may have some estimates of $r^{\prime}(0)$ or $r^{\prime}(1)$ when training the outgoing units, which can be used to approximate $r(1)-r(0)$ by first-order Taylor approximation.

We are now ready to discuss the five different estimates of $\nabla_{b}\operatorname*{\mathbb{E}}[r(H)]$ .

2.2 REINFORCE

The gradient can be computed as:

	$\displaystyle\nabla_{b}\operatorname*{\mathbb{E}}[r(H)]$	$\displaystyle=\operatorname{\mathbb{E}}[r(H)\nabla_{b}\log\operatorname{\text{Pr}}(H)]$		(9)
		$\displaystyle=\operatorname*{\mathbb{E}}[r(H)(H-\sigma(b))].$		(10)

Suppose that we have an unbiased estimate of $r(h)$ for any $h$ , denoted by $\hat{r}(h)$ , we can then estimate the gradient by:

\displaystyle\Delta_{rei}b=\hat{r}(H)(H-\sigma(b)),

(11)

which is an unbiased estimate of the gradient:

$\displaystyle\operatorname*{\mathbb{E}}[\Delta_{rei}b]$	$\displaystyle=\mathbb{E}_{H}[\operatorname*{\mathbb{E}}[\hat{r}(H)\|H](H-\sigma(b))]$	(12)
	$\displaystyle=\operatorname*{\mathbb{E}}[r(H)(H-\sigma(b))]$	(13)
	$\displaystyle=\nabla_{b}\operatorname*{\mathbb{E}}[r(H)].$	(14)

Note that the $R$ sampled (conditioned on $H$ ) is an unbiased estimate of $r(H)$ , so $R(H-\sigma(b))$ is an unbiased estimate of the gradient and this becomes the REINFORCE learning rule [3].

However, if $r(0)$ and $r(1)$ are very close, then the expected value of $\Delta_{rei}b$ is close to zero as shown by (2). Combined with the variance of $\hat{r}(H)$ , this makes the sign of $\Delta_{rei}b$ almost random, and so it takes many steps of gradient ascent to reach $b^{*}$ . This is indeed the case when the unit is a hidden unit in a large network of Bernoulli-logistic units.

2.3 STE Backprop

Suppose that (i) we can extend $r(h)$ to a differentiable function, and (ii) we have an estimate of $r^{\prime}(h)$ for any $h$ ( $f^{\prime}$ denotes the derivative of $f$ ), denoted by $\hat{r}^{\prime}(h)$ ; then we can apply STE backprop [4], a common method for training Bernoulli-logistic units ( $1\{\cdot\}$ denotes the indicator function and $U$ denotes a uniformly distributed and independent variable):

$\displaystyle\nabla_{b}\operatorname*{\mathbb{E}}[r(H)]$	$\displaystyle=\nabla_{b}\mathbb{E}_{U}[r(1\{\sigma(b)>U\})]$	(15)
	$\displaystyle=\mathbb{E}_{U}[\nabla_{b}r(1\{\sigma(b)>U\})]$	(16)
	$\displaystyle=\mathbb{E}_{U}[r^{\prime}(1\{\sigma(b)>U\})\nabla_{b}1\{\sigma(b)>U\}]$	(17)
	$\displaystyle\approx\mathbb{E}_{U}[r^{\prime}(1\{\sigma(b)>U\})\nabla_{b}\sigma(b)]$	(18)
	$\displaystyle=\mathbb{E}_{U}[r^{\prime}(1\{\sigma(b)>U\})\sigma(b)(1-\sigma(b))]$	(19)
	$\displaystyle=\operatorname*{\mathbb{E}}[r^{\prime}(H)]\sigma(b)(1-\sigma(b)),$	(20)

where (15) uses the fact that $1\{\sigma(b)>U\}$ has the same distribution as $H$ , and (18) uses the approximation of $\nabla_{x}1\{x>u\}\approx 1$ in STE backprop (note that the actual derivative of $f(x)=1\{x>u\}$ is zero at $x\neq u$ and does not exist at $x=u$ ).

Thus, the estimate of the gradient by STE backprop is given by:

\displaystyle\Delta_{ste}b=\hat{r}^{\prime}(H)\sigma(b)(1-\sigma(b)).

(21)

We observe that the form of $\Delta_{ste}b$ is very similar to (2). Intuitively, STE backprop is using $\hat{r}^{\prime}(H)$ , i.e. the estimated derivative of $r$ at $h=H$ , to approximate $r(1)-r(0)$ . If $r$ is a linear function and $\hat{r}^{\prime}(H)$ is unbiased, then $\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(H)]=\operatorname*{\mathbb{E}}[r^{\prime}(H)]=r(1)-r(0)$ and the estimate $\Delta_{ste}b$ is unbiased. However, if $r$ is a non-linear function, it is possible that $\Delta_{ste}b$ always has the wrong sign and thus $b$ converges to a value that minimizes $\operatorname*{\mathbb{E}}[R]$ instead. This is illustrated in Figure 1(b): in the second $r(h)$ , the slope at $h=0$ or $h=1$ are both positive, so the approximated $r(1)-r(0)$ is positive. However, the true $r(1)-r(0)$ is negative, and $\Delta_{ste}b$ always has a wrong sign in this case.

The reason for this approximation error is that increasing $b$ by a small amount increases the weight placed (or the probability) at $r(1)$ and decreases the weight placed (or the probability) at $r(0)$ by a small amount, instead of increasing $h$ by a small amount (we cannot ‘move’ on the curve $r(h)$ ). The latter case corresponds to $H=\sigma(b)$ , i.e. the activation value of the unit is deterministic and equals $\sigma(b)$ . In fact, backprop applied on a deterministic ANN with sigmoid activation function gives almost the same parameter update as STE backprop applied on an ANN of Bernoulli-logistic units. In other words, STE backprop treats the network as if the hidden units are outputting expected values instead of sampled values.

Refer to caption — (a) $r(h)$ from natural extension

2.3.1 Natural Extension

The two assumptions of having an extension and an estimate $\hat{r}^{\prime}(h)$ can be satisfied by using the natural extension $g$ defined in (8) if the unit is a hidden unit in a network. To estimate $r^{\prime}(h)$ , we observe:

$\displaystyle r^{\prime}(h)$	$\displaystyle=\nabla_{h}g(v_{1}h+c_{1},v_{2}h+c_{2},...,v_{m}h+c_{m})$	(22)
	$\displaystyle=\sum_{i=1}^{m}v_{i}\nabla_{i}g(v_{1}h+c_{1},v_{2}h+c_{2},...,v_{m}h+c_{m})$	(23)
	$\displaystyle=\sum_{i=1}^{m}v_{i}\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R\|H=h].$	(24)

Assuming the outgoing units are also learning through adding some estimates of $\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R|H]$ to its bias $c_{i}$ , we can use the change of $c_{i}$ as an estimate of $\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R|H]$ and plug it in the above formula to obtain $\hat{r}^{\prime}(H)$ . The estimate of $\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R|H]$ for outgoing units can be obtained by STE backprop if the outgoing units are hidden units, and REINFORCE (11) or direct computation (2) if the outgoing units are output units.

However, the natural extension of $r$ is not linear in most cases, so STE backprop gives a biased estimate. From the Taylor expansion of $g$ in (8), the estimation bias is in the order of $||v||^{2}$ , the squared norm of the outgoing weight. Note that this estimation bias accumulates across layers as $\hat{r}^{\prime}(H)$ is no longer an unbiased estimator of $r^{\prime}(H)$ for units with hidden outgoing units.

2.4 Weight Maximization

Both the parameter updates given by REINFORCE and STE backprop are not necessarily zero when the unit is not firing. A desirable property of a learning rule is that the unit’s parameters are not updated when the unit is not firing. This can be achieved by using $r(0)$ as a baseline in REINFORCE:

\displaystyle\nabla_{b}\operatorname*{\mathbb{E}}[r(H)]

\displaystyle=\operatorname*{\mathbb{E}}[(r(H)-r(0))(H-\sigma(b))].

(25)

Note that $(R-r(0))H$ is an unbiased estimate of $r(H)-r(0)$ conditioned on $H$ : if $H=0$ , then $\operatorname*{\mathbb{E}}[(R-r(0))H|H=0]=0=r(H)-r(0)$ ; if $H=1$ , then $\operatorname*{\mathbb{E}}[(R-r(0))H|H=1]=\operatorname*{\mathbb{E}}[R|H=1]-r(0)=r(H)-r(0)$ . Thus, an unbiased estimate of the gradient is:

\displaystyle\Delta b=(R-r(0))H(H-\sigma(b)).

(26)

However, the above estimate cannot be computed since $r(0)$ is not known in general. Suppose that (i) we can extend $r(h)$ to a differentiable function, and (ii) we have an estimate of $r^{\prime}(h)$ for any $h$ , denoted as $\hat{r}^{\prime}(h)$ , then we may approximate $(R-r(0))H$ by $\hat{r}^{\prime}(H)H$ , leading to the following estimate of the gradient:

\displaystyle\Delta_{wm}b=\hat{r}^{\prime}(H)H(H-\sigma(b)).

(27)

Compared to REINFORCE, Weight Maximization replaced the global reward $R$ by the individual reward $\hat{R}:=\hat{r}^{\prime}(H)H$ , which equals zero when the unit is not firing and equals an approximation of $r(1)-r(0)$ by $r^{\prime}(1)$ when the unit is firing. Similar to STE backprop, this approximation requires $r$ to be linear, which does not hold in most cases.

2.4.1 Natural Extension

The two assumptions of having an extension and an estimate $\hat{r}^{\prime}(h)$ can be satisfied by using the natural extension defined in (8) when the unit is a hidden unit in a network. We can estimate $r^{\prime}(h)h$ by:

$\displaystyle r^{\prime}(h)h$	$\displaystyle=h\nabla_{h}g(v_{1}h+c_{1},v_{2}h+c_{2},...,v_{m}h+c_{m})$	(28)
	$\displaystyle=\sum_{i=1}^{m}v_{i}h\nabla_{i}g(v_{1}h+c_{1},v_{2}h+c_{2},...,v_{m}h+c_{m})$	(29)
	$\displaystyle=\sum_{i=1}^{m}v_{i}\nabla_{v_{i}}\operatorname*{\mathbb{E}}[R\|H=h].$	(30)

Assuming the outgoing units are also learning through adding some estimates of $\nabla_{v_{i}}\operatorname*{\mathbb{E}}[R|H]$ to its weight $v_{i}$ , we can use the change of $v_{i}$ as an estimate of $\nabla_{v_{i}}\operatorname*{\mathbb{E}}[R|H]$ and plug it in the above formula to obtain $\hat{r}^{\prime}(H)H$ . This leads to the alternative perspective of maximizing outgoing weights in Weight Maximization [2].

However, the natural extension of $r$ is not linear in most cases, so Weight Maximization gives a biased estimate. From the Taylor expansion of $g$ in (8), the estimation bias is in the order of $||v||^{2}$ , the squared norm of the outgoing weight. Note that this estimation bias accumulates across layers as $\hat{r}^{\prime}(H)$ is no longer an unbiased estimator of $r^{\prime}(H)$ for units with hidden outgoing units.

2.5 High-order Weight Maximization

Both STE backprop and Weight Maximization assume the extended $r$ to be linear. In particular, Weight Maximization uses ${r}^{\prime}(1)$ to estimate $r(1)-r(0)$ , which is a first-order Taylor approximation. We may use $p$ -order Taylor approximation to get a better estimate (where $p\geq 1$ ):

\displaystyle r(1)-r(0)\approx r^{(1)}(1)-\frac{1}{2!}r^{(2)}(1)+\frac{1}{3!}r^{(3)}(1)...+(-1)^{p+1}\frac{1}{p!}r^{(p)}(1),

(31)

where $r^{(p)}(h)$ denotes the $p$ -order derivative of $r$ . With this new approximation, (27) is generalized to:

\displaystyle\Delta_{wm-p}b=\left(\sum_{k=1}^{p}(-1)^{k+1}\frac{1}{k!}\hat{r}^{(k)}(H)\right)H(H-\sigma(b)).

(32)

This gives the parameter update for $p$ -order Weight Maximization, which replaces the global reward $R$ in REINFORCE by the individual reward:

\displaystyle\hat{R}:=\left(\sum_{k=1}^{p}(-1)^{k+1}\frac{1}{k!}\hat{r}^{(k)}(H)\right)H.

(33)

We call the learning rule of updating $b$ by $\Delta_{wm-p}b$ in $(\ref{eq:wm2})$ $p$ -order Weight Maximization. Note that first-order Weight Maximization is equivalent to Weight Maximization discussed in Section 2.4.

Computing $\hat{R}$ requires estimates of ${r}^{(k)}(h)$ for $k=1,2,...,p$ . We discuss how to obtain these estimates next.

2.5.1 Natural Extension

Again, we use the natural extension of $r$ . Let first consider ${r}^{(1)}(h)$ :

	$\displaystyle r^{(1)}(h)$	(34)
$\displaystyle=$	$\displaystyle\nabla_{h}\operatorname*{\mathbb{E}}[R\|H=h]$	(35)
$\displaystyle=$	$\displaystyle\nabla_{h}\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=h)\operatorname{\mathbb{E}}[R\|D_{1:m}=d_{1:m}]$	(36)
$\displaystyle=$	$\displaystyle\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=h)\operatorname{\mathbb{E}}[R\|D_{1:m}=d_{1:m}]\nabla_{h}\log\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}\|H=h)$	(37)
$\displaystyle=$	$\displaystyle\operatorname{\mathbb{E}}[R\nabla_{h}\log\operatorname{\text{Pr}}(D_{1:m}\|H=h)\|H=h]$	(38)
$\displaystyle=$	$\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{i=1}^{m}Rv_{i}(D_{i}-\sigma(v_{i}h+c_{i}))\bigg{\|}H=h\right],$	(39)

Thus, an unbiased estimate of ${r}^{(1)}(h)$ is $\sum_{i=1}^{m}Rv_{i}(D_{i}-\sigma(v_{i}h+c_{i}))$ . If the outgoing units are hidden units and are trained by Weight Maximization, each unit receives a different reward $\hat{R}_{i}$ that can be used to replace the global reward $R$ in the estimate. If the outgoing units are output unit, we let $\hat{R}_{i}=R$ . Therefore, we have²²2For simplicity let first assume $\nabla_{c_{i}}\operatorname*{\mathbb{E}}[\hat{R}_{i}|H=h]=\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R|H=h]$ ; i.e. we ignore the approximation error of Weight Maximization applied on outgoing units. We can still use the global reward $R$ if the outgoing units are hidden units, but this leads to a higher variance.:

\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{i=1}^{m}Rv_{i}(D_{i}-\sigma(v_{i}h+c_{i}))\bigg{|}H=h\right]=\operatorname*{\mathbb{E}}\left[\sum_{i=1}^{m}\hat{R}_{i}v_{i}(D_{i}-\sigma(v_{i}h+c_{i}))\bigg{|}H=h\right].

(40)

Note that $\hat{R}_{i}(D_{i}-\sigma(v_{i}h+c_{i}))$ is also the update to weight $v_{i}$ for outgoing unit $i$ . To simplify notation, we define the following terms:

\displaystyle s^{(k)}=\begin{cases}\sum_{i=1}^{m}\hat{R}_{i}v_{i}(D_{i}-\sigma(v_{i}h+c_{i})),&\text{if $k=1$}\\ \sum_{i=1}^{m}\hat{R}_{i}v_{i}^{k}(-\sigma^{(k)}(v_{i}h+c_{i})),&\text{if $k=2,3,...$}\end{cases}

(41)

\displaystyle t^{(k)}=\begin{cases}\sum_{i=1}^{m}v_{i}(D_{i}-\sigma(v_{i}h+c_{i})),&\text{if $k=1$}\\ \sum_{i=1}^{m}v_{i}^{k}(-\sigma^{(k)}(v_{i}h+c_{i})).&\text{if $k=2,3,...$}\end{cases}

(42)

We see that $\hat{r}^{(1)}(h)\vcentcolon=s^{(1)}$ is an unbiased estimator of ${r}^{(1)}(h)$ . Note that $\nabla_{h}s^{(k)}=s^{(k+1)}$ and $\nabla_{h}t^{(k)}=t^{(k+1)}$ . Next, consider ${r}^{(2)}(h)$ :

$\displaystyle r^{(2)}(h)$	$\displaystyle=\nabla_{h}r^{(1)}(h)$	(43)
	$\displaystyle=\nabla_{h}\operatorname*{\mathbb{E}}[s^{(1)}\|H=h]$	(44)
	$\displaystyle=\operatorname{\mathbb{E}}[\nabla_{h}s^{(1)}+s^{(1)}\nabla_{h}\log\operatorname{\text{Pr}}(D_{1:m}\|H=h)\|H=h]$	(45)
	$\displaystyle=\operatorname*{\mathbb{E}}[s^{(2)}+s^{(1)}t^{(1)}\|H=h].$	(46)

Therefore, $\hat{r}^{(2)}(h)\vcentcolon=s^{(2)}+s^{(1)}t^{(1)}$ is an unbiased estimator of $r^{(2)}(h)$ . Next, consider ${r}^{(3)}(h)$ :

$\displaystyle r^{(3)}(h)$	$\displaystyle=\nabla_{h}r^{(2)}(h)$	(47)
	$\displaystyle=\nabla_{h}\operatorname*{\mathbb{E}}[s^{(2)}+s^{(1)}t^{(1)}\|H=h]$	(48)
	$\displaystyle=\operatorname*{\mathbb{E}}[\nabla_{h}(s^{(2)}+s^{(1)}t^{(1)})+(s^{(2)}+s^{(1)}t^{(1)})t^{(1)}\|H=h]$	(49)
	$\displaystyle=\operatorname*{\mathbb{E}}[s^{(3)}+2s^{(2)}t^{(1)}+s^{(1)}t^{(2)}+s^{(1)}(t^{(1)})^{2}\|H=h].$	(50)

Therefore, $\hat{r}^{(3)}(h)\vcentcolon=s^{(3)}+2s^{(2)}t^{(1)}+s^{(1)}t^{(2)}+s^{(1)}(t^{(1)})^{2}$ is an unbiased estimator of $r^{(3)}(h)$ . We can continue to compute an unbiased estimator of $r^{(4)}(h),r^{(5)}(h)$ and so on in the same way. See Appendix A for the general formula of an unbiased estimator of $r^{(p)}(h)$ . These estimates are then plugged in (32) to compute $\hat{R}$ .

If the high-order derivatives of $g$ in (7) are bounded, i.e. there exists $B$ such that $|\partial_{i_{1}}\partial_{i_{2}}...\partial_{i_{n}}g(hv+c)|<B$ for any $h\in[0,1]$ , $i_{j}\in\{1,2,...,m\}$ for $j\in\{1,2,...,n\}$ , $n>0$ , then

\displaystyle r(1)-r(0)

\displaystyle=\operatorname*{\mathbb{E}}\left[\sum_{k=1}^{p}(-1)^{k+1}\frac{1}{k!}\hat{r}^{(k)}(1)\bigg{|}H=1\right]+\mathcal{O}\left(\frac{||v||^{p+1}}{(p+1)!}\right).

(51)

That is, the estimation bias is in the order of $\frac{||v||^{p+1}}{(p+1)!}$ , which converges to $0$ as $p\rightarrow\infty$ . The proof is based on the error bound of Taylor approximation applied on $g$ .

2.5.2 Unbounded Derivatives

Unfortunately, the high-order derivatives of $g$ are not bounded for a network of Bernoulli-logistic units. This is because of the sigmoid function in $g$ as shown in (7), and the $p$ -order derivatives of the sigmoid function are unbounded (see Fig 2). For example, estimating $\sigma(4)$ at $x=0$ by Taylor approximation results in a diverging sequence as $p\rightarrow\infty$ . Denoting $C_{p+1}:=\max_{x}|\sigma^{(p+1)}(x)|$ , the estimation bias of $p$ -order Weight Maximization applied on a network of Bernoulli-logistic units is $\mathcal{O}\left(\frac{C_{p+1}||v||^{p+1}}{(p+1)!}\right)$ , which is unbounded as $p\rightarrow\infty$ if $||v||\geq\pi$ (see Appendix B).

In experiments, we observe that high-order Weight Maximization performs similar to low-order Weight Maximization in the beginning of training. But as the norm of outgoing weight grows large during training, the estimation bias becomes larger for a higher $p$ , and hence high-order Weight Maximization’s performance deteriorates quickly in the middle of training.

Maybe using activation functions other than the sigmoid function or using weight decay to prevent the outgoing weight’s norm from growing too large can prevent this issue. But there is another method to tackle this issue, which will be discussed next.

2.6 Unbiased Weight Maximization

Let us consider Fig 1 again. All previous methods (except REINFORCE) are based on using the derivative of $r$ to approximate $r(1)-r(0)$ . STE backprop approximates it by $r^{\prime}(0)$ or $r^{\prime}(1)$ ; Weight Maximization approximates it by $r^{\prime}(1)$ ; high-order Weight Maximization approximates it by high-order Taylor series of $r$ at $h=1$ . The former two methods assume the $r(h)$ to be linear, which is not true in general; the last method may lead to a divergent Taylor series if the outgoing weight is too large. Are there any other simple methods to approximates $r(1)-r(0)$ based on $r^{\prime}(h)$ ?

Fundamental theorem of calculus tells us that we can integrate $r^{\prime}(h)$ over $h$ to get $r(1)-r(0)$ :

\displaystyle\int^{1}_{0}r^{\prime}(u)du=r(1)-r(0).

(52)

This can in turn be written as:

\displaystyle\operatorname*{\mathbb{E}}[r^{\prime}(U)]=r(1)-r(0),

(53)

where $U$ is an independent and uniformly distributed random variable. Thus, if we have an unbiased estimate of $r^{\prime}(u)$ for any $u\in[0,1]$ , denoted as $\hat{r}^{\prime}(u)$ , we may use the following estimate instead:

\displaystyle\Delta_{uwm}b=\hat{r}^{\prime}(U)H(H-\sigma(b))

(54)

which is an unbiased estimator of $\nabla_{b}\operatorname*{\mathbb{E}}[r(H)]$ :

Proof.

$\displaystyle\operatorname*{\mathbb{E}}[\Delta_{uwm}b]$	$\displaystyle=\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)H(H-\sigma(b))]$	(55)
	$\displaystyle=\operatorname{\mathbb{E}}[\hat{r}^{\prime}(U)H(H-\sigma(b))\|H=1]\operatorname{\text{Pr}}(H=1)$	(56)
	$\displaystyle=\operatorname{\mathbb{E}}[\hat{r}^{\prime}(U)(1-\sigma(b))\|H=1]\operatorname{\text{Pr}}(H=1)$	(57)
	$\displaystyle=\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)\|H=1](1-\sigma(b))\sigma(b)$	(58)
	$\displaystyle=\operatorname*{\mathbb{E}}[{r}^{\prime}(U)](1-\sigma(b))\sigma(b)$	(59)
	$\displaystyle=(r(1)-r(0))(1-\sigma(b))\sigma(b)$	(60)
	$\displaystyle=\nabla_{b}\operatorname*{\mathbb{E}}[r(H)].$	(61)

The last line follows from (2). ∎

Note that $\Delta_{uwm}b$ is very similar to $\Delta_{wm}b$ in (27) - we only replaced $\hat{r}^{\prime}(H)$ in (27) with $\hat{r}^{\prime}(U)$ , i.e. we evaluate the gradient at a random point on $[0,1]$ instead of $H$ .

We also define the individual reward $\hat{R}:=\hat{r}^{\prime}(U)H$ , which is an unbiased estimator of $r(H)-r(0)$ conditioned on $H$ :

$\displaystyle\operatorname*{\mathbb{E}}[\hat{R}\|H]$	$\displaystyle=\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)H\|H]$	(62)
	$\displaystyle=\begin{cases}0,\text{if $H=0$}\\ \operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)],\text{if $H=1$}\end{cases}$	(63)
	$\displaystyle=r(H)-r(0).$	(64)

We call the learning rule of updating $b$ by $\Delta_{uwm}b$ in $(\ref{eq:rp1})$ unbiased Weight Maximization.

2.6.1 Natural Extension

The only remaining question is how can we obtain an unbiased estimate of $r^{\prime}(u)$ for any $u\in[0,1]$ when the unit can only output $\{0,1\}$ . Again, let consider the natural extension of $r$ . By importance sampling, we have, for any $u\in[0,1]$ :

\displaystyle\operatorname*{\mathbb{E}}\left[\frac{\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)}{\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H)}\sum_{i=1}^{m}\hat{R}_{i}\nabla_{u}\log\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u)\right]={r}^{\prime}(u),

(65)

where $\hat{R}_{i}$ denotes the individual reward to the outgoing unit $i$ , which equals the global reward $R$ if the outgoing unit is an output unit, or the individual reward computed by unbiased Weight Maximization applied on the outgoing unit if the outgoing unit is a hidden unit. We denote the term in the expectation by $\hat{r}^{\prime}(u)$ , which is an unbiased estimator of ${r}^{\prime}(u)$ and gives an iterative formula to compute the individual reward to all hidden units in a network.

Proof.

		$\displaystyle\operatorname{\mathbb{E}}\left[\frac{\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=u)}{\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H)}\sum_{i=1}^{m}\hat{R}_{i}\nabla_{u}\log\operatorname{\text{Pr}}(D_{i}=d_{i}\|H=u)\right]$		(66)
	$\displaystyle=$	$\displaystyle\sum_{h,d_{1:m}}\operatorname{\text{Pr}}(H=h)\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=u)\sum_{i=1}^{m}\operatorname{\mathbb{E}}[\hat{R}_{i}\|D_{1:m}=d_{1:m},H=h]\nabla_{u}\log\operatorname{\text{Pr}}(D_{i}=d_{i}\|H=u)$		(67)

If the outgoing units are output units, then $\hat{R}_{i}=R$ by definition, and (67) becomes:

	$\displaystyle\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=u)\sum_{i=1}^{m}\operatorname{\mathbb{E}}[{R}\|D_{1:m}=d_{1:m}]\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}\|H=u)$	(68)
$\displaystyle=$	$\displaystyle\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=u)\operatorname{\mathbb{E}}[{R}\|D_{1:m}=d_{1:m}]\sum_{i=1}^{m}\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}\|H=u)$	(69)
$\displaystyle=$	$\displaystyle\nabla_{u}\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=u)\operatorname{\mathbb{E}}[{R}\|D_{1:m}=d_{1:m}]$	(70)
$\displaystyle=$	$\displaystyle\nabla_{u}\operatorname*{\mathbb{E}}[{R}\|H=u]$	(71)
$\displaystyle=$	$\displaystyle\hat{r}^{\prime}(u),$	(72)

where $(\ref{eq:rp3})$ uses the assumption that $R$ is independent with $H$ conditioned on $D_{1:m}=d_{1:m}$ since the network has a multi-layer structure.

If the outgoing units are hidden units, then $\hat{R}_{i}$ is computed by unbiased Weight Maximization applied on the outgoing units, and $\operatorname*{\mathbb{E}}[\hat{R}_{i}|D_{i}=d_{i},H=h]=\operatorname*{\mathbb{E}}[R|D_{i}=d_{i},H=h]-\operatorname*{\mathbb{E}}[R|D_{i}=0,H=h]$ by (64). So (67) becomes:

	$\displaystyle\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=u)\sum_{i=1}^{m}\operatorname{\mathbb{E}}[{R}\|D_{1:m}=d_{1:m}]\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}\|H=u)-$
	$\displaystyle\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=u)\sum_{i=1}^{m}\operatorname{\mathbb{E}}[{R}\|D_{i}=0;D_{1:m\backslash i}=d_{1:m\backslash i},H=h]\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}\|H=u)$		(73)

It suffices to show that the second term is zero since the first term is (68):

	$\displaystyle\sum_{d_{1:m}}\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=u)\sum_{i=1}^{m}\operatorname{\mathbb{E}}[{R}\|D_{i}=0;D_{1:m\backslash i}=d_{1:m\backslash i}]\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}\|H=u)$	(74)
$\displaystyle=$	$\displaystyle\sum_{d_{1:m\backslash i}}\operatorname{\text{Pr}}(D_{1:m\backslash i}=d_{1:m\backslash i}\|H=u)\sum_{i=1}^{m}\operatorname{\mathbb{E}}[{R}\|D_{i}=0;D_{1:m\backslash i}=d_{1:m\backslash i},H=h]\nabla_{u}\sum_{d_{i}}\operatorname*{\text{Pr}}(D_{i}=d_{i}\|H=u)$	(75)
$\displaystyle=$	$\displaystyle 0.$	(76)

Note that the event of $H=u$ is not well defined since $H$ can only take binary values, and so $\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)$ is defined by allowing $H$ to pass any values in $[0,1]$ to outgoing units (i.e. extending $h$ to $[0,1]$ in (7)). ∎

$\hat{r}^{\prime}(u)$ can then be used to compute $\hat{R}$ by $\hat{r}^{\prime}(u)H$ , and $\Delta_{uwm}b$ can be computed to update $b$ .

Note that we can also use multiple i.i.d. $U_{1},U_{2},...U_{M}$ or numerical integration³³3Analytical integration is possible iff each outgoing unit receives the same reward. to estimate $\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)]$ to reduce variance, but it does not give a better performance in our experiments.

2.7 Relationship between different Weight Maximization

We have discussed three different forms of Weight Maximization: Weight Maximization in Section 2.4, high-order Weight Maximization in Section 2.5, and unbiased Weight Maximization in 2.6. The only difference between these algorithms is how the individual reward $\hat{R}$ is computed, as shown in Table 1. This individual reward $\hat{R}$ is used to replace the global reward $R$ in the REINFORCE learning rule.

Weight Maximization is equivalent to $p$ -order Weight Maximization when $p=1$ , as discussed previously. Weight Maximization is also equivalent to unbiased Weight Maximization if we set $U=H$ instead of uniformly sampling $U$ from $[0,1]$ . In other words, Weight Maximization evaluates the gradient $r^{\prime}(h)$ at endpoints while unbiased Weight Maximization evaluates the gradient $r^{\prime}(h)$ at a random point in $[0,1]$ .

In addition, it can be shown that if the Taylor series of $r$ converges, as $p\rightarrow\infty$ , the update given by $p$ -order Weight Maximization converges to unbiased Weight Maximization that directly computes $\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)]$ instead of estimating it with $\hat{r}^{\prime}(U)$ . The proof is quite long and is omitted here. From this we see that Weight Maximization can also be seen as the first-order Taylor approximation of unbiased Weight Maximization.

Though we include ‘Weight Maximization’ in these learning rules to emphasize their origins, it should be noted that these extensions can no longer be seen as maximizing the norm of outgoing weight as the individual rewards are computed differently compared to Weight Maximization.

Table 1: Comparison of how individual reward

\hat{R}

is computed and estimation bias.

	Formula for $\hat{R}$	Estimation Bias
REINFORCE	$R$	0
Weight Max	$\sum_{i=1}^{m}\hat{R}_{i}H\nabla_{H}\log\operatorname*{\text{Pr}}(D_{i}=d_{i}\|H)$	$\mathcal{O}\left(\|\|v\|\|^{2}\right)$
$p$ -order Weight Max	$\left(\sum_{k=1}^{p}(-1)^{k+1}\frac{1}{k!}\hat{r}^{(k)}(H)\right)H$ where $\hat{r}^{(k)}$ is defined by (82)	$\mathcal{O}\left(\frac{C_{p+1}\|\|v\|\|^{p+1}}{(p+1)!}\right)$
Unbiased Weight Max	$\frac{\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H=U)}{\operatorname{\text{Pr}}(D_{1:m}=d_{1:m}\|H)}\sum_{i=1}^{m}\hat{R}_{i}H\nabla_{U}\log\operatorname*{\text{Pr}}(D_{i}=d_{i}\|H=U)$	0

3 Experiments

To test the above algorithms, we consider the $k$ -bit multiplexer task. The input is sampled from all possible values of a binary vector of size $k+2^{k}$ with equal probability. The output set is $\{-1,1\}$ , and we give a reward of $+1$ if the network’s output equals the multiplexer’s output and $-1$ otherwise. We consider $k=4$ here, so the dimension of the input space is 20. We call a single interaction of the network from receiving an input to receiving the reward an episode; the series of episodes as the network’s performance increases during training is called a run.

We used a three-layer network of Bernoulli-logistic units, with the first and the second hidden layers having $N\in\{8,16,32,48,64,96\}$ units and the output layer having a single unit. The update step size for gradient ascent is chosen as $0.005$ , and the batch size is chosen as $16$ (i.e. we estimated the gradient for $16$ episodes and averaged them in the update). We used the Adam optimizer [5]. These hyperparameters are chosen prior to the experiments without any tuning.

The average reward in the first $5e6$ episodes for networks with different $N$ and learning rules is also shown in Figure 3. The learning curve (i.e. the reward of each episode throughout a run) for each $N$ is also shown in Figure 4. Note that the results are averaged over 5 independent runs, and the error bar represents the standard deviation over runs.

We observe that:

•

REINFORCE: When $N=8$ , REINFORCE gives the best performance in terms of both learning speed and asymptotic performance - it is easy to correlate units’ activation values with the reward when the network size is small; but as $N$ grows larger than 16, the learning curve does not change much - REINFORCE does not benefit from a larger network due to the increased noise.
•

STE Backprop: When $N$ is small, STE backprop has a poor asymptotic performance - when there are only a few units, the norm of weight has to be large, and so the network has a very different dynamic than an ANN with deterministic and sigmoid units, which is essentially the assumption of STE backprop. But when the network is large, the dynamic becomes similar, and so STE backprop works well when $N$ is large.
•

Weight Maximization and second-order Weight Maximization: Both learn fast in the first 1e6 episodes, but performance quickly deteriorates afterward. As the norm of weight grows large during training, estimation bias also grows large. Note that weight decay is required to make Weight Maximization work well, but we did not use weight decay in our experiments as it requires extensive tuning.
•

Unbiased Weight Maximization: It works well in terms of both learning speed and asymptotic performance under all $N$ . The unbiased property makes it work well when the network size is small, and the use of individual reward makes it scale well with a larger $N$ .

We have done additional experiments to understand the different learning rules better. Experiments on $p$ -order Weight Maximization for $p\in\{1,2,...,8\}$ can be found in Appendix C.1. Experiments on variants of unbiased Weight Maximization with a reduced variance can be found in Appendix C.2. Experiments on comparison with continuous-valued ANNs trained by backprop can be found in Appendix C.3. Finally, experiments on the estimation bias and variance of the different estimators can be found in Appendix C.4.

4 Discussion

Both the theoretical results and experiment results of unbiased Weight Maximization are promising. Among all learning rules considered in this report, it is the only unbiased learning rule that scales well in term of performance when the network grows larger. To our knowledge, this is also the first learning rule for a network of Bernoulli-logistic units that have such property.

However, there are a few drawbacks of unbiased Weight Maximization. First, the computation cost to compute the individual reward for a layer is $\mathcal{O}(m^{2}n)$ , where $m$ denotes the number of units on that layer and $n$ denotes the number of units on the next layer, since we have to compute the sampling ratio for each unit and each computation of sampling ratio requires $\mathcal{O}(mn)$ . In contrast, the computation cost to compute the individual reward signal or the feedback signal for a layer is only $\mathcal{O}(mn)$ in Weight Maximization or STE backprop; and no such computation is required for REINFORCE. Second, the computation of individual reward for unbiased Weight Maximization is too involved for biological neurons - it requires all units in a layer to coordinate together to compute the sampling ratio for individual reward signals to the upstream units. REINFORCE or Weight Maximization does not have such issue. Weight Maximization works well when the outgoing weight’s norm is small, and the efficacy of synaptic transmission in biological neurons cannot be unbounded, so Weight Maximization, which can be seen as the first-order approximation of unbiased Weight Maximization, may already be sufficient for biological neurons.

An important and fundamental question in this study is the motivation for discrete-valued ANNs in general. A continuous-valued ANNs trained by backprop can learn faster and has a better asymptotic performance than the discrete-valued ANNs trained by all learning rules considered in this report (see Appendix C.3), and so the motivation of discrete-valued ANNs remains to be explored.

In conclusion, this report analyzes different learning rules for discrete-valued ANNs and proposes two new learning rules, called high-order Weight Maximization and unbiased Weight Maximization, that extend from Weight Maximization. Possible future work includes:

•

Perform more experiments to understand the advantages or disadvantages of unbiased Weight Maximization (e.g. generalization error, dynamics of exploration) on standard supervised learning tasks and reinforcement learning tasks.
•

Generalize unbiased Weight Maximization to other discrete-valued units besides Bernoulli-logistic units, such as softmax units.
•

Adjust unbiased Weight Maximization to encourage exploration, e.g. adding some terms in the learning rule to encourage exploration such as the use of $\lambda$ in $A_{r-p}$ [6].
•

Explore methods to reduce the computational cost of unbiased Weight Maximization or simplify unbiased Weight Maximization while maintaining the performance or the unbiased property.
•

Develop an unbiased version of STE backprop, which is straightforward using the idea of unbiased Weight Maximization since both rely on estimates of $r(1)-r(0)$ .

5 Afterword: my view on discrete-valued ANNs

The dynamic of discrete-valued ANNs is very different than continuous-valued ANNs and may lead to a much more powerful network. For example, a restricted Boltzmann machine (RBM) with continuous-valued hidden and visible units can only model multi-variate Gaussian distribution, but an RBM with binary-valued hidden units (which becomes Bernoulli-logistic units) and continuous-valued visible units can model any distributions. Thus, by restricting the activation values of units to binary values, we can get a more powerful network paradoxically. We also observe this phenomenon in some recent advances in deep learning. The attention module in Transformers [7], which led to great success in natural language process (NLP) and underlies virtually all models in NLP nowadays, can be seen as a pseudo-discrete operation: the attention vector from a trained attention module is close to a one-hot vector, and so the network can ‘attend’ to different parts of the input in a discrete manner. Nonetheless, due to the lack of methods to train discrete-valued ANNs efficiently, we can only convert discrete operations to differentiable operations, e.g. through outputting expected values instead of sampled values as in the attention modules of Transformers, and train them by backprop. However, this may not be desirable - when being asked to think about the favorite fruit, we would not think about an apple with $0.5$ intensity, an orange with $0.3$ intensity, and a banana with $0.2$ intensity at the same time; we think of just one fruit at any moment! Clearly, we do not learn by making every thought differentiable. From a microscopic perspective, the output of a single biological neuron is also not differentiable as it either fires or not. Therefore, it may be better to find methods to directly train discrete operations embedded in an ANN instead of trying to design differentiable variants of them. Some discussions on the benefits of discrete-valued representation can be found in [8, 9]. To conclude, the assumption of everything being differentiable, which virtually underlies all recent advances in deep learning, should be questioned.

6 Acknowledgment

We would like to thank Andrew G. Barto, who inspired this research and provided valuable insights and comments.

Appendix A A general formula of $r^{(p)}(h)$

To obtain a general formula for an unbiased estimator of $r^{(p)}(h)$ , we first define $P(k)$ to be the set of partition of integer $k$ , with each partition represented by a tuple of the number of appearance. For example, $4$ can be partitioned into $4,3+1,2+2,2+1+1,1+1+1+1$ . To encode $2+1+1$ , we see that $1$ appears twice, $2$ appears once, so it is encoded into $(2,1,0,0)$ . So $P(4)=\{\{0,0,0,1\},\{1,0,1,0\},\{0,2,0,0\},\{2,1,0,0\},\{4,0,0,0\}\}$ . Then, we define $t_{f}^{(k)}$ for $k=1,2,...,$ by:

\displaystyle t_{f}^{(k)}

\displaystyle=\sum_{i\in P(k)}\frac{n!}{i_{1}!i_{2}!...i_{n}!}\left(\frac{t^{(1)}}{1!}\right)^{i_{1}}\left(\frac{t^{(2)}}{2!}\right)^{i_{2}}...\left(\frac{t^{(k)}}{k!}\right)^{i_{k}},

(77)

and $t_{f}^{(0)}=1$ . For example, for $k=1$ to $4$ :

$\displaystyle t_{f}^{(1)}$	$\displaystyle=t^{(1)},$	(78)
$\displaystyle t_{f}^{(2)}$	$\displaystyle=t^{(2)}+(t^{(1)})^{2},$	(79)
$\displaystyle t_{f}^{(3)}$	$\displaystyle=t^{(3)}+3t^{(1)}t^{(2)}+(t^{(1)})^{3},$	(80)
$\displaystyle t_{f}^{(4)}$	$\displaystyle=t^{(4)}+4t^{(1)}t^{(3)}+3(t^{(2)})^{2}+6(t^{(1)})^{2}t^{(2)}+(t^{(1)})^{4}.$	(81)

And $r^{(p)}(h)$ can be expressed by (can be proved by induction; proof omitted here):

\displaystyle r^{(p)}(h)=\operatorname*{\mathbb{E}}\left[\sum_{k=1}^{p}C^{p-1}_{k-1}s^{(k)}t_{f}^{(p-k)}\bigg{|}H=h\right],

(82)

where $s^{(k)}$ and $t^{(k)}$ are defined in (41) and (42). Therefore, the general formula to estimate $r^{(p)}(h)$ is $\hat{r}^{(p)}(h):=\sum_{k=1}^{p}C^{p-1}_{k-1}s^{(k)}t_{f}^{(p-k)}$ . For example, for $k=1$ to $4$ (note that they are the same as the estimates derived previously):

$\displaystyle\hat{r}^{(1)}(h)$	$\displaystyle=s^{(1)},$	(83)
$\displaystyle\hat{r}^{(2)}(h)$	$\displaystyle=s^{(2)}+t_{f}^{(1)},$	(84)
$\displaystyle\hat{r}^{(3)}(h)$	$\displaystyle=s^{(3)}+2s^{(2)}t_{f}^{(1)}+s^{(1)}t_{f}^{(2)},$	(85)
$\displaystyle\hat{r}^{(4)}(h)$	$\displaystyle=s^{(4)}+3s^{(3)}t_{f}^{(1)}+3s^{(2)}t_{f}^{(2)}+s^{(1)}t_{f}^{(3)}.$	(86)

Appendix B A general formula of $\sigma^{(p)}(x)$

The general formula of $\sigma^{(p)}(x)$ is given by:

\displaystyle\sigma^{(p)}(x)=\sum_{k=1}^{p}(-1)^{k-1}A_{p,k-1}\sigma(x)^{k}(1-\sigma(x))^{p+1-k},

(87)

where $A_{p,k-1}$ are the Eulerian Numbers. For the proof, see [10].

To see that $\frac{C_{p+1}||v||^{p+1}}{(p+1)!}$ is unbounded if $||v||>\pi$ , it suffices to prove that for $|x|>\pi$ ,

\displaystyle\lim_{n\rightarrow\infty}\left|\frac{\sigma^{(2n-1)}(0)x^{(2n-1)}}{(2n-1)!}\right|=+\infty.

(88)

Proof.

First, we can write $\sigma^{(p)}(0)$ as:

$\displaystyle\sigma^{(p)}(0)$	$\displaystyle=\sum_{k=1}^{p}(-1)^{k-1}A_{p,k-1}\sigma(0)^{k}(1-\sigma(0))^{p+1-k},$	(89)
	$\displaystyle=\sum_{k=1}^{p}(-1)^{k-1}A_{p,k-1}2^{-(p+1)}$	(90)
	$\displaystyle=2^{-(p+1)}\sum_{k=1}^{p}(-1)^{k-1}A_{p,k-1}$	(91)
	$\displaystyle=\frac{(2^{p+1}-1)B_{p+1}}{p+1},$	(92)

where $B_{p+1}$ denotes the Bernoulli number. Substituting back, we have

$\displaystyle\lim_{n\rightarrow\infty}\left\|\frac{\sigma^{(2n-1)}(0)x^{(2n-1)}}{(2n-1)!}\right\|$	$\displaystyle=\lim_{n\rightarrow\infty}\left\|\frac{(2^{2n}-1)B_{2n}x^{(2n-1)}}{(2n)!}\right\|$	(93)
	$\displaystyle\geq\lim_{n\rightarrow\infty}\left\|\frac{(2^{2n}-1)2(2n)!x^{(2n-1)}}{(2\pi)^{2n}(1-2^{-2n})(2n)!}\right\|$	(94)
	$\displaystyle=\lim_{n\rightarrow\infty}\left\|\frac{(2^{2n}-1)2x^{(2n-1)}}{(2\pi)^{2n}}\right\|$	(95)
	$\displaystyle=\lim_{n\rightarrow\infty}\left\|\frac{2}{x}\left(\frac{x}{\pi}\right)^{2n}\left(1-\frac{1}{2^{2n}}\right)\right\|,$	(96)

which is unbounded if $|x|>\pi$ . (94) uses $|B_{2n}|>\frac{2(2n)!}{(2\pi)^{2n}(1-2^{-2n})}$ [11]. ∎

Appendix C Additional Experiments

C.1 High-order Weight Maximization

We repeat the same experiment in Section 3 for $p$ -order Weight Maximization, with $p\in\{1,2,...,8\}$ . We use $N=64$ . The result is shown in Figure 5.

We see that $p$ -order Weight Maximization performs worse with a larger $p$ . Manual inspection shows that the individual reward computed by high-order Weight Maximization can be very large if the outgoing weight’s norm is large. This is essentially due to the problem of approximating the sigmoid function $\sigma(x)$ with Taylor series evaluated $x=a$ , which diverges when $|x-a|$ is too large.

C.2 Variants of unbiased Weight Maximization

In unbiased Weight Maximization we estimate $\operatorname*{\mathbb{E}}[r^{\prime}(U)]$ with $r^{\prime}(U)$ , which is a Monte Carlo estimate with a single sample. An estimate with a lower variance is $\sum_{i=1}^{M}\frac{r^{\prime}(U_{i})}{M}$ , where all $U_{i}$ are i.i.d. and uniformly distributed. Alternatives, we can estimate $\operatorname*{\mathbb{E}}[r^{\prime}(U)]$ by numerical integration using the rectangle rule applied on $M$ subintervals. To test these methods, we repeat the same experiment in Section 3 with $M\in\{1,5,10\}$ for both Monte Carlo (MC) and numerical integration (NI). We use $N=64$ . The result is shown in Figure 6.

We see that the learning curves are almost the same. It seems that these different methods to reduce variance do not yield observable improvement in performance despite the reduced variance. Maybe the reduction in variance is not significant enough to affect performance.

C.3 Comparison with continuous-valued ANNs

We repeat the same experiment in Section 3 for continuous-valued ANNs trained by backprop: the output unit is still a Bernoulli-logistic unit trained by REINFORCE, but all hidden units are deterministic sigmoid units trained by backprop. The architecture of the network is the same as the discrete-valued ANNs. Thus, the continuous-valued ANN considered is equivalent to the discrete-valued ANN in our experiments but with hidden units outputting expected values instead of sampled values. The average reward, similar to Figure 3 and Figure 4, is shown on Figure 7 and Figure 8.

We observe that continuous-valued ANNs trained by backprop has a better performance in term of both learning speed and asymptotic performance than discrete-valued ANNs, though the difference narrows for a larger network. This is likely because units in discrete-valued ANNs can only communicate by binary values, which constraints the amount of information passed between units, e.g. a unit can only express if there is an edge but not the probability that there is an edge. As the network size increases, more binary units can detect the same edge, and the average value of these units can estimate the probability that there is an edge, so the difference in performance narrows.

More experiments can be conducted to evaluate and compare discrete-valued ANNs and continuous-valued ANNs on other aspects besides learning speed and network capacity, such as generalization error in supervised learning tasks and dynamics of exploration in reinforcement learning tasks.

C.4 Estimation bias and variance of the estimators

All learning rules discussed can be seen as using different estimators to estimate the gradient of expected reward $\operatorname*{\mathbb{E}}[r(H)]$ w.r.t. bias of the unit $b$ . To evaluate the quality of estimators, we compute their estimation bias and variance on some randomly selected distributions.

We consider a five-layered network of Bernoulli-logistic units. The first hidden layer and the output layer have a single unit; the other hidden layers have four units. The weight and bias of these units are all uniformly sampled from $[-C,+C]$ where $C>0$ . The task considered has no inputs, and the rewards for the two binary outputs are uniformly sampled from $[-10,+10]$ . Then we apply the estimators to estimate $\operatorname*{\mathbb{E}}[r(H)]$ w.r.t. bias of the unit on the first layer. The estimation bias and variance of the different estimators can be analytically computed as the network’s size is small.

The estimation bias and variance for a network with different parameter range $C$ are shown in Figure 9 and Figure 10. We observe that when the $C$ is small, the estimation bias of $p$ -order Weight Maximization is lower for a higher $p$ - at that range, the Taylor series of sigmoid function converges. However, when $C$ is large, the Taylor series diverges, and so the estimation bias is larger for a higher $p$ . As for variance, we observe that REINFORCE has a steady variance regardless of $C$ , since the variance of REINFORCE is bounded by the square of expected rewards. Unbiased Weight Maximization’s variance increases fastest with $C$ , since the sampling ratio in unbiased Weight Maximization can be huge when the weight is large. Nonetheless, the large variance of unbiased Weight Maximization seems to have not much impact on the learning curve. This may be explained by the fact that those huge sampling ratios only occur with a probability close to zero and so may never be encountered during training. The relationship between variance and learning speed deserves further investigation.

References

[1] Richard S Sutton and Andrew G Barto “Reinforcement learning: An introduction” MIT press, 2018
[2] Stephen Chung “Learning by competition of self-interested reinforcement learning agents” In Proceedings of the AAAI Conference on Artificial Intelligence 36.6, 2022, pp. 6384–6393
[3] Ronald J Williams “Simple statistical gradient-following algorithms for connectionist reinforcement learning” In Machine learning 8.3-4 Springer, 1992, pp. 229–256
[4] Yoshua Bengio, Nicholas Léonard and Aaron Courville “Estimating or propagating gradients through stochastic neurons for conditional computation” In arXiv preprint arXiv:1308.3432, 2013
[5] Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
[6] Andrew G Barto “Learning by statistical cooperation of self-interested neuron-like computing elements” In Human Neurobiology 4.4, 1985, pp. 229–256
[7] Ashish Vaswani et al. “Attention is all you need” In Advances in neural information processing systems 30, 2017
[8] Lorenzo Ferrone and Fabio Massimo Zanzotto “Symbolic, distributed, and distributional representations for natural language processing in the era of deep learning: A survey” In Frontiers in Robotics and AI Frontiers, 2020, pp. 153
[9] Ruben Cartuyvels, Graham Spinks and Marie-Francine Moens “Discrete and continuous representations and processing in deep learning: looking forward” In AI Open 2 Elsevier, 2021, pp. 143–159
[10] Ali A Minai and Ronald D Williams “On the derivatives of the sigmoid” In Neural Networks 6.6 Elsevier, 1993, pp. 845–853
[11] Horst Alzer “Sharp bounds for the Bernoulli numbers” In Archiv der Mathematik 74.3 Springer, 2000, pp. 207–211

$\displaystyle\lim_{n\rightarrow\infty}\left\|\frac{\sigma^{(2n-1)}(0)x^{(2n-1)}}{(2n-1)!}\right\|$	$\displaystyle=\lim_{n\rightarrow\infty}\left\|\frac{(2^{2n}-1)B_{2n}x^{(2n-1)}}{(2n)!}\right\|$	(93)
	$\displaystyle\geq\lim_{n\rightarrow\infty}\left\|\frac{(2^{2n}-1)2(2n)!x^{(2n-1)}}{(2\pi)^{2n}(1-2^{-2n})(2n)!}\right\|$	(94)
	$\displaystyle=\lim_{n\rightarrow\infty}\left\|\frac{(2^{2n}-1)2x^{(2n-1)}}{(2\pi)^{2n}}\right\|$	(95)
	$\displaystyle=\lim_{n\rightarrow\infty}\left\|\frac{2}{x}\left(\frac{x}{\pi}\right)^{2n}\left(1-\frac{1}{2^{2n}}\right)\right\|,$	(96)

Report on Learning Rules for Discrete-valued Networks

Unbiased Weight Maximization

1 Overview

2 A single Bernoulli-logistic unit

2.1 A multi-layer network of Bernoulli-logistic units

2.2 REINFORCE

2.3 STE Backprop

2.3.1 Natural Extension

2.4 Weight Maximization

2.4.1 Natural Extension

2.5 High-order Weight Maximization

2.5.1 Natural Extension

2.5.2 Unbounded Derivatives

2.6 Unbiased Weight Maximization

Proof.

2.6.1 Natural Extension

Proof.

2.7 Relationship between different Weight Maximization

3 Experiments

4 Discussion

5 Afterword: my view on discrete-valued ANNs

6 Acknowledgment

Appendix A A general formula of r(p)​(h)r^{(p)}(h)

Appendix B A general formula of σ(p)​(x)\sigma^{(p)}(x)

Proof.

Appendix C Additional Experiments

C.1 High-order Weight Maximization

C.2 Variants of unbiased Weight Maximization

C.3 Comparison with continuous-valued ANNs

C.4 Estimation bias and variance of the estimators

References

Appendix A A general formula of $r^{(p)}(h)$

Appendix B A general formula of $\sigma^{(p)}(x)$