This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Report on Learning Rules for Discrete-valued Networks

Stephen Chung University of Massachusetts Amherst, Department of Computer Science, [email protected]
(March 2022)

Unbiased Weight Maximization

Stephen Chung University of Massachusetts Amherst, Department of Computer Science, [email protected]
(March 2022)

1 Overview

A biologically plausible method for training an Artificial Neural Network (ANN) involves treating each unit as a stochastic Reinforcement Learning (RL) agent, thereby considering the network as a team of agents. Consequently, all units can learn via REINFORCE, a local learning rule modulated by a global reward signal, which aligns more closely with biologically observed forms of synaptic plasticity [1, Chapter 15]. Nevertheless, this learning method is often slow and scales poorly with network size due to inefficient structural credit assignment, since a single reward signal is broadcast to all units without considering individual contributions.

Weight Maximization [2], a proposed solution, replaces a unit’s reward signal with the norm of its outgoing weight, thereby allowing each hidden unit to maximize the norm of the outgoing weight instead of the global reward signal. In this research report, we analyze the theoretical properties of Weight Maximization and propose a variant, Unbiased Weight Maximization. This new approach provides an unbiased learning rule that increases learning speed and improves asymptotic performance. Notably, to our knowledge, this is the first learning rule for a network of Bernoulli-logistic units that is unbiased and scales well with the number of network’s units in terms of learning speed.

2 A single Bernoulli-logistic unit

We first focus on the learning rules for a single Bernoulli-logistic unit with no inputs111To generalize the learning rules to a unit with input, we can multiply the update of the bias by the input to compute the update of the weight associated with that input.: denoting the activation value of the unit by HH, sigmoid function as σ\sigma, and the scalar bias (the parameter to be learned; not to be confused with the estimation bias of an estimator) by bb, the distribution of HH is given by Pr(H=h)=π(h;b)=σ((2h1)b)\operatorname*{\text{Pr}}(H=h)=\pi(h;b)=\sigma((2h-1)b) where h{0,1}h\in\{0,1\}. After sampling HH, a scalar reward RR is sampled from a distribution conditioned on HH. We are interested in learning bb such that the expected reward 𝔼[R]\operatorname*{\mathbb{E}}[R] is maximized. We also denote the conditional expectation 𝔼R[R|H=h]\operatorname*{\mathbb{E}}_{R}[R|H=h] by r(h)r(h), so 𝔼[R]=𝔼H[𝔼R[R|H]]=𝔼[r(H)]\operatorname*{\mathbb{E}}[R]=\operatorname*{\mathbb{E}}_{H}[\operatorname*{\mathbb{E}}_{R}[R|H]]=\operatorname*{\mathbb{E}}[r(H)].

Our goal is to compute or estimate b𝔼H[r(H)]\nabla_{b}\operatorname*{\mathbb{E}}_{H}[r(H)], the gradient of the expected reward w.r.t. parameter bb, so as to perform gradient ascent on the expected reward.

A direct computation of the gradient gives:

b𝔼[r(H)]\displaystyle\nabla_{b}\operatorname*{\mathbb{E}}[r(H)] =b(σ(b)r(0)+σ(b)r(1))\displaystyle=\nabla_{b}(\sigma(-b)r(0)+\sigma(b)r(1)) (1)
=(r(1)r(0))σ(b)(1σ(b)).\displaystyle=(r(1)-r(0))\sigma(b)(1-\sigma(b)). (2)

That is, the gradient is positive iff r(1)>r(0)r(1)>r(0) since σ(b)(1σ(b))\sigma(b)(1-\sigma(b)) is always positive. In other words, if the expected reward for firing (i.e. outputting 11) is larger than the reward for not firing (i.e. outputting 0), then the unit should fire more by increasing bb. In fact, we see that the optimal b=argmaxb𝔼[r(H)]b^{*}=\operatorname*{arg\,max}_{b}\operatorname*{\mathbb{E}}[r(H)] is infinitely positive if r(1)>r(0)r(1)>r(0) and infinitely negative if r(1)<r(0)r(1)<r(0). However, r(1)r(0)r(1)-r(0) is generally not known.

2.1 A multi-layer network of Bernoulli-logistic units

Before discussing the learning rules, we consider some properties of r(h)r(h) if the unit is one of the many units in a multi-layer network of Bernoulli-logistic units. In a multi-layer network, units are grouped into different layers. We call the last layer output layer and other layers hidden layers. Units on the output layer are called output units, and units on hidden layers are called hidden units. Activation values of the hidden units are passed to the next layer as inputs, and activation values of the output units are passed to the external environment and determine the reward distribution.

Therefore, if the unit is a hidden unit in a multi-layer network of Bernoulli-logistic units, HH is passed to units on next layer to determine their activation values D1,D2,,DmD_{1},D_{2},...,D_{m} by Pr(Di=d|H=h)=σ((2d1)(vih+ci))\operatorname*{\text{Pr}}(D_{i}=d|H=h)=\sigma((2d-1)(v_{i}h+c_{i})) where viv_{i} is the weight connecting from the unit to the next layer’s unit ii, cic_{i} is the bias of the next layer’s unit ii, d{0,1}d\in\{0,1\}, i{0,1,,m}i\in\{0,1,...,m\}, and mm is the number of next layer’s units. We call the vector v=[v1,,vm]v=[v_{1},...,v_{m}] outgoing weight, and the next layer’s units outgoing units.

From this, we see that r(h)r(h) can be written as some differentiable function g(v1h+c1,,vmh+cm)g(v_{1}h+c_{1},...,v_{m}h+c_{m}):

r(h)\displaystyle r(h) (3)
=\displaystyle= 𝔼[R|H=h]\displaystyle\operatorname*{\mathbb{E}}[R|H=h] (4)
=\displaystyle= d1:mPr(D1:m=d1:m|H=h)𝔼[R|D1:m=d1:m,H=h]\displaystyle\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=h)\operatorname*{\mathbb{E}}[R|D_{1:m}=d_{1:m},H=h] (5)
=\displaystyle= d1:mPr(D1:m=d1:m|H=h)𝔼[R|D1:m=d1:m]\displaystyle\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=h)\operatorname*{\mathbb{E}}[R|D_{1:m}=d_{1:m}] (6)
=\displaystyle= d1:mσ((2d11)(v1h+c1))σ((2dm1)(vmh+cm))𝔼[R|D1:m=d1:m]\displaystyle\sum_{d_{1:m}}\sigma((2d_{1}-1)(v_{1}h+c_{1}))...\sigma((2d_{m}-1)(v_{m}h+c_{m}))\operatorname*{\mathbb{E}}[R|D_{1:m}=d_{1:m}] (7)
=:\displaystyle=: g(v1h+c1,,vmh+cm),\displaystyle g(v_{1}h+c_{1},...,v_{m}h+c_{m}), (8)

where (6) uses the assumption that RR is independent with HH when conditioned on D1:mD_{1:m} since the network has a multi-layer structure.

Note that for h[0,1]h\in[0,1], gg is a differentiable function of hh, so gg gives a differentiable extension of r(h)r(h) to h[0,1]h\in[0,1]. We call this extension a natural extension of r(h)r(h), which extends r(h)r(h) by allowing the unit to pass any values in [0,1][0,1] instead of only {0,1}\{0,1\} to its outgoing units. Though differentiable, the natural extension can be highly non-linear, as shown in Figure 1(a). The natural extension of r(h)r(h), which is only applicable for hidden units, is useful as it opens up the possibility of other learning rules that are not possible for output units. For example, we may have some estimates of r(0)r^{\prime}(0) or r(1)r^{\prime}(1) when training the outgoing units, which can be used to approximate r(1)r(0)r(1)-r(0) by first-order Taylor approximation.

We are now ready to discuss the five different estimates of b𝔼[r(H)]\nabla_{b}\operatorname*{\mathbb{E}}[r(H)].

2.2 REINFORCE

The gradient can be computed as:

b𝔼[r(H)]\displaystyle\nabla_{b}\operatorname*{\mathbb{E}}[r(H)] =𝔼[r(H)blogPr(H)]\displaystyle=\operatorname*{\mathbb{E}}[r(H)\nabla_{b}\log\operatorname*{\text{Pr}}(H)] (9)
=𝔼[r(H)(Hσ(b))].\displaystyle=\operatorname*{\mathbb{E}}[r(H)(H-\sigma(b))]. (10)

Suppose that we have an unbiased estimate of r(h)r(h) for any hh, denoted by r^(h)\hat{r}(h), we can then estimate the gradient by:

Δreib=r^(H)(Hσ(b)),\displaystyle\Delta_{rei}b=\hat{r}(H)(H-\sigma(b)), (11)

which is an unbiased estimate of the gradient:

𝔼[Δreib]\displaystyle\operatorname*{\mathbb{E}}[\Delta_{rei}b] =𝔼H[𝔼[r^(H)|H](Hσ(b))]\displaystyle=\mathbb{E}_{H}[\operatorname*{\mathbb{E}}[\hat{r}(H)|H](H-\sigma(b))] (12)
=𝔼[r(H)(Hσ(b))]\displaystyle=\operatorname*{\mathbb{E}}[r(H)(H-\sigma(b))] (13)
=b𝔼[r(H)].\displaystyle=\nabla_{b}\operatorname*{\mathbb{E}}[r(H)]. (14)

Note that the RR sampled (conditioned on HH) is an unbiased estimate of r(H)r(H), so R(Hσ(b))R(H-\sigma(b)) is an unbiased estimate of the gradient and this becomes the REINFORCE learning rule [3].

However, if r(0)r(0) and r(1)r(1) are very close, then the expected value of Δreib\Delta_{rei}b is close to zero as shown by (2). Combined with the variance of r^(H)\hat{r}(H), this makes the sign of Δreib\Delta_{rei}b almost random, and so it takes many steps of gradient ascent to reach bb^{*}. This is indeed the case when the unit is a hidden unit in a large network of Bernoulli-logistic units.

2.3 STE Backprop

Suppose that (i) we can extend r(h)r(h) to a differentiable function, and (ii) we have an estimate of r(h)r^{\prime}(h) for any hh (ff^{\prime} denotes the derivative of ff), denoted by r^(h)\hat{r}^{\prime}(h); then we can apply STE backprop [4], a common method for training Bernoulli-logistic units (1{}1\{\cdot\} denotes the indicator function and UU denotes a uniformly distributed and independent variable):

b𝔼[r(H)]\displaystyle\nabla_{b}\operatorname*{\mathbb{E}}[r(H)] =b𝔼U[r(1{σ(b)>U})]\displaystyle=\nabla_{b}\mathbb{E}_{U}[r(1\{\sigma(b)>U\})] (15)
=𝔼U[br(1{σ(b)>U})]\displaystyle=\mathbb{E}_{U}[\nabla_{b}r(1\{\sigma(b)>U\})] (16)
=𝔼U[r(1{σ(b)>U})b1{σ(b)>U}]\displaystyle=\mathbb{E}_{U}[r^{\prime}(1\{\sigma(b)>U\})\nabla_{b}1\{\sigma(b)>U\}] (17)
𝔼U[r(1{σ(b)>U})bσ(b)]\displaystyle\approx\mathbb{E}_{U}[r^{\prime}(1\{\sigma(b)>U\})\nabla_{b}\sigma(b)] (18)
=𝔼U[r(1{σ(b)>U})σ(b)(1σ(b))]\displaystyle=\mathbb{E}_{U}[r^{\prime}(1\{\sigma(b)>U\})\sigma(b)(1-\sigma(b))] (19)
=𝔼[r(H)]σ(b)(1σ(b)),\displaystyle=\operatorname*{\mathbb{E}}[r^{\prime}(H)]\sigma(b)(1-\sigma(b)), (20)

where (15) uses the fact that 1{σ(b)>U}1\{\sigma(b)>U\} has the same distribution as HH, and (18) uses the approximation of x1{x>u}1\nabla_{x}1\{x>u\}\approx 1 in STE backprop (note that the actual derivative of f(x)=1{x>u}f(x)=1\{x>u\} is zero at xux\neq u and does not exist at x=ux=u).

Thus, the estimate of the gradient by STE backprop is given by:

Δsteb=r^(H)σ(b)(1σ(b)).\displaystyle\Delta_{ste}b=\hat{r}^{\prime}(H)\sigma(b)(1-\sigma(b)). (21)

We observe that the form of Δsteb\Delta_{ste}b is very similar to (2). Intuitively, STE backprop is using r^(H)\hat{r}^{\prime}(H), i.e. the estimated derivative of rr at h=Hh=H, to approximate r(1)r(0)r(1)-r(0). If rr is a linear function and r^(H)\hat{r}^{\prime}(H) is unbiased, then 𝔼[r^(H)]=𝔼[r(H)]=r(1)r(0)\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(H)]=\operatorname*{\mathbb{E}}[r^{\prime}(H)]=r(1)-r(0) and the estimate Δsteb\Delta_{ste}b is unbiased. However, if rr is a non-linear function, it is possible that Δsteb\Delta_{ste}b always has the wrong sign and thus bb converges to a value that minimizes 𝔼[R]\operatorname*{\mathbb{E}}[R] instead. This is illustrated in Figure 1(b): in the second r(h)r(h), the slope at h=0h=0 or h=1h=1 are both positive, so the approximated r(1)r(0)r(1)-r(0) is positive. However, the true r(1)r(0)r(1)-r(0) is negative, and Δsteb\Delta_{ste}b always has a wrong sign in this case.

The reason for this approximation error is that increasing bb by a small amount increases the weight placed (or the probability) at r(1)r(1) and decreases the weight placed (or the probability) at r(0)r(0) by a small amount, instead of increasing hh by a small amount (we cannot ‘move’ on the curve r(h)r(h)). The latter case corresponds to H=σ(b)H=\sigma(b), i.e. the activation value of the unit is deterministic and equals σ(b)\sigma(b). In fact, backprop applied on a deterministic ANN with sigmoid activation function gives almost the same parameter update as STE backprop applied on an ANN of Bernoulli-logistic units. In other words, STE backprop treats the network as if the hidden units are outputting expected values instead of sampled values.

Refer to caption
(a) r(h)r(h) from natural extension
Refer to caption
(b) Two different extended r(h)r(h)
Figure 1: (a) The natural extension of r(h)r(h) in the form of (7). 8 outgoing units (m=8m=8) is used and all parameters (v,c,𝔼[R|D1:m=d1:m]v,c,\operatorname*{\mathbb{E}}[R|D_{1:m}=d_{1:m}]) are randomly sampled. Four resulting r(h)r(h) are shown on the graph. (b) If the extension is linear (as shown in r1(h)r_{1}(h)), then the approximation of r(1)r(0)r(1)-r(0) by r(0)r^{\prime}(0) or r(1)r^{\prime}(1) is exact. But if the extension is non-linear (as shown in r2(h)r_{2}(h)), then the approximation of r(1)r(0)r(1)-r(0) by r(0)r^{\prime}(0) or r(1)r^{\prime}(1) can have a wrong sign.

2.3.1 Natural Extension

The two assumptions of having an extension and an estimate r^(h)\hat{r}^{\prime}(h) can be satisfied by using the natural extension gg defined in (8) if the unit is a hidden unit in a network. To estimate r(h)r^{\prime}(h), we observe:

r(h)\displaystyle r^{\prime}(h) =hg(v1h+c1,v2h+c2,,vmh+cm)\displaystyle=\nabla_{h}g(v_{1}h+c_{1},v_{2}h+c_{2},...,v_{m}h+c_{m}) (22)
=i=1mviig(v1h+c1,v2h+c2,,vmh+cm)\displaystyle=\sum_{i=1}^{m}v_{i}\nabla_{i}g(v_{1}h+c_{1},v_{2}h+c_{2},...,v_{m}h+c_{m}) (23)
=i=1mvici𝔼[R|H=h].\displaystyle=\sum_{i=1}^{m}v_{i}\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R|H=h]. (24)

Assuming the outgoing units are also learning through adding some estimates of ci𝔼[R|H]\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R|H] to its bias cic_{i}, we can use the change of cic_{i} as an estimate of ci𝔼[R|H]\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R|H] and plug it in the above formula to obtain r^(H)\hat{r}^{\prime}(H). The estimate of ci𝔼[R|H]\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R|H] for outgoing units can be obtained by STE backprop if the outgoing units are hidden units, and REINFORCE (11) or direct computation (2) if the outgoing units are output units.

However, the natural extension of rr is not linear in most cases, so STE backprop gives a biased estimate. From the Taylor expansion of gg in (8), the estimation bias is in the order of v2||v||^{2}, the squared norm of the outgoing weight. Note that this estimation bias accumulates across layers as r^(H)\hat{r}^{\prime}(H) is no longer an unbiased estimator of r(H)r^{\prime}(H) for units with hidden outgoing units.

2.4 Weight Maximization

Both the parameter updates given by REINFORCE and STE backprop are not necessarily zero when the unit is not firing. A desirable property of a learning rule is that the unit’s parameters are not updated when the unit is not firing. This can be achieved by using r(0)r(0) as a baseline in REINFORCE:

b𝔼[r(H)]\displaystyle\nabla_{b}\operatorname*{\mathbb{E}}[r(H)] =𝔼[(r(H)r(0))(Hσ(b))].\displaystyle=\operatorname*{\mathbb{E}}[(r(H)-r(0))(H-\sigma(b))]. (25)

Note that (Rr(0))H(R-r(0))H is an unbiased estimate of r(H)r(0)r(H)-r(0) conditioned on HH: if H=0H=0, then 𝔼[(Rr(0))H|H=0]=0=r(H)r(0)\operatorname*{\mathbb{E}}[(R-r(0))H|H=0]=0=r(H)-r(0); if H=1H=1, then 𝔼[(Rr(0))H|H=1]=𝔼[R|H=1]r(0)=r(H)r(0)\operatorname*{\mathbb{E}}[(R-r(0))H|H=1]=\operatorname*{\mathbb{E}}[R|H=1]-r(0)=r(H)-r(0). Thus, an unbiased estimate of the gradient is:

Δb=(Rr(0))H(Hσ(b)).\displaystyle\Delta b=(R-r(0))H(H-\sigma(b)). (26)

However, the above estimate cannot be computed since r(0)r(0) is not known in general. Suppose that (i) we can extend r(h)r(h) to a differentiable function, and (ii) we have an estimate of r(h)r^{\prime}(h) for any hh, denoted as r^(h)\hat{r}^{\prime}(h), then we may approximate (Rr(0))H(R-r(0))H by r^(H)H\hat{r}^{\prime}(H)H, leading to the following estimate of the gradient:

Δwmb=r^(H)H(Hσ(b)).\displaystyle\Delta_{wm}b=\hat{r}^{\prime}(H)H(H-\sigma(b)). (27)

Compared to REINFORCE, Weight Maximization replaced the global reward RR by the individual reward R^:=r^(H)H\hat{R}:=\hat{r}^{\prime}(H)H, which equals zero when the unit is not firing and equals an approximation of r(1)r(0)r(1)-r(0) by r(1)r^{\prime}(1) when the unit is firing. Similar to STE backprop, this approximation requires rr to be linear, which does not hold in most cases.

2.4.1 Natural Extension

The two assumptions of having an extension and an estimate r^(h)\hat{r}^{\prime}(h) can be satisfied by using the natural extension defined in (8) when the unit is a hidden unit in a network. We can estimate r(h)hr^{\prime}(h)h by:

r(h)h\displaystyle r^{\prime}(h)h =hhg(v1h+c1,v2h+c2,,vmh+cm)\displaystyle=h\nabla_{h}g(v_{1}h+c_{1},v_{2}h+c_{2},...,v_{m}h+c_{m}) (28)
=i=1mvihig(v1h+c1,v2h+c2,,vmh+cm)\displaystyle=\sum_{i=1}^{m}v_{i}h\nabla_{i}g(v_{1}h+c_{1},v_{2}h+c_{2},...,v_{m}h+c_{m}) (29)
=i=1mvivi𝔼[R|H=h].\displaystyle=\sum_{i=1}^{m}v_{i}\nabla_{v_{i}}\operatorname*{\mathbb{E}}[R|H=h]. (30)

Assuming the outgoing units are also learning through adding some estimates of vi𝔼[R|H]\nabla_{v_{i}}\operatorname*{\mathbb{E}}[R|H] to its weight viv_{i}, we can use the change of viv_{i} as an estimate of vi𝔼[R|H]\nabla_{v_{i}}\operatorname*{\mathbb{E}}[R|H] and plug it in the above formula to obtain r^(H)H\hat{r}^{\prime}(H)H. This leads to the alternative perspective of maximizing outgoing weights in Weight Maximization [2].

However, the natural extension of rr is not linear in most cases, so Weight Maximization gives a biased estimate. From the Taylor expansion of gg in (8), the estimation bias is in the order of v2||v||^{2}, the squared norm of the outgoing weight. Note that this estimation bias accumulates across layers as r^(H)\hat{r}^{\prime}(H) is no longer an unbiased estimator of r(H)r^{\prime}(H) for units with hidden outgoing units.

2.5 High-order Weight Maximization

Both STE backprop and Weight Maximization assume the extended rr to be linear. In particular, Weight Maximization uses r(1){r}^{\prime}(1) to estimate r(1)r(0)r(1)-r(0), which is a first-order Taylor approximation. We may use pp-order Taylor approximation to get a better estimate (where p1p\geq 1):

r(1)r(0)r(1)(1)12!r(2)(1)+13!r(3)(1)+(1)p+11p!r(p)(1),\displaystyle r(1)-r(0)\approx r^{(1)}(1)-\frac{1}{2!}r^{(2)}(1)+\frac{1}{3!}r^{(3)}(1)...+(-1)^{p+1}\frac{1}{p!}r^{(p)}(1), (31)

where r(p)(h)r^{(p)}(h) denotes the pp-order derivative of rr. With this new approximation, (27) is generalized to:

Δwmpb=(k=1p(1)k+11k!r^(k)(H))H(Hσ(b)).\displaystyle\Delta_{wm-p}b=\left(\sum_{k=1}^{p}(-1)^{k+1}\frac{1}{k!}\hat{r}^{(k)}(H)\right)H(H-\sigma(b)). (32)

This gives the parameter update for pp-order Weight Maximization, which replaces the global reward RR in REINFORCE by the individual reward:

R^:=(k=1p(1)k+11k!r^(k)(H))H.\displaystyle\hat{R}:=\left(\sum_{k=1}^{p}(-1)^{k+1}\frac{1}{k!}\hat{r}^{(k)}(H)\right)H. (33)

We call the learning rule of updating bb by Δwmpb\Delta_{wm-p}b in (32)(\ref{eq:wm2}) pp-order Weight Maximization. Note that first-order Weight Maximization is equivalent to Weight Maximization discussed in Section 2.4.

Computing R^\hat{R} requires estimates of r(k)(h){r}^{(k)}(h) for k=1,2,,pk=1,2,...,p. We discuss how to obtain these estimates next.

2.5.1 Natural Extension

Again, we use the natural extension of rr. Let first consider r(1)(h){r}^{(1)}(h):

r(1)(h)\displaystyle r^{(1)}(h) (34)
=\displaystyle= h𝔼[R|H=h]\displaystyle\nabla_{h}\operatorname*{\mathbb{E}}[R|H=h] (35)
=\displaystyle= hd1:mPr(D1:m=d1:m|H=h)𝔼[R|D1:m=d1:m]\displaystyle\nabla_{h}\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=h)\operatorname*{\mathbb{E}}[R|D_{1:m}=d_{1:m}] (36)
=\displaystyle= d1:mPr(D1:m=d1:m|H=h)𝔼[R|D1:m=d1:m]hlogPr(D1:m=d1:m|H=h)\displaystyle\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=h)\operatorname*{\mathbb{E}}[R|D_{1:m}=d_{1:m}]\nabla_{h}\log\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=h) (37)
=\displaystyle= 𝔼[RhlogPr(D1:m|H=h)|H=h]\displaystyle\operatorname*{\mathbb{E}}[R\nabla_{h}\log\operatorname*{\text{Pr}}(D_{1:m}|H=h)|H=h] (38)
=\displaystyle= 𝔼[i=1mRvi(Diσ(vih+ci))|H=h],\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{i=1}^{m}Rv_{i}(D_{i}-\sigma(v_{i}h+c_{i}))\bigg{|}H=h\right], (39)

Thus, an unbiased estimate of r(1)(h){r}^{(1)}(h) is i=1mRvi(Diσ(vih+ci))\sum_{i=1}^{m}Rv_{i}(D_{i}-\sigma(v_{i}h+c_{i})). If the outgoing units are hidden units and are trained by Weight Maximization, each unit receives a different reward R^i\hat{R}_{i} that can be used to replace the global reward RR in the estimate. If the outgoing units are output unit, we let R^i=R\hat{R}_{i}=R. Therefore, we have222For simplicity let first assume ci𝔼[R^i|H=h]=ci𝔼[R|H=h]\nabla_{c_{i}}\operatorname*{\mathbb{E}}[\hat{R}_{i}|H=h]=\nabla_{c_{i}}\operatorname*{\mathbb{E}}[R|H=h]; i.e. we ignore the approximation error of Weight Maximization applied on outgoing units. We can still use the global reward RR if the outgoing units are hidden units, but this leads to a higher variance.:

𝔼[i=1mRvi(Diσ(vih+ci))|H=h]=𝔼[i=1mR^ivi(Diσ(vih+ci))|H=h].\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{i=1}^{m}Rv_{i}(D_{i}-\sigma(v_{i}h+c_{i}))\bigg{|}H=h\right]=\operatorname*{\mathbb{E}}\left[\sum_{i=1}^{m}\hat{R}_{i}v_{i}(D_{i}-\sigma(v_{i}h+c_{i}))\bigg{|}H=h\right]. (40)

Note that R^i(Diσ(vih+ci))\hat{R}_{i}(D_{i}-\sigma(v_{i}h+c_{i})) is also the update to weight viv_{i} for outgoing unit ii. To simplify notation, we define the following terms:

s(k)={i=1mR^ivi(Diσ(vih+ci)),if k=1i=1mR^ivik(σ(k)(vih+ci)),if k=2,3,\displaystyle s^{(k)}=\begin{cases}\sum_{i=1}^{m}\hat{R}_{i}v_{i}(D_{i}-\sigma(v_{i}h+c_{i})),&\text{if $k=1$}\\ \sum_{i=1}^{m}\hat{R}_{i}v_{i}^{k}(-\sigma^{(k)}(v_{i}h+c_{i})),&\text{if $k=2,3,...$}\end{cases} (41)
t(k)={i=1mvi(Diσ(vih+ci)),if k=1i=1mvik(σ(k)(vih+ci)).if k=2,3,\displaystyle t^{(k)}=\begin{cases}\sum_{i=1}^{m}v_{i}(D_{i}-\sigma(v_{i}h+c_{i})),&\text{if $k=1$}\\ \sum_{i=1}^{m}v_{i}^{k}(-\sigma^{(k)}(v_{i}h+c_{i})).&\text{if $k=2,3,...$}\end{cases} (42)

We see that r^(1)(h):=s(1)\hat{r}^{(1)}(h)\vcentcolon=s^{(1)} is an unbiased estimator of r(1)(h){r}^{(1)}(h). Note that hs(k)=s(k+1)\nabla_{h}s^{(k)}=s^{(k+1)} and ht(k)=t(k+1)\nabla_{h}t^{(k)}=t^{(k+1)}. Next, consider r(2)(h){r}^{(2)}(h):

r(2)(h)\displaystyle r^{(2)}(h) =hr(1)(h)\displaystyle=\nabla_{h}r^{(1)}(h) (43)
=h𝔼[s(1)|H=h]\displaystyle=\nabla_{h}\operatorname*{\mathbb{E}}[s^{(1)}|H=h] (44)
=𝔼[hs(1)+s(1)hlogPr(D1:m|H=h)|H=h]\displaystyle=\operatorname*{\mathbb{E}}[\nabla_{h}s^{(1)}+s^{(1)}\nabla_{h}\log\operatorname*{\text{Pr}}(D_{1:m}|H=h)|H=h] (45)
=𝔼[s(2)+s(1)t(1)|H=h].\displaystyle=\operatorname*{\mathbb{E}}[s^{(2)}+s^{(1)}t^{(1)}|H=h]. (46)

Therefore, r^(2)(h):=s(2)+s(1)t(1)\hat{r}^{(2)}(h)\vcentcolon=s^{(2)}+s^{(1)}t^{(1)} is an unbiased estimator of r(2)(h)r^{(2)}(h). Next, consider r(3)(h){r}^{(3)}(h):

r(3)(h)\displaystyle r^{(3)}(h) =hr(2)(h)\displaystyle=\nabla_{h}r^{(2)}(h) (47)
=h𝔼[s(2)+s(1)t(1)|H=h]\displaystyle=\nabla_{h}\operatorname*{\mathbb{E}}[s^{(2)}+s^{(1)}t^{(1)}|H=h] (48)
=𝔼[h(s(2)+s(1)t(1))+(s(2)+s(1)t(1))t(1)|H=h]\displaystyle=\operatorname*{\mathbb{E}}[\nabla_{h}(s^{(2)}+s^{(1)}t^{(1)})+(s^{(2)}+s^{(1)}t^{(1)})t^{(1)}|H=h] (49)
=𝔼[s(3)+2s(2)t(1)+s(1)t(2)+s(1)(t(1))2|H=h].\displaystyle=\operatorname*{\mathbb{E}}[s^{(3)}+2s^{(2)}t^{(1)}+s^{(1)}t^{(2)}+s^{(1)}(t^{(1)})^{2}|H=h]. (50)

Therefore, r^(3)(h):=s(3)+2s(2)t(1)+s(1)t(2)+s(1)(t(1))2\hat{r}^{(3)}(h)\vcentcolon=s^{(3)}+2s^{(2)}t^{(1)}+s^{(1)}t^{(2)}+s^{(1)}(t^{(1)})^{2} is an unbiased estimator of r(3)(h)r^{(3)}(h). We can continue to compute an unbiased estimator of r(4)(h),r(5)(h)r^{(4)}(h),r^{(5)}(h) and so on in the same way. See Appendix A for the general formula of an unbiased estimator of r(p)(h)r^{(p)}(h). These estimates are then plugged in (32) to compute R^\hat{R}.

If the high-order derivatives of gg in (7) are bounded, i.e. there exists BB such that |i1i2ing(hv+c)|<B|\partial_{i_{1}}\partial_{i_{2}}...\partial_{i_{n}}g(hv+c)|<B for any h[0,1]h\in[0,1], ij{1,2,,m}i_{j}\in\{1,2,...,m\} for j{1,2,,n}j\in\{1,2,...,n\}, n>0n>0, then

r(1)r(0)\displaystyle r(1)-r(0) =𝔼[k=1p(1)k+11k!r^(k)(1)|H=1]+𝒪(vp+1(p+1)!).\displaystyle=\operatorname*{\mathbb{E}}\left[\sum_{k=1}^{p}(-1)^{k+1}\frac{1}{k!}\hat{r}^{(k)}(1)\bigg{|}H=1\right]+\mathcal{O}\left(\frac{||v||^{p+1}}{(p+1)!}\right). (51)

That is, the estimation bias is in the order of vp+1(p+1)!\frac{||v||^{p+1}}{(p+1)!}, which converges to 0 as pp\rightarrow\infty. The proof is based on the error bound of Taylor approximation applied on gg.

2.5.2 Unbounded Derivatives

Unfortunately, the high-order derivatives of gg are not bounded for a network of Bernoulli-logistic units. This is because of the sigmoid function in gg as shown in (7), and the pp-order derivatives of the sigmoid function are unbounded (see Fig 2). For example, estimating σ(4)\sigma(4) at x=0x=0 by Taylor approximation results in a diverging sequence as pp\rightarrow\infty. Denoting Cp+1:=maxx|σ(p+1)(x)|C_{p+1}:=\max_{x}|\sigma^{(p+1)}(x)|, the estimation bias of pp-order Weight Maximization applied on a network of Bernoulli-logistic units is 𝒪(Cp+1vp+1(p+1)!)\mathcal{O}\left(\frac{C_{p+1}||v||^{p+1}}{(p+1)!}\right), which is unbounded as pp\rightarrow\infty if vπ||v||\geq\pi (see Appendix B).

In experiments, we observe that high-order Weight Maximization performs similar to low-order Weight Maximization in the beginning of training. But as the norm of outgoing weight grows large during training, the estimation bias becomes larger for a higher pp, and hence high-order Weight Maximization’s performance deteriorates quickly in the middle of training.

Maybe using activation functions other than the sigmoid function or using weight decay to prevent the outgoing weight’s norm from growing too large can prevent this issue. But there is another method to tackle this issue, which will be discussed next.

Refer to caption
(a) σ(p)(x)\sigma^{(p)}(x) for different pp
Refer to caption
(b) Absolute error of pp-order Taylor approximation
Figure 2: (a) The high-order derivatives of σ(x)\sigma(x) are unbounded. Observe that the magnitude of σ(p)(x)\sigma^{(p)}(x) grows with pp. (b) Absolute error of pp-order Taylor approximation of σ(x)\sigma(x) at x=0x=0. Observe that a higher-order Taylor approximation leads to a larger error when |x|π|x|\geq\pi.

2.6 Unbiased Weight Maximization

Let us consider Fig 1 again. All previous methods (except REINFORCE) are based on using the derivative of rr to approximate r(1)r(0)r(1)-r(0). STE backprop approximates it by r(0)r^{\prime}(0) or r(1)r^{\prime}(1); Weight Maximization approximates it by r(1)r^{\prime}(1); high-order Weight Maximization approximates it by high-order Taylor series of rr at h=1h=1. The former two methods assume the r(h)r(h) to be linear, which is not true in general; the last method may lead to a divergent Taylor series if the outgoing weight is too large. Are there any other simple methods to approximates r(1)r(0)r(1)-r(0) based on r(h)r^{\prime}(h)?

Fundamental theorem of calculus tells us that we can integrate r(h)r^{\prime}(h) over hh to get r(1)r(0)r(1)-r(0):

01r(u)𝑑u=r(1)r(0).\displaystyle\int^{1}_{0}r^{\prime}(u)du=r(1)-r(0). (52)

This can in turn be written as:

𝔼[r(U)]=r(1)r(0),\displaystyle\operatorname*{\mathbb{E}}[r^{\prime}(U)]=r(1)-r(0), (53)

where UU is an independent and uniformly distributed random variable. Thus, if we have an unbiased estimate of r(u)r^{\prime}(u) for any u[0,1]u\in[0,1], denoted as r^(u)\hat{r}^{\prime}(u), we may use the following estimate instead:

Δuwmb=r^(U)H(Hσ(b))\displaystyle\Delta_{uwm}b=\hat{r}^{\prime}(U)H(H-\sigma(b)) (54)

which is an unbiased estimator of b𝔼[r(H)]\nabla_{b}\operatorname*{\mathbb{E}}[r(H)]:

Proof.
𝔼[Δuwmb]\displaystyle\operatorname*{\mathbb{E}}[\Delta_{uwm}b] =𝔼[r^(U)H(Hσ(b))]\displaystyle=\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)H(H-\sigma(b))] (55)
=𝔼[r^(U)H(Hσ(b))|H=1]Pr(H=1)\displaystyle=\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)H(H-\sigma(b))|H=1]\operatorname*{\text{Pr}}(H=1) (56)
=𝔼[r^(U)(1σ(b))|H=1]Pr(H=1)\displaystyle=\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)(1-\sigma(b))|H=1]\operatorname*{\text{Pr}}(H=1) (57)
=𝔼[r^(U)|H=1](1σ(b))σ(b)\displaystyle=\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)|H=1](1-\sigma(b))\sigma(b) (58)
=𝔼[r(U)](1σ(b))σ(b)\displaystyle=\operatorname*{\mathbb{E}}[{r}^{\prime}(U)](1-\sigma(b))\sigma(b) (59)
=(r(1)r(0))(1σ(b))σ(b)\displaystyle=(r(1)-r(0))(1-\sigma(b))\sigma(b) (60)
=b𝔼[r(H)].\displaystyle=\nabla_{b}\operatorname*{\mathbb{E}}[r(H)]. (61)

The last line follows from (2). ∎

Note that Δuwmb\Delta_{uwm}b is very similar to Δwmb\Delta_{wm}b in (27) - we only replaced r^(H)\hat{r}^{\prime}(H) in (27) with r^(U)\hat{r}^{\prime}(U), i.e. we evaluate the gradient at a random point on [0,1][0,1] instead of HH.

We also define the individual reward R^:=r^(U)H\hat{R}:=\hat{r}^{\prime}(U)H, which is an unbiased estimator of r(H)r(0)r(H)-r(0) conditioned on HH:

𝔼[R^|H]\displaystyle\operatorname*{\mathbb{E}}[\hat{R}|H] =𝔼[r^(U)H|H]\displaystyle=\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)H|H] (62)
={0,if H=0𝔼[r^(U)],if H=1\displaystyle=\begin{cases}0,\text{if $H=0$}\\ \operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)],\text{if $H=1$}\end{cases} (63)
=r(H)r(0).\displaystyle=r(H)-r(0). (64)

We call the learning rule of updating bb by Δuwmb\Delta_{uwm}b in (54)(\ref{eq:rp1}) unbiased Weight Maximization.

2.6.1 Natural Extension

The only remaining question is how can we obtain an unbiased estimate of r(u)r^{\prime}(u) for any u[0,1]u\in[0,1] when the unit can only output {0,1}\{0,1\}. Again, let consider the natural extension of rr. By importance sampling, we have, for any u[0,1]u\in[0,1]:

𝔼[Pr(D1:m=d1:m|H=u)Pr(D1:m=d1:m|H)i=1mR^iulogPr(Di=di|H=u)]=r(u),\displaystyle\operatorname*{\mathbb{E}}\left[\frac{\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)}{\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H)}\sum_{i=1}^{m}\hat{R}_{i}\nabla_{u}\log\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u)\right]={r}^{\prime}(u), (65)

where R^i\hat{R}_{i} denotes the individual reward to the outgoing unit ii, which equals the global reward RR if the outgoing unit is an output unit, or the individual reward computed by unbiased Weight Maximization applied on the outgoing unit if the outgoing unit is a hidden unit. We denote the term in the expectation by r^(u)\hat{r}^{\prime}(u), which is an unbiased estimator of r(u){r}^{\prime}(u) and gives an iterative formula to compute the individual reward to all hidden units in a network.

Proof.
𝔼[Pr(D1:m=d1:m|H=u)Pr(D1:m=d1:m|H)i=1mR^iulogPr(Di=di|H=u)]\displaystyle\operatorname*{\mathbb{E}}\left[\frac{\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)}{\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H)}\sum_{i=1}^{m}\hat{R}_{i}\nabla_{u}\log\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u)\right] (66)
=\displaystyle= h,d1:mPr(H=h)Pr(D1:m=d1:m|H=u)i=1m𝔼[R^i|D1:m=d1:m,H=h]ulogPr(Di=di|H=u)\displaystyle\sum_{h,d_{1:m}}\operatorname*{\text{Pr}}(H=h)\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)\sum_{i=1}^{m}\operatorname*{\mathbb{E}}[\hat{R}_{i}|D_{1:m}=d_{1:m},H=h]\nabla_{u}\log\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u) (67)

If the outgoing units are output units, then R^i=R\hat{R}_{i}=R by definition, and (67) becomes:

d1:mPr(D1:m=d1:m|H=u)i=1m𝔼[R|D1:m=d1:m]loguPr(Di=di|H=u)\displaystyle\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)\sum_{i=1}^{m}\operatorname*{\mathbb{E}}[{R}|D_{1:m}=d_{1:m}]\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u) (68)
=\displaystyle= d1:mPr(D1:m=d1:m|H=u)𝔼[R|D1:m=d1:m]i=1mloguPr(Di=di|H=u)\displaystyle\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)\operatorname*{\mathbb{E}}[{R}|D_{1:m}=d_{1:m}]\sum_{i=1}^{m}\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u) (69)
=\displaystyle= ud1:mPr(D1:m=d1:m|H=u)𝔼[R|D1:m=d1:m]\displaystyle\nabla_{u}\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)\operatorname*{\mathbb{E}}[{R}|D_{1:m}=d_{1:m}] (70)
=\displaystyle= u𝔼[R|H=u]\displaystyle\nabla_{u}\operatorname*{\mathbb{E}}[{R}|H=u] (71)
=\displaystyle= r^(u),\displaystyle\hat{r}^{\prime}(u), (72)

where (68)(\ref{eq:rp3}) uses the assumption that RR is independent with HH conditioned on D1:m=d1:mD_{1:m}=d_{1:m} since the network has a multi-layer structure.

If the outgoing units are hidden units, then R^i\hat{R}_{i} is computed by unbiased Weight Maximization applied on the outgoing units, and 𝔼[R^i|Di=di,H=h]=𝔼[R|Di=di,H=h]𝔼[R|Di=0,H=h]\operatorname*{\mathbb{E}}[\hat{R}_{i}|D_{i}=d_{i},H=h]=\operatorname*{\mathbb{E}}[R|D_{i}=d_{i},H=h]-\operatorname*{\mathbb{E}}[R|D_{i}=0,H=h] by (64). So (67) becomes:

d1:mPr(D1:m=d1:m|H=u)i=1m𝔼[R|D1:m=d1:m]loguPr(Di=di|H=u)\displaystyle\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)\sum_{i=1}^{m}\operatorname*{\mathbb{E}}[{R}|D_{1:m}=d_{1:m}]\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u)-
d1:mPr(D1:m=d1:m|H=u)i=1m𝔼[R|Di=0;D1:m\i=d1:m\i,H=h]loguPr(Di=di|H=u)\displaystyle\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)\sum_{i=1}^{m}\operatorname*{\mathbb{E}}[{R}|D_{i}=0;D_{1:m\backslash i}=d_{1:m\backslash i},H=h]\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u) (73)

It suffices to show that the second term is zero since the first term is (68):

d1:mPr(D1:m=d1:m|H=u)i=1m𝔼[R|Di=0;D1:m\i=d1:m\i]loguPr(Di=di|H=u)\displaystyle\sum_{d_{1:m}}\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u)\sum_{i=1}^{m}\operatorname*{\mathbb{E}}[{R}|D_{i}=0;D_{1:m\backslash i}=d_{1:m\backslash i}]\log\nabla_{u}\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u) (74)
=\displaystyle= d1:m\iPr(D1:m\i=d1:m\i|H=u)i=1m𝔼[R|Di=0;D1:m\i=d1:m\i,H=h]udiPr(Di=di|H=u)\displaystyle\sum_{d_{1:m\backslash i}}\operatorname*{\text{Pr}}(D_{1:m\backslash i}=d_{1:m\backslash i}|H=u)\sum_{i=1}^{m}\operatorname*{\mathbb{E}}[{R}|D_{i}=0;D_{1:m\backslash i}=d_{1:m\backslash i},H=h]\nabla_{u}\sum_{d_{i}}\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=u) (75)
=\displaystyle= 0.\displaystyle 0. (76)

Note that the event of H=uH=u is not well defined since HH can only take binary values, and so Pr(D1:m=d1:m|H=u)\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=u) is defined by allowing HH to pass any values in [0,1][0,1] to outgoing units (i.e. extending hh to [0,1][0,1] in (7)). ∎

r^(u)\hat{r}^{\prime}(u) can then be used to compute R^\hat{R} by r^(u)H\hat{r}^{\prime}(u)H, and Δuwmb\Delta_{uwm}b can be computed to update bb.

Note that we can also use multiple i.i.d. U1,U2,UMU_{1},U_{2},...U_{M} or numerical integration333Analytical integration is possible iff each outgoing unit receives the same reward. to estimate 𝔼[r^(U)]\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)] to reduce variance, but it does not give a better performance in our experiments.

2.7 Relationship between different Weight Maximization

We have discussed three different forms of Weight Maximization: Weight Maximization in Section 2.4, high-order Weight Maximization in Section 2.5, and unbiased Weight Maximization in 2.6. The only difference between these algorithms is how the individual reward R^\hat{R} is computed, as shown in Table 1. This individual reward R^\hat{R} is used to replace the global reward RR in the REINFORCE learning rule.

Weight Maximization is equivalent to pp-order Weight Maximization when p=1p=1, as discussed previously. Weight Maximization is also equivalent to unbiased Weight Maximization if we set U=HU=H instead of uniformly sampling UU from [0,1][0,1]. In other words, Weight Maximization evaluates the gradient r(h)r^{\prime}(h) at endpoints while unbiased Weight Maximization evaluates the gradient r(h)r^{\prime}(h) at a random point in [0,1][0,1].

In addition, it can be shown that if the Taylor series of rr converges, as pp\rightarrow\infty, the update given by pp-order Weight Maximization converges to unbiased Weight Maximization that directly computes 𝔼[r^(U)]\operatorname*{\mathbb{E}}[\hat{r}^{\prime}(U)] instead of estimating it with r^(U)\hat{r}^{\prime}(U). The proof is quite long and is omitted here. From this we see that Weight Maximization can also be seen as the first-order Taylor approximation of unbiased Weight Maximization.

Though we include ‘Weight Maximization’ in these learning rules to emphasize their origins, it should be noted that these extensions can no longer be seen as maximizing the norm of outgoing weight as the individual rewards are computed differently compared to Weight Maximization.

Table 1: Comparison of how individual reward R^\hat{R} is computed and estimation bias.
Formula for R^\hat{R} Estimation Bias
REINFORCE RR 0
Weight Max i=1mR^iHHlogPr(Di=di|H)\sum_{i=1}^{m}\hat{R}_{i}H\nabla_{H}\log\operatorname*{\text{Pr}}(D_{i}=d_{i}|H) 𝒪(v2)\mathcal{O}\left(||v||^{2}\right)
pp-order Weight Max (k=1p(1)k+11k!r^(k)(H))H\left(\sum_{k=1}^{p}(-1)^{k+1}\frac{1}{k!}\hat{r}^{(k)}(H)\right)H where r^(k)\hat{r}^{(k)} is defined by (82) 𝒪(Cp+1vp+1(p+1)!)\mathcal{O}\left(\frac{C_{p+1}||v||^{p+1}}{(p+1)!}\right)
Unbiased Weight Max Pr(D1:m=d1:m|H=U)Pr(D1:m=d1:m|H)i=1mR^iHUlogPr(Di=di|H=U)\frac{\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H=U)}{\operatorname*{\text{Pr}}(D_{1:m}=d_{1:m}|H)}\sum_{i=1}^{m}\hat{R}_{i}H\nabla_{U}\log\operatorname*{\text{Pr}}(D_{i}=d_{i}|H=U) 0

3 Experiments

To test the above algorithms, we consider the kk-bit multiplexer task. The input is sampled from all possible values of a binary vector of size k+2kk+2^{k} with equal probability. The output set is {1,1}\{-1,1\}, and we give a reward of +1+1 if the network’s output equals the multiplexer’s output and 1-1 otherwise. We consider k=4k=4 here, so the dimension of the input space is 20. We call a single interaction of the network from receiving an input to receiving the reward an episode; the series of episodes as the network’s performance increases during training is called a run.

We used a three-layer network of Bernoulli-logistic units, with the first and the second hidden layers having N{8,16,32,48,64,96}N\in\{8,16,32,48,64,96\} units and the output layer having a single unit. The update step size for gradient ascent is chosen as 0.0050.005, and the batch size is chosen as 1616 (i.e. we estimated the gradient for 1616 episodes and averaged them in the update). We used the Adam optimizer [5]. These hyperparameters are chosen prior to the experiments without any tuning.

The average reward in the first 5e65e6 episodes for networks with different NN and learning rules is also shown in Figure 3. The learning curve (i.e. the reward of each episode throughout a run) for each NN is also shown in Figure 4. Note that the results are averaged over 5 independent runs, and the error bar represents the standard deviation over runs.

Refer to caption
Figure 3: Average rewards of the first 5e65e6 episodes in the multiplexer task with a different NN. Results are averaged over 5 independent runs, and the error bar represents the standard deviation over the runs.
Refer to caption
Figure 4: Running average of rewards in each episode throughout a run in the multiplexer task with a different NN. Results are averaged over 5 independent runs, and the error bar represents the standard deviation over the runs.

We observe that:

  • REINFORCE: When N=8N=8, REINFORCE gives the best performance in terms of both learning speed and asymptotic performance - it is easy to correlate units’ activation values with the reward when the network size is small; but as NN grows larger than 16, the learning curve does not change much - REINFORCE does not benefit from a larger network due to the increased noise.

  • STE Backprop: When NN is small, STE backprop has a poor asymptotic performance - when there are only a few units, the norm of weight has to be large, and so the network has a very different dynamic than an ANN with deterministic and sigmoid units, which is essentially the assumption of STE backprop. But when the network is large, the dynamic becomes similar, and so STE backprop works well when NN is large.

  • Weight Maximization and second-order Weight Maximization: Both learn fast in the first 1e6 episodes, but performance quickly deteriorates afterward. As the norm of weight grows large during training, estimation bias also grows large. Note that weight decay is required to make Weight Maximization work well, but we did not use weight decay in our experiments as it requires extensive tuning.

  • Unbiased Weight Maximization: It works well in terms of both learning speed and asymptotic performance under all NN. The unbiased property makes it work well when the network size is small, and the use of individual reward makes it scale well with a larger NN.

We have done additional experiments to understand the different learning rules better. Experiments on pp-order Weight Maximization for p{1,2,,8}p\in\{1,2,...,8\} can be found in Appendix C.1. Experiments on variants of unbiased Weight Maximization with a reduced variance can be found in Appendix C.2. Experiments on comparison with continuous-valued ANNs trained by backprop can be found in Appendix C.3. Finally, experiments on the estimation bias and variance of the different estimators can be found in Appendix C.4.

4 Discussion

Both the theoretical results and experiment results of unbiased Weight Maximization are promising. Among all learning rules considered in this report, it is the only unbiased learning rule that scales well in term of performance when the network grows larger. To our knowledge, this is also the first learning rule for a network of Bernoulli-logistic units that have such property.

However, there are a few drawbacks of unbiased Weight Maximization. First, the computation cost to compute the individual reward for a layer is 𝒪(m2n)\mathcal{O}(m^{2}n), where mm denotes the number of units on that layer and nn denotes the number of units on the next layer, since we have to compute the sampling ratio for each unit and each computation of sampling ratio requires 𝒪(mn)\mathcal{O}(mn). In contrast, the computation cost to compute the individual reward signal or the feedback signal for a layer is only 𝒪(mn)\mathcal{O}(mn) in Weight Maximization or STE backprop; and no such computation is required for REINFORCE. Second, the computation of individual reward for unbiased Weight Maximization is too involved for biological neurons - it requires all units in a layer to coordinate together to compute the sampling ratio for individual reward signals to the upstream units. REINFORCE or Weight Maximization does not have such issue. Weight Maximization works well when the outgoing weight’s norm is small, and the efficacy of synaptic transmission in biological neurons cannot be unbounded, so Weight Maximization, which can be seen as the first-order approximation of unbiased Weight Maximization, may already be sufficient for biological neurons.

An important and fundamental question in this study is the motivation for discrete-valued ANNs in general. A continuous-valued ANNs trained by backprop can learn faster and has a better asymptotic performance than the discrete-valued ANNs trained by all learning rules considered in this report (see Appendix C.3), and so the motivation of discrete-valued ANNs remains to be explored.

In conclusion, this report analyzes different learning rules for discrete-valued ANNs and proposes two new learning rules, called high-order Weight Maximization and unbiased Weight Maximization, that extend from Weight Maximization. Possible future work includes:

  • Perform more experiments to understand the advantages or disadvantages of unbiased Weight Maximization (e.g. generalization error, dynamics of exploration) on standard supervised learning tasks and reinforcement learning tasks.

  • Generalize unbiased Weight Maximization to other discrete-valued units besides Bernoulli-logistic units, such as softmax units.

  • Adjust unbiased Weight Maximization to encourage exploration, e.g. adding some terms in the learning rule to encourage exploration such as the use of λ\lambda in ArpA_{r-p} [6].

  • Explore methods to reduce the computational cost of unbiased Weight Maximization or simplify unbiased Weight Maximization while maintaining the performance or the unbiased property.

  • Develop an unbiased version of STE backprop, which is straightforward using the idea of unbiased Weight Maximization since both rely on estimates of r(1)r(0)r(1)-r(0).

5 Afterword: my view on discrete-valued ANNs

The dynamic of discrete-valued ANNs is very different than continuous-valued ANNs and may lead to a much more powerful network. For example, a restricted Boltzmann machine (RBM) with continuous-valued hidden and visible units can only model multi-variate Gaussian distribution, but an RBM with binary-valued hidden units (which becomes Bernoulli-logistic units) and continuous-valued visible units can model any distributions. Thus, by restricting the activation values of units to binary values, we can get a more powerful network paradoxically. We also observe this phenomenon in some recent advances in deep learning. The attention module in Transformers [7], which led to great success in natural language process (NLP) and underlies virtually all models in NLP nowadays, can be seen as a pseudo-discrete operation: the attention vector from a trained attention module is close to a one-hot vector, and so the network can ‘attend’ to different parts of the input in a discrete manner. Nonetheless, due to the lack of methods to train discrete-valued ANNs efficiently, we can only convert discrete operations to differentiable operations, e.g. through outputting expected values instead of sampled values as in the attention modules of Transformers, and train them by backprop. However, this may not be desirable - when being asked to think about the favorite fruit, we would not think about an apple with 0.50.5 intensity, an orange with 0.30.3 intensity, and a banana with 0.20.2 intensity at the same time; we think of just one fruit at any moment! Clearly, we do not learn by making every thought differentiable. From a microscopic perspective, the output of a single biological neuron is also not differentiable as it either fires or not. Therefore, it may be better to find methods to directly train discrete operations embedded in an ANN instead of trying to design differentiable variants of them. Some discussions on the benefits of discrete-valued representation can be found in [8, 9]. To conclude, the assumption of everything being differentiable, which virtually underlies all recent advances in deep learning, should be questioned.

6 Acknowledgment

We would like to thank Andrew G. Barto, who inspired this research and provided valuable insights and comments.

Appendix A A general formula of r(p)(h)r^{(p)}(h)

To obtain a general formula for an unbiased estimator of r(p)(h)r^{(p)}(h), we first define P(k)P(k) to be the set of partition of integer kk, with each partition represented by a tuple of the number of appearance. For example, 44 can be partitioned into 4,3+1,2+2,2+1+1,1+1+1+14,3+1,2+2,2+1+1,1+1+1+1. To encode 2+1+12+1+1, we see that 11 appears twice, 22 appears once, so it is encoded into (2,1,0,0)(2,1,0,0). So P(4)={{0,0,0,1},{1,0,1,0},{0,2,0,0},{2,1,0,0},{4,0,0,0}}P(4)=\{\{0,0,0,1\},\{1,0,1,0\},\{0,2,0,0\},\{2,1,0,0\},\{4,0,0,0\}\}. Then, we define tf(k)t_{f}^{(k)} for k=1,2,,k=1,2,..., by:

tf(k)\displaystyle t_{f}^{(k)} =iP(k)n!i1!i2!in!(t(1)1!)i1(t(2)2!)i2(t(k)k!)ik,\displaystyle=\sum_{i\in P(k)}\frac{n!}{i_{1}!i_{2}!...i_{n}!}\left(\frac{t^{(1)}}{1!}\right)^{i_{1}}\left(\frac{t^{(2)}}{2!}\right)^{i_{2}}...\left(\frac{t^{(k)}}{k!}\right)^{i_{k}}, (77)

and tf(0)=1t_{f}^{(0)}=1. For example, for k=1k=1 to 44:

tf(1)\displaystyle t_{f}^{(1)} =t(1),\displaystyle=t^{(1)}, (78)
tf(2)\displaystyle t_{f}^{(2)} =t(2)+(t(1))2,\displaystyle=t^{(2)}+(t^{(1)})^{2}, (79)
tf(3)\displaystyle t_{f}^{(3)} =t(3)+3t(1)t(2)+(t(1))3,\displaystyle=t^{(3)}+3t^{(1)}t^{(2)}+(t^{(1)})^{3}, (80)
tf(4)\displaystyle t_{f}^{(4)} =t(4)+4t(1)t(3)+3(t(2))2+6(t(1))2t(2)+(t(1))4.\displaystyle=t^{(4)}+4t^{(1)}t^{(3)}+3(t^{(2)})^{2}+6(t^{(1)})^{2}t^{(2)}+(t^{(1)})^{4}. (81)

And r(p)(h)r^{(p)}(h) can be expressed by (can be proved by induction; proof omitted here):

r(p)(h)=𝔼[k=1pCk1p1s(k)tf(pk)|H=h],\displaystyle r^{(p)}(h)=\operatorname*{\mathbb{E}}\left[\sum_{k=1}^{p}C^{p-1}_{k-1}s^{(k)}t_{f}^{(p-k)}\bigg{|}H=h\right], (82)

where s(k)s^{(k)} and t(k)t^{(k)} are defined in (41) and (42). Therefore, the general formula to estimate r(p)(h)r^{(p)}(h) is r^(p)(h):=k=1pCk1p1s(k)tf(pk)\hat{r}^{(p)}(h):=\sum_{k=1}^{p}C^{p-1}_{k-1}s^{(k)}t_{f}^{(p-k)}. For example, for k=1k=1 to 44 (note that they are the same as the estimates derived previously):

r^(1)(h)\displaystyle\hat{r}^{(1)}(h) =s(1),\displaystyle=s^{(1)}, (83)
r^(2)(h)\displaystyle\hat{r}^{(2)}(h) =s(2)+tf(1),\displaystyle=s^{(2)}+t_{f}^{(1)}, (84)
r^(3)(h)\displaystyle\hat{r}^{(3)}(h) =s(3)+2s(2)tf(1)+s(1)tf(2),\displaystyle=s^{(3)}+2s^{(2)}t_{f}^{(1)}+s^{(1)}t_{f}^{(2)}, (85)
r^(4)(h)\displaystyle\hat{r}^{(4)}(h) =s(4)+3s(3)tf(1)+3s(2)tf(2)+s(1)tf(3).\displaystyle=s^{(4)}+3s^{(3)}t_{f}^{(1)}+3s^{(2)}t_{f}^{(2)}+s^{(1)}t_{f}^{(3)}. (86)

Appendix B A general formula of σ(p)(x)\sigma^{(p)}(x)

The general formula of σ(p)(x)\sigma^{(p)}(x) is given by:

σ(p)(x)=k=1p(1)k1Ap,k1σ(x)k(1σ(x))p+1k,\displaystyle\sigma^{(p)}(x)=\sum_{k=1}^{p}(-1)^{k-1}A_{p,k-1}\sigma(x)^{k}(1-\sigma(x))^{p+1-k}, (87)

where Ap,k1A_{p,k-1} are the Eulerian Numbers. For the proof, see [10].

To see that Cp+1vp+1(p+1)!\frac{C_{p+1}||v||^{p+1}}{(p+1)!} is unbounded if v>π||v||>\pi, it suffices to prove that for |x|>π|x|>\pi,

limn|σ(2n1)(0)x(2n1)(2n1)!|=+.\displaystyle\lim_{n\rightarrow\infty}\left|\frac{\sigma^{(2n-1)}(0)x^{(2n-1)}}{(2n-1)!}\right|=+\infty. (88)
Proof.

First, we can write σ(p)(0)\sigma^{(p)}(0) as:

σ(p)(0)\displaystyle\sigma^{(p)}(0) =k=1p(1)k1Ap,k1σ(0)k(1σ(0))p+1k,\displaystyle=\sum_{k=1}^{p}(-1)^{k-1}A_{p,k-1}\sigma(0)^{k}(1-\sigma(0))^{p+1-k}, (89)
=k=1p(1)k1Ap,k12(p+1)\displaystyle=\sum_{k=1}^{p}(-1)^{k-1}A_{p,k-1}2^{-(p+1)} (90)
=2(p+1)k=1p(1)k1Ap,k1\displaystyle=2^{-(p+1)}\sum_{k=1}^{p}(-1)^{k-1}A_{p,k-1} (91)
=(2p+11)Bp+1p+1,\displaystyle=\frac{(2^{p+1}-1)B_{p+1}}{p+1}, (92)

where Bp+1B_{p+1} denotes the Bernoulli number. Substituting back, we have

limn|σ(2n1)(0)x(2n1)(2n1)!|\displaystyle\lim_{n\rightarrow\infty}\left|\frac{\sigma^{(2n-1)}(0)x^{(2n-1)}}{(2n-1)!}\right| =limn|(22n1)B2nx(2n1)(2n)!|\displaystyle=\lim_{n\rightarrow\infty}\left|\frac{(2^{2n}-1)B_{2n}x^{(2n-1)}}{(2n)!}\right| (93)
limn|(22n1)2(2n)!x(2n1)(2π)2n(122n)(2n)!|\displaystyle\geq\lim_{n\rightarrow\infty}\left|\frac{(2^{2n}-1)2(2n)!x^{(2n-1)}}{(2\pi)^{2n}(1-2^{-2n})(2n)!}\right| (94)
=limn|(22n1)2x(2n1)(2π)2n|\displaystyle=\lim_{n\rightarrow\infty}\left|\frac{(2^{2n}-1)2x^{(2n-1)}}{(2\pi)^{2n}}\right| (95)
=limn|2x(xπ)2n(1122n)|,\displaystyle=\lim_{n\rightarrow\infty}\left|\frac{2}{x}\left(\frac{x}{\pi}\right)^{2n}\left(1-\frac{1}{2^{2n}}\right)\right|, (96)

which is unbounded if |x|>π|x|>\pi. (94) uses |B2n|>2(2n)!(2π)2n(122n)|B_{2n}|>\frac{2(2n)!}{(2\pi)^{2n}(1-2^{-2n})} [11]. ∎

Appendix C Additional Experiments

C.1 High-order Weight Maximization

We repeat the same experiment in Section 3 for pp-order Weight Maximization, with p{1,2,,8}p\in\{1,2,...,8\}. We use N=64N=64. The result is shown in Figure 5.

We see that pp-order Weight Maximization performs worse with a larger pp. Manual inspection shows that the individual reward computed by high-order Weight Maximization can be very large if the outgoing weight’s norm is large. This is essentially due to the problem of approximating the sigmoid function σ(x)\sigma(x) with Taylor series evaluated x=ax=a, which diverges when |xa||x-a| is too large.

C.2 Variants of unbiased Weight Maximization

In unbiased Weight Maximization we estimate 𝔼[r(U)]\operatorname*{\mathbb{E}}[r^{\prime}(U)] with r(U)r^{\prime}(U), which is a Monte Carlo estimate with a single sample. An estimate with a lower variance is i=1Mr(Ui)M\sum_{i=1}^{M}\frac{r^{\prime}(U_{i})}{M}, where all UiU_{i} are i.i.d. and uniformly distributed. Alternatives, we can estimate 𝔼[r(U)]\operatorname*{\mathbb{E}}[r^{\prime}(U)] by numerical integration using the rectangle rule applied on MM subintervals. To test these methods, we repeat the same experiment in Section 3 with M{1,5,10}M\in\{1,5,10\} for both Monte Carlo (MC) and numerical integration (NI). We use N=64N=64. The result is shown in Figure 6.

We see that the learning curves are almost the same. It seems that these different methods to reduce variance do not yield observable improvement in performance despite the reduced variance. Maybe the reduction in variance is not significant enough to affect performance.

C.3 Comparison with continuous-valued ANNs

We repeat the same experiment in Section 3 for continuous-valued ANNs trained by backprop: the output unit is still a Bernoulli-logistic unit trained by REINFORCE, but all hidden units are deterministic sigmoid units trained by backprop. The architecture of the network is the same as the discrete-valued ANNs. Thus, the continuous-valued ANN considered is equivalent to the discrete-valued ANN in our experiments but with hidden units outputting expected values instead of sampled values. The average reward, similar to Figure 3 and Figure 4, is shown on Figure 7 and Figure 8.

We observe that continuous-valued ANNs trained by backprop has a better performance in term of both learning speed and asymptotic performance than discrete-valued ANNs, though the difference narrows for a larger network. This is likely because units in discrete-valued ANNs can only communicate by binary values, which constraints the amount of information passed between units, e.g. a unit can only express if there is an edge but not the probability that there is an edge. As the network size increases, more binary units can detect the same edge, and the average value of these units can estimate the probability that there is an edge, so the difference in performance narrows.

More experiments can be conducted to evaluate and compare discrete-valued ANNs and continuous-valued ANNs on other aspects besides learning speed and network capacity, such as generalization error in supervised learning tasks and dynamics of exploration in reinforcement learning tasks.

C.4 Estimation bias and variance of the estimators

All learning rules discussed can be seen as using different estimators to estimate the gradient of expected reward 𝔼[r(H)]\operatorname*{\mathbb{E}}[r(H)] w.r.t. bias of the unit bb. To evaluate the quality of estimators, we compute their estimation bias and variance on some randomly selected distributions.

We consider a five-layered network of Bernoulli-logistic units. The first hidden layer and the output layer have a single unit; the other hidden layers have four units. The weight and bias of these units are all uniformly sampled from [C,+C][-C,+C] where C>0C>0. The task considered has no inputs, and the rewards for the two binary outputs are uniformly sampled from [10,+10][-10,+10]. Then we apply the estimators to estimate 𝔼[r(H)]\operatorname*{\mathbb{E}}[r(H)] w.r.t. bias of the unit on the first layer. The estimation bias and variance of the different estimators can be analytically computed as the network’s size is small.

The estimation bias and variance for a network with different parameter range CC are shown in Figure 9 and Figure 10. We observe that when the CC is small, the estimation bias of pp-order Weight Maximization is lower for a higher pp - at that range, the Taylor series of sigmoid function converges. However, when CC is large, the Taylor series diverges, and so the estimation bias is larger for a higher pp. As for variance, we observe that REINFORCE has a steady variance regardless of CC, since the variance of REINFORCE is bounded by the square of expected rewards. Unbiased Weight Maximization’s variance increases fastest with CC, since the sampling ratio in unbiased Weight Maximization can be huge when the weight is large. Nonetheless, the large variance of unbiased Weight Maximization seems to have not much impact on the learning curve. This may be explained by the fact that those huge sampling ratios only occur with a probability close to zero and so may never be encountered during training. The relationship between variance and learning speed deserves further investigation.

Refer to caption
Figure 5: Running average of rewards in each episode throughout a run in the multiplexer task for high-order Weight Maximization. Results are averaged over 5 independent runs, and the error bar represents the standard deviation over the runs.
Refer to caption
Figure 6: Running average of rewards in each episode throughout a run in the multiplexer task for variants of unbiased Weight Maximization. Results are averaged over 5 independent runs, and the error bar represents the standard deviation over the runs.
Refer to caption
Figure 7: Average rewards of the first 5e65e6 episodes in the multiplexer task with a different NN for both discrete-valued and continuous-valued ANNs. Results are averaged over 5 independent runs, and the error bar represents the standard deviation over the runs.
Refer to caption
Figure 8: Running average of rewards in each episode throughout a run in the multiplexer task with a different NN for both discrete-valued and continuous-valued ANNs. Results are averaged over 5 independent runs, and the error bar represents the standard deviation over the runs.
Refer to caption
Figure 9: Estimation bias of different estimators in a network with different parameter range MM. Note that REINFORCE and unbiased Weight Maximization have zero estimation bias, and therefore are not shown on the graph. Results shown on the graph are averaged over 20 randomly selected distribution.
Refer to caption
Figure 10: Variance of different estimators in a network with different parameter range MM. Results shown on the graph are averaged over 20 randomly selected distribution.

References

  • [1] Richard S Sutton and Andrew G Barto “Reinforcement learning: An introduction” MIT press, 2018
  • [2] Stephen Chung “Learning by competition of self-interested reinforcement learning agents” In Proceedings of the AAAI Conference on Artificial Intelligence 36.6, 2022, pp. 6384–6393
  • [3] Ronald J Williams “Simple statistical gradient-following algorithms for connectionist reinforcement learning” In Machine learning 8.3-4 Springer, 1992, pp. 229–256
  • [4] Yoshua Bengio, Nicholas Léonard and Aaron Courville “Estimating or propagating gradients through stochastic neurons for conditional computation” In arXiv preprint arXiv:1308.3432, 2013
  • [5] Diederik P Kingma and Jimmy Ba “Adam: A method for stochastic optimization” In arXiv preprint arXiv:1412.6980, 2014
  • [6] Andrew G Barto “Learning by statistical cooperation of self-interested neuron-like computing elements” In Human Neurobiology 4.4, 1985, pp. 229–256
  • [7] Ashish Vaswani et al. “Attention is all you need” In Advances in neural information processing systems 30, 2017
  • [8] Lorenzo Ferrone and Fabio Massimo Zanzotto “Symbolic, distributed, and distributional representations for natural language processing in the era of deep learning: A survey” In Frontiers in Robotics and AI Frontiers, 2020, pp. 153
  • [9] Ruben Cartuyvels, Graham Spinks and Marie-Francine Moens “Discrete and continuous representations and processing in deep learning: looking forward” In AI Open 2 Elsevier, 2021, pp. 143–159
  • [10] Ali A Minai and Ronald D Williams “On the derivatives of the sigmoid” In Neural Networks 6.6 Elsevier, 1993, pp. 845–853
  • [11] Horst Alzer “Sharp bounds for the Bernoulli numbers” In Archiv der Mathematik 74.3 Springer, 2000, pp. 207–211