Technical intro

1 Implementing the forward pass of a Hypernetwork with Transformers

1.1 Linear attention block

We first give a brief overview of the linear transformer architecture.

Given input tokens $E\in\mathbb{R}^{L\times d_{m}}$ for a sequence of length $L$ , a trasnformer block consists in a self-attention layer followed by a multi-layer-perception. The transformation is done by first computing queries, keys and values $Q,K,V=EW_{q},EW_{k},EW_{v}$ with which we then update $E$ as

	$\displaystyle E$	$\displaystyle\leftarrow E+QK^{T}VW_{P}$		(1)
	$\displaystyle E$	$\displaystyle\leftarrow E+\sigma(EW_{1})W_{2}$		(2)

where $W_{q},W_{k},W_{v}\in\mathbb{R}^{d_{m}\times d_{k}}$ and $W_{p}\in\mathbb{R}^{d_{k}\times d_{m}}$ as well as $W_{1}\in\mathbb{R}^{d_{m}\times d_{h}},W_{2}\in\mathbb{R}^{d_{h}\times d_{m}}$ are learnable parameter matrices. The $\sigma$ is a non linearity applied row wise. In practice, there are $H$ heads that performs the first attention operation in parallel, each with its own parameters $W_{q}^{(h)},W_{k}^{(h)},W_{v}^{(h)},W_{p}^{(h)}$ for all $h$ , resulting in the following forward function

\displaystyle E

\displaystyle\leftarrow E+\sum_{h}Q^{(h)}K^{(h)T}V^{(h)}W_{P}^{(h)}

(3)

1.2 Construction

We will now show a construction of a linear transformer above which would allow it to implement the forward pass of a given hypernetwork given any input $x\in\mathbb{R}^{d}$ and latent $z\in\mathbb{R}^{M}$ .

Hypernetwork

Let us consider the following linear hypernetwork:

x,z\rightarrow A\sigma(\omega(z)x)

(5)

where $\omega(z)=\sum_{m=1}^{M}z^{(m)}\Theta^{(m)}$ , $\Theta^{(m)}\in\mathbb{R}^{h\times d}$ for all $m$ and $A\in\mathbb{R}^{o\times d}$ .

Token construction

We assume there are only $2$ tokens, $e_{1}=(x^{\top},0_{M},1_{h+o})^{\top}$ and $e_{2}=(0_{d},z^{\top},0_{h+o})^{\top}$ where $0_{k},1_{k}$ indicate the $k$ dimensional row vector of $0$ resp $1$ . The output will be computed on the token stream of $e_{2}$ .

Linear attention

First, the attention layer will compute the forward pass $\omega(z)x$ . To do this, let us fix $H=M$ heads, $d_{q}=d_{k}=1$ and $d_{v}=h$ . For each head $m$ , we can construct the value matrix such that the first token has a value vector $\Theta^{(m)}x$ while the second has $0$ . By choosing the key and query matrices correctly, the attention score between the first and second token can be made to be exactly $z^{(m)}$ . By letting the projection matrix be constant across head, the attention operation would then be

e_{2}\leftarrow e_{2}+\sum_{m}^{M}z^{(m)}(\Theta^{(m)}x)^{\top}W_{P}

(6)

by appropriately choosing $W_{P}$ the residual stream would then equal $(0_{d},z^{\top},\omega(z)x,0_{o})^{\top}$ after the attention layer.

MLP

Finally, the MLP layer simply applies the correct non linearity $\sigma$ to $\omega(z)x$ and applies the readout weight $A$ to finally write the result on the remaining $0_{o}$ in the residual stream.