Technical intro
1 Implementing the forward pass of a Hypernetwork with Transformers
1.1 Linear attention block
We first give a brief overview of the linear transformer architecture.
Given input tokens for a sequence of length , a trasnformer block consists in a self-attention layer followed by a multi-layer-perception. The transformation is done by first computing queries, keys and values with which we then update as
(1) | ||||
(2) |
where and as well as are learnable parameter matrices. The is a non linearity applied row wise. In practice, there are heads that performs the first attention operation in parallel, each with its own parameters for all , resulting in the following forward function
(3) |
1.2 Construction
We will now show a construction of a linear transformer above which would allow it to implement the forward pass of a given hypernetwork given any input and latent .
Hypernetwork
Let us consider the following linear hypernetwork:
(5) |
where , for all and .
Token construction
We assume there are only tokens, and where indicate the dimensional row vector of resp . The output will be computed on the token stream of .
Linear attention
First, the attention layer will compute the forward pass . To do this, let us fix heads, and . For each head , we can construct the value matrix such that the first token has a value vector while the second has . By choosing the key and query matrices correctly, the attention score between the first and second token can be made to be exactly . By letting the projection matrix be constant across head, the attention operation would then be
(6) |
by appropriately choosing the residual stream would then equal after the attention layer.
MLP
Finally, the MLP layer simply applies the correct non linearity to and applies the readout weight to finally write the result on the remaining in the residual stream.