This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Mitigating Transformer Overconfidence via Lipschitz Regularization
(Supplementary Material)

Wenqian Ye Department of Computer Science, University of Virginia, Charlottesville, VA, USA AI Lab, Shenzhen Children’s Hospital, Shenzhen, China Yunsheng Ma College of Engineering, Purdue University, West Lafayette, IN, USA AI Lab, Shenzhen Children’s Hospital, Shenzhen, China Xu Cao Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, USA AI Lab, Shenzhen Children’s Hospital, Shenzhen, China Kun Tang T Lab, Tencent, Beijing, China

Appendix A Proof for the Lipschitz Constant of LayerNorm

The LayerNorm operation [layernorm] used in LRFormer can be expressed as:

LN(𝐱)\displaystyle\text{LN}(\mathbf{x}) =𝐱μ(𝐱)σ2(𝐱)+ϵ𝜸+𝜷\displaystyle=\frac{\mathbf{x}-\mu(\mathbf{x})}{\sqrt{\sigma^{2}(\mathbf{x})+\epsilon}}*\boldsymbol{\gamma}+\boldsymbol{\beta}

where 𝐱,𝜷,𝜸N\mathbf{x},\boldsymbol{\beta},\boldsymbol{\gamma}\in\mathbb{R}^{N}, μ(𝐱)=1Ni=1Nxi\mu(\mathbf{x})=\frac{1}{N}\sum_{i=1}^{N}x_{i}, σ2(𝐱)=1Ni=1N(xiμ(𝐱))2\sigma^{2}(\mathbf{x})=\frac{1}{N}\sum_{i=1}^{N}(x_{i}-\mu(\mathbf{x}))^{2}.

WLOG, assume N>2N>2 and not all xix_{i} are equal.

The derivatives of μ\mu and σ2\sigma^{2} w.r.t xx:

μ𝐱=1N𝟙\frac{\partial\mu}{\partial\mathbf{x}}=\frac{1}{N}\mathds{1}^{\top}
σ2𝐱=2N(𝐱μ)\frac{\partial\sigma^{2}}{\partial\mathbf{x}}=\frac{2}{N}(\mathbf{x}-\mu)^{\top}

Take the derivative of LN(𝐱)i\text{LN}(\mathbf{x})_{i}, the iith element of LN(𝐱)\text{LN}(\mathbf{x}), with respect to 𝐱\mathbf{x} is:

LN(𝐱)i𝐱=γi(σ2+ϵ)12[(𝐞i1N𝟙)1N(σ2+ϵ)1(xiμ)(𝐱μ)].\displaystyle\begin{split}\frac{\partial\text{LN}(\mathbf{x})_{i}}{\partial\mathbf{x}}&=\gamma_{i}(\sigma^{2}+\epsilon)^{-\frac{1}{2}}\bigg{[}(\mathbf{e}_{i}-\frac{1}{N}\mathds{1})^{\top}-\frac{1}{N}(\sigma^{2}+\epsilon)^{-1}(x_{i}-\mu)(\mathbf{x}-\mu)^{\top}\bigg{]}.\end{split} (1)

where 𝐞IN\mathbf{e}_{I}\in\mathbb{R}^{N} is a one-hot vector with 11 at the iith element. Therefore,

LN(𝐱)𝐱\displaystyle\frac{\partial\text{LN}(\mathbf{x})}{\partial\mathbf{x}} =(σ2+ϵ)12[diag(𝜸)1N𝜸𝟙1N(σ2+ϵ)1diag(𝜸)(𝐱μ)(𝐱μ)].\displaystyle=(\sigma^{2}+\epsilon)^{-\frac{1}{2}}\bigg{[}\text{diag}(\boldsymbol{\gamma})-\frac{1}{N}\boldsymbol{\gamma}\mathds{1}^{\top}-\frac{1}{N}(\sigma^{2}+\epsilon)^{-1}\text{diag}(\boldsymbol{\gamma})(\mathbf{x}-\mu)(\mathbf{x}-\mu)^{\top}\bigg{]}.
diag(𝜸)1N𝜸𝟙=2(N1)Nmaxi|γi|,\left\|\text{diag}(\boldsymbol{\gamma})-\frac{1}{N}\boldsymbol{\gamma}\mathds{1}^{\top}\right\|_{\infty}=\frac{2(N-1)}{N}\max_{i}|\gamma_{i}|, (2)

Take the infinity-norm on both sides, we have:

LN(𝐱)𝐱\displaystyle\left\|\frac{\partial\text{LN}(\mathbf{x})}{\partial\mathbf{x}}\right\|_{\infty} =(σ2+ϵ)12diag(𝜸)1N𝜸𝟙1N(σ2+ϵ)1diag(𝜸)(𝐱μ)(𝐱μ)\displaystyle=(\sigma^{2}+\epsilon)^{-\frac{1}{2}}\left\|\text{diag}(\boldsymbol{\gamma})-\frac{1}{N}\boldsymbol{\gamma}\mathds{1}^{\top}-\frac{1}{N}(\sigma^{2}+\epsilon)^{-1}\text{diag}(\boldsymbol{\gamma})(\mathbf{x}-\mu)(\mathbf{x}-\mu)^{\top}\right\|_{\infty}
ϵ12(2(N1)Nmaxi|γi|+1Nmaxi|γi|N(N2))\displaystyle\leq\epsilon^{-\frac{1}{2}}\bigg{(}\frac{2(N-1)}{N}\max_{i}|\gamma_{i}|+\frac{1}{N}\max_{i}|\gamma_{i}|N(N-2)\bigg{)}
ϵ12maxi|γi|N.\displaystyle\leq\epsilon^{-\frac{1}{2}}\max_{i}|\gamma_{i}|N.

Appendix B Proof for the Lipschitz Constant of LRSA

The pair-wise LRSA function is expressed as:

Sij=αxiWQxjWK22QFX(,2)\displaystyle S_{ij}=-\frac{\alpha\left\|x_{i}^{\top}W_{Q}-x_{j}^{\top}W_{K}\right\|_{2}^{2}}{\left\|Q\right\|_{F}\left\|X^{\top}\right\|_{(\infty,2)}} (3)
Pi=Si(X)\displaystyle P_{i}=S_{i}(X)
Pij=eSijt=1neSit1\displaystyle P_{ij}=\frac{e^{S_{ij}}}{\sum_{t=1}^{n}e^{S_{it}}}\leq 1

To take the derivative PijP_{ij}, there are two cases.

When t=jt=j:

PijSit\displaystyle\frac{\partial P_{ij}}{\partial S_{it}} =PijSij=Sij(eSijt=1neSit)=eSij(t=1neSit)(eSij)2(t=1neSit)2\displaystyle=\frac{\partial P_{ij}}{\partial S_{ij}}=\frac{\partial}{\partial S_{ij}}\bigg{(}\frac{e^{S_{ij}}}{\sum_{t=1}^{n}e^{S_{it}}}\bigg{)}=\frac{e^{S_{ij}}(\sum_{t=1}^{n}e^{S_{it}})-(e^{S_{ij}})^{2}}{(\sum_{t=1}^{n}e^{S_{it}})^{2}} (4)
=eSijt=1neSit(1eSijt=1neSit)=Pij(1Pij)\displaystyle=\frac{e^{S_{ij}}}{\sum_{t=1}^{n}e^{S_{it}}}\bigg{(}1-\frac{e^{S_{ij}}}{\sum_{t=1}^{n}e^{S_{it}}}\bigg{)}=P_{ij}(1-P_{ij})

When tjt\neq j:

PijSit=Sit(eSijt=1neSit)=eSijt=1neSiteSitt=1neSit=PijPit\displaystyle\frac{\partial P_{ij}}{\partial S_{it}}=\frac{\partial}{\partial S_{it}}\bigg{(}\frac{e^{S_{ij}}}{\sum_{t=1}^{n}e^{S_{it}}}\bigg{)}=-\frac{e^{S_{ij}}}{\sum_{t=1}^{n}e^{S_{it}}}\frac{e^{S_{it}}}{\sum_{t=1}^{n}e^{S_{it}}}=-P_{ij}P_{it}
Pijxk=t=1nPijSitSitxk=Pij(1Pij)Sijxkt=1,tjnPijPitSitxk=PijSijxkPijt=1nPitSitxk\displaystyle\frac{\partial P_{ij}}{\partial x_{k}}=\sum_{t=1}^{n}\frac{\partial P_{ij}}{\partial S_{it}}\frac{\partial S_{it}}{\partial x_{k}}=P_{ij}(1-P_{ij})\frac{\partial S_{ij}}{\partial x_{k}}-\sum_{t=1,t\neq j}^{n}P_{ij}P_{it}\frac{\partial S_{it}}{\partial x_{k}}=P_{ij}\frac{\partial S_{ij}}{\partial x_{k}}-P_{ij}\sum_{t=1}^{n}P_{it}\frac{\partial S_{it}}{\partial x_{k}} (5)

Take the infinity-norm on SitS_{it}, we get:

Sitxk\displaystyle\left\|\frac{\partial S_{it}}{\partial x_{k}}\right\|_{\infty} =xk(αxiWQxjWK22QFX(,2))\displaystyle=\left\|\frac{\partial}{\partial x_{k}}\bigg{(}-\frac{\alpha\left\|x_{i}^{\top}W_{Q}-x_{j}^{\top}W_{K}\right\|_{2}^{2}}{\left\|Q\right\|_{F}\left\|X^{\top}\right\|_{(\infty,2)}}\bigg{)}\right\|_{\infty}
=2αxiWQxjWK2QFX(,2)xiWQxjWK2xk+αxiWQxjWK22QFX(,2)2X(,2)xk\displaystyle=\left\|-\frac{2\alpha\left\|x_{i}^{\top}W_{Q}-x_{j}^{\top}W_{K}\right\|_{2}}{\left\|Q\right\|_{F}\left\|X^{\top}\right\|_{(\infty,2)}}\frac{\partial\left\|x_{i}^{\top}W_{Q}-x_{j}^{\top}W_{K}\right\|_{2}}{\partial x_{k}}+\frac{\alpha\left\|x_{i}^{\top}W_{Q}-x_{j}^{\top}W_{K}\right\|_{2}^{2}}{\left\|Q\right\|_{F}\left\|X^{\top}\right\|_{(\infty,2)}^{2}}\frac{\partial\left\|X^{\top}\right\|_{(\infty,2)}}{\partial x_{k}}\right\|_{\infty}
2αxiWQxjWK2QFX(,2)xiWQxjWK2xk+αxiWQxjWK22QFX(,2)2X(,2)xk\displaystyle\leq\left\|\frac{2\alpha\left\|x_{i}^{\top}W_{Q}-x_{j}^{\top}W_{K}\right\|_{2}}{\left\|Q\right\|_{F}\left\|X^{\top}\right\|_{(\infty,2)}}\frac{\partial\left\|x_{i}^{\top}W_{Q}-x_{j}^{\top}W_{K}\right\|_{2}}{\partial x_{k}}\right\|_{\infty}+\left\|\frac{\alpha\left\|x_{i}^{\top}W_{Q}-x_{j}^{\top}W_{K}\right\|_{2}^{2}}{\left\|Q\right\|_{F}\left\|X^{\top}\right\|_{(\infty,2)}^{2}}\frac{\partial\left\|X^{\top}\right\|_{(\infty,2)}}{\partial x_{k}}\right\|_{\infty}
2αQFxiWQ2+xjWK2X(,2)(xjWQ2xk+xjWK2xk)+αQF(xiWQ2+xjWK2X(,2))2\displaystyle\leq\frac{2\alpha}{\left\|Q\right\|_{F}}\frac{\left\|x_{i}^{\top}W_{Q}\right\|_{2}+\left\|x_{j}^{\top}W_{K}\right\|_{2}}{\left\|X^{\top}\right\|_{(\infty,2)}}\bigg{(}\frac{\partial\left\|x_{j}^{\top}W_{Q}\right\|_{2}}{\partial x_{k}}+\frac{\partial\left\|x_{j}^{\top}W_{K}\right\|_{2}}{\partial x_{k}}\bigg{)}+\frac{\alpha}{\left\|Q\right\|_{F}}\bigg{(}\frac{\left\|x_{i}^{\top}W_{Q}\right\|_{2}+\left\|x_{j}^{\top}W_{K}\right\|_{2}}{\left\|X^{\top}\right\|_{(\infty,2)}}\bigg{)}^{2}
2α(WQ2+WK2)QF2+α(WQ2+WK2)QF2\displaystyle\leq\frac{2\alpha(\left\|W_{Q}\right\|_{2}+\left\|W_{K}\right\|_{2})}{\left\|Q\right\|_{F}}^{2}+\frac{\alpha(\left\|W_{Q}\right\|_{2}+\left\|W_{K}\right\|_{2})}{\left\|Q\right\|_{F}}^{2}
=3α(WQ2+WK2)QF2\displaystyle=\frac{3\alpha(\left\|W_{Q}\right\|_{2}+\left\|W_{K}\right\|_{2})}{\left\|Q\right\|_{F}}^{2}

Thus,

Pijxk\displaystyle\left\|\frac{\partial P_{ij}}{\partial x_{k}}\right\|_{\infty} =PijSijxkPijt=1nPitSitxkPij3α(WQ2+WK2)QF2+Pijt=1nPit3α(WQ2+WK2)QF2\displaystyle=\left\|P_{ij}\frac{\partial S_{ij}}{\partial x_{k}}-P_{ij}\sum_{t=1}^{n}P_{it}\frac{\partial S_{it}}{\partial x_{k}}\right\|_{\infty}\leq P_{ij}\frac{3\alpha(\left\|W_{Q}\right\|_{2}+\left\|W_{K}\right\|_{2})}{\left\|Q\right\|_{F}}^{2}+P_{ij}\sum_{t=1}^{n}P_{it}\frac{3\alpha(\left\|W_{Q}\right\|_{2}+\left\|W_{K}\right\|_{2})}{\left\|Q\right\|_{F}}^{2}
6α(WQ2+WK2)QF26αXF(WQ2+WK2)WQF2\displaystyle\leq\frac{6\alpha(\left\|W_{Q}\right\|_{2}+\left\|W_{K}\right\|_{2})}{\left\|Q\right\|_{F}}^{2}\leq\frac{6\alpha}{\left\|X\right\|_{F}}\cdot\frac{(\left\|W_{Q}\right\|_{2}+\left\|W_{K}\right\|_{2})}{\left\|W_{Q}\right\|_{F}}^{2}

Appendix C Gaussian Process Layer

As an optional module in LRFormer, Gaussian Process (GP) with an RBF kernel following SNGP [Liu2020SimpleAP] is capable of perserving the distance awareness between input test sample and previously seen training data. This approach makes sure the model returns a uniform distribution over output labels when the input sample is OOD.

To make it end-to-end trainable, the Gaussian Process layer can be implemented a two-layer network:

logits(x)=Φ(x)β,Φ(x)=2Mcos(Wx+b)\operatorname{logits}(x)=\Phi(x)\beta,\quad\Phi(x)=\sqrt{\frac{2}{M}}*\cos(Wx+b) (6)

Here, xx is the input, and WW and bb are frozen weights initialized randomly from Gaussian and uniform distributions, respectively. Φ(x)\Phi(x) is Random Fourier Features (RFF) [williams2006gaussian]. β\beta is the learnable kernel weight similar to that of a Dense layer. The layer outputs the class prediction logits(x)NumClasses\operatorname{logits}(x)\in\mathbb{R}_{\operatorname{NumClasses}} .

Appendix D Experimental Details

In Table 1, we provide the training details used for reproducing the main results in Tables above. The Depth=12Depth=12 (pretraining) is the experimental setup of the ImageNet1K dataset pretraining. The other hyperparameters follows the same setting from DeiT III [Touvron2022ThreeTE].

Table 1: Hyperparameters for LRFormer Training.
Hyperparameters Depth=6Depth=6 Depth=12Depth=12 Depth=12Depth=12 (pretraining)
Layer depth 6 12 12
Input size 224×224224\times 224 224×224224\times 224 224×224224\times 224
Batch size 128 32 32
Warm-up steps 5 5 5
Optimizer SGD AdamW AdamW
Learning rate 0.01 0.006 0.004
Weight decay 0.05 0.05 0.05
Learning rate scheduler cosine cosine cosine
Training epochs 100 100 100