Mitigating Transformer Overconfidence via Lipschitz Regularization
(Supplementary Material)
Appendix A Proof for the Lipschitz Constant of LayerNorm
The LayerNorm operation [layernorm] used in LRFormer can be expressed as:
where , , .
WLOG, assume and not all are equal.
The derivatives of and w.r.t :
Take the derivative of , the th element of , with respect to is:
(1) |
where is a one-hot vector with at the th element. Therefore,
(2) |
Take the infinity-norm on both sides, we have:
Appendix B Proof for the Lipschitz Constant of LRSA
The pair-wise LRSA function is expressed as:
(3) |
To take the derivative , there are two cases.
When :
(4) | ||||
When :
(5) |
Take the infinity-norm on , we get:
Thus,
Appendix C Gaussian Process Layer
As an optional module in LRFormer, Gaussian Process (GP) with an RBF kernel following SNGP [Liu2020SimpleAP] is capable of perserving the distance awareness between input test sample and previously seen training data. This approach makes sure the model returns a uniform distribution over output labels when the input sample is OOD.
To make it end-to-end trainable, the Gaussian Process layer can be implemented a two-layer network:
(6) |
Here, is the input, and and are frozen weights initialized randomly from Gaussian and uniform distributions, respectively. is Random Fourier Features (RFF) [williams2006gaussian]. is the learnable kernel weight similar to that of a Dense layer. The layer outputs the class prediction .
Appendix D Experimental Details
In Table 1, we provide the training details used for reproducing the main results in Tables above. The (pretraining) is the experimental setup of the ImageNet1K dataset pretraining. The other hyperparameters follows the same setting from DeiT III [Touvron2022ThreeTE].
Hyperparameters | (pretraining) | ||
---|---|---|---|
Layer depth | 6 | 12 | 12 |
Input size | |||
Batch size | 128 | 32 | 32 |
Warm-up steps | 5 | 5 | 5 |
Optimizer | SGD | AdamW | AdamW |
Learning rate | 0.01 | 0.006 | 0.004 |
Weight decay | 0.05 | 0.05 | 0.05 |
Learning rate scheduler | cosine | cosine | cosine |
Training epochs | 100 | 100 | 100 |