This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Pushing Boundaries:
Mixup’s Influence on Neural Collapse

Quinn LeBlanc Fisher , Haoming Meng11footnotemark: 1 , Vardan Papyan
University of Toronto
Equal contribution
Abstract

Mixup is a data augmentation strategy that employs convex combinations of training instances and their respective labels to augment the robustness and calibration of deep neural networks. Despite its widespread adoption, the nuanced mechanisms that underpin its success are not entirely understood. The observed phenomenon of Neural Collapse, where the last-layer activations and classifier of deep networks converge to a simplex equiangular tight frame (ETF), provides a compelling motivation to explore whether mixup induces alternative geometric configurations and whether those could explain its success. In this study, we delve into the last-layer activations of training data for deep networks subjected to mixup, aiming to uncover insights into its operational efficacy. Our investigation (code), spanning various architectures and dataset pairs, reveals that mixup’s last-layer activations predominantly converge to a distinctive configuration different than one might expect. In this configuration, activations from mixed-up examples of identical classes align with the classifier, while those from different classes delineate channels along the decision boundary. Moreover, activations in earlier layers exhibit patterns, as if trained with manifold mixup. These findings are unexpected, as mixed-up features are not simple convex combinations of feature class means (as one might get, for example, by training mixup with the mean squared error loss). By analyzing this distinctive geometric configuration, we elucidate the mechanisms by which mixup enhances model calibration. To further validate our empirical observations, we conduct a theoretical analysis under the assumption of an unconstrained features model, utilizing the mixup loss. Through this, we characterize and derive the optimal last-layer features under the assumption that the classifier forms a simplex ETF.

1 Introduction

Consider a classification problem characterized by an input space 𝒳=D\mathcal{X}=\mathbb{R}^{D} and an output space 𝒴:={0,1}C\mathcal{Y}:=\{0,1\}^{C}. Given a training set {(𝐱i,𝐲i)}i=1N\left\{\left(\mathbf{x}_{i},\mathbf{y}_{i}\right)\right\}_{i=1}^{N}, with 𝐱i𝒳\mathbf{x}_{i}\in\mathcal{X} denoting the ii-th input data point and 𝐲i𝒴\mathbf{y}_{i}\in\mathcal{Y} representing the corresponding label, the goal is to train a model fθ:𝒳𝒴f_{\theta}:\mathcal{X}\mapsto\mathcal{Y} by finding parameters θ\theta that minimize the cross-entropy loss CE(fθ(𝐱i),𝐲i)\operatorname{CE}(f_{\theta}(\mathbf{x}_{i}),\mathbf{y}_{i}) incurred by the model’s prediction fθ(𝐱i)f_{\theta}(\mathbf{x}_{i}) relative to the true target 𝐲i\mathbf{y}_{i}, averaged over the training set, 1Ni=1NCE(fθ(𝐱i),𝐲i)\frac{1}{N}\sum_{i=1}^{N}\operatorname{CE}(f_{\theta}(\mathbf{x}_{i}),\mathbf{y}_{i}).

Papyan, Han, and Donoho (2020) observed that optimizing this loss leads to a phenomenon called Neural Collapse, where the last-layer activations and classifiers of the network converge to the geometric configuration of a simplex equiangular tight frame (ETF). This phenomenon reflects the natural tendency of the networks to organize the representations of different classes such that each class’s representations and classifiers become aligned, equinorm, and equiangularly spaced, providing optimal separation in the feature space. Understanding Neural Collapse is challenging due to the complex structure and inherent non-linearity of neural networks. Motivated by the expressivity of overparametrized models, the unconstrained features model (Mixon et al., 2020) and the layer-peeled model (Fang et al., 2021) have been introduced to study Neural Collapse theoretically. These mathematical models treat the last-layer features as free optimization variables along with the classifier weights, abstracting away the intricacies of the deep neural network.

1.1 Mixup

Mixup, a data augmentation strategy proposed by Zhang et al. (2017), generates new training examples through convex combinations of existing data points and labels:

𝐱iiλ=λ𝐱i+(1λ)𝐱i,𝐲iiλ=λ𝐲i+(1λ)𝐲i,\mathbf{x}_{ii^{\prime}}^{\lambda}=\lambda\mathbf{x}_{i}+(1-\lambda)\mathbf{x}_{i^{\prime}},\,\quad\mathbf{y}_{ii^{\prime}}^{\lambda}=\lambda\mathbf{y}_{i}+(1-\lambda)\mathbf{y}_{i^{\prime}},

where λ[0, 1]\lambda\in[0,\,1] is a randomly sampled value from a predetermined distribution DλD_{\lambda}. Conventionally, this distribution is a symmetric Beta(α,α)\operatorname{Beta}(\alpha,\alpha) distribution, with α=1\alpha=1 frequently set as the default. The loss associated with mixup can be mathematically represented as:

𝔼λDλ1N2i=1Ni=1NCE(fθ(𝐱iiλ),𝐲iiλ).\displaystyle\mathbb{E}_{\lambda\sim D_{\lambda}}\frac{1}{N^{2}}\sum_{i=1}^{N}\sum_{i^{\prime}=1}^{N}\operatorname{CE}(f_{\theta}(\mathbf{x}_{ii^{\prime}}^{\lambda}),\mathbf{y}_{ii^{\prime}}^{\lambda}). (1)

A specific mixup data point, represented as 𝐱iiλ\mathbf{x}_{ii^{\prime}}^{\lambda}, is categorized as a same-class mixup point when 𝐲i=𝐲i\mathbf{y}_{i}=\mathbf{y}_{i^{\prime}}, and classified as different-class when 𝐲i𝐲i\mathbf{y}_{i}\neq\mathbf{y}_{i^{\prime}}.

1.2 Problem Statement

Despite the widespread use and demonstrated efficacy of the mixup data augmentation strategy in enhancing the generalization and calibration of deep neural networks, its underlying operational mechanisms remain not well understood. The emergence of Neural Collapse prompts the following question:

Does mixup induce its own distinct configurations in last-layer activations, differing from traditional Neural Collapse? If so, does the configuration contribute to the method’s success?

This study aims to uncover the potential geometric configurations in the last-layer activations resulting from mixup and to determine whether these configurations can offer insights into its success.

1.3 Contributions

Our contributions in this paper are twofold.

Empirical Study and Discovery

We conduct an extensive empirical study focusing on the last-layer activations of mixup training data. Our study reveals that mixup induces a geometric configuration of last-layer activations across various datasets and models. This configuration is characterized by distinct behaviours:

  • Same-Class Activations: These form a simplex ETF, aligning with their respective classifier.

  • Different-Class Activations: These form channels along the decision boundary of the classifiers, exhibiting interesting behaviors: Data points with a mixup coefficient, λ\lambda, closer to 0.5 are located nearer to the middle of the channels. The density of different-class mixup points increases as λ\lambda approaches 0.5, indicating a collapsing behaviour towards the channels.

We investigate how this configuration varies under different training settings and the layer-wise trajectory the features take to arrive at the configuration. Additionally, the configuration offers insight into mixup’s success. Specifically, we measure the calibration induced by mixup and present an explanation for why the configuration leads to increased calibration.

Motivated by our theoretical analysis, we also examine the configuration of the last-layer activations obtained through training with mixup while fixing the classifier as a simplex ETF.

Theoretical Analysis

We provide a theoretical analysis of the discovered phenomenon, utilizing an adapted unconstrained features model tailored for the mixup training objective. Assuming the classifier forms a simplex ETF under optimality, we theoretically characterize the optimal last-layer activations for all class pairs and for every λ[0,1]\lambda\in[0,1].

1.4 Results Summary

The results of our extensive empirical investigation are presented in Figures 1, 3, 5, 10, and 12. These figures collectively illustrate a consistent identification of a unique last-layer configuration induced by mixup, observed across a diverse range of:

Architectures:

Our study incorporated the WideResNet-40-10 (Zagoruyko & Komodakis, 2017) and ViT-B (Dosovitskiy et al., 2021) architectures;

Datasets:

The datasets employed included FashionMNIST (Xiao et al., 2017), CIFAR10, and CIFAR100 (Krizhevsky & Hinton, 2009);

Optimizers:

We used stochastic gradient descent (SGD), Adam (Kingma & Ba, 2017), and AdamW (Loshchilov & Hutter, 2017) as optimizers.

The networks trained showed good generalization performance and calibration, as substantiated by the data presented in Tables 1 and 2. That is, the values are comparable to those found in other papers (Zhang et al., 2017; Thulasidasan et al., 2020).

Beyond our principal observations, we conducted a counterfactual experiment, the results of which are depicted in Figures 2 and 8. These reveal a notable divergence in the configuration of the last-layer features when mixup is not employed. They also show that for MSE loss, the last-layer activations are convex combinations of the classifiers, which one may expect.

Furthermore, we juxtaposed the findings from our empirical investigation with theoretically optimal features, which were derived from an unconstrained features model and are showcased in Figure 6.

Table 1: Test accuracy for experiments in Figures 1 and 8.
Network Dataset Baseline Mixup
FashionMNIST 95.10 94.21
WideResNet-40-10 CIFAR10 96.2 97.30
CIFAR100 80.03 81.42
FashionMNIST 93.71 94.24
ViT-B/4 CIFAR10 86.92 92.56
CIFAR100 59.95 69.83

To complement these results, we train models using mixup while fixing the classifier as a simplex ETF, and we plot the last-layer features in Figure 7. This yields last-layer features that align more closely with the theoretical features.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: (Visualization of activations outputted by networks trained with mixup). Last-layer activations of mixup training data for a randomly selected subset of three classes across various dataset and network architecture combinations trained with mixup. The first row illustrates activations generated by a WideResNet, while the second row showcases activations from a ViT. Each column corresponds to a different dataset. Coloration indicates the type of mixup (same or different class), along with the level of mixup, λ\lambda. For each plot, the relevant classifiers are plotted in black.

2 Experiments

2.1 Model Training

We consider FashionMNIST (Xiao et al., 2017), CIFAR10, and CIFAR100 (Krizhevsky & Hinton, 2009) datasets. Unless otherwise indicated, for all experiments using mixup augmentation, α=1\alpha{=}1 was used, meaning λ\lambda was sampled uniformly between 0 and 1. Each dataset is trained on both a Vision Transformer (Dosovitskiy et al., 2021), and a wide residual network (Zagoruyko & Komodakis, 2017). For each network and dataset combination, the experiment with the highest test accuracy is repeated without mixup and is referred to as the “baseline” result. No dropout was used in any experiments. Hyperparameter details are outlined in Appendix B.1.

2.2 Visualizations of last-layer activations

For each dataset and network pair, we visualize the last-layer activations for a subset of the training dataset consisting of three randomly selected classes. After obtaining the last-layer activations, they undergo a two-step projection: first onto the classifier for the subset of three classes, then onto a two-dimensional representation of a three-dimensional simplex ETF. A more detailed explanation of the projection can be found in Appendix B.2. The results of this experiment can be seen in Figure 1. Notably, activations from mixed-up examples of the same classes closely align with a simplex ETF structure, whereas those from different classes delineate channels along the decision boundary. Additionally, in certain plots, activations from mixed-up examples of different classes become increasingly sparse as λ\lambda approaches 0 and 1. This suggests a clustering of activations towards the channels. When generating plots, we keep the network in train mode 111We choose to have the network in evaluation mode for the WideResNet-40-10 CIFAR100 combination because the batch statistics are highly skewed due to the high number of classes.. Since ViT does not have batch normalization layers, this difference is not applicable.

2.3 Comparison of different loss functions

As part of our empirical investigation, we have conducted experiments utilizing Mean Squared Error (MSE) loss instead of cross-entropy, through which we observed in Figure 2 that features of mixed up examples are derived from simple convex combinations of same-class features. Initially, we anticipated a similar uninteresting configuration for cross-entropy; however, our measurements reveal that the resulting geometric configurations are markedly more interesting and complex. Additionally, we compare results in Figure 1 to the baseline (trained without mixup) cross-entropy loss in Figure 2. For the baseline networks, mixup data is loosely aligned with the classifier, regardless of same-class or different-class. The area in between classifiers is noisy and filled with examples where λ\lambda is close to 0.5. Additional baseline last-layer activations can be found in Figure 8.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: (Visualization of activations outputted by networks trained with various loss functions). Last-layer activations for WideResNet-40-10 trained on the CIFAR10 dataset, subsetted to three randomly selected classes. Projections are generated using the same method as Figure 1. Left to right: baseline cross-entropy, MSE mixup, cross-entropy mixup. Colouring indicates mixup type (same-class or different-class), and the level of mixup, λ\lambda. Relevant classifiers plotted in black. Additional dataset architecture combinations for baseline cross-entropy are available in appendix D.

2.4 Layer-wise trajectory of CLS token

Using the same projection method as in Figure 1, we investigate the trajectory of the CLS token for ViT models. First, we randomly select two CIFAR10 training images. Then, we create a selection of mixed up examples using the respective images. For each mixed up example, we project the path of the CLS token at each layer of the ViT-B/4 network.

Figure 3 presents the results for two images of the same class, and two images of a different class. For different-class mixup, the plot shows that for very small λ\lambda, the network first classifies the image as class 1 and only in deeper layers it realizes the image is also partially class 2.

The results in Figure 1 suggest that applying mixup to input data enforces a particularly rigid geometric structure on the last-layer activations. Manifold mixup (Verma et al., 2019), a subsequent technique, proposes the mixing of features across various layers of a network. The results in Figure 3 suggest that using regular mixup promotes manifold mixup-like behaviour in previous layers.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: (Projection of CLS token at each layer). Projections of the CLS token of the mix up of two randomly selected training images for various values of λ\lambda. Trajectories start at the origin. Colouring indicates mixup type (same-class or different-class), and the level of mixup, λ\lambda.

2.5 Calibration

Thulasidasan et al. (2020) demonstrated that mixup improves calibration for networks. That is, training with mixup causes the softmax probabilities to be in closer alignment with the true probabilities of misclassification. To measure a network’s calibration, we use the expected calibration error (ECE) as proposed by Pakdaman Naeini et al. (2015). The exact definition of ECE can be found in Appendix C. Results for ECE can be found in Table 2. Last-layer activation plots for α=0.4\alpha=0.4 are available in Figure 9 in Appendix D.

Table 2: CIFAR10 expected calibration error.
Network Baseline Mixup (α=1.0\alpha=1.0) Mixup (α=0.4\alpha=0.4)
WideResnet-40-10 0.024 0.077 0.013
ViT-B/4 0.122 0.014 0.019
hi,iλh_{i,i^{\prime}}^{\lambda}
Figure 4: (Diagram showing the relationship between calibration and the configuration). As λ\lambda approaches 0.5, the last-layer activation hi,iλh_{i,i^{\prime}}^{\lambda} (black) traverses the blue line of the configuration, leading to less confident predictions. Simultaneously, the variability of the activation (perforated black circle) results in an increase in misclassification due to the probability of being on the incorrect side of the decision boundary (green) increasing.

The configuration presented in Figure 1 sheds light on why mixup improves calibration. Recall, mixup promotes alignment of the model’s softmax probabilities for the training example 𝐱iiλ\mathbf{x}_{ii^{\prime}}^{\lambda} with its label λ𝐲i+(1λ)𝐲i\lambda\mathbf{y}_{i}+(1-\lambda)\mathbf{y}_{i^{\prime}}. Here, λ\lambda acts as a gauge for these softmax probabilities, essentially reflecting the model’s confidence in its predictions. Turning to Figure 4, it becomes therefore evident that as λ\lambda nears 0.5, the model’s certainty in its predictions diminishes. This reduction in confidence is manifested geometrically through the spatial distribution of features along the channel. This, in turn, causes an increase in misclassification rates, due to a greater chance of activations erroneously crossing the decision boundary. This simultaneous reduction in confidence and classification accuracy leads to enhanced calibration in the model and is purely attributed to the geometric structure to which the model converged. The above logic holds as we traverse the train mixed up features but we expect some test features to be noisy perturbations of train mixed up features.

3 Unconstrained features model for mixup

3.1 Theoretical characterization of optimal last-layer leatures

To study the resulting last-layer features under mixup, we consider an adaptation of the unconstrained features model to mixup training. Let dC1d\geq C-1 be the dimension of the last-layer features, 𝒚iC{\bm{y}}_{i}\in\mathbb{R}^{C} be the one-hot vector in entry ii, 𝑾C×d{\bm{W}}\in\mathbb{R}^{C{\times}d} be the classifier, and 𝒉iiλd{\bm{h}}_{ii^{\prime}}^{\lambda}\in\mathbb{R}^{d} be the last-layer feature associated with target λ𝒚i+(1λ)𝒚i\lambda{\bm{y}}_{i}+\left(1-\lambda\right){\bm{y}}_{i^{\prime}}. Then, adapting Equation 1 to the unconstrained features setting, we consider the optimization problem given by

min𝑾,𝒉iiλ𝔼λDλ1C2i=1Ci=1C(CE(𝑾𝒉iiλ,λ𝒚i+(1λ)𝒚i)+λ𝑯2𝒉iiλ22)+λ𝑾2𝑾F2,\min_{{\bm{W}},{\bm{h}}_{ii^{\prime}}^{\lambda}}\mathbb{E}_{\lambda\sim D_{\lambda}}\frac{1}{C^{2}}\sum_{i=1}^{C}\sum_{i^{\prime}=1}^{C}\left(\operatorname{CE}\left({\bm{W}}{\bm{h}}_{ii^{\prime}}^{\lambda},\lambda{\bm{y}}_{i}+\left(1-\lambda\right){\bm{y}}_{i^{\prime}}\right)+\frac{\lambda_{{\bm{H}}}}{2}\lVert{\bm{h}}_{ii^{\prime}}^{\lambda}\rVert_{2}^{2}\right)+\frac{\lambda_{{\bm{W}}}}{2}\lVert{\bm{W}}\rVert_{F}^{2}, (2)

where λ𝑾,λ𝑯>0\lambda_{{\bm{W}}},\lambda_{{\bm{H}}}>0 are the weight decay parameters. It is reasonable to consider decay on the features 𝒉iiλ{\bm{h}}_{ii^{\prime}}^{\lambda}, a practice that is frequently observed in prior work (Zhu et al., 2021; Zhou et al., 2022), due to the implicit decay to the last-layer features from the inclusion of decay in the previous layers’ parameters.

The following theorem characterizes the optimal last-layer features under the assumption that the optimal classifier 𝑾{\bm{W}} is a simplex ETF, ie.

mCC1(𝑰C1C𝟏C𝟏C)𝑼,m\sqrt{\frac{C}{C-1}}({\bm{I}}_{C}-\frac{1}{C}\mathbf{1}_{C}\mathbf{1}_{C}^{\top}){\bm{U}}^{\top}, (3)

where 𝑰CC×C{\bm{I}}_{C}\in\mathbb{R}^{C{\times}C} is the identity, 𝟏CC\mathbf{1}_{C}\in\mathbb{R}^{C} is the ones vector, 𝐔d×C\mathbf{U}\in\mathbb{R}^{d\times C} is a partial orthogonal matrix (satisfying 𝐔𝐔=𝑰C\mathbf{U}^{\top}\mathbf{U}={\bm{I}}_{C}), and m{0}m\in\mathbb{R}\setminus\{0\} is its multiplier.

Note that we make this assumption as it holds in practice based on our empirical measurements illustrated in Figure 5.

Theorem 3.1.

Assume that at optimality, 𝐖{\bm{W}} is a simplex ETF with multiplier mm, and denote the ii-th row of 𝐖{\bm{W}} by 𝐰i{\bm{w}}_{i}. Then, any minimizer of equation 2 satisfies:

1) Same-Class: For all i=1,,Ci=1,\ldots,C and λ[0,1]\lambda\in[0,1],

𝒉iiλ=(1C)Km2𝒘i,{\bm{h}}_{ii}^{\lambda}=\frac{\left(1-C\right)K}{m^{2}}{\bm{w}}_{i},

where K<0K<0 is the unique solution to the equation

eCKCm2(1C)λ𝑯K+C1=0.e^{-CK}-\frac{Cm^{2}}{(1-C)\lambda_{{\bm{H}}}K}+C-1=0.

2) Different-Class: For all iii\neq i^{\prime} and λ[0,1]\lambda\in[0,1],

𝒉iiλ=(1C)Cm2((Kλ𝒘i,𝒉iiλ)𝒘i+((C1)Kλ+𝒘i,𝒉iiλ)𝒘i),{\bm{h}}_{ii^{\prime}}^{\lambda}=\frac{(1-C)}{Cm^{2}}\left(\left(K_{\lambda}-\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle\right){\bm{w}}_{i}+\left((C-1)K_{\lambda}+\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle\right){\bm{w}}_{i^{\prime}}\right),

where 𝐰i,𝐡iiλ\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle is of the form

log(eKλ(2C+Cm2(1C)Kλλ𝑯±12(C2Cm2(1C)Kλλ𝑯)24eCKλ)),\log\left(e^{K_{\lambda}}\left(2-C+\frac{Cm^{2}}{\left(1-C\right)K_{\lambda}\lambda_{{\bm{H}}}}\pm\frac{1}{2}\sqrt{\left(C-2-\frac{Cm^{2}}{\left(1-C\right)K_{\lambda}\lambda_{{\bm{H}}}}\right)^{2}-4e^{-CK_{\lambda}}}\right)\right),

and Kλ<0K_{\lambda}<0 satisfies

e𝒘i,𝒉iiλ=Cm2(1C)Kλλ𝑯eKλ((1C)λ𝑯𝒘i,𝒉iiλCm2+λ).e^{\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}=\frac{Cm^{2}}{\left(1-C\right)K_{\lambda}\lambda_{{\bm{H}}}}e^{K_{\lambda}}\left(\frac{(1-C)\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}{Cm^{2}}+\lambda\right).

The proof of Theorem 3.1 can be found in Appendix A.1

Interpretation of Theorem.

Theorem 3.1 establishes that, within the framework of our model’s assumptions, the optimal same-class features are independent of λ\lambda and align with the classifier as a simplex ETF. In contrast, the optimal features for different classes are linear combinations (depending on λ\lambda) of the classifier rows corresponding to the mixed-up targets. This is consistent with the observations in Figure 1, where the same-class features consistently cluster at simplex vertices, regardless of the value of λ\lambda, while the different-class features dynamically flow between these vertices as λ\lambda varies.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: (Convergence of classifier to simplex ETF). Measurements on the classifier, 𝑾{\bm{W}}, for each network architecture and dataset combination. First and third plot: Coefficient of variation of the classifier norms, Stdi(𝒘i2)/Avgi(𝒘i2)\operatorname{Std}_{i}\left(\left\|{\bm{w}}_{i}\right\|_{2}\right)/\operatorname{Avg}_{i}\left(\left\|{\bm{w}}_{i}\right\|_{2}\right). Second and fourth plot: Standard deviation of the cosines between classifiers of distinct classes, Stdi,ii(𝒘i,𝒘i/(𝒘i2𝒘i2))\operatorname{Std}_{i,i^{\prime}\neq i}\left(\langle{\bm{w}}_{i},{\bm{w}}_{i^{\prime}}\rangle/\left(\left\|{\bm{w}}_{i}\right\|_{2}\left\|{\bm{w}}_{i^{\prime}}\right\|_{2}\right)\right) with iii\neq i^{\prime}. As training progresses, measurements indicate that 𝑾{\bm{W}} is trending toward a simplex ETF configuration.

In Figure 6, we plot the last-layer features obtained from Theorem 3.1, numerically solving for the values of KK and KλK_{\lambda} that satisfy their respective equations.

Refer to caption
(a) Loss = 0.33457
Refer to caption
(b) Loss = 0.33465

Refer to caption

Refer to caption

Figure 6: (Optimal activations from unconstrained features model). On the left are optimal last-layer activations obtained from our theoretical analysis. We set m=3m=3, C=10C=10, d=100d=100, and λ𝑯=1×106\lambda_{{\bm{H}}}=1\times 10^{-6}, and randomly sample 5000 different λ\lambda values from the Beta(1,1)\operatorname{Beta}(1,1) distribution for a randomly selected subset of three classes. Projections are generated using the same method as depicted in Figure 1. Colouring indicates the mixup type (same-class or different-class), and the level of mixup, λ\lambda.

Similar to the empirical results in Figure 1, the density of different-class mixup points decreases as lambda approaches 0 and 1. However, the theoretically optimal features exhibit channels arranged in a hexagonal pattern, differing from the empirical features observed in the FashionMNIST and CIFAR10 datasets. In particular, in the empirical representations, there is a more pronounced elongation of different-class features as the mixup parameter λ\lambda approaches 0.50.5. In attempt to understand these differences, we introduce an amplification of these same features in the directions of the classifier rows not corresponding to the mixed-up targets, with increasing amplifications as λ\lambda gets closer to 0.5 (details of the amplification function is outlined in Appendix A.2). This results in the plot on the right that behaves more similarly to the other empirical outcomes, while achieving a very close (though marginally larger) loss when compared to the true optimal configuration (the loss values are indicated below each plot in Figure 6). This demonstrates that the features have some degree of flexibility while remaining in close proximity to the minimum loss.

3.2 Training with fixed simplex ETF classifier

Refer to caption
Refer to caption
Refer to caption
Figure 7: (Visualization of activations outputted by network trained with mixup fixing the classifier as a simplex ETF). Last-layer activations of mixup training data are presented here for a randomly selected subset of three classes. Coloration indicates the type of mixup (same-class or different-class), along with the level of mixup, λ\lambda. This model achieves a test accuracy of 97.35%.

To enhance our comprehension of the differences between theoretical features (depicted in Figure 6) and empirical features (illustrated in Figure 1), we performed an experiment employing mixup within the training framework detailed in Section 2, but fixing the classifier as a simplex ETF. The resulting last-layer features are visualized in Figure 7. Prior work (Zhu et al., 2021; Yang et al., 2022; Pernici et al., 2022) have explored the effects of fixing the classifier, but not in the context of mixup. Our observations reveal that when the classifier is fixed as a simplex ETF, the empirical features tend to exhibit a more hexagonal shape in its different-class mixup features, aligning more closely with the theoretically optimal features. Moreover, comparable generalization performance is achieved when compared to training with a learnable classifier under the same setting.

Based on these results, a possible explanation for the variation in configuration is that during the training process, the classifier is still being learned and requires several epochs to converge to a simplex ETF, as depicted in Figure 5. During this period, the features may traverse regions that lead to slightly suboptimal loss, as there is flexibility in the features’ structures without much degradation to the loss performance (depicted in Figure 6).

4 Related work

The success of mixup has prompted many mixup variants, each successful in their own right (Guo et al., 2018; Verma et al., 2019; Yun et al., 2019; Kim et al., 2020). Additionally, various works have been devoted to better understanding the effects and success of the method.

Guo et al. (2018) identified manifold intrusion as a potential limitation of mixup, stemming from discrepancies between the mixed-up label of a mixed-up example and its true label, and they propose a method for overcoming this.

In addition to the work by Thulasidasan et al. (2020) on calibration for networks trained with mixup, Zhang et al. (2022) posits that this improvement in calibration due to mixup is correlated with the capacity of the network. Zhang et al. (2021) theoretically demonstrates that training with mixup corresponds to minimizing an upper bound of the adversarial loss.

Chaudhry et al. (2022) delved into the linearity of various representations of a deep network trained with mixup. They observed that representations nearer to the input and output layer exhibit greater linearity compared to those situated in the middle.

Carratino et al. (2022) interprets mixup as an empirical risk minimization estimator employing transformed data, leading to a process that notably enhances both model accuracy and calibration. Continuing on the same path, Park et al. (2022) offers a unified theoretical analysis that integrates various aspects of mixup methods.

Furthermore, Chidambaram et al. (2021) conducted a detailed examination of the classifier optimal to mixup, comparing it with the classifier obtained through standard training.

Recent work has also been devoted to studying the benefits of mixup with feature-learning based analysis by Chidambaram et al. (2023) and Zou et al. (2023). The former considering two features generated from a symmetric distribution for each class and the latter considering a data model with two features of different frequencies, feature noise, and random noise.

The discovery of Neural Collapse by Papyan et al. (2020) has spurred investigations of this phenomenon. Recent theoretical inquiries by Mixon et al. (2020); Fang et al. (2021); Lu & Steinerberger (2020); E & Wojtowytsch (2020); Poggio & Liao (2020); Zhu et al. (2021); Han et al. (2021); Tirer & Bruna (2022); Wang et al. ; Kothapalli et al. (2022) have delved into the analysis of Neural Collapse employing both the unconstrained features model (Mixon et al., 2020) and the layer-peeled model (Fang et al., 2021). Liu et al. (2023) removes the assumption on the feature dimension and the number of classes in Neural Collapse and presents a Generalized Neural Collapse which is characterized by minimizing intra-class variability and maximizing inter-class separability.

To our knowledge, there has not been any investigation into the geometric configuration induced by mixup in the last layer.

5 Conclusion

In conclusion, through an extensive empirical investigation across various architectures and datasets, we have uncovered a distinctive geometric configuration of last-layer activations induced by mixup. This configuration exhibits intriguing behaviors, such as same-class activations forming a simplex equiangular tight frame (ETF) aligned with their respective classifiers, and different-class activations delineating channels along the decision boundary, with varying densities depending on the mixup coefficient. We also examine the layer-wise trajectory that features follow to reach this configuration in the last-layer, and measure the calibration induced by mixup to provide an explanation for why this particular configuration in beneficial for calibration.

Furthermore, we have complemented our empirical findings with a theoretical analysis, adapting the unconstrained features model to mixup. Theoretical results indicate that the optimal same-class features are independent of the mixup coefficient and align with the classifier, while different-class features are dynamic linear combinations of the classifier rows corresponding to mixed-up targets, influenced by the mixup coefficient. Motivated by our theoretical analysis, we also conduct experiments investigating the configuration of the last-layer activations from training with mixup while keeping the classifier fixed as a simplex ETF and see that it aligns more closely with the theoretically optimal features, without degrading test-performance.

These findings collectively shed light on the intricate workings of mixup in training deep networks, emphasizing its role in organizing last-layer activations for improved calibration. Understanding these geometric configurations induced by mixup opens up avenues for further research into the design of data augmentation strategies and their impact on neural network training.

6 Acknowledgements

We acknowledge the support of the Natural Sciences and Engineering Research Council of Canada (NSERC). This research was enabled in part by support provided by Compute Ontario (http://www.computeontario.ca/) and Compute Canada (http://www.computecanada.ca/).

References

  • Carratino et al. (2022) Luigi Carratino, Moustapha Cissé, Rodolphe Jenatton, and Jean-Philippe Vert. On mixup regularization, 2022.
  • Chaudhry et al. (2022) Arslan Chaudhry, Aditya Krishna Menon, Andreas Veit, Sadeep Jayasumana, Srikumar Ramalingam, and Sanjiv Kumar. When does mixup promote local linearity in learned representations?, 2022.
  • Chidambaram et al. (2021) Muthu Chidambaram, Xiang Wang, Yuzheng Hu, Chenwei Wu, and Rong Ge. Towards understanding the data dependency of mixup-style training. CoRR, abs/2110.07647, 2021. URL https://arxiv.org/abs/2110.07647.
  • Chidambaram et al. (2023) Muthu Chidambaram, Xiang Wang, Chenwei Wu, and Rong Ge. Provably learning diverse features in multi-view data with midpoint mixup, 2023.
  • Dosovitskiy et al. (2021) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021.
  • E & Wojtowytsch (2020) Weinan E and Stephan Wojtowytsch. On the emergence of tetrahedral symmetry in the final and penultimate layers of neural network classifiers. arXiv preprint arXiv:2012.05420, 2020.
  • Fang et al. (2021) Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021.
  • Guo et al. (2018) Hongyu Guo, Yongyi Mao, and Richong Zhang. Mixup as locally linear out-of-manifold regularization, 2018.
  • Han et al. (2021) XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations, 2021.
  • Kim et al. (2020) Jang-Hyun Kim, Wonho Choo, and Hyun Oh Song. Puzzle mix: Exploiting saliency and local statistics for optimal mixup, 2020.
  • Kingma & Ba (2017) Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017.
  • Kothapalli et al. (2022) Vignesh Kothapalli, Ebrahim Rasromani, and Vasudev Awatramani. Neural collapse: A review on modelling principles and generalization. arXiv preprint arXiv:2206.04041, 2022.
  • Krizhevsky & Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical Report 0, University of Toronto, Toronto, Ontario, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  • Liu et al. (2023) Weiyang Liu, Longhui Yu, Adrian Weller, and Bernhard Schölkopf. Generalizing and decoupling neural collapse via hyperspherical uniformity gap, 2023.
  • Loshchilov & Hutter (2017) Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam. CoRR, abs/1711.05101, 2017. URL http://arxiv.org/abs/1711.05101.
  • Lu & Steinerberger (2020) Jianfeng Lu and Stefan Steinerberger. Neural collapse with cross-entropy loss. arXiv preprint arXiv:2012.08465, 2020.
  • Mixon et al. (2020) Dustin G Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features. arXiv preprint arXiv:2011.11619, 2020.
  • Pakdaman Naeini et al. (2015) Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using bayesian binning. Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), Feb. 2015. doi: 10.1609/aaai.v29i1.9602. URL https://ojs.aaai.org/index.php/AAAI/article/view/9602.
  • Papyan et al. (2020) Vardan Papyan, X. Y. Han, and David L. Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, sep 2020. doi: 10.1073/pnas.2015509117. URL https://doi.org/10.1073%2Fpnas.2015509117.
  • Park et al. (2022) Chanwoo Park, Sangdoo Yun, and Sanghyuk Chun. A unified analysis of mixed sample data augmentation: A loss function perspective, 2022.
  • Pernici et al. (2022) Federico Pernici, Matteo Bruni, Claudio Baecchi, and Alberto Del Bimbo. Regular polytope networks. IEEE Transactions on Neural Networks and Learning Systems, 33(9):4373–4387, September 2022. ISSN 2162-2388. doi: 10.1109/tnnls.2021.3056762. URL http://dx.doi.org/10.1109/TNNLS.2021.3056762.
  • Poggio & Liao (2020) Tomaso Poggio and Qianli Liao. Explicit regularization and implicit bias in deep network classifiers trained with the square loss. arXiv preprint arXiv:2101.00072, 2020.
  • Thulasidasan et al. (2020) Sunil Thulasidasan, Gopinath Chennupati, Jeff Bilmes, Tanmoy Bhattacharya, and Sarah Michalak. On mixup training: Improved calibration and predictive uncertainty for deep neural networks, 2020.
  • Tirer & Bruna (2022) Tom Tirer and Joan Bruna. Extended unconstrained features model for exploring deep neural collapse. In international conference on machine learning (ICML), 2022.
  • Verma et al. (2019) Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, Ioannis Mitliagkas, Aaron Courville, David Lopez-Paz, and Yoshua Bengio. Manifold mixup: Better representations by interpolating hidden states, 2019.
  • (26) Peng Wang, Huikang Liu, Can Yaras, Laura Balzano, and Qing Qu. Linear convergence analysis of neural collapse with unconstrained features. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop).
  • Xiao et al. (2017) Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR, abs/1708.07747, 2017. URL http://arxiv.org/abs/1708.07747.
  • Yang et al. (2022) Yibo Yang, Shixiang Chen, Xiangtai Li, Liang Xie, Zhouchen Lin, and Dacheng Tao. Inducing neural collapse in imbalanced learning: Do we really need a learnable classifier at the end of deep neural network?, 2022.
  • Yun et al. (2019) Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features, 2019.
  • Zagoruyko & Komodakis (2017) Sergey Zagoruyko and Nikos Komodakis. Wide residual networks, 2017.
  • Zhang et al. (2017) Hongyi Zhang, Moustapha Cissé, Yann N. Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. CoRR, abs/1710.09412, 2017. URL http://arxiv.org/abs/1710.09412.
  • Zhang et al. (2021) Linjun Zhang, Zhun Deng, Kenji Kawaguchi, Amirata Ghorbani, and James Zou. How does mixup help with robustness and generalization? In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=8yKEo06dKNo.
  • Zhang et al. (2022) Linjun Zhang, Zhun Deng, Kenji Kawaguchi, and James Zou. When and how mixup improves calibration, 2022.
  • Zhou et al. (2022) Jinxin Zhou, Chong You, Xiao Li, Kangning Liu, Sheng Liu, Qing Qu, and Zhihui Zhu. Are all losses created equal: A neural collapse perspective, 2022.
  • Zhu et al. (2021) Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features, 2021.
  • Zou et al. (2023) Difan Zou, Yuan Cao, Yuanzhi Li, and Quanquan Gu. The benefits of mixup for feature learning, 2023.

Appendix A Theoretical Model

A.1 Proof of Theorem 3.1

Our proof uses similar techniques as Yang et al. (2022), but we extend these ideas to the more intricate last-layer features that arise from mixup.

Proof.

Assuming that 𝑾{\bm{W}} is a simplex ETF with multiplier mm, our unconstrained features optimization problem in equation 2 becomes separable across λ\lambda and i,ii,i^{\prime}, and so it suffices to minimize

Liiλ=λ𝒚ilog(e𝒘i,𝒉iiλk=1Ce𝒘k,𝒉iiλ)(1λ)𝒚ilog(e𝒘i,𝒉iiλk=1Ce𝒘k,𝒉iiλ)+12λ𝑯𝒉iiλ22\displaystyle L_{ii^{\prime}}^{\lambda}=-\lambda{\bm{y}}_{i}\log\left(\frac{e^{\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}}{\sum_{k=1}^{C}e^{\left\langle{\bm{w}}_{k},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}}\right)-(1-\lambda){\bm{y}}_{i^{\prime}}\log\left(\frac{e^{\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}}{\sum_{k=1}^{C}e^{\left\langle{\bm{w}}_{k},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}}\right)+\frac{1}{2}\lambda_{{\bm{H}}}\left\|{\bm{h}}_{ii^{\prime}}^{\lambda}\right\|_{2}^{2}

over each 𝒉iiλ{\bm{h}}_{ii^{\prime}}^{\lambda} individually.

Liiλ𝒉iiλ\displaystyle\frac{\partial L_{ii^{\prime}}^{\lambda}}{\partial{\bm{h}}_{ii^{\prime}}^{\lambda}} =𝑾(𝒑(λ𝒚i+(1λ)𝒚j))+λ𝑯𝒉iiλ\displaystyle={\bm{W}}^{\top}\left({\bm{p}}-\left(\lambda{\bm{y}}_{i}+(1-\lambda){\bm{y}}_{j}\right)\right)+\lambda_{{\bm{H}}}{\bm{h}}_{ii^{\prime}}^{\lambda} (𝒑=softmax(𝑾𝒉iiλ){\bm{p}}=\mathrm{softmax}\left({\bm{W}}{\bm{h}}_{ii^{\prime}}^{\lambda}\right))
=j=1C𝒘jpjλ𝒘i(1λ)𝒘i+λ𝑯𝒉iiλ\displaystyle=\sum_{j=1}^{C}{\bm{w}}_{j}p_{j}-\lambda{\bm{w}}_{i}-\left(1-\lambda\right){\bm{w}}_{i^{\prime}}+\lambda_{{\bm{H}}}{\bm{h}}_{ii^{\prime}}^{\lambda} (pjp_{j} j-th entry of 𝒑{\bm{p}})
=ji,i𝒘jpj+(piλ)𝒘i+(pi(1λ))𝒘i+λ𝑯𝒉iiλ\displaystyle=\sum_{j\neq i,i^{\prime}}{\bm{w}}_{j}p_{j}+\left(p_{i}-\lambda\right){\bm{w}}_{i}+\left(p_{i^{\prime}}-(1-\lambda)\right){\bm{w}}_{i^{\prime}}+\lambda_{{\bm{H}}}{\bm{h}}_{ii^{\prime}}^{\lambda}

Setting Liiλ𝒉iiλ=0\frac{\partial L_{ii^{\prime}}^{\lambda}}{\partial{\bm{h}}_{ii^{\prime}}^{\lambda}}=0 gives

ji,i𝒘jpj+(piλ)𝒘i+(pi(1λ))𝒘i+λ𝑯𝒉iiλ=0.\sum_{j\neq i,i^{\prime}}{\bm{w}}_{j}p_{j}+\left(p_{i}-\lambda\right){\bm{w}}_{i}+\left(p_{i^{\prime}}-(1-\lambda)\right){\bm{w}}_{i^{\prime}}+\lambda_{{\bm{H}}}{\bm{h}}_{ii^{\prime}}^{\lambda}=0. (4)

Case: i=ii=i^{\prime}

In this case, equation 4 reduces to

ji𝒘jpj+(pi1)𝒘i+λ𝑯𝒉iiλ=0.\sum_{j\neq i}{\bm{w}}_{j}p_{j}+\left(p_{i}-1\right){\bm{w}}_{i}+\lambda_{{\bm{H}}}{\bm{h}}_{ii}^{\lambda}=0. (5)

Taking inner product with 𝒘j,ji{\bm{w}}_{j},j\neq i in equation 5 gives

m2pjm2C1ki,jpkm2(pi1)C1+λ𝑯𝒘j,𝒉iiλ=0\displaystyle m^{2}p_{j}-\frac{m^{2}}{C-1}\sum_{k\neq i,j}p_{k}-\frac{m^{2}\left(p_{i}-1\right)}{C-1}+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{j},{\bm{h}}_{ii}^{\lambda}\right\rangle=0
m2pjm2C1(kjpk1)+λ𝑯𝒘j,𝒉iiλ=0\displaystyle m^{2}p_{j}-\frac{m^{2}}{C-1}\left(\sum_{k\neq j}p_{k}-1\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{j},{\bm{h}}_{ii}^{\lambda}\right\rangle=0
m2pj(CC1)+λ𝑯𝒘j,𝒉iiλ=0.\displaystyle m^{2}p_{j}\left(\frac{C}{C-1}\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{j},{\bm{h}}_{ii}^{\lambda}\right\rangle=0.

Since m2m^{2}, pjp_{j}, CC1\frac{C}{C-1} , λ𝑯\lambda_{{\bm{H}}} are all positive, it follows that 𝒘j,𝒉iiλ<0\left\langle{\bm{w}}_{j},{\bm{h}}_{ii}^{\lambda}\right\rangle<0.

For all j,kij,k\neq i, we have

exp(𝒘j,𝒉iiλ)exp(𝒘k,𝒉iiλ)=pjpk=𝒘j,𝒉iiλ𝒘k,𝒉iiλ.\frac{\exp\left(\left\langle{\bm{w}}_{j},{\bm{h}}_{ii}^{\lambda}\right\rangle\right)}{\exp\left(\left\langle{\bm{w}}_{k},{\bm{h}}_{ii}^{\lambda}\right\rangle\right)}=\frac{p_{j}}{p_{k}}=\frac{\left\langle{\bm{w}}_{j},{\bm{h}}_{ii}^{\lambda}\right\rangle}{\left\langle{\bm{w}}_{k},{\bm{h}}_{ii}^{\lambda}\right\rangle}.

Since exp(x)x\frac{\exp(x)}{x} is strictly decreasing on (,0)(-\infty,0), in particular it is injective on (,0)(-\infty,0), so

𝒘j,𝒉iiλ=𝒘k,𝒉iiλ=K,\left\langle{\bm{w}}_{j},{\bm{h}}_{ii}^{\lambda}\right\rangle=\left\langle{\bm{w}}_{k},{\bm{h}}_{ii}^{\lambda}\right\rangle=K,\\

for some K<0K<0 and pj=pk=pp_{j}=p_{k}=p for all j,kij,k\neq i, where

p=λ𝑯1CCKm2.p=\lambda_{{\bm{H}}}\frac{1-C}{C}\frac{K}{m^{2}}.

We also have

S=eKp=C1Cm2Kλ𝑯eK.S=\frac{e^{K}}{p}=\frac{C}{1-C}\cdot\frac{m^{2}}{K\lambda_{{\bm{H}}}}e^{K}.

Then,

𝒉iiλ\displaystyle{\bm{h}}_{ii}^{\lambda} =1λ𝑯(ji𝒘jpj+(pi1)𝒘i)\displaystyle=\frac{-1}{\lambda_{{\bm{H}}}}\left(\sum_{j\neq i}{\bm{w}}_{j}p_{j}+\left(p_{i}-1\right){\bm{w}}_{i}\right)
=1λ𝑯(p𝒘i+(pi1)𝒘i)\displaystyle=\frac{-1}{\lambda_{{\bm{H}}}}\left(-p{\bm{w}}_{i}+\left(p_{i}-1\right){\bm{w}}_{i}\right)
=1λ𝑯(p𝒘i(C1)p𝒘i)\displaystyle=\frac{-1}{\lambda_{{\bm{H}}}}\left(-p{\bm{w}}_{i}-\left(C-1\right)p{\bm{w}}_{i}\right)
=1λ𝑯Cp𝒘i\displaystyle=\frac{1}{\lambda_{{\bm{H}}}}Cp{\bm{w}}_{i}
=(1C)Km2𝒘i.\displaystyle=\frac{\left(1-C\right)K}{m^{2}}{\bm{w}}_{i}.

Taking inner product with 𝒘i{\bm{w}}_{i} in equation 5 gives

m2C1kipk+m2(pi1)+λ𝑯𝒘i,𝒉iiλ=0\displaystyle-\frac{m^{2}}{C-1}\sum_{k\neq i}p_{k}+m^{2}\left(p_{i}-1\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i},{\bm{h}}_{ii}^{\lambda}\right\rangle=0
m2pi(1C1+1)m2(1C1+1)+λ𝑯𝒘i,𝒉iiλ=0,\displaystyle m^{2}p_{i}\left(\frac{1}{C-1}+1\right)-m^{2}\left(\frac{1}{C-1}+1\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i},{\bm{h}}_{ii}^{\lambda}\right\rangle=0,

and so

𝒘i,𝒉iiλ=(1C)K.\left\langle{\bm{w}}_{i},{\bm{h}}_{ii}^{\lambda}\right\rangle=(1-C)K. (6)

By our definition of 𝒑{\bm{p}} as the softmax applied on 𝑾𝒉iiλ{\bm{W}}{\bm{h}}_{ii}^{\lambda}, KK needs to satisfy

e𝒘i,𝒉iiλ\displaystyle e^{\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle} =Spi\displaystyle=Sp_{i}
=C1Cm2Kλ𝑯eK(1(C1)p)\displaystyle=\frac{C}{1-C}\cdot\frac{m^{2}}{K\lambda_{{\bm{H}}}}e^{K}\left(1-(C-1)p\right)
=C1Cm2Kλ𝑯eK(1+((C1)2λ𝑯KCm2).\displaystyle=\frac{C}{1-C}\cdot\frac{m^{2}}{K\lambda_{{\bm{H}}}}e^{K}\left(1+\frac{((C-1)^{2}\lambda_{{\bm{H}}}K}{Cm^{2}}\right).

ie.

eCK=Cm2(1C)λ𝑯K+1C.e^{-CK}=\frac{Cm^{2}}{(1-C)\lambda_{{\bm{H}}}K}+1-C.

So, KK must satisfy f(K)=0f(K)=0, where f:(,0)f\colon\left(-\infty,0\right)\to\mathbb{R} is defined by

f(x)=eCxCm2(1C)λ𝑯x(1C),f(x)=e^{-Cx}-\frac{Cm^{2}}{(1-C)\lambda_{{\bm{H}}}x}-(1-C),

(note that we only consider the domain (,0)\left(-\infty,0\right) since we’ve shown that K<0K<0). We will show that there exists a unique KK satisfying these properties.

f(x)=CeCx+Cm2(1C)λ𝑯x2<0,f^{\prime}(x)=-Ce^{-Cx}+\frac{Cm^{2}}{(1-C)\lambda_{{\bm{H}}}x^{2}}<0,

since CeCx<0-Ce^{-Cx}<0 and Cm2(1C)λ𝑯x2<0\frac{Cm^{2}}{(1-C)\lambda_{{\bm{H}}}x^{2}}<0 (since all terms in the product are positive except (1C)<0(1-C)<0) for all xx. So ff is strictly decreasing, and thus it is injective.

limx0f(x)=\lim_{x\to 0^{-}}f(x)=-\infty and limxf(x)=\lim_{x\to-\infty}f(x)=\infty, so by continuity of ff, there exists K<0K<0 such that f(K)=0f(K)=0, and KK is unique by injectivity of ff.

Case: iii\neq i^{\prime}

Taking inner product with 𝒘j,ji,i{\bm{w}}_{j},j\neq i,i^{\prime} in equation 4 and using properties of 𝑾{\bm{W}} as a simplex ETF gives

m2pjm2C1ki,i,jpkm2(piλ)C1m2(pi(1λ))C1+λ𝑯𝒘j,𝒉iiλ=0\displaystyle m^{2}p_{j}-\frac{m^{2}}{C-1}\sum_{k\neq i,i^{\prime},j}p_{k}-\frac{m^{2}\left(p_{i}-\lambda\right)}{C-1}-\frac{m^{2}\left(p_{i^{\prime}}-(1-\lambda)\right)}{C-1}+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{j},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0
m2pjm2C1(kjpk1)+λ𝑯𝒘j,𝒉iiλ=0\displaystyle m^{2}p_{j}-\frac{m^{2}}{C-1}\left(\sum_{k\neq j}p_{k}-1\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{j},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0
m2pj(1+1C1)+λ𝑯𝒘j,𝒉iiλ=0\displaystyle m^{2}p_{j}\left(1+\frac{1}{C-1}\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{j},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0
m2pj(CC1)+λ𝑯𝒘j,𝒉iiλ=0.\displaystyle m^{2}p_{j}\left(\frac{C}{C-1}\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{j},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0.

By the same argument as in the previous case, we get that for all j,ki,ij,k\neq i,i^{\prime},

𝒘j,𝒉iiλ=𝒘k,𝒉iiλ=Kλ,\left\langle{\bm{w}}_{j},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=\left\langle{\bm{w}}_{k},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=K_{\lambda},

for some Kλ<0K_{\lambda}<0 (we will omit the subscript λ\lambda for brevity as we are optimizing over each λ\lambda individually). Thus pj=pk=pp_{j}=p_{k}=p for all j,ki,ij,k\neq i,i^{\prime}, where

p=λ𝑯1CCKm2.p=\lambda_{{\bm{H}}}\frac{1-C}{C}\frac{K}{m^{2}}.

Let S=k=1Cexp𝒘k,𝒉iiλS=\sum_{k=1}^{C}\exp\left\langle{\bm{w}}_{k},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle. Then, eKS=p\frac{e^{K}}{S}=p and so

S=eKp=C1Cm2Kλ𝑯eK.S=\frac{e^{K}}{p}=\frac{C}{1-C}\cdot\frac{m^{2}}{K\lambda_{{\bm{H}}}}e^{K}. (7)

Taking inner product with 𝒘i{\bm{w}}_{i} in equation 4 gives

m2C1ji,ipj+m2(piλ)m2C1(pi(1λ))+λ𝑯𝒘i,𝒉iiλ=0\displaystyle\frac{-m^{2}}{C-1}\sum_{j\neq i,i^{\prime}}p_{j}+m^{2}\left(p_{i}-\lambda\right)-\frac{m^{2}}{C-1}\left(p_{i^{\prime}}-(1-\lambda)\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0
m2C1(1pipi)+m2(piλ)m2C1(pi(1λ))+λ𝑯ωi,𝒉iiλ=0\displaystyle\frac{-m^{2}}{C-1}\left(1-p_{i}-p_{i^{\prime}}\right)+m^{2}\left(p_{i}-\lambda\right)-\frac{m^{2}}{C-1}\left(p_{i^{\prime}}-(1-\lambda)\right)+\lambda_{{\bm{H}}}\left\langle\omega_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0
m2C1pi+m2(piλ)m2C1λ+λ𝑯𝒘i,𝒉iiλ=0\displaystyle\frac{m^{2}}{C-1}p_{i}+m^{2}\left(p_{i}-\lambda\right)-\frac{m^{2}}{C-1}\lambda+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0
m2pi(1+1C1)m2λ(1+1C1)+λ𝑯𝒘i,𝒉iiλ=0\displaystyle m^{2}p_{i}\left(1+\frac{1}{C-1}\right)-m^{2}\lambda\left(1+\frac{1}{C-1}\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0
m2(1+1C1)(piλ)+λ𝑯𝒘i,𝒉iiλ=0.\displaystyle m^{2}\left(1+\frac{1}{C-1}\right)\left(p_{i}-\lambda\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0.

So,

Cm2C1(piλ)+λ𝑯ωi,𝒉iiλ=0.\frac{Cm^{2}}{C-1}\left(p_{i}-\lambda\right)+\lambda_{{\bm{H}}}\left\langle\omega_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0. (8)

Similarly, taking inner product with 𝒘i{\bm{w}}_{i^{\prime}} in equation 4 gives us

m2C1ji,ipjm2C1(piλ)+m2(pi(1λ))+λ𝑯𝒘i,𝒉iiλ=0\displaystyle\frac{-m^{2}}{C-1}\sum_{j\neq i,i^{\prime}}p_{j}-\frac{m^{2}}{C-1}\left(p_{i}-\lambda\right)+m^{2}\left(p_{i^{\prime}}-(1-\lambda)\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0
m2(1+1C1)(pi(1λ))+λ𝑯𝒘i,𝒉iiλ=0,\displaystyle m^{2}\left(1+\frac{1}{C-1}\right)\left(p_{i^{\prime}}-(1-\lambda)\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0,

and so

Cm2C1(pi(1λ))+λ𝑯𝒘i,𝒉iiλ=0.\frac{Cm^{2}}{C-1}\left(p_{i^{\prime}}-(1-\lambda)\right)+\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=0. (9)

Summing up equations equation 8 and equation 9 gives us

Cm2C1(pi+pi1)+λ𝑯(𝒘i,𝒉iiλ+𝒘i,𝒉iiλ)=0\displaystyle\frac{Cm^{2}}{C-1}\left(p_{i}+p_{i^{\prime}}-1\right)+\lambda_{{\bm{H}}}\left(\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle+\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle\right)=0
Cm2C1((C2)p)+λ𝑯(𝒘i,𝒉iiλ+𝒘i,𝒉iiλ)=0\displaystyle\frac{Cm^{2}}{C-1}\left(-(C-2)p\right)+\lambda_{{\bm{H}}}\left(\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle+\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle\right)=0
(C2)λ𝑯K+λ𝑯(𝒘i,𝒉iiλ+𝒘i,𝒉iiλ)=0\displaystyle(C-2)\lambda_{{\bm{H}}}K+\lambda_{{\bm{H}}}\left(\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle+\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle\right)=0
𝒘i,𝒉iiλ+𝒘i,𝒉iiλ=(C2)K.\displaystyle\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle+\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle=-(C-2)K.

Then, using equation 7 and the definition of SS gives us

C1Cm2Kλ𝑯eK\displaystyle\frac{C}{1-C}\cdot\frac{m^{2}}{K\lambda_{{\bm{H}}}}e^{K} =S\displaystyle=S
=k=1Ce𝒘k,𝒉iiλ\displaystyle=\sum_{k=1}^{C}e^{\left\langle{\bm{w}}_{k},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}
=(C2)eK+e𝒘i,𝒉iiλ+e(C2)K𝒘i,𝒉iiλ,\displaystyle=(C-2)e^{K}+e^{\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}+e^{-(C-2)K-\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle},

and thus

(e𝒘i,𝒉iiλ)2+eK(C2C1Cm2Kλ𝑯)e𝒘i,𝒉iiλ+e(C2)K=0.\left(e^{\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}\right)^{2}+e^{K}\left(C-2-\frac{C}{1-C}\cdot\frac{m^{2}}{K\lambda_{{\bm{H}}}}\right)e^{\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}+e^{-(C-2)K}=0. (10)

We then solve the quadratic equation in e𝒘i,𝒉iiλe^{\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle} to get

e𝒘i,𝒉iiλ=eK(C2Cm2(1C)Kλ𝑯)±(eK(C2Cm2(1C)Kλ𝑯))24e(C2)K2.e^{\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}=\frac{-e^{K}\left(C-2-\frac{Cm^{2}}{\left(1-C\right)K\lambda_{{\bm{H}}}}\right)\pm\sqrt{\left(e^{K}\left(C-2-\frac{Cm^{2}}{\left(1-C\right)K\lambda_{{\bm{H}}}}\right)\right)^{2}-4e^{-(C-2)K}}}{2}.

So, 𝒘i,𝒉iiλ\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle is of the form

log(eK(C2Cm2(1C)Kλ𝑯)±(eK(C2Cm2(1C)Kλ𝑯))24e(C2)K2).\displaystyle\log\left(\frac{-e^{K}\left(C-2-\frac{Cm^{2}}{\left(1-C\right)K\lambda_{{\bm{H}}}}\right)\pm\sqrt{\left(e^{K}\left(C-2-\frac{Cm^{2}}{\left(1-C\right)K\lambda_{{\bm{H}}}}\right)\right)^{2}-4e^{-(C-2)K}}}{2}\right).

By our definition of 𝒑{\bm{p}} as the softmax applied on 𝑾𝒉iiλ{\bm{W}}{\bm{h}}_{ii^{\prime}}^{\lambda}, KK must satisfy

e𝒘i,𝒉iiλ\displaystyle e^{\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle} =Spi\displaystyle=Sp_{i}
=eKp\displaystyle=\frac{e^{K}}{p}
=C1Cm2Kλ𝑯eK((1C)λ𝑯𝒘i,𝒉iiλCm2+λ).\displaystyle=\frac{C}{1-C}\cdot\frac{m^{2}}{K\lambda_{{\bm{H}}}}e^{K}\left(\frac{(1-C)\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}{Cm^{2}}+\lambda\right).

We have

ji,i𝒘ipj+(piλ)𝒘i+(pi(1λ))𝒘i\displaystyle\sum_{j\neq i,i^{\prime}}{\bm{w}}_{i}p_{j}+\left(p_{i}-\lambda\right){\bm{w}}_{i}+\left(p_{i^{\prime}}-(1-\lambda)\right){\bm{w}}_{i^{\prime}}
=p(𝒘i𝒘i)+(piλ)𝒘i+(pi(1λ))𝒘i\displaystyle=p\left(-{\bm{w}}_{i}-{\bm{w}}_{i^{\prime}}\right)+\left(p_{i}-\lambda\right){\bm{w}}_{i}+\left(p_{i^{\prime}}-(1-\lambda)\right){\bm{w}}_{i^{\prime}} (since j=1C𝒘j=0\sum_{j=1}^{C}{\bm{w}}_{j}=0)
=(piλp)𝒘i+(pi(1λ)p)𝒘i.\displaystyle=\left(p_{i}-\lambda-p\right){\bm{w}}_{i}+\left(p_{i^{\prime}}-(1-\lambda)-p\right){\bm{w}}_{i^{\prime}}.

Substituting this in equation 4, we get

𝒉iiλ\displaystyle{\bm{h}}_{ii^{\prime}}^{\lambda} =1λ𝑯((piλp)𝒘i+(pi(1λ)p)𝒘i)\displaystyle=\frac{-1}{\lambda_{{\bm{H}}}}\left(\left(p_{i}-\lambda-p\right){\bm{w}}_{i}+\left(p_{i^{\prime}}-(1-\lambda)-p\right){\bm{w}}_{i^{\prime}}\right)
=1λ𝑯((p(piλ))𝒘i+(p(pi(1λ)))𝒘i)\displaystyle=\frac{1}{\lambda_{{\bm{H}}}}\left(\left(p-(p_{i}-\lambda)\right){\bm{w}}_{i}+\left(p-(p_{i^{\prime}}-(1-\lambda))\right){\bm{w}}_{i^{\prime}}\right)
=(1C)Kλ𝑯(1C)λ𝑯𝒘i,𝒉iiλλ𝑯Cm2𝒘i+(1C)Kλ𝑯(1C)λ𝑯𝒘i,𝒉iiλλ𝑯Cm2𝒘i\displaystyle=\frac{(1-C)K\lambda_{{\bm{H}}}-(1-C)\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}{\lambda_{{\bm{H}}}Cm^{2}}{\bm{w}}_{i}+\frac{(1-C)K\lambda_{{\bm{H}}}-(1-C)\lambda_{{\bm{H}}}\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle}{\lambda_{{\bm{H}}}Cm^{2}}{\bm{w}}_{i^{\prime}}
=(1C)Cm2((K𝒘i,𝒉iiλ)𝒘i+(K𝒘i,𝒉iiλ)𝒘i)\displaystyle=\frac{(1-C)}{Cm^{2}}\left(\left(K-\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle\right){\bm{w}}_{i}+\left(K-\left\langle{\bm{w}}_{i^{\prime}},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle\right){\bm{w}}_{i^{\prime}}\right)
=(1C)Cm2((K𝒘i,𝒉iiλ)𝒘i+((C1)K+𝒘i,𝒉iiλ)𝒘i),\displaystyle=\frac{(1-C)}{Cm^{2}}\left(\left(K-\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle\right){\bm{w}}_{i}+\left((C-1)K+\left\langle{\bm{w}}_{i},{\bm{h}}_{ii^{\prime}}^{\lambda}\right\rangle\right){\bm{w}}_{i^{\prime}}\right),

and that concludes our proof. ∎

A.2 Amplification of Theoretical features

In this section we provide additional details of function used to generate the amplified features in Figure 6.

We define ϵ(λ)=45exp(20(λ0.5)4)25\epsilon(\lambda)=\frac{4}{5}\exp(-20(\lambda-0.5)^{4})-\frac{2}{5}. Then for the different class features (iii\neq i^{\prime}), define the amplified features as 𝒉~iiλ=𝒉iiλϵ(λ)ji,i𝒘j\tilde{{\bm{h}}}_{ii^{\prime}}^{\lambda}={\bm{h}}_{ii^{\prime}}^{\lambda}-\epsilon(\lambda)\sum_{j\neq i,i^{\prime}}{\bm{w}}_{j}.

The motivation for the function form of ϵ(λ)\epsilon(\lambda) is to ensure that it is symmetric about λ=0.5\lambda=0.5, increasing when λ<0.5\lambda<0.5 and decreasing when λ>0.5\lambda>0.5 with its maximum at λ=0.5\lambda=0.5. These properties correspond to a larger amplification as λ\lambda approaches 0.50.5, while preserving symmetry in the amplifications. Note that the exact function ϵ(λ)\epsilon(\lambda) is not important, but rather that it yields last-layer features that are closer to the empirical results (with elongations, while resulting in just a minor increase in loss. As mentioned in the main text, this implies that the features can have some deviation from the theoretical optimum without much change in the value of the loss.

Appendix B Experimental details

B.1 Hyperparameter settings

For the WideResNet experiments, we minimize the mixup loss using stochastic gradient descent (SGD) with momentum 0.9 and weight decay 1×1041{\times}10^{-4}. All datasets are trained on a WideResNet-40-10 for 500 epochs with a batch size of 128. We sweep over 10 logarithmically spaced learning rates between 0.010.01 and 0.250.25, picking whichever results in the highest test accuracy. The learning rate is annealed by a factor of 10 at 30%, 50%, and 90% of the total training time.

For the ViT experiments, we minimize the mixup loss using Adam optimization (Kingma & Ba, 2017). For each dataset we train a ViT-B with a patch size of 4 for 1000 epochs with a batch size of 128. We sweep over 10 logarithmically spaced learning rates from 1×1041{\times}10^{-4} to 3×1033{\times}10^{-3} and weight decay values from 0 to 0.050.05, selecting whichever yields the highest test accuracy. The learning rate is warmed up for 10 epochs and is annealed using cosine annealing as a function of total epochs.

B.2 Projection Method

For all of the last-layer activation plots, the same projection method is used. First, we randomly select three classes. We denote the centred last-layer activations for said classes by a matrix Hm×nH\in\mathbb{R}^{m\times n} and the classifier of the network for said classes as W3×mW\in\mathbb{R}^{3\times m}. The projection method is then as follows:

  1. 1.

    Calculate USVT=SVD(W)USV^{T}=\operatorname{SVD}(W^{*}) where WW^{*} is the normalized classifier.

  2. 2.

    Define Q=UVTQ=UV^{T}

  3. 3.

    Let A2×3A\in\mathbb{R}^{2\times 3} be a two dimensional representation of a three dimensional simplex.

  4. 4.

    Compute X=AQHX=AQH and plot.

Appendix C Expected calibration error

To calculate the expected calibration error, first gather the predictions into MM bins of equal interval size. Let BmB_{m} be the set of predictions whose confidence is in bin mm. We can define the accuracy and confidence of a given bin as

acc(Bm)\displaystyle\operatorname{acc}(B_{m}) =1|Bm|iBm𝟏(y^i=yi)\displaystyle=\frac{1}{\left|B_{m}\right|}\sum_{i\in B_{m}}\mathbf{1}\left(\hat{y}_{i}=y_{i}\right)
conf(Bm)\displaystyle\operatorname{conf}(B_{m}) =1|Bm|iBmp^i\displaystyle=\frac{1}{\left|B_{m}\right|}\sum_{i\in B_{m}}\hat{p}_{i}

where p^i\hat{p}_{i} is the confidence of example ii. The expected calibration (ECE) is then calculated as

ECE=m=1M|Bm|n|acc(Bm)conf(Bm)|\operatorname{ECE}=\sum_{m=1}^{M}\frac{\left|B_{m}\right|}{n}\left|\operatorname{acc}\left(B_{m}\right)-\operatorname{conf}\left(B_{m}\right)\right|

Appendix D Additional Last-Layer Plots

Here we provide additional plots of last-layer activations. Namely, Figure 8 provides additional baseline last-layer activation plots for mixup data for every architecture and dataset combination in Figure 1. Figure 9 provides last-layer activations for the additional α\alpha value in Table 2. Figure 10 shows the evolution of the last-layer activations throughout training. Figure 11 shows last-layer activations for multiple random subsets of three classes. Finally, Figure 12 shows the last-layer activations for a ViT-B/4 trained on CIFAR10 using the AdamW optimizer.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 8: (Visualization of activations outputted by networks trained without mixup). Last-layer activations for a randomly selected subset of three classes of mixup training data for various dataset and network architecture combinations. Projections are generated using the same method as Figure 1. All networks are trained using empirical risk minimization (no mixup). Colouring indicates mixup type (same-class or different-class), and the level of mixup, λ\lambda.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 9: (Visualization of activations outputted by networks trained with α=0.4\alpha=0.4). Last-layer activations for WideResNet-40-10 and ViT-B/4 trained on the CIFAR10 dataset, subsetted to three randomly selected classes. Projections are generated using the same method as Figure 1. For both cases, α=0.4\alpha=0.4 is used. Colouring indicates mixup type (same-class or different-class), and the level of mixup, λ\lambda. Relevant classifiers plotted in black.
Refer to caption
(a) Epoch 100
Refer to caption
(b) Epoch 300

Refer to caption

(c) Epoch 500

Refer to caption

Refer to caption

Figure 10: (Activation convergence during training with mixup). Evolution of Last-layer activations for WideResNet-40-10 trained on CIFAR10 through training. Projections are generated in the same manner as Figure 1. Coloration indicates the type of mixup (same-class or different-class), along with the level of mixup, λ\lambda. Relevant classifiers plotted in black. As training progresses, different-class mixup points are pushed towards the decision boundary converging to the configuration depicted in Figure 1.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: (Visualization of last-layer activations for multiple subsets of classes). Last-layer activations for a randomly selected subsets of three classes of mixup training data for WRN-40-10 trained on CIFAR10. Projections are generated using the same method as Figure 1. Colouring indicates mixup type (same-class or different-class), and the level of mixup, λ\lambda. Black indicates relevant classifiers.
Refer to caption

Refer to caption

Refer to caption

Figure 12: (Visualization of activations outputted by ViT-B trained with mixup using AdamW). Last-layer activations for ViT-B/4 trained following same training regiment as the ViT’s outlined in section 2 except with the AdamW optimizer. Projection is generated using the same method as Figure 1. Colouring indicates mixup type (same-class or different-class), and the level of mixup, λ\lambda. Black indicates relevant classifiers.