This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Generalized Neural Collapse for a Large Number of Classes

Jiachen JiangΔ Ohio State UniversityΔ [email protected] Δ &Jinxin Zhou Δ Ohio State UniversityΔ [email protected] Δ &Peng Wang Δ University of MichiganΔ [email protected] Δ &Qing Qu Δ University of MichiganΔ [email protected] Δ &Dustin Mixon Δ Ohio State UniversityΔ [email protected] Δ &Chong You Δ Google ResearchΔ [email protected] Δ &Zhihui Zhu Δ Ohio State UniversityΔ [email protected] The first two authors contribute equally to this work.Corresponding authors: CY and ZZ.
Abstract

Neural collapse provides an elegant mathematical characterization of learned last layer representations (a.k.a. features) and classifier weights in deep classification models. Such results not only provide insights but also motivate new techniques for improving practical deep models. However, most of the existing empirical and theoretical studies in neural collapse focus on the case that the number of classes is small relative to the dimension of the feature space. This paper extends neural collapse to cases where the number of classes are much larger than the dimension of feature space, which broadly occur for language models, retrieval systems, and face recognition applications. We show that the features and classifier exhibit a generalized neural collapse phenomenon, where the minimum one-vs-rest margins is maximized. We provide empirical study to verify the occurrence of generalized neural collapse in practical deep neural networks. Moreover, we provide theoretical study to show that the generalized neural collapse provably occurs under unconstrained feature model with spherical constraint, under certain technical conditions on feature dimension and number of classes.

1 Introduction

Over the past decade, deep learning algorithms have achieved remarkable progress across numerous machine learning tasks and have significantly enhanced the state-of-the-art in many practical applications ranging from computer vision to natural language processing and retrieval systems. Despite their tremendous success, a comprehensive understanding of the features learned from deep neural networks (DNNs) is still lacking. The recent work Papyan et al. (2020); Papyan (2020) has empirically uncovered an intriguing phenomenon regarding the last-layer features and classifier of DNNs, called Neural Collapse (𝒩𝒞\mathcal{NC}) that can be briefly summarized as the following characteristics:

  • Variability Collapse (𝒩𝒞1\mathcal{NC}_{1}): Within-class variability of features collapses to zero.

  • Convergence to Simplex ETF (𝒩𝒞2\mathcal{NC}_{2}): Class-mean features converge to a simplex Equiangular Tight Frame (ETF), achieving equal lengths, equal pair-wise angles, and maximal distance in the feature space.

  • Self-Duality (𝒩𝒞3\mathcal{NC}_{3}): Linear classifiers converge to class-mean features, up to a global rescaling.

Neural collapse provides a mathematically elegant characterization of learned representations or features in deep learning based classification models, independent of network architectures, dataset properties, and optimization algorithms. Building on the so-called unconstrained feature model (Mixon et al., 2020) or the layer-peeled model (Fang et al., 2021), subsequent research (Zhu et al., 2021; Lu & Steinerberger, 2020; Ji et al., 2021; Yaras et al., ; Wojtowytsch et al., 2020; Ji et al., ; Zhou et al., ; Han et al., ; Tirer & Bruna, 2022; Zhou et al., 2022a; Poggio & Liao, 2020; Thrampoulidis et al., 2022; Tirer et al., 2023; Nguyen et al., 2022) has provided theoretical evidence for the existence of the 𝒩𝒞\mathcal{NC} phenomenon when using a family of loss functions including cross-entropy (CE) loss, mean-square-error (MSE) loss and variants of CE loss. Theoretical results regarding 𝒩𝒞\mathcal{NC} not only contribute to a new understanding of the working of DNNs but also provide inspiration for developing new techniques to enhance their practical performance in various settings, such as imbalanced learning (Xie et al., 2023; Liu et al., 2023b), transfer learning (Galanti et al., 2022a; Li et al., 2022; Xie et al., 2022; Galanti et al., 2022b), continual learning (Yu et al., 2022; Yang et al., 2023), loss and architecture designs (Chan et al., 2022; Yu et al., 2020; Zhu et al., 2021), etc.

However, most of the existing empirical and theoretical studies in 𝒩𝒞\mathcal{NC} focus on the case that the number of classes is small relative to the dimension of the feature space. Nevertheless, there are many cases in practice where the number of classes can be extremely large, such as

  • Person identification (Deng et al., 2019), where each identity is regarded as one class.

  • Language models (Devlin et al., 2018), where the number of classes equals the vocabulary size111Language models are usually trained to classify a token (or a collection of them) that is either masked in the input (as in BERT (Devlin et al., 2018)), or the next one following the context (as in language modeling), or a span of masked tokens in the input (as in T5 (Raffel et al., 2020)), etc. In such cases, the number of classes is equal to the number of all possible tokens, i.e., the vocabulary size..

  • Retrieval systems (Mitra et al., 2018), where each document in the dataset represents one class.

  • Contrastive learning (Chen et al., 2020a), where each training data can be regarded as one class.

In such cases, it is usually infeasible to have a feature dimension commensurate with the number of classes due to computational and memory constraints. Therefore, it is crucial to develop a comprehensive understanding of the characteristics of learned features in such cases, particularly with the increasing use of web-scale datasets that have a vast number of classes.

Refer to caption
(a) One-vs-rest distance: Case 1
Refer to caption
(b) One-vs-rest distance: Case 2
Refer to caption
(c) One-vs-one distance
Figure 1: In Generalized Neural Collapse (𝒢𝒩𝒞\mathcal{GNC}), the optimal classifier weight {𝒘k}\{{\bm{w}}_{k}\} is a Softmax Code defined from maximizing the one-vs-rest distance (see Definition 2.1). (a, b) Illustration of the one-vs-rest distance using the example of 𝒘1{\bm{w}}_{1}-vs-{𝒘2,𝒘3,𝒘4}\{\bm{w}_{2},\bm{w}_{3},\bm{w}_{4}\} distance, under two configurations of {𝒘k}k=14\{{\bm{w}}_{k}\}_{k=1}^{4} in a two-dimensional space. The distance in Case 1 is larger than that in Case 2. (c) Illustration of the one-vs-one distance used to define the Tammes problem (see Eq. (11)). We prove 𝒢𝒩𝒞\mathcal{GNC} under technical conditions on Softmax Code and Tammes problem (see Section 3).
Contributions.

This paper studies the geometric properties of the learned last-layer features and the classifiers for cases where the number of classes can be arbitrarily large compared to the feature dimension. Motivated by the use of spherical constraints in learning with a large number of classes, such as person identification and contrastive learning, we consider networks trained with spherical constraints on the features and classifiers. Our contributions can be summarized as follows.

  • The Arrangement Problem: Generalizing 𝒩𝒞\mathcal{NC} to a Large Number of Classes. In Section 2 we introduce the generalized 𝒩𝒞\mathcal{NC} (𝒢𝒩𝒞\mathcal{GNC}) for characterizing the last-layer features and classifier. In particular, 𝒢𝒩𝒞\mathcal{GNC}1 and 𝒢𝒩𝒞\mathcal{GNC}3 state the same as 𝒩𝒞\mathcal{NC}1 and 𝒩𝒞\mathcal{NC}3, respectively. 𝒢𝒩𝒞\mathcal{GNC}2 states that the classifier weight is a Softmax Code, which generalizes the notion of a simplex ETF and is defined as the collection of points on the unit hyper-sphere that maximizes the minimum one-vs-all distance (see Figure 1 (a,b) for an illustration). Empirically, we verify that the 𝒢𝒩𝒞\mathcal{GNC} approximately holds in practical DNNs trained with a small temperature in CE loss. Furthermore, we conduct theoretical study in Section 3 to show that under the unconstrained features model (UFM) (Mixon et al., 2020; Fang et al., 2021; Zhu et al., 2021) and with a vanishing temperature, the global solutions satisfy 𝒢𝒩𝒞\mathcal{GNC} under technical conditions on Softmax Code and solutions to the Tammes problem (Tammes, 1930), the latter defined as a collection of points on the unit hyper-sphere that maximizes the minimum one-vs-one distance (see Figure 1(c) for an illustration).

  • The Assignment Problem: Implicit Regularization of Class Semantic Similarity. Unlike the simplex ETF (as in 𝒩𝒞\mathcal{NC}2) in which the distance between any pair of vectors is the same, not all pairs in a Softmax Code (as in 𝒢𝒩𝒞\mathcal{GNC}2) are of equal distant when the number of classes is greater than the feature space dimension. This leads to the “assignment” problem, i.e., the correspondence between the classes and the weights in a Softmax Code. In Section 4, we show empirically an implicit regularization effect by the semantic similarity of the classes, i.e., conceptually similar classes (e.g., Cat and Dog) are often assigned to closer classifier weights in Softmax Code, compared to those that are conceptually dissimilar (e.g., Cat and Truck). Moreover, such an implicit regularization is beneficial, i.e., enforcing other assignments produces inferior model quality.

  • Cost Reduction for Practical Network Training/Fine-tuning. The universality of alignment between classifier weights and class means (i.e., 𝒢𝒩𝒞\mathcal{GNC}3) implies that training the classifier is unnecessary and the weight can be simply replaced by the class-mean features. Our experiments in Section 5 demonstrate that such a strategy achieves comparable performance to classical training methods, and even better out-of-distribution performance than classical fine-tuning methods with significantly reduced parameters.

Related work.

The recent work Liu et al. (2023a) also introduces a notion of generalized 𝒩𝒞\mathcal{NC} for the case of large number of classes, which predicts equal-spaced features. However, their work focuses on networks trained with weight decay, for which empirical results in Appendix B and Yaras et al. (2023) show to not produce equal-length and equal-spaced features for a relatively large number of classes. Moreover, the work Liu et al. (2023a) relies on a specific choice of kernel function to describe the uniformity. Instead, we concretely define 𝒢𝒩𝒞\mathcal{GNC}2 through the softmax code. When preparing this submission, we notice a concurrent work Gao et al. (2023) that provides analysis for generalized 𝒩𝒞\mathcal{NC}, but again for networks trained with weight decay. In addition, that work analyzes gradient flow for the corresponding UFM with a particular choice of weight decay, while our work studies the global optimality of the training problem. The work Zhou et al. (2022a) empirically shows that MSE loss is inferior to the CE loss when K>d+1K>d+1, but no formal analysis is provided for CE loss. Finally, the global optimality of the UFM with spherical constraints has been studied in Lu & Steinerberger (2022); Yaras et al. (2023) but only for the cases Kd+1K\leq d+1 or KK\rightarrow\infty.

2 Generalized Neural Collapse for A Large Number of Classes

In this section, we begin by providing a brief overview of DNNs and introducing notations used in this study in Section 2.1. We will also introduce the concept of the UFM which is used in theoretical study of the subsequent section. Next, we introduce the notion of Softmax Code for describing the distribution of a collection of points on the unit sphere, which prepares us to present a formal definition of Generalized Neural Collapse and empirical verification of its validity in Section 2.2.

2.1 Basics Concepts of DNNs

A DNN classifier aims to learn a feature mapping ϕ𝜽():Dd\phi_{{\bm{\theta}}}(\cdot):\mathbb{R}^{D}\rightarrow\mathbb{R}^{d} with learnable parameters 𝜽{\bm{\theta}} that maps from input 𝒙D\bm{x}\in\mathbb{R}^{D} to a deep representation called the feature ϕ𝜽(𝒙)d\phi_{{\bm{\theta}}}(\bm{x})\in\mathbb{R}^{d}, and a linear classifier 𝑾=[𝒘1𝒘2𝒘K]d×K{\bm{W}}=\left[{\bm{w}}_{1}\quad{\bm{w}}_{2}\quad\cdots\quad{\bm{w}}_{K}\right]\in\mathbb{R}^{d\times K} such that the output (also known as the logits) Ψ𝚯(𝒙)=𝑾ϕ𝜽(𝒙)K\Psi_{\bm{\Theta}}(\bm{x})={\bm{W}}^{\top}\phi_{{\bm{\theta}}}\left(\bm{x}\right)\in\mathbb{R}^{K} can make a correct prediction. Here, 𝚯={𝜽,𝑾}\bm{\Theta}=\{{\bm{\theta}},{\bm{W}}\} represents all the learnable parameters of the DNN.222We ignore the bias term in the linear classifier since (i)(i) the bias term is used to compensate the global mean of the features and vanishes when the global mean is zero (Papyan et al., 2020; Zhu et al., 2021), (ii)(ii) it is the default setting across a wide range of applications such as person identification (Wang et al., 2018; Deng et al., 2019), contrastive learning (Chen et al., 2020a; Chen & He, 2021), etc.

Given a balanced training set {(𝒙k,i,𝒚k)}i[n],k[K]D×K\left\{\left(\bm{x}_{k,i},\bm{y}_{k}\right)\right\}_{i\in[n],k\in[K]}\subseteq\mathbb{R}^{D}\times\mathbb{R}^{K}, where 𝒙k,i{\bm{x}}_{k,i} is the ii-th sample in the kk-th class and 𝒚k{\bm{y}}_{k} is the corresponding one-hot label with all zero entries except for unity in the kk-th entry, the network parameters 𝚯\bm{\Theta} are typically optimized by minimizing the following CE loss

min𝚯1nKk=1Ki=1nCE(Ψ𝚯(𝒙k,i),𝒚k,τ),CE(𝒛,𝒚k,τ)=log(exp(zk/τ)j=1Kexp(zj/τ)).\displaystyle\min_{\bm{\Theta}}\frac{1}{nK}\sum_{k=1}^{K}\sum_{i=1}^{n}\mathcal{L}_{\text{CE}}\left(\Psi_{\bm{\Theta}}\left({\bm{x}}_{k,i}\right),{\bm{y}}_{k},\tau\right),\ \mathcal{L}_{\text{CE}}\left(\bm{z},\bm{y}_{k},\tau\right)=-\log\Big{(}\frac{\exp(z_{k}/\tau)}{\sum_{j=1}^{K}\exp(z_{j}/\tau)}\Big{)}. (1)

In above, we assume that a spherical constraint is imposed on the feature and classifier weights and that the logit zkz_{k} is divided by the temperature parameter τ\tau. This is a common practice when dealing with a large number of classes (Wang et al., 2018; Chang et al., 2019; Chen et al., 2020a). Specifically, we enforce {𝒘k,ϕ𝚯(𝒙k,i)}𝕊d1:={𝒂d:𝒂2=1}\{{\bm{w}}_{k},\phi_{\bm{\Theta}}({\bm{x}}_{k,i})\}\subseteq\mathbb{S}^{d-1}:=\{\bm{a}\in\mathbb{R}^{d}:\|\bm{a}\|_{2}=1\} for all i[n]i\in[n] and k[K]k\in[K]. An alternative regularization is weight decay on the model parameters 𝚯\bm{\Theta}, the effect of which we study in Appendix B.

To simplify the notation, we denote the oblique manifold embedded in Euclidean space by 𝒪B(d,K):={𝑾d×K|𝒘k𝕊d1,k[K]}\operatorname{{\mathcal{O}B}}(d,K):=\left\{{\bm{W}}\in\mathbb{R}^{d\times K}\ |\ {\bm{w}}_{k}\in\mathbb{S}^{d-1},\ \forall k\in[K]\right\}. In addition, we denote the last-layer features by 𝒉k,i:=ϕ𝜽(𝒙k,i){\bm{h}}_{k,i}:=\phi_{{\bm{\theta}}}({\bm{x}}_{k,i}). We rewrite all the features in a matrix form as

𝑯:=[𝑯1𝑯2𝑯K]d×nK,with𝐇k:=[𝐡k,1𝐡k,n]d×n.\bm{H}:=\left[\bm{H}_{1}\quad\bm{H}_{2}\quad\cdots\quad\bm{H}_{K}\right]\in\mathbb{R}^{d\times nK},\text{with}\ \mathbf{H}_{k}:=\begin{bmatrix}\mathbf{h}_{k,1}&\cdots&\mathbf{h}_{k,n}\end{bmatrix}\in\mathbb{R}^{d\times n}.

Also we denote by 𝒉¯k:=1ni=1n𝒉k,i\overline{{\bm{h}}}_{k}:=\frac{1}{n}\sum_{i=1}^{n}{\bm{h}}_{k,i} the class-mean feature for each class.

Unconstrained Features Model (UFM).

The UFM (Mixon et al., 2020) or layer-peeled model (Fang et al., 2021), wherein the last-layer features are treated as free optimization variables, are widely used for theoretically understanding the 𝒩𝒞\mathcal{NC} phenomena. In this paper, we will consider the following UFM with a spherical constraint on classifier weights 𝑾{\bm{W}} and unconstrained features 𝑯{\bm{H}}:

min𝑾,𝑯1nKk=1Ki=1nCE(𝑾𝒉k,i,𝒚k,τ)s.t.𝑾𝒪B(d,K),𝑯𝒪B(d,nK).\min_{{\bm{W}},{\bm{H}}}\frac{1}{nK}\sum_{k=1}^{K}\sum_{i=1}^{n}\mathcal{L}_{\text{CE}}\left({\bm{W}}^{\top}{\bm{h}}_{k,i},{\bm{y}}_{k},\tau\right)\quad\textrm{s.t.}\quad{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K),\ {\bm{H}}\in\operatorname{{\mathcal{O}B}}(d,nK). (2)

2.2 Generalized Neural Collapse

We start by introducing the notion of softmax code which will be used for describing 𝒢𝒩𝒞\mathcal{GNC}.

Definition 2.1 (Softmax Code).

Given positive integers dd and KK, a softmax code is an arrangement of KK points on a unit sphere of d\mathbb{R}^{d} that maximizes the minimal distance between one point and the convex hull of the others:

max𝑾𝒪B(d,K)ρone-vs-rest(𝑾),whereρone-vs-rest(𝑾)minkdist(𝒘k,{𝒘j}j[K]k).\displaystyle\max_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)}\rho_{\textup{one-vs-rest}}({\bm{W}}),\leavevmode\nobreak\ \leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ \leavevmode\nobreak\ \rho_{\textup{one-vs-rest}}({\bm{W}})\doteq\min_{k}\operatorname{dist}\!\Big{(}{\bm{w}}_{k},\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}\Big{)}. (3)

In above, the distance between a point 𝐯\bm{v} and a set 𝒲\mathcal{W} is defined as dist(𝐯,𝒲)=inf𝐰conv(𝒲){𝐯𝐰}\operatorname{dist}(\bm{v},\mathcal{W})=\inf_{\bm{w}\in\operatorname{conv}(\mathcal{W})}\{\|\bm{v}-\bm{w}\|\}, where conv()\operatorname{conv}(\cdot) denotes the convex hull of a set.

We now extend 𝒩𝒞\mathcal{NC} to the Generalized Neural Collapse (𝒢𝒩𝒞\mathcal{GNC}) that captures the properties of the features and classifiers at the terminal phase of training. With a vanishing temperature (i.e., τ0\tau\rightarrow 0), the last-layer features and classifier exhibit the following 𝒢𝒩𝒞\mathcal{GNC} phenomenon:

  • Variability Collapse (𝒢𝒩𝒞1\mathcal{GNC}_{1}). All features of the same class collapse to the corresponding class mean. Formally, as used in Papyan et al. (2020), the quantity 𝒢𝒩𝒞11Ktr(𝚺W𝚺B)0\mathcal{GNC}_{1}\doteq\frac{1}{K}\operatorname{tr}\big{(}{\bm{\Sigma}_{W}\bm{\Sigma}_{B}^{\dagger}}\big{)}\rightarrow 0, where 𝚺B:=1Kk=1K𝒉¯k𝒉¯k\bm{\Sigma}_{B}:=\frac{1}{K}\sum_{k=1}^{K}\overline{{\bm{h}}}_{k}\overline{{\bm{h}}}_{k}^{\top} and 𝚺W:=1nKk=1ki=1n(𝒉k,i𝒉¯k)(𝒉k,i𝒉¯k)\bm{\Sigma}_{W}:=\frac{1}{nK}\sum_{k=1}^{k}\sum_{i=1}^{n}\left({\bm{h}}_{k,i}-\overline{{\bm{h}}}_{k}\right)\left({\bm{h}}_{k,i}-\overline{{\bm{h}}}_{k}\right)^{\top} denote the between-class and within-class covariance matrices, respectively.

  • Softmax Codes (𝒢𝒩𝒞2\mathcal{GNC}_{2}). Classifier weights converge to the softmax code in definition 2.1. This property may be measured by 𝒢𝒩𝒞2ρone-vs-rest(𝑾)max𝑾𝒪B(d,K)ρone-vs-rest(𝑾)\mathcal{GNC}_{2}\doteq\rho_{\textup{one-vs-rest}}({\bm{W}})\to\max_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)}\rho_{\textup{one-vs-rest}}({\bm{W}}).

  • Self-Duality (𝒢𝒩𝒞3\mathcal{GNC}_{3}). Linear classifiers converge to the class-mean features. Formally, this alignment can be measured by 𝒢𝒩𝒞31Kk=1K(1𝒘k𝒉¯k)0\mathcal{GNC}_{3}\doteq\frac{1}{K}\sum_{k=1}^{K}\left(1-\bm{w}_{k}^{\top}\overline{\bm{h}}_{k}\right)\rightarrow 0.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Illustration of 𝒢𝒩𝒞\mathcal{GNC} and test accuracy across different temperatures τ\tau in training a ResNet18 on CIFAR100 with d=10d=10 and K=100K=100. “Optimal” in the second left figure refers to max𝑾𝒪B(d,K)ρone-vs-rest(𝑾)\max_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)}\rho_{\textup{one-vs-rest}}({\bm{W}}).

The main difference between 𝒢𝒩𝒞\mathcal{GNC} and 𝒩𝒞\mathcal{NC} lies in 𝒢𝒩𝒞\mathcal{GNC}2 / 𝒩𝒞\mathcal{NC}2, which describe the configuration of the classifier weight 𝑾\bm{W}. In 𝒩𝒞\mathcal{NC}2, the classifier weights corresponding to different classes are described as a simplex ETF, which is a configuration of vectors that have equal pair-wise distance and that distance is maximized. Such a configuration does not exist in general when the number of classes is large, i.e., K>d+1K>d+1. 𝒢𝒩𝒞\mathcal{GNC}2 introduces a new configuration described by the notion of softmax code. By Definition 2.1, a softmax code is a configuration where each vector is maximally separated from all the other points, measured by its distance to their convex hull. Such a definition is motivated from theoretical analysis (see Section 3). In particular, it reduces to simplex ETF when Kd+1K\leq d+1 (see Theorem 3.3).

Interpretation of Softmax Code.

Softmax Code abides a max-distance interpretation. Specifically, consider the features {𝒉k,i}k[K],i[n]\{{\bm{h}}_{k,i}\}_{k\in[K],i\in[n]} from nn classes. In multi-class classification, one commonly used distance (or margin) measurement is the one-vs-rest (also called one-vs-all or one-vs-other) distance (Murphy, 2022), i.e., the distance of class kk vis-a-vis other classes. Noting that the distance between two classes is equivalent to the distance between the convex hulls of the data from each class (Murphy, 2022), the distance of class kk vis-a-vis other classes is given by dist({𝒉k,i}i[n],{𝒉k,i}k[K]k,i[n])\operatorname{dist}(\{{\bm{h}}_{k,i}\}_{i\in[n]},\{{\bm{h}}_{k^{\prime},i}\}_{k^{\prime}\in[K]\setminus k,i\in[n]}). From 𝒢𝒩𝒞\mathcal{GNC}1 and 𝒢𝒩𝒞\mathcal{GNC}3 we can rewrite the distance as

dist({𝒉k,i}i[n],{𝒉k,i}k[K]k,i[n])=dist(𝒉¯k,{𝒉¯k}k[K]k)=dist(𝒘k,{𝒘k}k[K]k).\operatorname{dist}\!\big{(}\{{\bm{h}}_{k,i}\}_{i\in[n]},\{{\bm{h}}_{k^{\prime},i}\}_{k^{\prime}\in[K]\setminus k,i\in[n]}\big{)}=\operatorname{dist}\!\big{(}\overline{{\bm{h}}}_{k},\{\overline{{\bm{h}}}_{k^{\prime}}\}_{k^{\prime}\in[K]\setminus k}\big{)}=\operatorname{dist}\!\big{(}{\bm{w}}_{k},\{{\bm{w}}_{k^{\prime}}\}_{k^{\prime}\in[K]\setminus k}\big{)}. (4)

By noticing that the rightmost term is minimized in a Softmax Code, it follows from 𝒢𝒩𝒞\mathcal{GNC}2 that the learned features have the property that their one-vs-rest distance minimized over all classes k[K]k\in[K] is maximized. In other words, measured by one-vs-rest distance, the learned features are such that they are maximally separated. Finally, we mention that the separation of classes may be characterized by other measures of distance as well, such as the one-vs-one distance (also known as the sample margin in Cao et al. (2019); Zhou et al. (2022b)) which leads to the well-known Tammes problem. We will discuss this in Section 3.2.

Experimental Verification of 𝒢𝒩𝒞\mathcal{GNC}.

We verify the occurence of 𝒢𝒩𝒞\mathcal{GNC} by training a ResNet18 (He et al., 2016) for image classification on the CIFAR100 dataset (Krizhevsky, 2009), and report the results in Figure 2. To simulate the case of K>d+1K>d+1, we use a modified ResNet18 where the feature dimension is 1010. From Figure 2, we can observe that both 𝒢𝒩𝒞\mathcal{GNC}1 and 𝒢𝒩𝒞\mathcal{GNC}3 converge to 0, and 𝒢𝒩𝒞\mathcal{GNC}2 converges towards the spherical code with relatively small temperature τ\tau. Additionally, selecting a small τ\tau is not only necessary for achieving 𝒢𝒩𝒞\mathcal{GNC}, but also for attaining high testing performance. Due to limited space, we present experimental details and other experiments with different architectures and datasets in Appendix B. In the next section, we provide a theoretical justification for 𝒢𝒩𝒞\mathcal{GNC} under UFM in (2).

3 Theoretical Analysis of GNC

In this section, we provide a theoretical analysis of 𝒢𝒩𝒞\mathcal{GNC} under the UFM in (2). We first show in Section 3.1 that under appropriate temperature parameters, the solution to (2) can be approximated by the solution to a “HardMax” problem, which is of a simpler form amenable for subsequent analysis. We then provide a theoretical analysis of 𝒢𝒩𝒞\mathcal{GNC} in Section 3.2, by first proving the optimal classifier forms a Softmax Code (𝒢𝒩𝒞\mathcal{GNC}2), and then establishing 𝒢𝒩𝒞\mathcal{GNC}1 and 𝒢𝒩𝒞\mathcal{GNC}3 under technical conditions on Softmax Code and solutions to the Tammes problem. In addition, we provide insights for the design of feature dimension dd given a number of classes KK by analyzing the upper and lower bound for the one-vs-rest distance of a Softmax Code. All proofs can be found in Appendix C.

3.1 Preparation: the Asymptotic CE Loss

Due to the nature of the softmax function which blends the output vector, analyzing the CE loss can be difficult even for the unconstrained features model. The previous work Yaras et al. (2023) analyzing the case Kd+1K\leq d+1 relies on the simple structure of the global solutions, where the classifiers form a simplex ETF. However, this approach cannot be directly applied to the case K>d+1K>d+1 due to the absence of an informative characterization of the global solution. Motivated by the fact that the temperature τ\tau is often selected as a small value (τ<1\tau<1, e.g., τ=1/30\tau=1/30 in Wang et al. (2018)) in practical applications (Wang et al., 2018; Chen & He, 2021), we consider the case of τ0\tau\rightarrow 0 where the CE loss (2) converges to the following “HardMax” problem:

min𝑾𝒪B(d,K)𝑯𝒪B(d,nK)HardMax(𝑾,𝑯),whereHardMax(𝑾,𝑯)maxk[K]maxi[n]maxkk𝒘k𝒘k,𝒉k,i,\displaystyle\min_{\begin{subarray}{c}{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)\\ {\bm{H}}\in\operatorname{{\mathcal{O}B}}(d,nK)\end{subarray}}\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}),\leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ \mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}})\doteq\max_{k\in[K]}\max_{i\in[n]}\max_{k^{\prime}\neq k}\langle\bm{w}_{k^{\prime}}-\bm{w}_{k},\bm{h}_{k,i}\rangle, (5)

where ,\langle\cdot,\cdot\rangle denotes the inner-product operator. More precisely, we have the following result.

Lemma 3.1 (Convergence to the HardMax problem).

For any positive integers KK and nn, we have

lim supτ0(argmin𝑾𝒪B(d,K)𝑯𝒪B(d,nK)1nKk=1Ki=1nCE(𝑾𝒉k,i,𝒚k,τ))argmin𝑾𝒪B(d,K)𝑯𝒪B(d,nK)HardMax(𝑾,𝑯).\limsup_{\tau\to 0}\left(\operatorname*{arg\,min}_{\begin{subarray}{c}{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)\\ {\bm{H}}\in\operatorname{{\mathcal{O}B}}(d,nK)\end{subarray}}\frac{1}{nK}\sum_{k=1}^{K}\sum_{i=1}^{n}\mathcal{L}_{\text{CE}}\left({\bm{W}}^{\top}{\bm{h}}_{k,i},{\bm{y}}_{k},\tau\right)\right)\!\subseteq\!\operatorname*{arg\,min}_{\begin{subarray}{c}{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)\\ {\bm{H}}\in\operatorname{{\mathcal{O}B}}(d,nK)\end{subarray}}\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}).\!\!\vspace{-1em} (6)

Our goal is not to replace CE with the HardMax function in practice. Instead, we will analyze the HardMax problem in (5) to gain insight into the global solutions and the 𝒢𝒩𝒞\mathcal{GNC} phenomenon.

3.2 Main Result: Theoretical Analysis of 𝒢𝒩𝒞\mathcal{GNC}

𝒢𝒩𝒞\mathcal{GNC}2 and Softmax Code.

Our main result for 𝒢𝒩𝒞\mathcal{GNC}2 is the following.

Theorem 3.2 (𝒢𝒩𝒞\mathcal{GNC}2).

Let (𝐖,𝐇)({\bm{W}}^{\star},{\bm{H}}^{\star}) be an optimal solution to (5). Then, it holds that 𝐖{\bm{W}}^{\star} is a Softmax Code, i.e.,

𝑾=argmax𝑾𝒪B(d,K)ρone-vs-rest(𝑾).{\bm{W}}^{\star}=\operatorname*{arg\,max}\limits_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)}\rho_{\textup{one-vs-rest}}({\bm{W}}). (7)

𝒢𝒩𝒞\mathcal{GNC}2 is described by the Softmax Code, which is defined from an optimization problem (see Definition 2.1). This optimization problem may not have a closed form solution in general. Nonetheless, the one-vs-rest distance that is used to define Softmax Code has a clear geometric meaning, making an intuitive interpretation of Softmax Code tractable. Specifically, maximizing the one-vs-rest distance results in the classifier weight vectors {𝒘k}\{{\bm{w}}_{k}^{\star}\} to be maximally distant. As shown in Figures 1(a) and 1(b) for a simple setting of four classes in a 2D plane, the weight vectors {𝒘k}\{{\bm{w}}_{k}\} that are uniformly distributed (and hence maximally distant) have a larger margin than the non-uniform case.

For certain choices of (d,K)(d,K) the Softmax Code bears a simple form.

Theorem 3.3.

For any positive integers KK and dd, let 𝐖𝒪B(d,K){\bm{W}}^{\star}\in\operatorname{{\mathcal{O}B}}(d,K) be a Softmax Code. Then,

  • d=2d=2: {𝒘k}\{{\bm{w}}_{k}^{\star}\} is uniformly distributed on the unit circle, i.e., {𝒘k}={(cos(2πkK+α),sin(2πkK+α))}\{{\bm{w}}_{k}^{\star}\}=\{\big{(}\cos(\frac{2\pi k}{K}+\alpha),\sin(\frac{2\pi k}{K}+\alpha)\big{)}\} for some α\alpha;

  • Kd+1K\leq d+1: {𝒘k}\{{\bm{w}}_{k}^{\star}\} forms a simplex ETF, i.e., 𝑾=KK1𝑷(𝑰K1K1K1K)\bm{W}^{\star}=\sqrt{\frac{K}{K-1}}\bm{P}(\bm{I}_{K}-\frac{1}{K}\textbf{1}_{K}\textbf{1}_{K}^{\top}) for some orthonomal 𝑷IRd×K\bm{P}\in I\!\!R^{d\times K};

  • d+1<K2dd+1<K\leq 2d: ρone-vs-rest(𝑾)=1\rho_{\textup{one-vs-rest}}({\bm{W}}^{\star})=1 which can be achieved when {𝒘k}\{{\bm{w}}_{k}^{\star}\} are a subset of vertices of a cross-polytope;

For the cases of Kd+1K\leq d+1, the optimal 𝑾{\bm{W}}^{\star} from Theorem 3.3 is the same as that of Lu & Steinerberger (2022). However, Theorem 3.3 is an analysis of the HardMax loss while Lu & Steinerberger (2022) analyzed the CE loss.

𝒢𝒩𝒞\mathcal{GNC}1 and Within-class Variability Collapse.

To establish the within-class variability collapse property, we require a technical condition associated with the Softmax Code. Recall that Softmax Codes are those that maximize the minimum one-vs-rest distance over all classes. We introduce rattlers, which are classes that do not attain such a minimum.

Definition 3.4 (Rattler of Softmax Code).

Given positive integers dd and KK, a rattler associated with a Softmax Code 𝐖SC𝒪B(d,K){\bm{W}}^{\text{SC}}\in\operatorname{{\mathcal{O}B}}(d,K) is an index krattler[K]k_{\text{rattler}}\in[K] for which

mink[K]dist(𝒘kSC,{𝒘jSC}j[K]k)dist(𝒘krattlerSC,{𝒘jSC}j[K]krattler).\min_{k\in[K]}\operatorname{dist}({\bm{w}}_{k}^{\text{SC}},\{{\bm{w}}_{j}^{\text{SC}}\}_{j\in[K]\setminus k})\neq\operatorname{dist}({\bm{w}}^{\text{SC}}_{k_{\text{rattler}}},\{{\bm{w}}^{\text{SC}}_{j}\}_{j\in[K]\setminus k_{\text{rattler}}}). (8)

In other words, rattlers are points in a Softmax Code with no neighbors at the minimum one-to-rest distance. This notion is borrowed from the literature of the Tammes Problem (Cohn, 2022; Wang, 2009), which we will soon discuss in more detail333The occurrence of rattlers is rare: Among the 182182 pairs of (d,K)(d,K) for which the solution to Tammes problem is known, only 3131 have rattlers (Cohn, 2022). This has excluded the cases of d=2d=2 or K2dK\leq 2d where there is no rattler. The occurrence of ratter in Softmax Code may be rare as well..

We are now ready to present the main results for 𝒢𝒩𝒞\mathcal{GNC}1.

Theorem 3.5 (𝒢𝒩𝒞\mathcal{GNC}1).

Let (𝐖,𝐇)({\bm{W}}^{\star},{\bm{H}}^{\star}) be an optimal solution to (5). For all kk that is not a rattler of 𝐖{\bm{W}}^{\star}, it holds that

𝒉¯k𝒉k,1==𝒉k,n=𝒫𝕊d1(𝒘k𝒫{𝒘j}j[K]k(𝒘k)),\overline{{\bm{h}}}_{k}^{\star}\doteq{\bm{h}}_{k,1}^{\star}=\cdots={\bm{h}}_{k,n}^{\star}=\mathcal{P}_{\mathbb{S}^{d-1}}\left({\bm{w}}_{k}^{\star}-\mathcal{P}_{\{{\bm{w}}_{j}^{\star}\}_{j\in[K]\setminus k}}({\bm{w}}_{k}^{\star})\right), (9)

where 𝒫𝒲(𝐯)argmin𝐰conv(𝒲){𝐯𝐰2}\mathcal{P}_{\mathcal{W}}({\bm{v}})\doteq\operatorname*{arg\,min}_{{\bm{w}}\in\operatorname{conv}(\mathcal{W})}\{\|{\bm{v}}-{\bm{w}}\|_{2}\} denotes the projection of 𝐯{\bm{v}} on conv(𝒲)\operatorname{conv}(\mathcal{W}).

The following result shows that the requirement in the Theorem 3.5 that kk is not a rattler is satisfied in many cases.

Theorem 3.6.

If d=2d=2, or Kd+1K\leq d+1, Softmax Code has no rattler for all classes.

𝒢𝒩𝒞\mathcal{GNC}3 and Self-Duality.

To motivate our technical conditions for establishing self-duality, assume that any optimal solution (𝑾,𝑯)({\bm{W}}^{\star},{\bm{H}}^{\star}) to (5) satisfies self-duality as well as 𝒢𝒩𝒞\mathcal{GNC}1. This implies that

argmin𝑾𝒪B(d,K),𝑯𝒪B(d,nK)HardMax(𝑾,𝑯)=argmin𝑾𝒪B(d,nK)maxk[K]maxi[n]maxkk𝒘k𝒘k,𝒘k.\operatorname*{arg\,min}_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K),{\bm{H}}\in\operatorname{{\mathcal{O}B}}(d,nK)}\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}})=\operatorname*{arg\,min}_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,nK)}\max_{k\in[K]}\max_{i\in[n]}\max_{k^{\prime}\neq k}\langle\bm{w}_{k^{\prime}}-\bm{w}_{k},\bm{w}_{k}\rangle. (10)

After simplification we may rewrite the optimization problem on the right hand side equivalently as:

max𝑾𝒪B(d,K)ρone-vs-one(𝑾),whereρone-vs-one(𝑾)mink[K]minkkdist(𝒘k,𝒘k).\max_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)}\rho_{\textup{one-vs-one}}({\bm{W}}),\leavevmode\nobreak\ \leavevmode\nobreak\ \text{where}\leavevmode\nobreak\ \leavevmode\nobreak\ \rho_{\textup{one-vs-one}}({\bm{W}})\doteq\min_{k\in[K]}\min_{k^{\prime}\neq k}\operatorname{dist}({\bm{w}}_{k},{\bm{w}}_{k^{\prime}}). (11)

Eq. (11) is the well-known Tammes problem. Geometrically, the problem asks for a distribution of KK points on the unit sphere of IRdI\!\!R^{d} so that the minimum distance between any pair of points is maximized. The Tammes problem is unsolved in general, except for certain pairs of (K,d)(K,d).

Both the Tammes problem and the Softmax Code are problems of arranging points to be maximally separated on the unit sphere, with their difference being the specific measures of separation. Comparing (11) and (3), the Tammes problem maximizes for all k[K]k\in[K] the one-vs-one distance, i.e., minkkdist(𝒘k,𝒘k)\min_{k^{\prime}\neq k}\operatorname{dist}({\bm{w}}_{k},{\bm{w}}_{k^{\prime}}), whereas the Softmax Code maximizes the minimum one-vs-rest distance, i.e., dist(𝒘k,{𝒘j}j[K]k)\operatorname{dist}({\bm{w}}_{k},\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}). Both one-vs-one distance and one-vs-rest distances characterize the separation of the weight vector 𝒘k{\bm{w}}_{k} from {𝒘j}j[K]k\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}. As illustrated in Figure 1, taking k=1k=1, the former is the distance between 𝒘1{\bm{w}}_{1} and its closest point in the set {𝒘2,𝒘3,𝒘4}\{{\bm{w}}_{2},{\bm{w}}_{3},{\bm{w}}_{4}\}, in this case 𝒘2{\bm{w}}_{2} (see Figure 1(c)), whereas the later captures the minimal distance from 𝒘1{\bm{w}}_{1} to the convex hull of the rest vectors {𝒘1,𝒘2,𝒘3}\{{\bm{w}}_{1},{\bm{w}}_{2},{\bm{w}}_{3}\} (see Figure 1(b)).

Since the Tammes problem can be derived from the self-duality constraint on the HardMax problem, it may not be surprising that the Tammes problem can be used to describe a condition for establishing self-duality. Specifically, we have the following result.

Theorem 3.7 (𝒢𝒩𝒞\mathcal{GNC}3).

For any K,dK,d such that both Tammes problem and Softmax Code have no rattler, the following two statements are equivalent:

  • Any optimal solution (𝑾,𝑯)({\bm{W}}^{\star},{\bm{H}}^{\star}) to (5) satisfies 𝒉k,i=𝒘k,i[n],k[K]{\bm{h}}_{k,i}^{\star}={\bm{w}}_{k}^{*},\forall i\in[n],\forall k\in[K];

  • The Tammes problem and the Softmax codes are equivalent, i.e.,

    argmax𝑾𝒪B(d,K)ρone-vs-rest(𝑾)=argmax𝑾𝒪B(d,K)ρone-vs-one(𝑾).\operatorname*{arg\,max}_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)}\rho_{\textup{one-vs-rest}}({\bm{W}})=\operatorname*{arg\,max}_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)}\rho_{\textup{one-vs-one}}({\bm{W}}). (12)

In words, Theorem 3.7 states that 𝒢𝒩𝒞\mathcal{GNC}3 holds if and only if the Tammes problem in (11) and the Softmax codes are equivalent. As both the Tammes problem and Softmax Code maximize separation between one vector and the others, though their notions of separation are different, we conjecture that they are equivalent and share the same optimal solutions. We prove this conjecture for some special cases and leave the study for the general case as future work.

Theorem 3.8.

If d=2d=2, or Kd+1K\leq d+1, the Tammes problem and the Softmax codes are equivalent.

3.3 Insights for Choosing Feature Dimension dd Given Class Number KK

Given a class number KK, how does the choice of feature dimension dd affect the model performance? Intuitively, smaller dd reduces the separability between classes in a Softmax Code. We define this rigorously by providing bounds for the one-vs-rest distance of a Softmax Code based on dd and KK.

Theorem 3.9.

Assuming K2πedK\geq\sqrt{2\pi\sqrt{e}d} and letting Γ()\Gamma(\cdot) denote the Gamma function, we have

12[πKΓ(d+12)Γ(d2+1)]2d1max𝑾𝒪B(d,K)ρone-vs-rest(𝑾)2[2πKΓ(d+12)Γ(d2)]1d1.\displaystyle\frac{1}{2}\biggr{[}\frac{\sqrt{\pi}}{K}\frac{\Gamma\left(\frac{d+1}{2}\right)}{\Gamma\left(\frac{d}{2}+1\right)}\biggr{]}^{\frac{2}{d-1}}\leq\max_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)}\rho_{\textup{one-vs-rest}}({\bm{W}})\leq 2\biggr{[}\frac{2\sqrt{\pi}}{K}\frac{\Gamma\left(\frac{d+1}{2}\right)}{\Gamma\left(\frac{d}{2}\right)}\biggr{]}^{\frac{1}{d-1}}. (13)
Refer to caption
Figure 3: Effect of feature dimension dd on (Left yy-axis): ρone-vs-rest(𝑾)\rho_{\textup{one-vs-rest}}({\bm{W}}^{\star}) and its upper/lower bounds (in Theorem 3.9), and (Right yy-axis): training and test accuracies for ResNet-50 on ImageNet.

The bounds characterize the separability for KK classes in dd-dimensional space. Given the number of classes KK and desired margin ρ\rho, the minimal feature dimension is roughly an order of log(K2/ρ)\log(K^{2}/\rho), showing classes separate easily in higher dimensions. This also provides a justification for applications like face classification and self-supervised learning, where the number of classes (e.g., millions of classes) could be significantly larger than the dimensionality of the features (e.g., d=512d=512).

By conducting experiments on ResNet-50 with varying feature dimensions for ImageNet classification, we further corroborate the relationship between feature dimension and network performance in Figure 3. First, we observe that the curve of the optimal distance is closely aligned with the curve of testing performance, indicating a strong correlation between distance and testing accuracy. Moreover, both the distance and performance curves exhibit a slow (exponential) decrease as the feature dimension dd decreases, which is consistent with the bounds in Theorem 3.9.

4 The Assignment Problem: An Empirical Study

Unlike the case dK1d\geq K-1 where the optimal classifier (simplex ETF) has equal angles between any pair of the classifier weights, when d<K1d<K-1, not all pairs of classifier weights are equally distant with the optimal 𝑾{\bm{W}} (Softmax Code) predicted in Theorem 3.2. Consequently, this leads to a “class assignment” problem. To illustrate this, we train a ResNet18 network with d=2d=2 on four classes {Automobile, Cat, Dog, Truck} from CIFAR10 dataset that are selected due to their clear semantic similarity and discrepancy. In this case, according to Theorem 3.3, the optimal classifiers are given by [1,0],[1,0],[0,1],[0,1][1,0],[-1,0],[0,1],[0,-1], up to a rotation. Consequently, there are three distinct class assignments, as illustrated in Figures 4(b), 4(c) and 4(d).

When doing standard training, the classifier consistently converges to the case where Cat and Dog are closer together across 5 different trials; Figure 4(a) shows the learned features (dots) and classifier weights (arrows) in one of such trials. This demonstrates the implicit algorithmic regularization in training DNNs, which naturally attracts (semantically) similar classes and separates dissimilar ones.

We also conduct experiments with the classifier fixed to be one of the three arrangements, and present the results in Figures 4(b), 4(c) and 4(d). Among them, we observe that the case where Cat and Dog are far apart achieves a testing accuracy of 89.95%89.95\%, which is lower than the other two cases with testing accuracies of 91.90%91.90\% and 92.13%92.13\%. This demonstrates the important role of class assignment to the generalization of DNNs, and that the implicit bias of the learned classifier is benign, i.e., leads to a more generalizable solutions.

Refer to caption
(a) Trainable (91.45%91.45\%)
Refer to caption
(b) Fixed A1 (91.90%91.90\%)
Refer to caption
(c) Fixed A2 (92.13%92.13\%)
Refer to caption
(d) Fixed A3 (89.95%89.95\%)
Figure 4: Assignment of classes to classifier weights for a ResNet18 with 2-dimensional feature space trained on the 4 classes {Automobile, Cat, Dog, Truck} from CIFAR10. (a) Learned classifier. (b-d) Classifiers fixed to be three different assignments. Test accuracy is reported in the bracket.

5 Implications for Practical Network Training/Fine-tuning

Since the classifier always converges to a simplex ETF when Kd+1K\leq d+1, prior work proposes to fix the classifier as a simplex ETF for reducing training cost (Zhu et al., 2021) and handling imbalance dataset (Yang et al., 2022). When K>d+1K>d+1, the optimal classifier is also known to be a Softmax Code according to 𝒢𝒩𝒞\mathcal{GNC}2. However, the same method as in prior work may become sub-optimal due to the class assignment problem (see Section 4). To address this, we introduce the method of class-mean features (CMF) classifiers, where the classifier weights are set to be the exponential moving average of the mini-batch class-mean features during the training process. This approach is motivated from 𝒢𝒩𝒞\mathcal{GNC}3 which states that the optimal classifier converges to the class-mean features. We explain the detail of CMF in Appendix B. As in prior work, CMF can reduce trainable parameters as well. For instance, it can reduce 30.9130.91% of total parameters in a ResNet18 for BUPT-CBFace-50 dataset (Zhang & Deng, 2020). Here, we compare CMF with the standard training where the classifier is learned together with the feature mapping, in both training from scratch and fine-tuning.

Training from Scratch.

We train a ResNet18 on CIFAR100 by using a learnable classifier or the CMF classifier. The learning curves in Figure 5 indicate that the approach with CMF classifier achieves comparable performance to the classical training protocols.

Refer to caption
(a) ResNet
Refer to caption
(b) DenseNet
Refer to caption
(c) ResNeXt
Figure 5: Comparison of the learning curves (training and testing accuracies) with learned classifiers vs. CMF classifiers trained with ResNet18, DenseNet121, and ResNeXt50 on CIFAR100 dataset and d=10d=10.
Fine-tuning.

To verify the effectiveness of the CMF classifiers on fine-tuning, we follow the setting in Kumar et al. (2022) to measure the performance of the fine-tuned model on both in-distribution (ID) task (i.e., CIFAR10 Krizhevsky (2009)) and OOD task (STL10 Coates et al. (2011)). We compare the standard approach that fine-tunes both the classifier (randomly initialized) and the pre-trained feature mapping with our approach (using the CMF classifier). Our experiments show that the approach with CMF classifier achieves slightly better ID accuracy (98.00%98.00\% VS 97.00%97.00\%) and a better OOD performance (90.67%90.67\% VS 87.42%87.42\%). The improvement of OOD performance stems from the ability to align the classifier with the class-means through the entire process, which better preserves the OOD property of the pre-trained model. Our approach also simplifies the two-stage approach of linearly probing and subsequent full fine-tuning in Kumar et al. (2022).

6 Conclusion

In this work, we have introduced generalized neural collapse (𝒢𝒩𝒞\mathcal{GNC}) for characterizing learned last-layer features and classifiers in DNNs under an arbitrary number of classes and feature dimensions. We empirically validate the 𝒢𝒩𝒞\mathcal{GNC} phenomenon on practical DNNs that are trained with a small temperature in the CE loss and subject to spherical constraints on the features and classifiers. Building upon the unconstrained features model we have proven that 𝒢𝒩𝒞\mathcal{GNC} holds under certain technical conditions. 𝒢𝒩𝒞\mathcal{GNC} could offer valuable insights for the design, training, and generalization of DNNs. For example, the minimal one-vs-rest distance provides implications for designing feature dimensions when dealing with a large number of classes. Additionally, we have leveraged 𝒢𝒩𝒞\mathcal{GNC} to enhance training efficiency and fine-tuning performance by fixing the classifier as class-mean features. Further exploration of 𝒢𝒩𝒞\mathcal{GNC} in other scenarios, such as imbalanced learning, is left for future work. It is also of interest to further study the problem of optimally assigning classifiers from Softmax Code for each class, which could shed light on developing techniques for better classification performance.

Acknowledgment

JJ, JZ, and ZZ acknowledge support from NSF grants CCF-2240708 and IIS-2312840. PW and QQ acknowledge support from NSF CAREER CCF-2143904, NSF CCF-2212066, NSF CCF-2212326, NSF IIS 2312842, ONR N00014-22-1-2529, an AWS AI Award, a gift grant from KLA, and MICDE Catalyst Grant. We thank CloudBank (supported by NSF under Award #1925001) for providing the computational resources.

References

  • Boyd & Vandenberghe (2004) Stephen P Boyd and Lieven Vandenberghe. Convex optimization. Cambridge university press, 2004.
  • Braides (2006) Andrea Braides. A handbook of γ\gamma-convergence. In Handbook of Differential Equations: stationary partial differential equations, volume 3, pp.  101–213. Elsevier, 2006.
  • Cao et al. (2019) Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss. Advances in neural information processing systems, 32, 2019.
  • Carathéodory (1911) Constantin Carathéodory. Über den variabilitätsbereich der fourier’schen konstanten von positiven harmonischen funktionen. Rendiconti Del Circolo Matematico di Palermo (1884-1940), 32(1):193–217, 1911.
  • Chan et al. (2022) Kwan Ho Ryan Chan, Yaodong Yu, Chong You, Haozhi Qi, John Wright, and Yi Ma. Redunet: A white-box deep network from the principle of maximizing rate reduction. The Journal of Machine Learning Research, 23(1):4907–5009, 2022.
  • Chang et al. (2019) Wei-Cheng Chang, X Yu Felix, Yin-Wen Chang, Yiming Yang, and Sanjiv Kumar. Pre-training tasks for embedding-based large-scale retrieval. In International Conference on Learning Representations, 2019.
  • Chen et al. (2020a) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pp. 1597–1607. PMLR, 2020a.
  • Chen & He (2021) Xinlei Chen and Kaiming He. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  15750–15758, 2021.
  • Chen et al. (2020b) Xinlei Chen, Haoqi Fan, Ross Girshick, and Kaiming He. Improved baselines with momentum contrastive learning, 2020b.
  • Coates et al. (2011) Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. In Geoffrey Gordon, David Dunson, and Miroslav Dudík (eds.), Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, volume 15 of Proceedings of Machine Learning Research, pp.  215–223, Fort Lauderdale, FL, USA, 11–13 Apr 2011. PMLR. URL https://proceedings.mlr.press/v15/coates11a.html.
  • Cohn (2022) Henry Cohn. Small spherical and projective codes. 2022.
  • Deng et al. (2019) Jiankang Deng, Jia Guo, Niannan Xue, and Stefanos Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  4690–4699, 2019.
  • Devlin et al. (2018) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  • Fang et al. (2021) Cong Fang, Hangfeng He, Qi Long, and Weijie J Su. Exploring deep neural networks via layer-peeled model: Minority collapse in imbalanced training. Proceedings of the National Academy of Sciences, 118(43):e2103091118, 2021.
  • Fickus et al. (2017) Matthew Fickus, John Jasper, Dustin G Mixon, and Cody E Watson. A brief introduction to equi-chordal and equi-isoclinic tight fusion frames. In Wavelets and Sparsity XVII, volume 10394, pp.  186–194. SPIE, 2017.
  • Galanti et al. (2022a) Tomer Galanti, András György, and Marcus Hutter. Generalization bounds for transfer learning with pretrained classifiers. arXiv preprint arXiv:2212.12532, 2022a.
  • Galanti et al. (2022b) Tomer Galanti, András György, and Marcus Hutter. On the role of neural collapse in transfer learning. In International Conference on Learning Representations, 2022b.
  • Gao et al. (2023) Peifeng Gao, Qianqian Xu, Peisong Wen, Huiyang Shao, Zhiyong Yang, and Qingming Huang. A study of neural collapse phenomenon: Grassmannian frame, symmetry, generalization, 2023.
  • (19) XY Han, Vardan Papyan, and David L Donoho. Neural collapse under mse loss: Proximity to and dynamics on the central path. In International Conference on Learning Representations.
  • He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • (22) Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse. In International Conference on Learning Representations.
  • Ji et al. (2021) Wenlong Ji, Yiping Lu, Yiliang Zhang, Zhun Deng, and Weijie J Su. An unconstrained layer-peeled perspective on neural collapse. arXiv preprint arXiv:2110.02796, 2021.
  • Krizhevsky (2009) Alex Krizhevsky. Learning multiple layers of features from tiny images. pp.  32–33, 2009. URL https://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
  • Kumar et al. (2022) Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022.
  • Li et al. (2022) Xiao Li, Sheng Liu, Jinxin Zhou, Xinyu Lu, Carlos Fernandez-Granda, Zhihui Zhu, and Qing Qu. Principled and efficient transfer learning of deep models via neural collapse. arXiv preprint arXiv:2212.12206, 2022.
  • Liu et al. (2023a) Weiyang Liu, Longhui Yu, Adrian Weller, and Bernhard Schölkopf. Generalizing and decoupling neural collapse via hyperspherical uniformity gap. arXiv preprint arXiv:2303.06484, 2023a.
  • Liu et al. (2023b) Xuantong Liu, Jianfeng Zhang, Tianyang Hu, He Cao, Yuan Yao, and Lujia Pan. Inducing neural collapse in deep long-tailed learning. In International Conference on Artificial Intelligence and Statistics, pp.  11534–11544. PMLR, 2023b.
  • Lu & Steinerberger (2020) Jianfeng Lu and Stefan Steinerberger. Neural collapse with cross-entropy loss. arXiv preprint arXiv:2012.08465, 2020.
  • Lu & Steinerberger (2022) Jianfeng Lu and Stefan Steinerberger. Neural collapse under cross-entropy loss. Applied and Computational Harmonic Analysis, 59:224–241, 2022.
  • Mitra et al. (2018) Bhaskar Mitra, Nick Craswell, et al. An introduction to neural information retrieval. Foundations and Trends® in Information Retrieval, 13(1):1–126, 2018.
  • Mixon et al. (2020) Dustin G. Mixon, Hans Parshall, and Jianzong Pi. Neural collapse with unconstrained features, 2020.
  • Moore (1974) Michael H. Moore. Vector packing in finite dimensional vector spaces. Linear Algebra and its Applications, 8(3):213–224, 1974. ISSN 0024-3795. doi: https://doi.org/10.1016/0024-3795(74)90067-6. URL https://www.sciencedirect.com/science/article/pii/0024379574900676.
  • Murphy (2022) Kevin P Murphy. Probabilistic machine learning: an introduction. MIT press, 2022.
  • Nguyen et al. (2022) Duc Anh Nguyen, Ron Levie, Julian Lienen, Gitta Kutyniok, and Eyke Hüllermeier. Memorization-dilation: Modeling neural collapse under noise. arXiv preprint arXiv:2206.05530, 2022.
  • Papyan (2020) Vardan Papyan. Traces of class/cross-class structure pervade deep learning spectra. Journal of Machine Learning Research, 21(252):1–64, 2020.
  • Papyan et al. (2020) Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal phase of deep learning training. Proceedings of the National Academy of Sciences, 117(40):24652–24663, 2020.
  • Poggio & Liao (2020) Tomaso Poggio and Qianli Liao. Explicit regularization and implicit bias in deep network classifiers trained with the square loss. arXiv preprint arXiv:2101.00072, 2020.
  • Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  • Rangamani et al. (2023) Akshay Rangamani, Marius Lindegaard, Tomer Galanti, and Tomaso A Poggio. Feature learning in deep classifiers through intermediate neural collapse. In International Conference on Machine Learning, pp. 28729–28745. PMLR, 2023.
  • Rockafellar & Wets (2009) R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  • Tammes (1930) Pieter Merkus Lambertus Tammes. On the origin of number and arrangement of the places of exit on the surface of pollen-grains. Recueil des travaux botaniques néerlandais, 27(1):1–84, 1930.
  • Thrampoulidis et al. (2022) Christos Thrampoulidis, Ganesh Ramachandra Kini, Vala Vakilian, and Tina Behnia. Imbalance trouble: Revisiting neural-collapse geometry. Advances in Neural Information Processing Systems, 35:27225–27238, 2022.
  • Tirer & Bruna (2022) Tom Tirer and Joan Bruna. Extended unconstrained features model for exploring deep neural collapse. In International Conference on Machine Learning, pp. 21478–21505. PMLR, 2022.
  • Tirer et al. (2023) Tom Tirer, Haoxiang Huang, and Jonathan Niles-Weed. Perturbation analysis of neural collapse. In International Conference on Machine Learning, pp. 34301–34329. PMLR, 2023.
  • Wang et al. (2018) Hao Wang, Yitong Wang, Zheng Zhou, Xing Ji, Dihong Gong, Jingchao Zhou, Zhifeng Li, and Wei Liu. Cosface: Large margin cosine loss for deep face recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  5265–5274, 2018.
  • Wang (2009) Jeffrey Wang. Finding and investigating exact spherical codes. Experimental Mathematics, 18(2):249–256, 2009.
  • Wojtowytsch et al. (2020) Stephan Wojtowytsch et al. On the emergence of simplex symmetry in the final and penultimate layers of neural network classifiers. arXiv preprint arXiv:2012.05420, 2020.
  • Xie et al. (2023) Liang Xie, Yibo Yang, Deng Cai, and Xiaofei He. Neural collapse inspired attraction-repulsion-balanced loss for imbalanced learning. Neurocomputing, 2023.
  • Xie et al. (2022) Shuo Xie, Jiahao Qiu, Ankita Pasad, Li Du, Qing Qu, and Hongyuan Mei. Hidden state variability of pretrained language models can guide computation reduction for transfer learning. arXiv preprint arXiv:2210.10041, 2022.
  • Yang et al. (2022) Yibo Yang, Liang Xie, Shixiang Chen, Xiangtai Li, Zhouchen Lin, and Dacheng Tao. Do we really need a learnable classifier at the end of deep neural network? arXiv preprint arXiv:2203.09081, 2022.
  • Yang et al. (2023) Yibo Yang, Haobo Yuan, Xiangtai Li, Zhouchen Lin, Philip Torr, and Dacheng Tao. Neural collapse inspired feature-classifier alignment for few-shot class incremental learning. arXiv preprint arXiv:2302.03004, 2023.
  • (53) Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the riemannian manifold. In Advances in Neural Information Processing Systems.
  • Yaras et al. (2023) Can Yaras, Peng Wang, Zhihui Zhu, Laura Balzano, and Qing Qu. Neural collapse with normalized features: A geometric analysis over the riemannian manifold, 2023.
  • You (2018) Chong You. Sparse methods for learning multiple subspaces from large-scale, corrupted and imbalanced data. PhD thesis, Johns Hopkins University, 2018.
  • Yu et al. (2022) Longhui Yu, Tianyang Hu, Lanqing Hong, Zhen Liu, Adrian Weller, and Weiyang Liu. Continual learning by modeling intra-class variation. arXiv preprint arXiv:2210.05398, 2022.
  • Yu et al. (2020) Yaodong Yu, Kwan Ho Ryan Chan, Chong You, Chaobing Song, and Yi Ma. Learning diverse and discriminative representations via the principle of maximal coding rate reduction. Advances in Neural Information Processing Systems, 33:9422–9434, 2020.
  • Zhang & Deng (2020) Yaobin Zhang and Weihong Deng. Class-balanced training for deep face recognition. In Proceedings of the ieee/cvf conference on computer vision and pattern recognition workshops, pp.  824–825, 2020.
  • (59) Jinxin Zhou, Chong You, Xiao Li, Kangning Liu, Sheng Liu, Qing Qu, and Zhihui Zhu. Are all losses created equal: A neural collapse perspective. In Advances in Neural Information Processing Systems.
  • Zhou et al. (2022a) Jinxin Zhou, Xiao Li, Tianyu Ding, Chong You, Qing Qu, and Zhihui Zhu. On the optimization landscape of neural collapse under mse loss: Global optimality with unconstrained features. In International Conference on Machine Learning, pp. 27179–27202. PMLR, 2022a.
  • Zhou et al. (2022b) Xiong Zhou, Xianming Liu, Deming Zhai, Junjun Jiang, Xin Gao, and Xiangyang Ji. Learning towards the largest margins. arXiv preprint arXiv:2206.11589, 2022b.
  • Zhu et al. (2021) Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A geometric analysis of neural collapse with unconstrained features. Advances in Neural Information Processing Systems, 34:29820–29834, 2021.

Part Appendix

\parttoc

Appendix A Organizations and Basic.

The organization of the appendix is as follows. Firstly, we introduce the fundamental theorems that are utilized throughout the appendices. In Appendix B, we offer comprehensive information regarding the datasets and computational resources for each figure, along with presenting additional experimental results on practical datasets. Lastly, in Appendix C, we provide the theoretical proofs for all the theorems mentioned in Section 3.

Lemma A.1 (LogSumExp).

Let denote 𝐱=[x1,x2,,xn]n{\bm{x}}=\begin{bmatrix}x_{1},x_{2},\cdots,x_{n}\end{bmatrix}\in\mathcal{R}^{n} and the LogSumExp function LSE(𝐱,τ)=log(i=1nexp(xi/τ))\text{LSE}({\bm{x}},\tau)=\log\left(\sum_{i=1}^{n}\exp(x_{i}/\tau)\right). With τ0\tau\rightarrow 0, we have

τLSE(𝒙,τ)maxixi\displaystyle\tau\text{LSE}({\bm{x}},\tau)\rightarrow\max_{i}x_{i}

In other words, the LogSumExp function is a smooth approximation to the maximum function.

Proof of Lemma A.1.

According to the definition of the LogSumExp function, we have

maxixiτLSE(𝒙,τ)maxixi+τlogn\displaystyle\max_{i}x_{i}\leq\tau\text{LSE}({\bm{x}},\tau)\leq\max_{i}x_{i}+\tau\log n

where the first inequality is strict unless n=1n=1 and the second inequality is strict unless all arguments are equal x1=x2==xnx_{1}=x_{2}=\cdots=x_{n}. Therefore, with τ0\tau\rightarrow 0, τLSE(𝒙,τ)maxixi\tau\text{LSE}({\bm{x}},\tau)\rightarrow\max_{i}x_{i}. ∎

Theorem A.2 (Carathéodory’s theorem Carathéodory (1911)).

Given 𝐰1,,𝐰Kd{\bm{w}}_{1},\ldots,{\bm{w}}_{K}\in\mathbb{R}^{d}, if a point 𝐯{\bm{v}} lies in the convex hull conv({𝐰1,,𝐰K})\operatorname{conv}(\{{\bm{w}}_{1},\ldots,{\bm{w}}_{K}\}) of {𝐰1,,𝐰K}\{{\bm{w}}_{1},\ldots,{\bm{w}}_{K}\}, then 𝐯{\bm{v}} resides in the convex hull of at most d+1d+1 of the points in {𝐰1,,𝐰K}\{{\bm{w}}_{1},\ldots,{\bm{w}}_{K}\}.

Appendix B Experiment

In this section, we begin by providing additional details regarding the datasets and the computational resources utilized in the paper. Specifically, CIFAR10, CIFAR100, and BUPT-CBFace datasets are publicly available for academic purposes under the MIT license. Additionally, all experiments were conducted on 4xV100 GPU with 32G memory. Furthermore, we present supplementary experiments and implementation specifics for each figure.

B.1 Implementation details

We present implementation details for results in the paper.

Results in Figure 2.

To illustrate the occurrence of the 𝒢𝒩𝒞\mathcal{GNC} phenomenon in practical multi-class classification problems, we trained a ResNet18 network (He et al., 2016) on the CIFAR100 dataset (Krizhevsky, 2009) using CE loss with varying temperature parameters. In this experiment, we set the dimensions of the last-layer features to 1010. This is achieved by setting the number of channels in the second convolutional layer of the last residual block before the classifier layer to be 1010. Prior to training, we applied standard preprocessing techniques, which involved normalizing the images (channel-wise) using their mean and standard deviation. We also employed standard data augmentation methods. For optimization, we utilized SGD with a momentum of 0.90.9 and an initial learning rate of 0.10.1, which decayed according to the CosineAnnealing over a span of 200200 epochs.

The optimal margin (represented by the dotted line) in the second left subfigure is obtained from numerical optimization of the Softmax Code problem. In this optimization process, we used an initial learning rate of 0.10.1 and decreased it by a factor of 1010 every 10001000 iterations, for a total of 50005000 iterations.

Results in Figure 4.

To demonstrate the implicit algorithmic regularization in training DNNs, we train a ResNet18 network on four classes Automobile, Cat, Dog, Truck from CIFAR10 dataset. We set the dimensions of the last-layer features to 22 so that the learned features can be visualized, and we used a temperature parameter of 0.050.05. We optimized the networks for a total of 800800 epochs using SGD with a momentum of 0.90.9 and an initial learning rate of 0.10.1, which is decreased by a factor of 1010 every 200200 epochs.

Results in Figure 5.

To assess the effectiveness of the proposed CMF (Class Mean Feature) method, we utilized the ResNet18 architecture (He et al., 2015) as the feature encoder on the CIFAR100 dataset (Krizhevsky, 2009), where we set the feature dimension to d=20d=20 and the temperature parameter to τ=0.1\tau=0.1. Since the model can only access a mini-batch of the dataset for each iteration, it becomes computationally prohibitive to calculate the class-mean features of the entire dataset. As a solution, we updated the classifier by employing the exponential moving average of the feature class mean, represented as 𝑾(t+1)β𝑾(t)+(1β)𝑯¯(t){\bm{W}}^{(t+1)}\leftarrow\beta{\bm{W}}^{(t)}+(1-\beta)\overline{{\bm{H}}}^{(t)}. Here, 𝑯¯(t)K×d\overline{{\bm{H}}}^{(t)}\in\mathcal{R}^{K\times d} denotes the class mean feature of the mini-batch in iteration tt, 𝑾K×d{\bm{W}}\in\mathcal{R}^{K\times d} represents the classifier weights, and β[0,1)\beta\in[0,1) denotes the momentum coefficient (in our experiment, β=0.9\beta=0.9). We applied the same data preprocessing and augmentation techniques mentioned previously and employed an initial learning rate of 0.10.1 with the CosineAnnealing scheduler.

Results in the fine-tuning experiment of Section 5.

We fine-tune a pretrained ResNet50 on MoCo v2 (Chen et al., 2020b) on CIFAR10 (Krizhevsky, 2009). The CIFAR10 test dataset was chosen as the in-distribution (ID) task, while the STL10 dataset (Coates et al., 2011) served as the out-of-distribution (OOD) task. To preprocess the dataset, we resized the images to 224×224224\times 224 using BICUBIC interpolation and normalized them by their mean and standard deviation. Since there is no “monkey” class in CIFAR10 dataset, we remove the “monkey” class in the STL10 dataset. Additionally, we reassign the labels of CIFAR10 datasets to the STL10 dataset in order to match the classifier output. For optimization, we employed the Adam optimizer with a learning rate of 1e51e-5 and utilized the CosineAnnealing scheduler. The models are fine-tuned for 5 epochs with the batch size of 100. We reported the best ID testing accuracy achieved during the fine-tuning process and also provided the corresponding OOD accuracy.

Results in Figure B.1.

To visualize the different structures learned under weight decay or spherical constraint, we conducted experiments in both practical and unconstrained feature model settings. In the practical setting, we trained a ResNet18 network (He et al., 2016) on the first 3030 classes of the CIFAR100 dataset (Krizhevsky, 2009) using either weight decay or spherical constraint with the cross-entropy (CE) loss. For visualization purposes, we set the dimensions of the last-layer features to 22, and for the spherical constraint, we used a temperature parameter of 7070. Prior to training, we applied standard preprocessing techniques, which included normalizing the images (channel-wise) using their mean and standard deviation. Additionally, we employed standard data augmentation methods. We optimized the networks for a total of 1600 epochs using SGD with a momentum of 0.90.9 and an initial learning rate of 0.1, which decreased by a factor of 1010 every 700700 epochs. For the unconstrained feature model setting, we considered only one sample per class and treated the feature (class-mean features) and weights as free optimization variables. In this case, we initialized the learning rate at 0.10.1 and decreased it by a factor of 1010 every 10001000 iterations, for totaling 50005000 iterations. The temperature parameter was set to 0.020.02 for the spherical constraint.

Results in Figure B.6.

To demonstrate the implicit algorithmic regularization in training DNNs, we train a ResNet18 network on four classes Automobile, Cat, Deer, Dog, Horse, Truck from CIFAR10 dataset. We set the dimensions of the last-layer features to 22 so that the learned features can be visualized, and we used a temperature parameter of 0.20.2. We optimized the networks for a total of 600600 epochs using SGD with a momentum of 0.90.9 and an initial learning rate of 0.10.1, which is decreased by a factor of 1010 every 200200 epochs.

B.2 Effect of regularization: Spherical constraint vs. weight decay

Refer to caption
(a) CIFAR100 (WD)
Refer to caption
(b) CIFAR100 (SC)
Refer to caption
(c) UFM (WD)
Refer to caption
(d) UFM (SC)
Figure B.1: Comparison of classifier weights and last-layer features with weight decay (WD) vs spherical constraint (SC) on features and classifiers. (a) vs (b): Results with a ResNet18 trained on CFIAR100. (c) vs (d): Results with the unconstrained feature model in (B.1) and (2), respectively. In all settings, we set K=30K=30 and d=2d=2 for visualization; the first K=30K=30 classes from CIFAR100 dataset are used to train ResNet18.

We study the features and classifiers of networks trained with weight decay for the case K>d+1K>d+1, and compare with those obtained with spherical constraint. For the purpose of visualization, we train a ResNet18 with d=2d=2 on the first K=30K=30 classes of CIFAR100 and display the learned features and classifiers in Figure B.1(a). Below we summarize observations from Figure B.1(a).

  • Non-equal length and non-uniform distribution. The vectors of class-mean features {𝒉¯k}\{\overline{\bm{h}}_{k}\} do not have equal length, and also appear to be not equally spaced when normalized to the unit sphere.

  • Self-duality only in direction. The classifiers point towards the same direction as their corresponding class-mean features, but they have different lengths, and there exists no global scaling to exactly align them. As shown in Figure B.2, the ratios between the lengths of classifier weights and class-mean features vary across different classes. Therefore, there exists no global scaling to align them exactly.

In contrast, using spherical constraint produces equal-length and uniformly distributed features, as well as aligned classifier and classifier weights not only in direction but also in length, see Figure B.1 (b). Such a result is aligned with 𝒢𝒩𝒞\mathcal{GNC}. This observation may explain the common practice of using feature normalization in applications with an extremely large number of classes (Chen & He, 2021; Wang et al., 2018), and justify our study of networks trained with sphere constraints in (2) rather than weight decay as in Liu et al. (2023a).

We note that the discrepancy between the approaches using weight decay and spherical constraints is not due to the insufficient expressiveness of the networks. In fact, we also observe different performances in the unconstrained feature model with spherical constraints as in (2) and with the following regularized form (Zhu et al., 2021; Mixon et al., 2020; Tirer & Bruna, 2022; Zhou et al., 2022a):

min𝑾,𝑯1nKk=1Ki=1nCE(𝑾𝒉k,i,𝒚k,τ)+λ2(𝑾F2+𝑯F2),\min_{{\bm{W}},{\bm{H}}}\frac{1}{nK}\sum_{k=1}^{K}\sum_{i=1}^{n}\mathcal{L}_{\text{CE}}\left({\bm{W}}^{\top}{\bm{h}}_{k,i},{\bm{y}}_{k},\tau\right)+\frac{\lambda}{2}(\left\|{\bm{W}}\right\|_{F}^{2}+\left\|{\bm{H}}\right\|_{F}^{2}), (B.1)

where λ\lambda represents the weight decay parameters. We observe similar phenomena in Figure B.1(c, d) as in Figure B.1(a, b) that the weight decay formulation results in features with non-equal lengths, non-uniform distribution, and different lengths than the classifiers. In the next section, we provide a theoretical justification for 𝒢𝒩𝒞\mathcal{GNC} under the UFM for Problem (2).

Refer to caption
Figure B.2: Illustration of the length of the classifier weights and class-mean features in unconstrained feature model (UFM) with weight decay (WD) form in Figure 1(c). The ratios between the lengths of classifier weights and class-mean features vary across different classes.

B.3 Additional results on prevalence of 𝒢𝒩𝒞\mathcal{GNC}

We provide additional evidence on the occurrence of the 𝒢𝒩𝒞\mathcal{GNC} phenomenon in practical multi-class classification problems. Towards that, we train ResNet18, DenseNet121, and ResNeXt50 network on the CIFAR100, Tiny-ImageNet and BUPT-CBFace-50 datasets using CE loss. To illustrate the case where the feature dimension is smaller than the number of classes, we insert another linear layer before the last-layer classifier and set the dimensions of the features as d=10d=10 for CIFAR100 and Tiny-ImageNet, and d=512d=512 for BUPT-CBFace-50.

The results are reported in Figure B.3. It can be seen that in all the cases the 𝒢𝒩𝒞\mathcal{GNC}1, 𝒢𝒩𝒞\mathcal{GNC}2 and 𝒢𝒩𝒞\mathcal{GNC}3 measures converge mostly monotonically as a function of the training epochs towards the values predicted by 𝒢𝒩𝒞\mathcal{GNC}(i.e., 0 for 𝒢𝒩𝒞\mathcal{GNC}1 and 𝒢𝒩𝒞\mathcal{GNC}3 and the objective of Softmax Code for 𝒢𝒩𝒞\mathcal{GNC}2, see Section 2.2).

Effect of temperature τ\tau.

The results on BUPT-CBFace-50 reported in Figure B.3 uses a temperature τ=0.02\tau=0.02. To examine the effect of τ\tau, we conduct experiments with varying τ\tau and report the 𝒢𝒩𝒞\mathcal{GNC}2 in Figure B.4. It can be seen that the 𝒢𝒩𝒞\mathcal{GNC}2 measure at convergence monotonically increases as τ\tau decreases.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)

Δ

Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Figure B.3: Illustration of 𝒢𝒩𝒞\mathcal{GNC} and train accuracy across different network architectures on CIFAR100(top), Tiny-ImageNet(middle) and BUPT-CBFace-50(bottom) datasets. We train the networks on CIFAR100 with d=10,K=100d=10,\ K=100, Tiny-ImageNet with d=10,K=200d=10,\ K=200 and BUPT-CBFace-50 with d=512,K=10000d=512,\ K=10000.
Refer to caption
Figure B.4: Illustration of 𝒢𝒩𝒞\mathcal{GNC}2 on BUPT-CBFace-50 dataset across varying and temperatures. We train the networks on BUPT-CBFace-50 with d=512,K=10,000d=512,\ K=10,000.
Implementation detail.

We choose the temperature τ=0.1\tau=0.1 for CIFAR100 and Tiny-ImageNet and τ=0.02\tau=0.02 for the BUPT-CBFace-50 dataset. The CIFAR100 dataset consists of 60,00060,000 32×3232\times 32 color images in 100100 classes and Tiny-ImageNet contains 100,000100,000 64×6464\times 64 color images in 200200 classes. The BUPT-CBFace-50 dataset consists of 500,000500,000 images in 10,00010,000 classes and all images are resized to the size of 50×5050\times 50. To adapt to the smaller size images, we modify these architectures by changing the first convolutional layer to have a kernel size of 33, a stride of 11, and a padding size of 11. For our data augmentation strategy, we employ the random crop with a size of 3232 and padding of 44, random horizontal flip with a probability of 0.50.5, and random rotation with degrees of 1515 to increase the diversity of our training data. Then we normalize the images (channel-wise) using their mean and standard deviation. For optimization, we utilized SGD with a momentum of 0.90.9 and an initial learning rate of 0.10.1, which decayed according to the CosineAnnealing over a span of 200200 epochs. The optimal margin (represented by the dash line) in the second from left column is obtained through numerical optimization of the Softmax Codes problem.

Refer to caption
Figure B.5: Illustration of NCC training accuracy with different temperatures.

B.4 The Nearest Centroid Classifier

As the learned features exhibit within-class variability collapse and are maximally distant between classes, the classifier also converges to the nearest centroid classifier (a.k.a nearest class-center classifier (NCC), where each sample is classified with the nearest class-mean features), which is termed as 𝒩𝒞\mathcal{NC}4 in Papyan et al. (2020) and exploited in (Galanti et al., 2022b; a; Rangamani et al., 2023) for studying 𝒩𝒞\mathcal{NC}. To evaluate the convergence in terms of NCC accuracy, we use the same setup as in Figure 2, i.e., train a ResNet18 network on the CIFAR100 dataset with CE loss using different temperatures. We then classify the features by the NCC. The result is presented in Figure B.5. We can observe that with a relatively small temperature τ\tau, the NCC accuracy converges to 100% and hence the classifier also converges to a NCC.

B.5 Additional results on the effect of assignment problem

We provide additional evidence on the effect of ”class assignment” problem in practical multi-class classification problems. To illustrate this, we train a ResNet18 network with d = 2 on six classes {Automobile, Cat, Deer, Dog, Horse, Truck} from CIFAR10 dataset that are selected due to their clear semantic similarity and discrepancy. Consequently, according to Theorem 3.3, there are fifteen distinct class assignments up to permutation.

When doing standard training, the classifier consistently converges to the case where Cat-Dog, Automobile-Truck and Deer-Horse pairs are closer together across 5 different trials; Figure 6(a) shows the learned features (dots) and classifier weights (arrows) in one of such trials. This demonstrates the implicit algorithmic regularization in training DNNs, which naturally attracts (semantically) similar classes and separates dissimilar ones.

We also conduct experiments with the classifier fixed to be three of the fifteen arrangements, and present the results in Figures 6(b) and 6(d). Among them, we observe that the case where Cat-Dog, Automobile-Truck and Deer-Horse pairs are far apart achieves a testing accuracy of 86.37%86.37\%, which is lower than the other two cases with testing accuracies of 91.60%91.60\% and 91.95%91.95\%. This demonstrates the important role of class assignment to the generalization of DNNs, and that the implicit bias of the learned classifier is benign, i.e., leads to a more generalizable solutions. Moreover, compared with the case of four classes in Figure 4, the discrepancy of test accuracy between different assignments increases from 2.18%2.18\% to 5.58%5.58\%. This suggests that as the number of classes grows, the importance of appropriate class assignments becomes increasingly paramount.

Refer to caption
(a) Trainable (91.82%91.82\%)
Refer to caption
(b) Fixed A1 (91.60%91.60\%)
Refer to caption
(c) Fixed A2 (91.95%91.95\%)
Refer to caption
(d) Fixed A3 (86.37%86.37\%)
Figure B.6: Assignment of classes to classifier weights for a ResNet18 with 2-dimensional feature space trained on the 6 classes {Automobile, Cat, Deer, Dog, Horse, Truck} from CIFAR10. (a) Learned classifier. (b-d) Classifiers fixed to be three different assignments. Test accuracy is reported in the bracket.

Appendix C Theoretical Proofs

C.1 Proof of Equation 6

We rewrite Equation 6 below for convenience.

Lemma C.1 (Convergence to the “HardMax” problem).

For any positive integers KK and nn, we have

lim supτ0(argmin1nKk=1Ki=1nCE(𝑾𝒉k,i,𝒚k,τ))argminHardMax(𝑾,𝑯).\limsup_{\tau\to 0}\left(\operatorname*{arg\,min}\frac{1}{nK}\sum_{k=1}^{K}\sum_{i=1}^{n}\mathcal{L}_{\text{CE}}\left({\bm{W}}^{\top}{\bm{h}}_{k,i},{\bm{y}}_{k},\tau\right)\right)\subseteq\operatorname*{arg\,min}\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}). (C.1)

In above, all argmin\arg\min are taken over 𝐖𝒪B(d,K),𝐇𝒪B(d,nK){\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K),{\bm{H}}\in\operatorname{{\mathcal{O}B}}(d,nK).

Proof.

Our proof uses the fundamental theorem of Γ\Gamma-convergence Braides (2006) (a.k.a. epi-convergence Rockafellar & Wets (2009)). Denote

0(𝑾,𝑯,τ)τlogi=1nk=1KCE(𝑾𝒉k,i,𝒚k,τ),\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)\doteq\tau\cdot\log\sum_{i=1}^{n}\sum_{k=1}^{K}\mathcal{L}_{\text{CE}}({\bm{W}}^{\top}{\bm{h}}_{k,i},{\bm{y}}_{k},\tau), (C.2)

and let

𝒲{(𝑾,𝑯)𝒪B(d,K)×𝒪B(d,K):(𝒘k𝒘k)𝒉k,i0,i[n],k[K],k[K]k}.\mathcal{WH}^{-}\doteq\{({\bm{W}},{\bm{H}})\in\operatorname{{\mathcal{O}B}}(d,K)\times\operatorname{{\mathcal{O}B}}(d,K):({\bm{w}}_{k^{\prime}}-{\bm{w}}_{k})^{\top}{\bm{h}}_{k,i}\leq 0,\forall i\in[n],k\in[K],k^{\prime}\in[K]\setminus k\}. (C.3)

By Lemma C.2, the function 0(𝑾,𝑯,τ)\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau) converges uniformly to HardMax(𝑾,𝑯)\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}) on 𝒲\mathcal{WH}^{-} as ϵ0\epsilon\to 0. Combining it with (Rockafellar & Wets, 2009, Proposition 7.15), we have 0(𝑾,𝑯,τ)\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau) Γ\Gamma-converges to HardMax(𝑾,𝑯)\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}) on 𝒲\mathcal{WH}^{-} as well. By applying (Braides, 2006, Theorem 2.10), we have

lim supτ0argmin(𝑾,𝑯)𝒲0(𝑾,𝑯,τ)argmin(𝑾,𝑯)𝒲HardMax(𝑾,𝑯).\limsup_{\tau\to 0}\operatorname*{arg\,min}_{({\bm{W}},{\bm{H}})\in\mathcal{WH}^{-}}\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)\subseteq\operatorname*{arg\,min}_{({\bm{W}},{\bm{H}})\in\mathcal{WH}^{-}}\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}). (C.4)

Note that in above, argmin\arg\min are taken over 𝒲\mathcal{WH}^{-} which is a strict subset of 𝑾𝒪B(d,K),𝑯𝒪B(d,nK){\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K),{\bm{H}}\in\operatorname{{\mathcal{O}B}}(d,nK). However, by Lemma C.3, we know that

argmin(𝑾,𝑯)𝒲0(𝑾,𝑯,τ)=argmin(𝑾,𝑯)𝒪B(d,K)×𝒪B(d,K)0(𝑾,𝑯,τ),\operatorname*{arg\,min}_{({\bm{W}},{\bm{H}})\in\mathcal{WH}^{-}}\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)=\operatorname*{arg\,min}_{({\bm{W}},{\bm{H}})\in\operatorname{{\mathcal{O}B}}(d,K)\times\operatorname{{\mathcal{O}B}}(d,K)}\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau), (C.5)

which holds for all τ\tau sufficiently small. Hence, we have

lim supτ0argmin(𝑾,𝑯)𝒪B(d,K)×𝒪B(d,K)0(𝑾,𝑯,τ)argmin(𝑾,𝑯)𝒪B(d,K)×𝒪B(d,K)HardMax(𝑾,𝑯),\limsup_{\tau\to 0}\operatorname*{arg\,min}_{({\bm{W}},{\bm{H}})\in\operatorname{{\mathcal{O}B}}(d,K)\times\operatorname{{\mathcal{O}B}}(d,K)}\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)\subseteq\operatorname*{arg\,min}_{({\bm{W}},{\bm{H}})\in\operatorname{{\mathcal{O}B}}(d,K)\times\operatorname{{\mathcal{O}B}}(d,K)}\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}), (C.6)

which concludes the proof. ∎

Lemma C.2.

0(𝑾,𝑯,τ)\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau) converges uniformly to HardMax(𝐖,𝐇)\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}) in the domain (𝐖,𝐇)𝒲({\bm{W}},{\bm{H}})\in\mathcal{WH}^{-} as τ0\tau\to 0.

Proof.

Recall from the definition of the CE loss in (1) that

0(𝑾,𝑯,τ)τlogi=1nk=1KCE(𝑾𝒉k,i,𝒚k,τ)=τlogi=1nk=1Klog(1+k[K]kexp((𝒘k𝒘k)𝒉k,i/τ))\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)\doteq\tau\cdot\log\sum_{i=1}^{n}\sum_{k=1}^{K}\mathcal{L}_{\text{CE}}({\bm{W}}^{\top}{\bm{h}}_{k,i},{\bm{y}}_{k},\tau)\\ =\tau\log\sum_{i=1}^{n}\sum_{k=1}^{K}\log\left(1+\sum_{k^{\prime}\in[K]\setminus k}\exp\left(({\bm{w}}_{k^{\prime}}-{\bm{w}}_{k})^{\top}{\bm{h}}_{k,i}/\tau\right)\right) (C.7)

Denote αi,k,k=(𝒘k𝒘k)𝒉k,i,i[n],k[K],k[K]k\alpha_{i,k,k^{\prime}}=({\bm{w}}_{k^{\prime}}-{\bm{w}}_{k})^{\top}{\bm{h}}_{k,i},\forall i\in[n],k\in[K],k^{\prime}\in[K]\setminus k for convenience. Fix any i[n]i\in[n] and k[K]k\in[K], by the property that x1+xlog(1+x)x\frac{x}{1+x}\leq\log(1+x)\leq x for all x>1x>-1, we have

k[K]kexp(αi,k,kτ)1+k[K]kexp(αi,k,kτ)log(1+k[K]kexp(αi,k,kτ))k[K]kexp(αi,k,kτ).\frac{\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)}{1+\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)}\leq\log\left(1+\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)\right)\leq\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right). (C.8)

Note that in the domain (𝑾,𝑯)𝒲({\bm{W}},{\bm{H}})\in\mathcal{WH}^{-} we have αi,k,k0\alpha_{i,k,k^{\prime}}\leq 0 for all i,k,kki,k,k^{\prime}\neq k. It follows trivially from monotonicity of exponential function that k[K]kexp(αi,k,kτ)<K1\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)<K-1 holds for all i,ki,k. Hence, continuing from the inequality above we have

k[K]kexp(αi,k,kτ)Klog(1+k[K]kexp(αi,k,kτ))k[K]kexp(αi,k,kτ).\frac{\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)}{K}\leq\log\left(1+\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)\right)\leq\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right). (C.9)

Summing over i,ki,k we obtain

i,kk[K]kexp(αi,k,kτ)Ki,klog(1+k[K]kexp(αi,k,kτ))i,kk[K]kexp(αi,k,kτ),\sum_{i,k}\frac{\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)}{K}\leq\sum_{i,k}\log\left(1+\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)\right)\leq\sum_{i,k}\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right), (C.10)

Using the property of max function we further obtain

maxi,k,k[K]kexp(αi,k,kτ)Ki,klog(1+k[K]kexp(αi,k,kτ))nK(K1)maxi,k,k[K]kexp(αi,k,kτ).\frac{\max_{i,k,k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)}{K}\leq\sum_{i,k}\log\left(1+\sum_{k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right)\right)\\ \leq n\cdot K\cdot(K-1)\cdot\max_{i,k,k^{\prime}\in[K]\setminus k}\exp\left(\frac{\alpha_{i,k,k^{\prime}}}{\tau}\right). (C.11)

Taking logarithmic on both sides and multiplying all terms by τ\tau we get

maxi,k,k[K]kαi,k,kτlogK0(𝑾,𝑯,τ)τlog(nK(K1))+maxi,k,k[K]kαi,k,k.\max_{i,k,k^{\prime}\in[K]\setminus k}\alpha_{i,k,k^{\prime}}-\tau\log K\leq\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)\leq\tau\log(n\cdot K\cdot(K-1))+\max_{i,k,k^{\prime}\in[K]\setminus k}\alpha_{i,k,k^{\prime}}. (C.12)

Noting that maxi,k,k[K]kαi,k,k=HardMax(𝑾,𝑯)\max_{i,k,k^{\prime}\in[K]\setminus k}\alpha_{i,k,k^{\prime}}=\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}), we have

HardMax(𝑾,𝑯)τlogK0(𝑾,𝑯,τ)τlog(nK(K1))+HardMax(𝑾,𝑯).\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}})-\tau\log K\leq\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)\leq\tau\log(n\cdot K\cdot(K-1))+\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}). (C.13)

Hence, for any ϵ>0\epsilon>0, by taking τ0=ϵmax{logK,log(nK(K1))}\tau_{0}=\frac{\epsilon}{\max\{\log K,\log(n\cdot K\cdot(K-1))\}}, we have that for any (𝑾,𝑯)𝒲({\bm{W}},{\bm{H}})\in\mathcal{WH}^{-}, we have

|0(𝑾,𝑯,τ)HardMax(𝑾,𝑯)|τmax{logK,log(nK(K1))}<ϵ,|\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)-\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}})|\leq\tau\max\{\log K,\log(n\cdot K\cdot(K-1))\}<\epsilon, (C.14)

for any τ<τ0\tau<\tau_{0}. That is, 0(𝑾,𝑯,τ)\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau) converges uniformly to HardMax(𝑾,𝑯)\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}).

Lemma C.3.

Given any n,Kn,K there exists a constant τ0\tau_{0} such that for any τ<τ0\tau<\tau_{0} we have

argmin(𝑾,𝑯)𝒪B(d,K)×𝒪B(d,K)0(𝑾,𝑯,τ)𝒲.\operatorname*{arg\,min}_{({\bm{W}},{\bm{H}})\in\operatorname{{\mathcal{O}B}}(d,K)\times\operatorname{{\mathcal{O}B}}(d,K)}\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)\in\mathcal{WH}^{-}. (C.15)
Proof.

Note that for any (𝑾,𝑯)𝒪B(d,K)×𝒪B(d,K)({\bm{W}},{\bm{H}})\notin\operatorname{{\mathcal{O}B}}(d,K)\times\operatorname{{\mathcal{O}B}}(d,K), there exists i¯[n],k¯[K]\bar{i}\in[n],\bar{k}\in[K], and k¯[K]k¯\bar{k}^{\prime}\in[K]\setminus\bar{k} such that (𝒘k¯𝒘k¯)𝒉k¯,i¯0({\bm{w}}_{\bar{k}^{\prime}}-{\bm{w}}_{\bar{k}})^{\top}{\bm{h}}_{\bar{k},\bar{i}}\geq 0. Hence, we have

0(𝑾,𝑯,τ)=τlogi=1nk=1Klog(1+k[K]kexp((𝒘k𝒘k)𝒉k,i/τ))τloglog(1+exp(𝒘k¯𝒘k¯)𝒉k¯,i¯/τ)τloglog(2),τ>0.\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau)=\tau\log\sum_{i=1}^{n}\sum_{k=1}^{K}\log\left(1+\sum_{k^{\prime}\in[K]\setminus k}\exp\left(({\bm{w}}_{k^{\prime}}-{\bm{w}}_{k})^{\top}{\bm{h}}_{k,i}/\tau\right)\right)\\ \geq\tau\log\log(1+\exp({\bm{w}}_{\bar{k}^{\prime}}-{\bm{w}}_{\bar{k}})^{\top}{\bm{h}}_{\bar{k},\bar{i}}/\tau)\geq\tau\log\log(2),\quad\forall\tau>0. (C.16)

On the other hand, we show that one may construct a (𝑾,𝑯)𝒲({\bm{W}}^{*},{\bm{H}}^{*})\in\mathcal{WH}^{-} such that 0(𝑾,𝑯,τ)<τloglog(2)\mathcal{L}_{0}({\bm{W}}^{*},{\bm{H}}^{*},\tau)<\tau\log\log(2) for any small enough τ\tau. Towards that, we take 𝑾{\bm{W}}^{*} to be any matrix in 𝒪B(d,K)\operatorname{{\mathcal{O}B}}(d,K) with distinct columns. Denote M=maxkk𝒘k,𝒘kM=\max_{k\neq k^{\prime}}\langle{\bm{w}}_{k}^{*},{\bm{w}}_{k^{\prime}}^{*}\rangle the inner product of the closest pair of columns from 𝑾{\bm{W}}^{*}, which by construction has value M<1M<1. Take 𝒉k,i=𝒘k{\bm{h}}_{k,i}={\bm{w}}_{k} for all k[K],i[n]k\in[K],i\in[n]. We have (𝒘k𝒘k)𝒉k,i(1M)({\bm{w}}_{k^{\prime}}^{*}-{\bm{w}}_{k}^{*})^{\top}{\bm{h}}_{k,i}^{*}\leq-(1-M), for all i[n],k[K]i\in[n],k\in[K], and k[K]kk^{\prime}\in[K]\setminus k. Plugging this into the definition of 0(𝑾,𝑯,τ)\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau) we have

0(𝑾,𝑯,τ)=τlogi=1nk=1Klog(1+k[K]kexp((𝒘k𝒘k)𝒉k,i/τ))τlog(nKlog(1+(K1)exp(1Mτ)))<τloglog2,τ<τ0\mathcal{L}_{0}({\bm{W}}^{*},{\bm{H}}^{*},\tau)=\tau\log\sum_{i=1}^{n}\sum_{k=1}^{K}\log\left(1+\sum_{k^{\prime}\in[K]\setminus k}\exp\left(({\bm{w}}_{k^{\prime}}^{*}-{\bm{w}}_{k}^{*})^{\top}{\bm{h}}_{k,i}^{*}/\tau\right)\right)\\ \leq\tau\log\left(nK\log(1+(K-1)\exp{(-\frac{1-M}{\tau})})\right)<\tau\log\log 2,\quad\forall\tau<\tau_{0} (C.17)

for some τ0>0\tau_{0}>0 that depends only on MM. In above, the last inequality holds because one can always find a τ0>0\tau_{0}>0 such that nKlog(1+(K1)exp(1Mτ))<log2nK\log(1+(K-1)\exp{(-\frac{1-M}{\tau})})<\log 2 for τ<τ0\tau<\tau_{0}. This implies that any (𝑾,𝑯)𝒪B(d,K)×𝒪B(d,K)({\bm{W}},{\bm{H}})\notin\operatorname{{\mathcal{O}B}}(d,K)\times\operatorname{{\mathcal{O}B}}(d,K) is not a minimizer of 0(𝑾,𝑯,τ)\mathcal{L}_{0}({\bm{W}},{\bm{H}},\tau) for a sufficiently small τ\tau, which finishes the proof. ∎

Refer to caption
Figure C.1: Verifying Equation 6 under d=3d=3 and K=7K=7.

As depicted in Figure C.1, we plot the cosine value of the minimal angle obtained from optimizing the CE loss (blue line) and the ”HardMax” approach (black line) for different temperature parameters. The figure demonstrates that as the temperature parameter τ0\tau\rightarrow 0, the blue line converges to the black line, thus validating our proof.

C.2 An important lemma

We present an important lemma which will be used to prove many of the subsequent results.

Lemma C.4 (Optimal Features for Fixed Classifier).

For any k[K]k\in[K], suppose 𝐰kconv({𝐰j}j[K]k){\bm{w}}_{k}\notin\operatorname{conv}(\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}), then

min𝒉𝕊d1maxkk𝒘k𝒘k,𝒉=dist(𝒘k,{𝒘j}j[K]k).\min_{{\bm{h}}\in\mathbb{S}^{d-1}}\max_{k^{\prime}\neq k}\left\langle{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},{\bm{h}}\right\rangle=-\operatorname{dist}({\bm{w}}_{k},\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}). (C.18)

In addition, the optimal 𝐡{\bm{h}} is given by

𝒉=𝒫𝕊d1(𝒘k𝒫{𝒘j}j[K]k(𝒘k)){\bm{h}}=\mathcal{P}_{\mathbb{S}^{d-1}}\left({\bm{w}}_{k}-\mathcal{P}_{\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}}({\bm{w}}_{k})\right) (C.19)

where 𝒫𝒲(𝐯)argmin𝐰conv(𝒲){𝐯𝐰2}\mathcal{P}_{\mathcal{W}}({\bm{v}})\doteq\operatorname*{arg\,min}_{{\bm{w}}\in\operatorname{conv}(\mathcal{W})}\{\|{\bm{v}}-{\bm{w}}\|_{2}\} denotes the projection of 𝐯{\bm{v}} on conv(𝒲)\operatorname{conv}(\mathcal{W}).

Proof.

The proof follows from combining Lemma C.5 and Lemma C.6. ∎

Lemma C.5.

Suppose 𝐰kconv({𝐰j}j[K]k){\bm{w}}_{k}\notin\operatorname{conv}(\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}). Then

min𝒉𝕊d1maxkk𝒘k𝒘k,𝒉min𝒉21maxkk𝒘k𝒘k,𝒉\min_{{\bm{h}}\in\mathbb{S}^{d-1}}\max_{k^{\prime}\neq k}\left\langle{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},{\bm{h}}\right\rangle\equiv\min_{\|{\bm{h}}\|_{2}\leq 1}\max_{k^{\prime}\neq k}\left\langle{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},{\bm{h}}\right\rangle (C.20)

where \equiv means the two problems are equivalent, i.e., have the same optimal solutions.

Proof of Lemma C.5.

By the separating hyperplane theorem (e.g. (Boyd & Vandenberghe, 2004, Example 2.20)), there exist a nonzero vector 𝒉¯\bar{{\bm{h}}} and bIRb\in I\!\!R such that maxkk𝒘k,𝒉¯<b\max_{k^{\prime}\neq k}\langle{\bm{w}}_{k^{\prime}},\bar{{\bm{h}}}\rangle<b and 𝒘k,𝒉¯>b\langle{\bm{w}}_{k},\bar{{\bm{h}}}\rangle>b, i.e., maxkk𝒘k𝒘k,𝒉¯<0\max_{k^{\prime}\neq k}\left\langle{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},\bar{{\bm{h}}}\right\rangle<0. Let 𝒉{\bm{h}}^{*} be any optimal solution to the RHS of (C.20). Then, it holds that

maxkk𝒘k𝒘k,𝒉maxkk𝒘k𝒘k,𝒉¯𝒉¯2<0.\max_{k^{\prime}\neq k}\left\langle{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},{\bm{h}}^{*}\right\rangle\leq\max_{k^{\prime}\neq k}\left\langle{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},\frac{\bar{{\bm{h}}}}{\|\bar{{\bm{h}}}\|_{2}}\right\rangle<0.

Hence, it must be the case that 𝒉2=1\|{\bm{h}}^{*}\|_{2}=1; otherwise, taking 𝒉=𝒉/𝒉2{\bm{h}}={\bm{h}}^{*}/\|{\bm{h}}^{*}\|_{2} gives lower objective for the RHS of (C.20), contradicting the optimality of 𝒉{\bm{h}}^{*}. ∎

Lemma C.6.

Suppose 𝐰kconv({𝐰j}j[K]k){\bm{w}}_{k}\notin\operatorname{conv}(\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}). Consider the (primal) problem

min𝒉21maxkk𝒘k𝒘k,𝒉.\min_{\|{\bm{h}}\|_{2}\leq 1}\max_{k^{\prime}\neq k}\left\langle{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},{\bm{h}}\right\rangle. (C.21)

Its dual problem is given by

max𝒗IRd𝒘kkkvk𝒘k2s.t.kkvk=1,andvk0,kk.\max_{{\bm{v}}\in I\!\!R^{d}}-\|\bm{w}_{k}-\sum_{k^{\prime}\neq k}v_{k^{\prime}}\bm{w}_{k^{\prime}}\|_{2}\leavevmode\nobreak\ \leavevmode\nobreak\ \textrm{s.t.}\sum_{k^{\prime}\neq k}v_{k^{\prime}}=1,\leavevmode\nobreak\ \text{and}\leavevmode\nobreak\ v_{k^{\prime}}\geq 0,\forall k^{\prime}\neq k. (C.22)

with zero duality gap. Moreover, for any primal optimal solution 𝐡{\bm{h}}^{*} there is a dual optimal solution 𝐯{\bm{v}}^{*} and they satisfy

𝒉=𝒘kkvk𝒘k𝒘kkvk𝒘k2.{\bm{h}}^{*}=\frac{{\bm{w}}_{k}-\sum_{k^{\prime}}v_{k^{\prime}}^{*}{\bm{w}}_{k^{\prime}}}{\|{\bm{w}}_{k}-\sum_{k^{\prime}}v_{k^{\prime}}^{*}{\bm{w}}_{k^{\prime}}\|_{2}}. (C.23)
Proof of Lemma C.6.

We rewrite the primal problem as

min𝒉21,𝒑IRdmaxkkpks.t.pk=𝒘k𝒘k,𝒉.\min_{\|{\bm{h}}\|_{2}\leq 1,{\bm{p}}\in I\!\!R^{d}}\max_{k^{\prime}\neq k}p_{k^{\prime}}\leavevmode\nobreak\ \leavevmode\nobreak\ \textrm{s.t.}\leavevmode\nobreak\ p_{k^{\prime}}=\left\langle{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},{\bm{h}}\right\rangle. (C.24)

Introducing the dual variable 𝒗IRd{\bm{v}}\in I\!\!R^{d}, the Lagragian function is

(𝒉,𝒑,𝒗)=maxkkpkkkvk(pk𝒘k𝒘k,𝒉),𝒉21.\mathcal{L}(\bm{h},{\bm{p}},{\bm{v}})=\max_{k^{\prime}\neq k}p_{k^{\prime}}-\sum_{k^{\prime}\neq k}v_{k^{\prime}}(p_{k^{\prime}}-\left\langle{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},{\bm{h}}\right\rangle),\leavevmode\nobreak\ \leavevmode\nobreak\ \|{\bm{h}}\|_{2}\leq 1. (C.25)

We now derive the dual problem, defined as

max𝒗min𝒉21,𝒑IRdmaxkk(𝒉,𝒑,𝒗)=max𝒗(min𝒑(maxkkpkkkvkpk)+min𝒉21kkvk𝒘k𝒘k,𝒉)=max𝒗min𝒉21kkvk𝒘k𝒘k,𝒉s.t.kkvk=1,vk0kk=max𝒗𝒘kkkvk𝒘k2s.t.kkvk=1,vk0kk.\begin{split}\max_{{\bm{v}}}\min_{\|{\bm{h}}\|_{2}\leq 1,{\bm{p}}\in I\!\!R^{d}}\max_{k^{\prime}\neq k}\mathcal{L}(\bm{h},{\bm{p}},{\bm{v}})&=\max_{{\bm{v}}}\left(\min_{{\bm{p}}}\left(\max_{k^{\prime}\neq k}p_{k^{\prime}}-\sum_{k^{\prime}\neq k}v_{k^{\prime}}p_{k^{\prime}}\right)+\min_{\|\bm{h}\|_{2}\leq 1}\left\langle\sum_{k^{\prime}\neq k}v_{k^{\prime}}{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},{\bm{h}}\right\rangle\right)\\ &=\max_{{\bm{v}}}\min_{\|\bm{h}\|_{2}\leq 1}\left\langle\sum_{k^{\prime}\neq k}v_{k^{\prime}}{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k},{\bm{h}}\right\rangle\leavevmode\nobreak\ \leavevmode\nobreak\ \textrm{s.t.}\sum_{k^{\prime}\neq k}v_{k^{\prime}}=1,\leavevmode\nobreak\ v_{k^{\prime}}\geq 0\leavevmode\nobreak\ \forall k^{\prime}\neq k\\ &=\max_{{\bm{v}}}-\|\bm{w}_{k}-\sum_{k^{\prime}\neq k}v_{k^{\prime}}\bm{w}_{k^{\prime}}\|_{2}\leavevmode\nobreak\ \leavevmode\nobreak\ \textrm{s.t.}\sum_{k^{\prime}\neq k}v_{k^{\prime}}=1,\leavevmode\nobreak\ v_{k^{\prime}}\geq 0\leavevmode\nobreak\ \forall k^{\prime}\neq k.\end{split} (C.26)

In above, the second equality follows from the fact that the conjugate (see e.g. Boyd & Vandenberghe (2004)) of the max function is the indicator function of the probability simplex. The third equality uses the assumption that 𝒘kconv({𝒘j}j[K]k){\bm{w}}_{k}\notin\operatorname{conv}(\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}), which implies that kkvk𝒘k𝒘k0\sum_{k^{\prime}\neq k}v_{k^{\prime}}{\bm{w}}_{k^{\prime}}-{\bm{w}}_{k}\neq 0 under the simplex constraint of 𝒗{\bm{v}}, hence the optimal 𝒉{\bm{h}} to the optimization in the second line can be easily obtained as 𝒉=𝒘kkvk𝒘k𝒘kkvk𝒘k2{\bm{h}}=\frac{{\bm{w}}_{k}-\sum_{k^{\prime}}v_{k^{\prime}}{\bm{w}}_{k^{\prime}}}{\|{\bm{w}}_{k}-\sum_{k^{\prime}}v_{k^{\prime}}{\bm{w}}_{k^{\prime}}\|_{2}}.

Finally, the rest of the claims hold as the primal problem is convex with the Slater’s condition satisfied. ∎

C.3 Proof of Theorem 3.2

Here we prove the following result which is a stronger version of Theorem 3.2.

Theorem C.7.

Let (𝐖,𝐇)({\bm{W}}^{\star},{\bm{H}}^{\star}) be an optimal solution to (5). Then, it holds that 𝐖{\bm{W}}^{\star} is a Softmax Code, i.e.,

𝑾argmax𝑾𝒪B(d,K)ρone-vs-rest(𝑾).{\bm{W}}^{\star}\in\operatorname*{arg\,max}\limits_{{\bm{W}}\in\operatorname{{\mathcal{O}B}}(d,K)}\rho_{\textup{one-vs-rest}}({\bm{W}}). (C.27)

Conversely, let 𝐖SC{\bm{W}}^{\text{SC}} be any Softmax Code. Then, there exists a 𝐇SC{\bm{H}}^{\text{SC}} such that (𝐖SC,𝐇SC)({\bm{W}}^{\text{SC}},{\bm{H}}^{\text{SC}}) is an optimal solution to (5).

Proof.

The proof is divided into two parts.

Any optimal solution to (5) is a Softmax Code. Our proof is based on providing a lower bound on the objective HardMax(𝑾,𝑯)\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}}). We distinguish two cases in deriving the lower bound.

  • 𝑾{\bm{W}} has distinct columns. In this case we use the following bound:

    HardMax(𝑾,𝑯)=maxk[K]maxi[n]maxkk𝒘k𝒘k,𝒉k,imaxk[K]maxi[n]min𝒉¯k,i𝕊d1maxkk𝒘k𝒘k,𝒉¯k,i=mink[K]dist(𝒘k,{𝒘j}j[K]k).\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}})=\max_{k\in[K]}\max_{i\in[n]}\max_{k^{\prime}\neq k}\langle\bm{w}_{k^{\prime}}-\bm{w}_{k},\bm{h}_{k,i}\rangle\\ \geq\max_{k\in[K]}\max_{i\in[n]}\min_{\bar{{\bm{h}}}_{k,i}\in\mathbb{S}^{d-1}}\max_{k^{\prime}\neq k}\langle\bm{w}_{k^{\prime}}-\bm{w}_{k},\bar{{\bm{h}}}_{k,i}\rangle=-\min_{k\in[K]}\operatorname{dist}({\bm{w}}_{k},\{{\bm{w}}_{j}\}_{j\in[K]\setminus k}). (C.28)

    In above, the first equality follows directly from definition of the HardMax function. The inequality follows trivially from the property of the min\min operator. The last equality follows from Lemma C.4, which requires that 𝑾{\bm{W}} has distinct columns. Continuing on the rightmost term in (C.28), we have

    mink[K]dist(𝒘k,{𝒘j}j[K]k)=ρone-vs-rest(𝑾)ρone-vs-rest(𝑾SC),-\min_{k\in[K]}\operatorname{dist}({\bm{w}}_{k},\{{\bm{w}}_{j}\}_{j\in[K]\setminus k})=-\rho_{\textup{one-vs-rest}}({\bm{W}})\geq-\rho_{\textup{one-vs-rest}}({\bm{W}}^{\text{SC}}), (C.29)

    where 𝑾SC{\bm{W}}^{\text{SC}} is any Softmax Code. In above, the equality follows from the definition of the operator ρone-vs-rest()\rho_{\textup{one-vs-rest}}(), and the inequality follows from the definition of the Softmax Code. In particular, by defining 𝑾^=𝑾SC\widehat{{\bm{W}}}={\bm{W}}^{\text{SC}} and 𝑯^\widehat{{\bm{H}}} as

    𝒉^k,i=𝒘^kproj(𝒘^k,{𝒘^j}j[K]k)𝒘^kproj(𝒘^k,{𝒘^j}j[K]k)2,\widehat{{\bm{h}}}_{k,i}=\frac{\widehat{{\bm{w}}}_{k}-\mathrm{proj}(\widehat{{\bm{w}}}_{k},\{\widehat{{\bm{w}}}_{j}\}_{j\in[K]\setminus k})}{\|\widehat{{\bm{w}}}_{k}-\mathrm{proj}(\widehat{{\bm{w}}}_{k},\{\widehat{{\bm{w}}}_{j}\}_{j\in[K]\setminus k})\|_{2}}, (C.30)

    all inequalities in (C.28) and (C.29) holds with equality by taking 𝑾=𝑾^{\bm{W}}=\widehat{{\bm{W}}} and 𝑯=𝑯^{\bm{H}}=\widehat{{\bm{H}}}, at which we have that HardMax(𝑾^,𝑯^)=ρone-vs-rest(𝑾SC)\mathcal{L}_{\text{HardMax}}(\widehat{{\bm{W}}},\widehat{{\bm{H}}})=-\rho_{\textup{one-vs-rest}}({\bm{W}}^{\text{SC}}).

  • 𝑾{\bm{W}} does not have distinct columns. Hence, there exists k1k_{1}, k2k_{2} such that k1k2k_{1}\neq k_{2} but 𝒘k1=𝒘k2{\bm{w}}_{k_{1}}={\bm{w}}_{k_{2}}. We have

    HardMax(𝑾,𝑯)=maxk[K]maxi[n]maxkk𝒘k𝒘k,𝒉k,imaxi[n]𝒘k2𝒘k1,𝒉k1,i=0.\mathcal{L}_{\text{HardMax}}({\bm{W}},{\bm{H}})=\max_{k\in[K]}\max_{i\in[n]}\max_{k^{\prime}\neq k}\langle\bm{w}_{k^{\prime}}-\bm{w}_{k},\bm{h}_{k,i}\rangle\geq\max_{i\in[n]}\langle{\bm{w}}_{k_{2}}-{\bm{w}}_{k_{1}},{\bm{h}}_{k_{1},i}\rangle=0. (C.31)

Combining the above two cases, and by noting that ρone-vs-rest(𝑾SC)<0-\rho_{\textup{one-vs-rest}}({\bm{W}}^{\text{SC}})<0, we have that (𝑾^,𝑯^)(\widehat{{\bm{W}}},\widehat{{\bm{H}}}) is an optimal solution to the HardMax problem in (5). Moreover, since (𝑾,𝑯)({\bm{W}}^{*},{\bm{H}}^{*}) is an optimal solution to the HardMax problem, it must attain the lower bound, i.e.,

HardMax(𝑾,𝑯)=ρone-vs-rest(𝑾SC).\mathcal{L}_{\text{HardMax}}({\bm{W}}^{*},{\bm{H}}^{*})=-\rho_{\textup{one-vs-rest}}({\bm{W}}^{\text{SC}}). (C.32)

Hence, 𝑾{\bm{W}}^{*} has to attain the equality in (C.29). By definition, this means that 𝑾{\bm{W}}^{*} is a Softmax Code, which concludes the proof of this part.