Label Embedding via Low-Coherence Matrices

Jianxin Zhang and Clayton Scott
Electrical Engineering and Computer Science
University of Michigan
Ann Arbor, MI 48109
{jianxinz, clayscot}@umich.edu

Abstract

Label embedding is a framework for multiclass classification problems where each label is represented by a distinct vector of some fixed dimension, and training involves matching model output to the vector representing the correct label. While label embedding has been successfully applied in extreme classification and zero-shot learning, and offers both computational and statistical advantages, its theoretical foundations remain poorly understood. This work presents an analysis of label embedding in the context of extreme multiclass classification, where the number of classes $C$ is very large. We present an excess risk bound that reveals a trade-off between computational and statistical efficiency, quantified via the coherence of the embedding matrix. We further show that under the Massart noise condition, the statistical penalty for label embedding vanishes with sufficiently low coherence. Our analysis supports an algorithm that is simple, scalable, and easily parallelizable, and experimental results demonstrate its effectiveness in large-scale applications.

1 Introduction

In standard classification, the goal is to learn from feature-label pairs $\left\{(x_{i},y_{i})\right\}_{i=1}^{N}$ a classifier $h:\mathcal{X}\rightarrow\mathcal{Y}$ that maps a feature vector $x$ in the feature space $\mathcal{X}$ to its label $y$ in the label space $\mathcal{Y}$ . Label embedding is a framework that represents each label by a vector of fixed dimension, and learns a function to map each feature vector to the vector representing its label. At inference time, the label of a test data point is assigned to match the nearest label representative in the embedding space. The standard classification setup can be viewed as a special case of label embedding, where each label is one-hot encoded.

This work examines labeling embedding in the context of extreme classification, which refers to multiclass and multilabel classification problems involving thousands of classes or more (Wei et al.,, 2022). Extreme classification has emerged as an essential research area in machine learning, owing to an increasing number of real-world applications involving massive numbers of classes, such as image recognition (Zhou et al.,, 2014), natural language processing (Le and Mikolov,, 2014; Jernite et al.,, 2017), and recommendation systems (Bhatia et al.,, 2015; Chang et al.,, 2019). Traditional classification methods often struggle to scale effectively in these scenarios due to the high computational cost and memory requirements associated with handling large label spaces. Consequently, there is a growing need for efficient and scalable algorithms that can tackle extreme classification problems without compromising on performance (Prabhu and Varma,, 2014; Prabhu et al.,, 2018; Deng et al.,, 2018). Successful applications of label embedding to extreme classification include Yu et al., (2014); Jain et al., (2019); Bhatia et al., (2015); Guo et al., (2019); Evron et al., (2018); Hsu et al., (2009). Furthermore, Rodríguez et al., (2018) argues that label embedding can accelerate the convergence rate and better capture latent relationships between categories.

Despite its widespread use, the theoretical basis of label embedding has not been thoroughly explored. This paper presents a new excess risk bound that provides insight into how label embedding algorithms work. The bound establishes a trade-off between computational efficiency and classification accuracy, and explains the accuracy penalty in terms of the coherence of the embedding matrix. Our theory applies to various types of embedding: data-independent embeddings (Weston et al.,, 2002; Hsu et al.,, 2009), those anchored in auxiliary information (Akata et al.,, 2013), and embeddings co-trained with models (Weston et al.,, 2010). Intriguingly, under the multiclass noise condition of Massart and Nédélec, (2006), the statistical penalty associated with a positive matrix coherence, which is unavoidable when reducing the dimension of the label space, disappears. Guided by this theory, we investigate a simple yet efficient label embedding algorithm, and show empirically that in extreme classification tasks, this algorithms outperforms existing methods.

2 Related Work

While label embedding has also been successfully applied to zero-shot learning (Wang et al.,, 2019; Akata et al.,, 2013), we focus here on extreme classification, together with related theoretical contributions.

2.1 Extreme Classification

Besides label embedding, existing methods for extreme multiclass classification can be grouped into three main categories: label hierarchy, one-vs-all methods, and other methods.

Label Embedding. A natural approach involves representing each label as a vector in a low-dimensional space. LEML (Yu et al.,, 2014) leverages a low-rank assumption on linear models and effectively constrains the output space of models to a low-dimensional space. SLICE (Jain et al.,, 2019) is designed to train on low-dimensional dense features, with each label represented by the mean of all feature vectors associated with that label. SLEEC (Bhatia et al.,, 2015) is a local embedding framework that preserves the distance between label vectors. Guo et al., (2019) point out that low-dimensional embedding-based models could suffer from significant overfitting. Their theoretical insights inspire a novel regularization technique to alleviate overfitting in embedding-based models. WLSTS (Evron et al.,, 2018) is an extreme multiclass classification framework based on error correcting output coding, which embeds labels with codes induced by graphs. Hsu et al., (2009) use column vectors from a matrix with the restricted isometry property (RIP) to represent labels. Their analysis is primarily tailored to multilabel classification.They deduce bounds for the conditional $\ell_{2}$ -error, which measures the squared $2$ -norm difference between the prediction and the label vector — a metric that is not a standard measure of classification error. In contrast, our work analyzes the standard classification error.

Label Hierarchy. Numerous methods such as Parabel (Prabhu et al.,, 2018), Bonsai (Khandagale et al.,, 2020), AttentionXML (You et al.,, 2019), lightXML (Jiang et al.,, 2021), XR-Transformer (Zhang et al.,, 2021), X-Transformer (Wei et al.,, 2019), XR-Linear (Yu et al.,, 2022), and ELIAS (Gupta et al.,, 2022) partition the label spaces into clusters. This is typically achieved by performing $k$ -means clustering on the feature space. The training process involves training a cluster-level model to assign a cluster to a feature vector, followed by training a label-level model to assign labels within the cluster.

One-vs-all methods. One-vs-all (OVA) algorithms address extreme classification problems with $C$ labels by modeling them as $C$ independent binary classification problems. For each label, a classifier is trained to predict its presence. DiSMEC (Babbar and Schölkopf,, 2017) introduces a large-scale distributed framework to train linear OVA models, albeit at an expensive computational cost. ProXML (Babbar and Schölkopf,, 2019) mitigates the impact of data scarcity with adversarial perturbations. PD-Sparse (Yen et al.,, 2016) and PPD-Sparse (Yen et al., 2017a, ) propose optimization algorithms to exploit a sparsity assumption on labels and feature vectors.

Other methods. Beyond the above categories, DeepXML (Dahiya et al.,, 2021) uses a negative sampling procedure that shortlists $O(\log C)$ relevant labels during training and prediction. VM (Choromanska and Langford,, 2015) constructs trees with $\mathcal{O}(\log C)$ depth that have leaves with low label entropy. Based on the standard random forest training algorithm, FastXML (Prabhu and Varma,, 2014) proposes to directly optimize the Discounted Cumulative Gain to reduce the training cost. AnnexML (Tagami,, 2017) constructs $k$ -nearest neighbor graph of the label vectors and attempts to reproduce the graph structure in a lower-dimension feature space.

2.2 Excess risk bounds

Our theoretical contributions are expressed as excess risk bounds, which quantify how the excess risk associated to a surrogate loss relates to the excess risk for the 01 loss. Excess risk bounds for classification were developed by Zhang, (2004); Bartlett et al., (2006); Steinwart, (2007) and subsequently developed and extended by several authors. Ramaswamy and Agarwal, (2012) shows that one needs to minimize a convex surrogate loss defined on at least a $C-1$ dimension space to achieve consistency for the standard $C$ -class classification problem, $i.e.$ , any convex surrogate loss function operating on a dimension less than $C-1$ inevitably suffers an irreducible error. We draw insights from (Steinwart,, 2007) to establish a novel excess risk bound for the label embedding framework, quantifying this irreducible error.

From a different perspective, Ramaswamy et al., (2018) put forth a novel surrogate loss function for multiclass classification with an abstain option. This abstain option enables the classifier to opt-out from making predictions at a certain cost. Remarkably, their proposed methods not only demonstrate consistency but also effectively reduce the multiclass problems to $\lceil\log C\rceil$ binary classification problems by encoding the classes with their binary representations. In particular, the region in $\mathcal{X}$ that causes the irreducible error in our excess risk bound is abstained in Ramaswamy et al., (2018) to achieve lossless dimension reduction in the abstention setting.

3 Matrix Coherence

Our theory relies on the notion of the coherence of a matrix $A\in\mathbb{C}^{n\times C}$ , which is the maximum magnitude of the dot products between distinct columns.

Definition 1.

Let $\left\{a_{j}\right\}_{j=1}^{C}$ be columns of the matrix $A\in\mathbb{C}^{n\times C}$ , where $\left\lVert a_{j}\right\rVert_{2}=1$ for all $j$ . The coherence of $A$ is defined as $\lambda^{A}:=\max_{1\leq i\neq j\leq C}\left\lvert\left\langle a_{i},a_{j}\right\rangle\right\rvert$ .

If $n\geq C$ , $\lambda^{A}$ is $0$ when the columns of $A$ are orthonormal. When $n<C$ , however, the coherence must be positive. Indeed, (Welch,, 1974) showed that for $A\in\mathbb{C}^{n\times C}$ and $n\leq C$ , $\lambda^{A}\geq\sqrt{\frac{C-n}{n(C-1)}}$ .

There are a number of known constructions of low-coherence matrices when $n<C$ . A primary class of examples is the random matrices with columns of unit norms that satisfy the Johnson-Lindenstrauss property (Johnson and Lindenstraus,, 1984). For example, a Rademacher matrix has entries that are sampled $i.i.d.$ from a uniform distribution on $\left\{\frac{1}{\sqrt{n}},-\frac{1}{\sqrt{n}}\right\}$ . With high probability, a Rademacher random matrix of shape $n\times C$ achieves a coherence $\lambda\leq\sqrt{\frac{c_{0}\log C}{n}}$ for some constant $c_{0}$ (Achlioptas,, 2001). While random matrices can be easily obtained and have a low coherence in general, they require explicit storage (Nelson and Temlyakov,, 2011) and can be outperformed in practical problems by some carefully crafted deterministic matrices (Naidu et al.,, 2016; Liu and Jia,, 2020). Numerous deterministic constructions of low-coherence matrices have been proposed (Nelson and Temlyakov,, 2011; Yu,, 2011; Li et al.,, 2012; Xu,, 2011; Naidu et al.,, 2016). In particular, Nelson and Temlyakov, (2011) proposes a deterministic construction that can achieve $\lambda\approx C^{-\frac{1}{4}}$ with $n\approx\sqrt{C}$ , which avoids explicit storage of the matrix and can achieve a lower coherence in practice. There are also algorithms that directly optimize matrices for a smaller coherence (Wei et al.,, 2020; Abolghasemi et al.,, 2010; Obermeier and Martinez-Lorenzo,, 2017; Li et al.,, 2013; Lu et al.,, 2018).

4 Label Embedding by Low-Coherence Matrices

We first introduce the notations and problem statement in Section 4.1, and then present the associated algorithm in Section 4.2. The excess risk bound and its interpretation are presented in Section 4.3. Finally, we introduce the condition for lossless label embedding in Section 4.4.

4.1 Problem Statement

We first introduce the notations. Let $\mathcal{X}$ denote the feature space and $\mathcal{Y}=\left\{1,\dots,C\right\}$ denote the label space where $C\in\mathbb{N}$ . Let $(X,Y)$ be random variables in $\mathcal{X}\times\mathcal{Y}$ , and let $P$ be the probability measure that governs $(X,Y)$ . We use $P_{\mathcal{X}}$ to denote the marginal distribution of $P$ on $\mathcal{X}$ .

To define the standard classification setting, denote the 0-1 loss $L_{01}:\mathcal{Y}\times\mathcal{Y}\rightarrow\mathbb{R}$ by $L_{01}\left(\hat{y},y\right)=\begin{cases}1,&\text{ if }y\neq\hat{y}\\ 0,&o.w.\end{cases}$ . The risk of a classifier $h$ is $\mathbb{E}\left[L_{01}(h(X),Y)\right]$ , and the goal of classification is to learn a classifier from training data whose risk is as close as possible to the Bayes risk $\min_{h\in\mathcal{H}}\mathbb{E}\left[L_{01}(h(X),Y)\right]$ , where $\mathcal{H}=\left\{\text{measurable }h:\mathcal{X}\rightarrow\mathcal{Y}\right\}$ is the set of all measurable functions $\mathcal{X}$ to $\mathcal{Y}$ .

We now describe an approach to classification based on label embedding, which represents labels as vectors in $n$ dimensional $\mathbb{C}^{n}$ . In particular, let $G$ be an $n\times C$ matrix with unit norm columns, called the embedding matrix, having coherence $\lambda^{G}<1$ . The columns of $G$ are denoted by $g_{1},g_{2},\dots,g_{C}$ , and the column $g_{i}$ is used to embed the $i$ -th label.

Given an embedding matrix $G$ , the original $C$ -class classification problem may be reduced to a multi-output regression problem, where the classification instance $(x,y)$ translates to the regression instance $(x,g_{y})$ . Given training data $\{(x_{i},y_{i})\}$ for classification, we create training data $\{(x_{i},g_{y_{i}})\}$ for regression, and apply any algorithm for multi-output regression to learn a regression function $f:\mathcal{X}\to\mathbb{C}^{n}$ .

At test time, given a test point $x$ , a label $y$ is obtained by taking the nearest neighbor to $f(x)$ among the columns of $G$ . In particular, define the decoding function $\beta^{G}:\mathbb{C}^{n}\rightarrow\mathcal{Y}$ , $\beta^{G}(p)=\min\left\{\operatorname*{arg\,min}_{i\in\mathcal{Y}}\left\lVert p-g_{i}\right\rVert_{2}\right\}$ , where $p$ is the output of a regression model. Then the label assigned to $x$ is $\beta^{G}(f(x))$ . In the rare case where multiple nearest neighbors exist for $p$ , this ambiguity is resolved by taking the label with smallest index.

Thus, label embedding searches over classifiers of the form $\beta^{G}\circ f$ , where $f\in\mathcal{F}=\left\{\text{all measurable }f:\mathcal{X}\rightarrow\mathbb{C}^{n}\right\}$ . Fortunately, according to the following result, no expressiveness is lost by considering classifiers of this form.

Proposition 2.

$\beta^{G}\circ\mathcal{F}$ = $\mathcal{H}$

Proof.

This stems from the following facts: (i) for any given $f\in\mathcal{F}$ , the function $\beta^{G}\circ f$ is also measurable, (ii) for all $h\in\mathcal{H}$ , the function $f(x)=g_{h(x)}$ ensures that $\beta^{G}\circ f=h$ , and (iii) such $f$ are measurable because $\forall\text{ measurable sets }\mathcal{S}$ , $f^{-1}(\mathcal{S})=\bigcup_{i:g_{i}\in\mathcal{S}}h^{-1}(i)$ . ∎

It follows that $\min_{f\in\mathcal{F}}\mathbb{E}_{P}\left[L_{01}(\beta^{G}(f(X)),Y)\right]$ is the Bayes risk for classification as defined earlier. This allows us to focus our attention on learning $f:\mathcal{X}\to\mathbb{C}^{n}$ . Toward that end, we now formalize notions of loss and risk for the task of learning a multi-output function $f$ for multiclass classification via label embedding.

Definition 3.

Let $\mathcal{L}:\mathbb{C}^{n}\times\mathcal{Y}\rightarrow\mathbb{R}$ be a loss function. Define the $\mathcal{L}$ -risk of $f$ with distribution $P$ to be $\mathcal{R}_{\mathcal{L},P}:\mathcal{F}\rightarrow\mathbb{R},\ \ \mathcal{R}_{\mathcal{L},P}(f):=\mathbb{E}_{P}\left[\mathcal{L}(f(X),Y)\right]$ and the $\mathcal{L}$ -Bayes risk to be $\mathcal{R}_{\mathcal{L},P}^{*}:=\inf_{f\in\mathcal{F}}\mathcal{R}_{\mathcal{L},P}(f).$

Using this notation, the target loss for learning is the loss function $L^{G}:\mathbb{C}^{n}\times\mathcal{Y}$ associated with embedding matrix $G$ defined by $L^{G}(p,y):=L_{01}(\beta^{G}(p),y)$ . By Prop. 2, the $L^{G}$ -risk of $f$ is the usual classification risk of the associated classifier $h=\beta^{G}\circ f$ . Given $G$ , we’d like to find an $f$ that minimizes $\mathcal{R}_{L^{G},P}(f)$ , or in other words, an $f$ that makes the excess risk $\mathcal{R}_{L^{G},P}-\mathcal{R}_{L^{G},P}^{*}$ as small as possible.

While the target loss $L^{G}$ defines our learning goal, it is not practical as a training objective because of its discrete nature. Therefore, for learning purposes, Hsu et al., (2009); Akata et al., (2013); Yu et al., (2014) suggest a surrogate loss, namely, the squared distance between $f(x)$ and $g_{y}$ . More precisely, for a given embedding matrix $G$ , define $\ell^{G}:\mathbb{C}^{n}\times\mathcal{Y}\rightarrow\mathbb{R}$ as $\ell^{G}(p,y):=\frac{1}{2}\left\lVert p-g_{y}\right\rVert^{2}_{2}$ . This surrogate allows us to learn $f$ by applying existing multi-output regression algorithms as we describe next. Thus, the label embedding learning problem is to learn $f$ with small surrogate excess risk. Our subsequent analysis will connect $\mathcal{R}_{L^{G},P}-\mathcal{R}_{L^{G},P}^{*}$ to $\mathcal{R}_{\ell^{G},P}-\mathcal{R}_{\ell^{G},P}^{*}$ .

4.2 Learning Algorithms

The theory presented below motivates the use of a $G$ with low coherence, and an $f$ with small surrogate risk $\mathcal{R}_{\ell^{G},P}$ . In light of this theory, the problem statement just introduced leads to a natural meta-algorithm for label embedding. This meta-algorithm should not be considered novel as its essential ingredients have been previously introduced Akata et al., (2013); Rodríguez et al., (2018). Our contribution is the theory below, and the experimental assessment of the meta-alogrithm with low-coherence matrices for extreme multiclass classification.

Let the training dataset $\left\{\left(x_{i},y_{i}\right)\right\}_{i=1}^{N}$ be realizations of $\left(X,Y\right)$ . Given an embedding matrix $G=\left[g_{1},g_{2},\dots,g_{C}\right]$ , replace each label $y_{i}$ with its embedding vector $g_{y_{i}}$ to form the regression dataset $\left\{\left(x_{i},g_{y_{i}}\right)\right\}_{i=1}^{N}$ . Next, train a regression model $f$ on the regression dataset. For example, this may be achieved by solving $\min_{f\in\mathcal{F}_{0}}\frac{1}{N}\sum_{i=1}^{N}\ell^{G}(f(x_{i}),y_{i}),$ where $\mathcal{F}_{0}\subset\mathcal{F}$ is some model class, such as a multilayer perceptron with $n$ nodes in the output layer.

$G$ can either be fixed or trained jointly with $f$ . Our analysis is independent of model training, and thus applies to either case. Given a new data point $x$ , to assign a label we search for the nearest embedding vector $g_{y}$ of $f(x)$ and annotate $x$ with the label $y$ . The full meta-algorithm is presented in Alg. 1.

Algorithm 1 Label Embedding

1: Input: a model class

\mathcal{F}_{0}\in\mathcal{F}

, the dataset

\mathcal{D}=\left\{\left(x_{i},y_{i}\right)\right\}_{i=1}^{N}

, and an embedding matrix

G\in\mathbb{C}^{n\times C}

, whose columns are

g_{1},g_{2},\dots,g_{C}

2: Form the regression dataset

\mathcal{D}_{r}=\left\{\left(x_{i},g_{y_{i}}\right)\right\}_{i=1}^{N}

3: Train a regression model

f

\mathcal{D}_{r}

where

G

may be fixed or learned.

4: Return:

\beta^{G}\circ f

4.3 Excess Risk Bound

We present an excess risk bound, which relates the excess surrogate risk to the excess target risk. This bound justifies the use of the squared error surrogate, and also reveals a trade-off between the reduction in dimensionality (as reflected by $\lambda^{G}$ ) and the potential penalty in accuracy.

To state the bound, define the class posterior $\eta(x)=\left(\eta_{1}(x),\dots,\eta_{C}(x)\right)$ where $\eta_{i}(x)=P_{Y\mid X=x}(i)$ . Define $d(x)=\max_{i}\eta_{i}(x)-\max_{i\notin\operatorname*{arg\,max}_{j}\eta_{j}(x)}\eta_{i}(x)$ , which is a measure of “noise” at $x$ . We discuss this quantity further after the main result, which we now state.

Theorem 4.

Let $G$ be an embedding matrix with coherence $\lambda^{G}$ . Then

	$\displaystyle\quad\mathcal{R}_{L^{G},P}(f)-\mathcal{R}^{*}_{L^{G},P}$		(1)
	$\displaystyle\leq\inf_{r>\frac{2\lambda^{G}}{1+\lambda^{G}}}\biggl{\{}\frac{2\lambda^{G}}{1+\lambda^{G}}P_{\mathcal{X}}(d(X)<r)+$		(2)
	$\displaystyle\sqrt{\frac{4-2\lambda^{G}}{(1+\lambda^{G})^{2}}P_{\mathcal{X}}(d(X)<r)(\mathcal{R}_{\ell^{G},P}(f)-\mathcal{R}^{*}_{\ell^{G},P})}$		(3)
	$\displaystyle+\frac{4-2\lambda^{G}}{(r(1+\lambda^{G})-2\lambda^{G})^{2}}\left(\mathcal{R}_{\ell^{G},P}(f)-\mathcal{R}^{*}_{\ell^{G},P}\right)\biggl{\}}.$		(4)

We begin by sketching the proof, and after the sketch offer a discussion of the bound. Full proofs and associated lemmas are provided in the Appendix. Before delving into the proof sketch, the concept of conditional risk is needed.

Definition 5.

Let $\mathcal{L}:\mathbb{C}^{n}\times\mathcal{Y}\rightarrow\mathbb{R}$ be a loss function and $P$ be a probability measure on $\mathcal{X}\times\mathcal{Y}$ . For a point $x\in\mathcal{X}$ , define the conditional risk at $x$ as $C_{\mathcal{L},x}:\mathbb{C}^{n}\rightarrow\mathbb{R}\text{, }C_{\mathcal{L},x}(p)=\mathbb{E}_{y\sim P_{Y\mid X=x}}\mathcal{L}\left(p,y\right),$ and $C_{\mathcal{L},x}^{*}=\inf_{p\in\mathbb{C}^{n}}C_{\mathcal{L},x}(p)$ .

The conditional risk represents the expected value of a loss given the model output $p$ and a feature vector $x$ . Note that $\mathcal{R}_{\mathcal{L},P}(f)=\mathbb{E}_{X\sim P_{\mathcal{X}}}C_{\mathcal{L},X}(f(X))$ . For brevity, we introduce the notation $C_{1,x}(p)=C_{L^{G},x}(p)-C^{*}_{L^{G},x},C_{2,x}(p)=C_{\ell^{G},x}(p)-C^{*}_{\ell^{G},x}$ .

Proof sketch for Theorem 4.

We begin by demonstrating in Lemma 11 that there exists a unique $p_{x}^{*}$ such that $C^{*}_{\ell^{G},x}=C_{\ell^{G},x}(p_{x}^{*})$ . Here, $p_{x}^{*}$ represents the optimal model output at a specific point $x$ . We establish in Lemma 12 that $\forall x\in\mathcal{X},\forall j,k\in\left[C\right],\forall p\in\mathbb{C}^{n}$ ,

	$\displaystyle\quad\frac{(1+\lambda^{G})(\eta_{k}(x)-\eta_{j}(x))-2\lambda^{G}}{\sqrt{2-2\lambda^{G}}}>\left\lVert p_{x}^{*}-p\right\rVert$		(5)
	$\displaystyle\implies\left\lVert p-g_{j}\right\rVert_{2}>\left\lVert p-g_{k}\right\rVert_{2}.$		(6)

This implies that if a model output $p$ is sufficiently close to $p_{x}^{*}$ and the probability $\eta_{k}(x)$ for label $k$ to occur exceeds the probability $\eta_{j}(x)$ for label $j$ by a some margin, then $p$ is closer to $g_{k}$ than to $g_{j}$ .

By leveraging the above property along with the convexity of $C_{2,x}$ , we demonstrate in Lemma 13 that $\forall x\in\mathcal{X}$ , $\forall r>\frac{2\lambda^{G}}{1+\lambda^{G}}$ , and $\forall p\in\mathbb{C}^{n}$ ,

C_{2,x}(p)<\frac{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}{4-2\lambda^{G}}\implies C_{1,x}(p)<r.

(7)

This means a small $C_{2,x}(p)$ will lead to a small $C_{1,x}(p)$ up to the noise tolerance $\epsilon$ .

The final step involves expressing the excess risk as

	$\displaystyle\quad\mathcal{R}_{L^{G},P}(f)-\mathcal{R}_{L^{G},P}^{*}=\int_{\mathcal{X}}C_{1,x}(f(x))$		(8)
	$\displaystyle=\int_{x:d(x)<r}C_{1,x}(f(x))+\int_{x:d(x)\geq r}C_{1,x}(f(x)).$		(9)

By plugging (7) into the integral, the first integral leads to terms (2) and (3) and the second integral leads to term (4). ∎

As mentioned earlier, the goal of learning is to minimize the excess target risk $\mathcal{R}_{L^{G},P}(f)-\mathcal{R}^{*}_{L^{G},P}$ . The theorem shows that this goal can be achieved up to the first (irreducible) term by minimizing the excess surrogate risk $\mathcal{R}_{\ell^{G},P}(f)-\mathcal{R}^{*}_{\ell^{G},P}$ . The excess surrogate risk can be driven to zero by any consistent algorithm for multi-output regression with squared error loss.

The quantity $d(x)$ can be viewed as a measure of noise (inherent in the joint distribution of $(X,Y)$ ) at a point $x$ . While $\max_{i}\eta_{i}(x)$ represents the probability of the most likely label occurring, $\max_{i\notin\operatorname*{arg\,max}_{j}\eta_{j}(x)}\eta_{i}(x)$ represents the probability of the second most likely label occurring, A large $d(x)$ implies that $\operatorname*{arg\,max}_{i}\eta_{i}(x)$ is, with high confidence, the correct prediction at $x$ . In contrast, if $d(x)$ is small, our confidence in predicting $\operatorname*{arg\,max}_{i}\eta_{i}(x)$ is reduced, as the second most likely label has a similar probability of being correct.

As pointed out by Ramaswamy and Agarwal, (2012), any convex surrogate loss function operating on a dimension less than $C-1$ inevitably suffers an irreducible error, which is measured by the coherence of the embedding matrix $G,\lambda^{G}$ , in the label embedding framework. From the proof outline, we can see that the model $f^{*}$ minimizing the $\ell^{G}$ -risk $\mathcal{R}_{\ell^{G},P}(f)$ may potentially make a suboptimal prediction at point $x$ when $d(x)<\frac{2\lambda^{G}}{1+\lambda^{G}}$ . Conversely, when $d(x)>\frac{2\lambda^{G}}{1+\lambda^{G}}$ , $f^{*}$ will always make the optimal prediction at point $x$ . Given a classification problem with $C$ classes, a larger embedding dimension $n$ will lead to a smaller coherence $\lambda^{G}$ , making $d(x)>\frac{2\lambda^{G}}{1+\lambda^{G}}$ on a larger region in $\mathcal{X}$ at the cost of increasing computational complexity. On the other hand, by choosing a smaller $n$ , $d(x)<\frac{2\lambda^{G}}{1+\lambda^{G}}$ on a larger region in $\mathcal{X}$ , increasing the first term in Theorem 4. This interpretation highlights the balance between the benefits of dimensionality reduction and the potential impact on prediction accuracy, as a function of the coherence of the embedding matrix, $\lambda^{G}$ , and the noisiness measure, $d(x)$ .

4.4 Improvement Under Low Noise

While Theorem 4 holds universally (for all distributions $P$ ), by considering a specific subset of distributions, we can derive a more conventional form of the excess risk bound. As a direct consequence of Theorem 4, under the multiclass extension of the Massart noise condition (Massart and Nédélec,, 2006), which requires $d(X)>c$ with probability 1 for some $c$ , the first and second terms in Theorem 4 vanish. In this case, we recover a conventional excess risk bound, where $\mathcal{R}_{L^{G},P}(f)-\mathcal{R}^{*}_{L^{G},P}$ tends to $0$ with $\mathcal{R}_{\ell^{G},P}(f)-\mathcal{R}^{*}_{\ell^{G},P}$ . We now formalize this.

Definition 6 (Multiclass Massart Noise Condition).

The distribution $P$ on $\mathcal{X}\times\mathcal{Y}$ is said to satisfy the Multiclass Massart Noise Condition if and only if $\exists c>0$ such that $P_{\mathcal{X}}(d(X)>c)=1$ .

Corollary 7.

Assume $P$ satisfies the Multiclass Massart Noise Condition. If $\lambda^{G}\in\left(0,\frac{\operatorname*{ess\,inf}d}{2-\operatorname*{ess\,inf}d}\right)$ , then

\displaystyle\quad\mathcal{R}_{L^{G},P}(f)-\mathcal{R}^{*}_{L^{G},P}\leq\frac{4-2\lambda^{G}}{\left((1+\lambda^{G})\operatorname*{ess\,inf}d-2\lambda^{G}\right)^{2}}\left(\mathcal{R}_{\ell^{G},P}(f)-\mathcal{R}^{*}_{\ell^{G},P}\right)

(10)

where $\operatorname*{ess\,inf}d$ is the essential infimum of $d$ , $i.e.,\operatorname*{ess\,inf}d=\sup\left\{a\in\mathbb{R}:P_{\mathcal{X}}(d(X)<a)=0\right\}$ .

5 Experiments

In this section, we present an experimental evaluation of the label embedding framework using low coherence embedding matrices and the squared loss on extreme multiclass classification problems where $n\ll C$ . We have aimed to empirically investigate the connections between the coherence of the embedding matrix and the accuracy. We also verify the computational advantage of label embedding by comparing it against PD-Sparse (Yen et al.,, 2016) and PPD-Sparse (Yen et al., 2017b, ), which are the most computationally efficient methods for extreme classification to the best of our knowledge. Although methods like AttentionXML (You et al.,, 2019), lightXML (Jiang et al.,, 2021), XR-Transformer (Zhang et al.,, 2021), X-Transformer (Wei et al.,, 2019), and ELIAS (Gupta et al.,, 2022) have been designed for language datasets and leverage language models like BERT, our focus is to assess label embedding as a general purpose method. Thus, we compare our method with more general approaches, including Parabel (Prabhu et al.,, 2018), PD-Sparse (Yen et al.,, 2016), PPD-Sparse (Yen et al., 2017a, ), and AnnexML (Tagami,, 2017), which are applicable across a broader range of multiclass extreme classification problems.

5.1 Experiment Setup

We conduct experiments on three large-scale datasets, DMOZ, LSHTC1, and ODP, which are extensively used for benchmarking extreme classification algorithms. The details of these datasets are provided in Table 1, with DMOZ and LSHTC1 available from (Yen et al.,, 2016), and ODP from (Medini et al.,, 2019). We adapt our method to fully connected neural networks, which are implemented using Pytorch, with a 2-layer fully connected neural network used for the LSHTC1 and DMOZ datasets and a 4-layer fully connected neural network for the ODP dataset. We tune the hyperparameters for all models on a held-out dataset. For our embedding matrix, we employ the Rademacher matrix, the Gaussian matrix, the complex Gaussian matrix, and the matrix proposed by (Nelson and Temlyakov,, 2011), which has demonstrated superior empirical performance in our tests. For the competing methods, we use the hyperparameters as suggested in their respective papers or accompanying code.

Dataset	$N_{\text{train}}$	$N_{\text{test}}$	$D$	$C$
LSHTC1	83805	5000	328282	12046
DMOZ	335068	38340	561127	11879
ODP	975936	493014	493014	103361

Table 1: Summary of the datasets used in the experiments. Here,

N_{\text{train}}

is the number of training data points,

N_{\text{test}}

the number of test data points,

D

the number of features, and

C

the number of classes.

We compare our method against the following state-of-the-art methods:

•

PD-Sparse (Yen et al.,, 2016): an efficient solver designed to exploit the sparsity in extreme classification.
•

PPD-Sparse (Yen et al., 2017a, ): a multi-process extension of PD-Sparse (Yen et al.,, 2016).
•

Parabel (Prabhu et al.,, 2018): a tree-based method which builds a label-hierarchy.
•

AnnexML (Tagami,, 2017): a method which constructs a $k$ -nearest neighbor graph of the label vectors and attempts to reproduce the graph structure from a lower-dimension feature space.
•

MLP (CE Loss): Standard multilayer perceptron classifier with cross-entropy loss.
•

MLP (SE Loss): Standard multilayer perceptron classifier with squared error loss.

We experiment with the following types of embedding matrices:

•

Rademacher: entries are sampled $i.i.d.$ from a uniform distribution on $\left[\frac{1}{\sqrt{n}},-\frac{1}{\sqrt{n}}\right]$ .
•

Gaussian: entries are sampled $i.i.d.$ from $\mathcal{N}(0,\frac{1}{n})$ . We normalize each column to have unit norm.
•

Complex Gaussian: the real and imaginary parts of each entry are sampled $i.i.d.$ from $\mathcal{N}(0,\frac{1}{2n})$ . We normalize each column to have unit norm.
•

Nelson (Nelson and Temlyakov,, 2011): a deterministic construction of low-coherence complex matrices. If $r$ is an integer, $n>r$ is a prime number, and $n^{r}\geq C$ , then the coherence of the constructed matrix is at most $\frac{r-1}{\sqrt{n}}$ . We choose $r=2$ in experiments.

While our theoretical framework is adaptable to any form of embedding, whether trained, fixed, or derived from auxiliary information, we focus on fixed embeddings in our empirical studies. This choice stems from the absence of a standardized approach to train or construct embeddings with auxiliary data. By centering on fixed embeddings, we ensure a controlled evaluation, minimizing confounding factors and emphasizing the role of coherence of the embeddings.

All neural network training is performed on a single NVIDIA A40 GPU with 48GB of memory. For our neural networks models, we explore different embedding dimensions and provide figures showing the relationship between the coherence of the embedding matrix and the accuracy. The full details of the experiments are presented in the appendix.

5.2 Experimental Results

The experimental results presented in Tables 2, 3, and 4 highlight the superior performance of our proposed method across various datasets. In Tables 2, 3, and 4, the first column is the name of the method or embedding type, the second column is the embedding dimension for label embedding methods, and the third column is the mean and standard deviation of accuracy in 5 randomized repetitions. We highlight the best-performing method for each dataset.

We plot the accuracies as the coherence of the embedding matrix decreases in Figures 1, 2, and 3 for the LSHTC1, DMOZ, and ODP datasets, respectively. Figures 1, 2, and 3 demonstrate a negative correlation between the coherence of the embedding matrix and the accuracy, confirming our theoretical analysis.

Table 2: Accuracy on LSHTC1 Dataset

Method	Dim	Acc. (Mean $\pm$ Std)
Gaussian	32	$0.1871\pm 0.0022$
Gaussian	64	$0.2464\pm 0.0023$
Gaussian	128	$0.2826\pm 0.0019$
Gaussian	256	$0.2997\pm 0.0016$
Gaussian	512	$0.3061\pm 0.0022$
Complex Gaussian	32	$0.2477\pm 0.0031$
Complex Gaussian	64	$0.2842\pm 0.0022$
Complex Gaussian	128	$0.2997\pm 0.0021$
Complex Gaussian	256	$0.3058\pm 0.0013$
Complex Gaussian	512	$0.3098\pm 0.0014$
Rademacher	32	$0.1866\pm 0.0020$
Rademacher	64	$0.2477\pm 0.0017$
Rademacher	128	$0.2830\pm 0.0027$
Rademacher	256	$0.3006\pm 0.0021$
Rademacher	512	$0.3066\pm 0.0022$
Nelson	113	$0.2966\pm 0.0012$
Nelson	127	$0.2971\pm 0.0013$
Nelson	251	$0.3050\pm 0.0016$
Nelson	509	$0.3108\pm 0.0015$
MLP (CE Loss)	N/A	$0.2630\pm 0.0036$
MLP (SE Loss)	N/A	$0.2803\pm 0.0017$
Annex ML	N/A	$0.2906\pm 0.0035$
Parabel	N/A	$0.2224\pm 0.0000$
PD-Sparse	N/A	$0.2204\pm 0.0006$
PPD-Sparse	N/A	$0.2260\pm 0.0011$

Table 3: Accuracy on DMOZ Dataset

Method	Dim	Acc. (Mean $\pm$ Std)
Gaussian	32	$0.3866\pm 0.0018$
Gaussian	64	$0.4416\pm 0.0001$
Gaussian	128	$0.4676\pm 0.0005$
Gaussian	256	$0.4758\pm 0.0012$
Gaussian	512	$0.4792\pm 0.0007$
Complex Gaussian	32	$0.4411\pm 0.0004$
Complex Gaussian	64	$0.4678\pm 0.0007$
Complex Gaussian	128	$0.4751\pm 0.0009$
Complex Gaussian	256	$0.4784\pm 0.0006$
Complex Gaussian	512	$0.4790\pm 0.0008$
Rademacher	32	$0.3841\pm 0.0017$
Rademacher	64	$0.4415\pm 0.0005$
Rademacher	128	$0.4660\pm 0.0004$
Rademacher	256	$0.4767\pm 0.0009$
Rademacher	512	$0.4782\pm 0.0006$
Nelson	113	$0.4757\pm 0.0006$
Nelson	127	$0.4771\pm 0.0004$
Nelson	251	$0.4801\pm 0.0004$
Nelson	509	$0.4802\pm 0.0006$
MLP (CE Loss)	N/A	$0.4671\pm 0.0006$
MLP (SE Loss)	N/A	$0.3838\pm 0.0014$
Annex ML	N/A	$0.3982\pm 0.0014$
Parabel	N/A	$0.3856\pm 0.0000$
PD-Sparse	N/A	$0.3976\pm 0.0003$
PPD-Sparse	N/A	$0.3930\pm 0.0007$

Table 4: Performance on ODP Dataset

Method	Dim	Acc. (Mean $\pm$ Std)
Gaussian	128	$0.1532\pm 0.0003$
Gaussian	256	$0.1762\pm 0.0007$
Gaussian	512	$0.1935\pm 0.0004$
Gaussian	1024	$0.2077\pm 0.0003$
Gaussian	2048	$0.2181\pm 0.0007$
Gaussian	4096	$0.2220\pm 0.0004$
Rademacher	128	$0.1533\pm 0.0001$
Rademacher	256	$0.1763\pm 0.0004$
Rademacher	512	$0.1936\pm 0.0004$
Rademacher	1024	$0.2075\pm 0.0005$
Rademacher	2048	$0.2181\pm 0.0006$
Rademacher	4096	$0.2213\pm 0.0008$
Complex Gaussian	128	$0.1763\pm 0.0004$
Complex Gaussian	256	$0.1939\pm 0.0007$
Complex Gaussian	512	$0.2074\pm 0.0008$
Complex Gaussian	1024	$0.2178\pm 0.0004$
Complex Gaussian	2048	$0.2214\pm 0.0006$
Complex Gaussian	4096	$0.2190\pm 0.0007$
Nelson	331	$0.2014\pm 0.0006$
Nelson	509	$0.2086\pm 0.0009$
Nelson	1021	$0.2191\pm 0.0006$
Nelson	2039	$0.2230\pm 0.0006$
MLP (CE Loss)	N/A	$0.1399\pm 0.0011$
MLP (SE Loss)	N/A	$0.1864\pm 0.0002$
Annex ML	N/A	$0.2161\pm 0.0004$
Parabel	N/A	$0.1709\pm 0.0000$
PD-Sparse	N/A	Train Time $>$ 50 hrs
PPD-Sparse	N/A	$0.1366\pm 0.0005$

Refer to caption — Figure 1: Coherence vs Accuracy for LSHTC1.

5.3 Computational Advantage

PD-Sparse (Yen et al.,, 2016) and PPD-Sparse (Yen et al., 2017b, ) are among the most computationally efficient methods for training for extreme classification, to the best of our knowledge. PD-Sparse and PPD-Sparse both solve an elastic net efficiently leveraging the sparsity in extreme classification. To allow for a fair comparison, we also use the Rademacher embedding with an elastic net, which is a linear model regularized by both $\ell_{1}$ and $\ell_{2}$ penalties. To compare the computational efficiency, we consistently set the embedding dimension to $n=360$ . In our distributed implementation, each node independently solves a subset of elastic nets with real-value output, effectively spreading out the computation. As shown in the table, the Rademacher embedding not only trains faster but also achieves higher accuracy. We train the PD-Sparse method and the Single-Node Rademacher embedding on Intel Xeon Gold 6154 processors, equipped with 36 cores and 180GB of memory. Our distributed approach and the PPD-Sparse method — also implemented in a distributed fashion — are trained across 10 CPU nodes, harnessing 360 cores and 1.8TB of memory in total. In Table 5, we compare Rademacher embedding with PD-sparse and PPD-sparse. The Rademacher embedding clearly outperforms PD-sparse and PPD-sparse in both runtime and accuracy.

6 Conclusion and Future Work

We provide a theoretical analysis for label embedding methods in the context of extreme multiclass classification. Our analysis offers a deeper understanding of the tradeoffs between dimensionality reduction and the potential penalty in accuracy. We derive an excess risk bound that quantifies this tradeoff in terms of the coherence of the embedding matrix, and show that the statistical penalty for label embedding vanishes under the multiclass Massart condition. Through extensive experiments, we demonstrated that label embedding with low-coherence matrices outperforms state-of-the-art techniques.

While our analysis focuses on excess risk, our reduction of classification to regression means that existing generalization error bounds (for multi-output regression with squared error loss) can be applied to analyze the generalization error in our context. The work of Zhai and Wang, (2018) shows that the generalization error grows linearly as the dimension of the output space grows. This suggests that smaller embedding dimensions lead to tighter control of the generalization error.

Future work may consider extension to multilabel classification, online learning, zero-shot learning, and learning with rejection. Our theoretical framework provides a solid foundation for future research in this area.

Table 5: Accuracy across Datasets

PD-Sparse (Single Node)
Method	LSHTC1	DMOZ	ODP
Accuracy	$0.2204$	$0.3976$	N/A
Training Time	230s	829s	$>$ 50 hrs
PPD-Sparse (Distributed)
Accuracy	$0.2260$	$0.3930$	$0.1366$
Training Time	135s	656s	668s
Rademacher (Single Node)
Accuracy	$0.2342$	$0.4109$	$0.1511$
Training Time	55s	254s	2045s
Rademacher (Distributed)
Accuracy	$0.2338$	$0.4057$	$0.1506$
Training Time	14s	68s	350s

References

Abolghasemi et al., (2010) Abolghasemi, V., Ferdowsi, S., Makkiabadi, B., and Sanei, S. (2010). On optimization of the measurement matrix for compressive sensing. In 2010 18th European Signal Processing Conference, pages 427–431.
Achlioptas, (2001) Achlioptas, D. (2001). Database-friendly random projections. In Proceedings of the Twentieth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems, PODS ’01, page 274–281, New York, NY, USA. Association for Computing Machinery.
Akata et al., (2013) Akata, Z., Perronnin, F., Harchaoui, Z., and Schmid, C. (2013). Label-embedding for attribute-based classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
Babbar and Schölkopf, (2017) Babbar, R. and Schölkopf, B. (2017). DiSMEC: Distributed Sparse Machines for Extreme Multi-Label Classification. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining, WSDM ’17, page 721–729, New York, NY, USA. Association for Computing Machinery.
Babbar and Schölkopf, (2019) Babbar, R. and Schölkopf, B. (2019). Data scarcity, robustness and extreme multi-label classification. Machine Learning, 108.
Bartlett et al., (2006) Bartlett, P. L., Jordan, M. I., and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds. Journal of the American Statistical Association, 101(473):138–156.
Bhatia et al., (2015) Bhatia, K., Jain, H., Kar, P., Varma, M., and Jain, P. (2015). Sparse Local Embeddings for Extreme Multi-label Classification. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
Chang et al., (2019) Chang, W.-C., Yu, H.-F., Zhong, K., Yang, Y., and Dhillon, I. S. (2019). A Modular Deep Learning Approach for Extreme Multi-label Text Classification. ArXiv, abs/1905.02331.
Choromanska and Langford, (2015) Choromanska, A. E. and Langford, J. (2015). Logarithmic Time Online Multiclass prediction. In Cortes, C., Lawrence, N., Lee, D., Sugiyama, M., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 28. Curran Associates, Inc.
Dahiya et al., (2021) Dahiya, K., Saini, D., Mittal, A., Dave, K., Jain, H., Agarwal, S., and Varma, M. (2021). DeepXML: A deep extreme multi-Label learning framework applied to short text documents. In ACM International Conference on Web Search and Data Mining.
Deng et al., (2018) Deng, H., Han, J., Zhao, Z., and Liu, H. (2018). Dual principal component pursuit: Improved analysis and efficient algorithms. In Advances in Neural Information Processing Systems, pages 1514–1524.
Evron et al., (2018) Evron, I., Moroshko, E., and Crammer, K. (2018). Efficient Loss-Based Decoding on Graphs for Extreme Classification. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
Guo et al., (2019) Guo, C., Mousavi, A., Wu, X., Holtmann-Rice, D. N., Kale, S., Reddi, S., and Kumar, S. (2019). Breaking the Glass Ceiling for Embedding-Based Classifiers for Large Output Spaces. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Gupta et al., (2022) Gupta, N., Chen, P., Yu, H.-F., Hsieh, C.-J., and Dhillon, I. (2022). ELIAS: End-to-End Learning to Index and Search in Large Output Spaces. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 19798–19809. Curran Associates, Inc.
Hsu et al., (2009) Hsu, D., Kakade, S. M., Langford, J., and Zhang, T. (2009). Multi-Label Prediction via Compressed Sensing. In Proceedings of the 22nd International Conference on Neural Information Processing Systems, NIPS’09, page 772–780, Red Hook, NY, USA. Curran Associates Inc.
Jain et al., (2019) Jain, H., Balasubramanian, V., Chunduri, B., and Varma, M. (2019). Slice: Scalable Linear Extreme Classifiers trained on 100 Million Labels for Related Searches. In WSDM ’19, February 11–15, 2019, Melbourne, VIC, Australia. ACM. Best Paper Award at WSDM ’19.
Jernite et al., (2017) Jernite, Y., Choromanska, A., and Sontag, D. (2017). Simultaneous Learning of Trees and Representations for Extreme Classification and Density Estimation. In Precup, D. and Teh, Y. W., editors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages 1665–1674. PMLR.
Jiang et al., (2021) Jiang, T., Wang, D., Sun, L., Yang, H., Zhao, Z., and Zhuang, F. (2021). LightXML: Transformer with Dynamic Negative Sampling for High-Performance Extreme Multi-label Text Classification. Proceedings of the AAAI Conference on Artificial Intelligence, 35(9):7987–7994.
Johnson and Lindenstraus, (1984) Johnson, W. B. and Lindenstraus, J. (1984). Extensions of Lipschitz mappings into Hilbert space. Contemporary mathematics, 26:189–206.
Khandagale et al., (2020) Khandagale, S., Xiao, H., and Babbar, R. (2020). Bonsai: Diverse and Shallow Trees for Extreme Multi-Label Classification. Mach. Learn., 109(11):2099–2119.
Le and Mikolov, (2014) Le, Q. and Mikolov, T. (2014). Distributed representations of sentences and documents. In International conference on machine learning, pages 1188–1196. PMLR.
Li et al., (2013) Li, G., Zhu, Z., Yang, D., Chang, L., and Bai, H. (2013). On projection matrix optimization for compressive sensing systems. IEEE Transactions on Signal Processing, 61(11):2887–2898.
Li et al., (2012) Li, S., Gao, F., Ge, G., and Zhang, S. (2012). Deterministic construction of compressed sensing matrices via algebraic curves. IEEE Transactions on Information Theory, 58(8):5035–5041.
Liu and Jia, (2020) Liu, X. and Jia, L. (2020). Deterministic construction of compressed sensing matrices via vector spaces over finite fields. IEEE Access, 8:203301–203308.
Lu et al., (2018) Lu, C., Li, H., and Lin, Z. (2018). Optimized projections for compressed sensing via direct mutual coherence minimization. Signal Processing, 151:45–55.
Massart and Nédélec, (2006) Massart, P. and Nédélec, E. (2006). Risk Bounds for Statistical Learning. The Annals of Statistics, 34(5):2326–2366.
Medini et al., (2019) Medini, T. K. R., Huang, Q., Wang, Y., Mohan, V., and Shrivastava, A. (2019). Extreme Classification in Log Memory using Count-Min Sketch: A Case Study of Amazon Search with 50M Products. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Naidu et al., (2016) Naidu, R. R., Jampana, P., and Sastry, C. S. (2016). Deterministic compressed sensing matrices: Construction via euler squares and applications. IEEE Transactions on Signal Processing, 64(14):3566–3575.
Nelson and Temlyakov, (2011) Nelson, J. and Temlyakov, V. (2011). On the size of incoherent systems. Journal of Approximation Theory, 163(9):1238–1245.
Obermeier and Martinez-Lorenzo, (2017) Obermeier, R. and Martinez-Lorenzo, J. A. (2017). Sensing matrix design via mutual coherence minimization for electromagnetic compressive imaging applications. IEEE Transactions on Computational Imaging, 3(2):217–229.
Prabhu et al., (2018) Prabhu, Y., Kag, A., Harsola, S., Agrawal, R., and Varma, M. (2018). Parabel: Partitioned Label Trees for Extreme Classification with Application to Dynamic Search Advertising. In ACM International WWW Conference, pages 993–1002. International World Wide Web Conferences Steering Committee, International World Wide Web Conferences Steering Committee.
Prabhu and Varma, (2014) Prabhu, Y. and Varma, M. (2014). FastXML: A Fast, Accurate and Stable Tree-Classifier for Extreme Multi-Label Learning. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’14, page 263–272, New York, NY, USA. Association for Computing Machinery.
Ramaswamy and Agarwal, (2012) Ramaswamy, H. G. and Agarwal, S. (2012). Classification calibration dimension for general multiclass losses. In Pereira, F., Burges, C., Bottou, L., and Weinberger, K., editors, Advances in Neural Information Processing Systems, volume 25. Curran Associates, Inc.
Ramaswamy et al., (2018) Ramaswamy, H. G., Tewari, A., and Agarwal, S. (2018). Consistent algorithms for multiclass classification with an abstain option. Electronic Journal of Statistics, 12(1):530 – 554.
Rodríguez et al., (2018) Rodríguez, P., Bautista, M. A., Gonzàlez, J., and Escalera, S. (2018). Beyond one-hot encoding: Lower dimensional target embedding. Image and Vision Computing, 75:21–31.
Steinwart, (2007) Steinwart, I. (2007). How to compare different loss functions and their risks. Constructive Approximation, 26:225–287.
Tagami, (2017) Tagami, Y. (2017). AnnexML: Approximate Nearest Neighbor Search for Extreme Multi-Label Classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 455–464, New York, NY, USA. Association for Computing Machinery.
Wang et al., (2019) Wang, W., Zheng, V. W., Yu, H., and Miao, C. (2019). A survey of zero-shot learning: Settings, methods, and applications. ACM Trans. Intell. Syst. Technol., 10(2).
Wei et al., (2022) Wei, T., Mao, Z., Shi, J.-X., Li, Y.-F., and Zhang, M.-L. (2022). A survey on extreme multi-label learning.
Wei et al., (2019) Wei, T., Tu, W.-W., and Li, Y.-F. (2019). Learning for Tail Label Data: A Label-Specific Feature Approach. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI’19, page 3842–3848. AAAI Press.
Wei et al., (2020) Wei, Z., Zhang, J., Xu, Z., and Liu, Y. (2020). Measurement matrix optimization via mutual coherence minimization for compressively sensed signals reconstruction. Mathematical Problems in Engineering, 2020:7979606.
Welch, (1974) Welch, L. (1974). Lower bounds on the maximum cross correlation of signals (corresp.). IEEE Transactions on Information Theory, 20(3):397–399.
Weston et al., (2010) Weston, J., Bengio, S., and Usunier, N. (2010). Large scale image annotation: learning to rank with joint word-image embeddings. Machine Learning, 81(1):21–35.
Weston et al., (2002) Weston, J., Chapelle, O., Elisseeff, A., Schölkopf, B., and Vapnik, V. (2002). Kernel dependency estimation. In Proceedings of the 15th International Conference on Neural Information Processing Systems, NIPS’02, page 897–904, Cambridge, MA, USA. MIT Press.
Xu, (2011) Xu, Z. (2011). Deterministic sampling of sparse trigonometric polynomials. Journal of Complexity, 27(2):133–140.
(46) Yen, I. E., Huang, X., Dai, W., Ravikumar, P., Dhillon, I., and Xing, E. (2017a). PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 545–553, New York, NY, USA. Association for Computing Machinery.
(47) Yen, I. E., Huang, X., Dai, W., Ravikumar, P., Dhillon, I., and Xing, E. (2017b). PPDsparse: A Parallel Primal-Dual Sparse Method for Extreme Classification. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’17, page 545–553, New York, NY, USA. Association for Computing Machinery.
Yen et al., (2016) Yen, I. E.-H., Huang, X., Ravikumar, P., Zhong, K., and Dhillon, I. (2016). PD-Sparse : A Primal and Dual Sparse Approach to Extreme Multiclass and Multilabel Classification. In Balcan, M. F. and Weinberger, K. Q., editors, Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 3069–3077, New York, New York, USA. PMLR.
You et al., (2019) You, R., Zhang, Z., Wang, Z., Dai, S., Mamitsuka, H., and Zhu, S. (2019). AttentionXML: Label Tree-Based Attention-Aware Deep Model for High-Performance Extreme Multi-Label Text Classification. Curran Associates Inc., Red Hook, NY, USA.
Yu et al., (2014) Yu, H.-F., Jain, P., Kar, P., and Dhillon, I. S. (2014). Large-scale multi-label learning with missing labels. In Proceedings of the 31st International Conference on International Conference on Machine Learning - Volume 32, ICML’14, page I–593–I–601. JMLR.org.
Yu et al., (2022) Yu, H.-F., Zhong, K., Zhang, J., Chang, W.-C., and Dhillon, I. S. (2022). PECOS: Prediction for enormous and correlated output spaces. Journal of Machine Learning Research.
Yu, (2011) Yu, N. Y. (2011). Deterministic compressed sensing matrices from multiplicative character sequences. In 2011 45th Annual Conference on Information Sciences and Systems, pages 1–5.
Zhai and Wang, (2018) Zhai, K. and Wang, H. (2018). Adaptive dropout with rademacher complexity regularization. In International Conference on Learning Representations.
Zhang et al., (2021) Zhang, J., Chang, W.-C., Yu, H.-F., and Dhillon, I. (2021). Fast Multi-Resolution Transformer Fine-tuning for Extreme Multi-label Text Classification. In Ranzato, M., Beygelzimer, A., Dauphin, Y., Liang, P., and Vaughan, J. W., editors, Advances in Neural Information Processing Systems, volume 34, pages 7267–7280. Curran Associates, Inc.
Zhang, (2004) Zhang, T. (2004). Statistical behavior and consistency of classification methods based on convex risk minimization. The Annals of Statistics, 32(1):56–85.
Zhou et al., (2014) Zhou, D.-X., Tang, J., and Yang, Y. (2014). Large scale multi-label learning with missing labels. In International Conference on Machine Learning, pages 593–601. PMLR.

Appendix A Proof of Theorem 4

Recall that $\beta^{G}(p)=\min\left\{\operatorname*{arg\,min}_{i\in\mathcal{Y}}\left\lVert p-g_{i}\right\rVert_{2}\right\}$ .

Lemma 8.

$\forall p\in\mathbb{C}^{n},\forall x\in\mathcal{X},C_{1,x}(p)=\max_{i}\eta_{i}(x)-\eta_{\beta^{G}(p)}(x)$ . .

Proof.

Recall that $L^{G}(p,y)=L_{01}(\beta^{G}(p),y)=\mathbbm{1}_{\left\{\min\left\{\operatorname*{arg\,min}_{i\in\mathcal{Y}}\left\lVert p-g_{i}\right\rVert_{2}\right\}\neq y\right\}}$ . The result follows from the observation that $C_{L^{G},x}(p)=1-\eta_{\beta^{G}(p)}(x)$ and $C^{*}_{L^{G},x}=1-\max_{i}\eta_{i}(x)$ . ∎

Lemma 9.

$C_{2,x}$ is strictly convex and minimized at $p^{*}_{x}=G\eta(x)=\sum_{i=1}^{C}\eta_{i}(x)g_{i}$ .

Proof.

$\displaystyle C_{\ell^{G},x}(p)$	$\displaystyle=\mathbb{E}_{y\sim P_{Y\mid X=x}}\ell_{2}\left(p,g_{y}\right)$	(11)
	$\displaystyle=\frac{1}{2}\sum_{i=1}^{C}\eta_{i}(x)\left\lVert p-g_{i}\right\rVert_{2}^{2}$	(12)
	$\displaystyle=\frac{1}{2}\sum_{i=1}^{C}\eta_{i}(x)\left\langle p-g_{i},p-g_{i}\right\rangle$	(13)
	$\displaystyle=\frac{1}{2}\left\langle p,p\right\rangle-\mathfrak{Re}\left\langle\sum_{i=1}^{C}\eta_{i}(x)g_{i},p\right\rangle+\frac{1}{2}\sum_{i=1}^{C}\eta_{i}(x)\left\lVert g_{i}\right\rVert_{2}^{2}$	(14)
	$\displaystyle=\frac{1}{2}\left\lVert p-\sum_{i=1}^{C}\eta_{i}(x)g_{i}\right\rVert_{2}^{2}+\frac{1}{2}\sum_{i=1}^{C}\eta_{i}(x)\left\lVert g_{i}\right\rVert_{2}^{2}-\frac{1}{2}\left\lVert\sum_{i=1}^{C}\eta_{i}(x)g_{i}\right\rVert_{2}^{2}$	(15)

So $C_{2,x}(p)$ is strictly convex and minimized at $p_{x}^{*}=\sum_{i=1}^{C}\eta_{i}(x)g_{i}=G\eta(x)$ . ∎

We’ll use $p^{*}_{x}$ to denote $G\eta(x)$ henceforward.

Lemma 10.

Let $V$ be a normed space with norm $\left\lVert\cdot\right\rVert_{V}$ . Let $u:V\rightarrow\mathbb{R}$ be a strictly convex function. Let $u$ be minimized at $x^{*}$ with $u(x^{*})=0$ . $\forall x\in V$ , $\forall\delta_{0}>0$ , if $u(x)<\delta:=\inf_{x:\left\lVert x-x^{*}\right\rVert_{V}=\delta_{0}}u(x)$ , then $\left\lVert x-x^{*}\right\rVert_{V}<\delta_{0}$ .

Proof.

We first confirm the fact that $\forall t\in\left(0,1\right)$ and $q\in V-\left\{0\right\}$ , $u(x^{*}+tq)<u(x^{*}+q)$ .

$\displaystyle u\left(x^{*}+tq\right)$	$\displaystyle=u\left((1-t)x^{}+t(x^{}+q)\right)$	(16)
	$\displaystyle<(1-t)u(x^{})+tu(x^{}+q)$	(17)
	$\displaystyle=tu(x^{*}+q)$	(18)
	$\displaystyle<u(x^{*}+q).$	(19)

Assume the opposite: For some $u\in V$ , $\delta_{0}>0$ , $u(x)<\delta=\inf_{x:\left\lVert x-x^{*}\right\rVert_{V}=\delta_{0}}u(x)$ and $\left\lVert x-x^{*}\right\rVert_{V}\geq\delta_{0}$ . Then

u(x)=u(x^{*}+(x-x^{*}))\geq u(x^{*}+\frac{\delta_{0}}{\left\lVert x-x^{*}\right\rVert_{V}}(x-x^{*}))\geq\delta,

which results in a contradiction. ∎

Lemma 11.

Fix $x\in\mathcal{X}$ . $\forall\delta_{0}>0$ $\forall p\in\mathbb{C}^{n}$ , $C_{2,x}(p)<\frac{1}{2}\delta_{0}^{2}\implies\left\lVert p-p^{*}_{x}\right\rVert<\delta_{0}$ .

Proof.

Applying Lemma 10 with $u=C_{2,x}$ , $V=\mathbb{C}^{n}$ , $\left\lVert\cdot\right\rVert_{V}=\left\lVert\cdot\right\rVert_{2}$ , and $x^{*}=p_{x}^{*}$ , we have $\forall p\in\mathbb{C}^{n}$ , $C_{2,x}(p)<\inf_{s:\left\lVert s-p_{x}^{*}\right\rVert=\delta_{0}}C_{2,x}(s)\implies\left\lVert p-p_{x}^{*}\right\rVert_{2}<\delta_{0}$ . Furthermore,

$\displaystyle\inf_{s:\left\lVert s-p_{x}^{*}\right\rVert=\delta_{0}}C_{2,x}(s)$	$\displaystyle=\inf_{s:\left\lVert s-p_{x}^{}\right\rVert=\delta_{0}}\left(\frac{1}{2}\left\langle s,s\right\rangle-\mathfrak{Re}\left\langle p^{}_{x},s\right\rangle\right)-\left(\frac{1}{2}\left\langle p_{x}^{},p_{x}^{}\right\rangle-\mathfrak{Re}\left\langle p^{}_{x},p_{x}^{}\right\rangle\right)$	(20)
	$\displaystyle=\inf_{s:\left\lVert s-p_{x}^{}\right\rVert=\delta_{0}}\frac{1}{2}\left\lVert s-p_{x}^{}\right\rVert_{2}^{2}$	(21)
	$\displaystyle=\frac{1}{2}\delta_{0}^{2}.$	(22)

∎

To facilitate our proofs, we denote $\lambda_{j,k}=\left\langle g_{j},g_{k}\right\rangle$ , $\lambda_{j,k}^{\mathfrak{Re}}=\mathfrak{Re}\lambda_{j,k}$ , and $\lambda^{G}=\max_{j\neq k}\left\lvert\lambda_{j,k}\right\rvert$ .

Lemma 12.

$\forall x\in\mathcal{X}$ , $\forall j$ , $k\in\left[C\right]$ , $\forall p\in\mathbb{C}^{n}$ ,

\frac{(1+\lambda^{G})(\eta_{k}(x)-\eta_{j}(x))-2\lambda^{G}}{\sqrt{2-2\lambda^{G}}}>\left\lVert p_{x}^{*}-p\right\rVert_{2}\implies\left\lVert p-g_{j}\right\rVert_{2}>\left\lVert p-g_{k}\right\rVert_{2}.

Proof.

We first consider the general case when $p\neq p_{x}^{*}$ . Write $p=p_{x}^{*}+\delta_{0}v$ where $v=\frac{p-p_{x}^{*}}{\left\lVert p-p_{x}^{*}\right\rVert_{2}}$ and $\delta_{0}=\left\lVert p-p_{x}^{*}\right\rVert_{2}$ . Recall that $p_{x}^{*}=G\eta(x)$ . Note the inequality immediately implies $\eta_{k}(x)>\eta_{j}(x)$ .

	$\displaystyle\frac{(1+\lambda^{G})(\eta_{k}(x)-\eta_{j}(x))-2\lambda^{G}}{\sqrt{2-2\lambda^{G}}}>\delta_{0}$	(23)
$\displaystyle\implies$	$\displaystyle(1-\lambda^{G})(\eta_{k}(x)-\eta_{j}(x))-2\lambda^{G}(1-(\eta_{k}(x)-\eta_{j}(x)))>\delta_{0}\sqrt{2-2\lambda^{G}}$	(24)
$\displaystyle\implies$	$\displaystyle(1-\lambda^{G})(\eta_{k}(x)-\eta_{j}(x))-2\lambda^{G}(1-\eta_{k}(x)-\eta_{j}(x))>\delta_{0}\sqrt{2-2\lambda^{G}}$	(25)
$\displaystyle\implies$	$\displaystyle\sqrt{1-\lambda^{G}}(\eta_{k}(x)-\eta_{j}(x))-\frac{2\lambda^{G}}{\sqrt{1-\lambda^{G}}}(1-\eta_{k}(x)-\eta_{j}(x))>\sqrt{2}\delta_{0}$	(26)
$\displaystyle\implies$	$\displaystyle\sqrt{1-\lambda_{j,k}^{\mathfrak{Re}}}(\eta_{k}(x)-\eta_{j}(x))+\frac{1}{\sqrt{1-\lambda_{j,k}^{\mathfrak{Re}}}}\sum_{i\neq j,k}(\lambda_{i,k}^{\mathfrak{Re}}-\lambda_{i,j}^{\mathfrak{Re}})\eta_{i}(x)>\sqrt{2}\delta_{0}$	(27)
$\displaystyle\implies$	$\displaystyle(1-\lambda_{j,k}^{\mathfrak{Re}})(\eta_{k}(x)-\eta_{j}(x))+\sum_{i\neq j,k}(\lambda_{i,k}^{\mathfrak{Re}}-\lambda_{i,j}^{\mathfrak{Re}})\eta_{i}(x)>\delta_{0}\sqrt{2-2\lambda_{j,k}^{\mathfrak{Re}}}$	(28)
$\displaystyle\implies$	$\displaystyle\mathfrak{Re}\left\langle p^{*},g_{k}-g_{j}\right\rangle>\delta_{0}\left\lVert g_{j}-g_{k}\right\rVert_{2}$	(29)
$\displaystyle\implies$	$\displaystyle\mathfrak{Re}\left\langle p^{*},g_{k}-g_{j}\right\rangle>\mathfrak{Re}\left\langle\delta_{0}v,g_{j}-g_{k}\right\rangle$	(30)
$\displaystyle\implies$	$\displaystyle\mathfrak{Re}\left\langle p^{}+\delta_{0}v,g_{k}\right\rangle>\mathfrak{Re}\left\langle p^{}+\delta_{0}v,g_{j}\right\rangle$	(31)
$\displaystyle\implies$	$\displaystyle\left\lVert p-g_{j}\right\rVert_{2}^{2}>\left\lVert p-g_{k}\right\rVert_{2}^{2}$	(32)

Inequality (27) follows the fact that $\forall i,i^{\prime},\left\lvert\lambda_{i,i^{\prime}}^{\mathfrak{Re}}\right\rvert\leq\lambda^{G}$ . Inequality (29) follows from $p_{x}^{*}=\sum_{i=1}^{C}\eta_{i}(x)g_{i}$ and $\left\lVert g_{j}-g_{k}\right\rVert_{2}=\sqrt{2-2\lambda_{j,k}^{\mathfrak{Re}}}$ . Inequality (30) is implied by the Cauchy-Schwarz inequality. In the last inequality (32), we use the fact that $\left\lVert g_{j}\right\rVert_{2}=\left\lVert g_{k}\right\rVert_{2}=1$ .

Now let $p=p_{x}^{*}$ . Let $\frac{(1+\lambda^{G})(\eta_{k}(x)-\eta_{j}(x))-2\lambda^{G}}{\sqrt{2-2\lambda^{G}}}>0$ .

	$\displaystyle\left\lVert p-g_{j}\right\rVert_{2}^{2}-\left\lVert p-g_{k}\right\rVert_{2}^{2}$	(34)
$\displaystyle=$	$\displaystyle 2\mathfrak{Re}\left\langle p_{x}^{*},g_{k}-g_{j}\right\rangle$	(35)
$\displaystyle=$	$\displaystyle 2(1-\lambda_{j,k}^{\mathfrak{Re}})(\eta_{k}(x)-\eta_{j}(x))+2\sum_{i\neq j,k}(\lambda_{i,k}^{\mathfrak{Re}}-\lambda_{i,j}^{\mathfrak{Re}})\eta_{i}(x)$	(36)
$\displaystyle\geq$	$\displaystyle 2(1-\lambda^{G})(\eta_{k}(x)-\eta_{j}(x))-4\lambda^{G}(1-\eta_{k}(x)-\eta_{j}(x))$	(37)
$\displaystyle\geq$	$\displaystyle 2(1-\lambda^{G})(\eta_{k}(x)-\eta_{j}(x))-4\lambda^{G}(1-(\eta_{k}(x)-\eta_{j}(x)))$	(38)
$\displaystyle=$	$\displaystyle 2(1+\lambda^{G})(\eta_{k}(x)-\eta_{j}(x))-4\lambda^{G}>0$	(39)

∎

Lemma 13.

$\forall x\in\mathcal{X}$ , $\forall r>\frac{2\lambda^{G}}{1+\lambda^{G}}$ , and $\forall p\in\mathbb{C}^{n}$ , $C_{2,x}(p)<\frac{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}{4-2\lambda^{G}}\implies C_{1,x}(p)<r.$

Proof.

By Lemma 11, $C_{2,x}(p)<\frac{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}{4-2\lambda^{G}}\implies\left\lVert p-p_{x}^{*}\right\rVert_{2}<\frac{(1+\lambda^{G})r-2\lambda^{G}}{\sqrt{2-\lambda^{G}}}.$ Fix $x\in\mathcal{X}$ . Recall that $\beta^{G}(p)=\min\left\{\operatorname*{arg\,min}_{i\in\mathcal{Y}}\left\lVert p-g_{i}\right\rVert_{2}\right\}$ . We claim

\left\lVert p-p_{x}^{*}\right\rVert<\frac{(1+\lambda^{G})r-2\lambda^{G}}{\sqrt{2-\lambda^{G}}}\implies\max_{i}\eta_{i}(x)-\eta_{\beta^{G}(p)}(x)<r.

Assume $\left\lVert p-p_{x}^{*}\right\rVert<\frac{(1+\lambda^{G})r-2\lambda^{G}}{\sqrt{2-\lambda^{G}}}$ and $\max_{i}\eta_{i}(x)-\eta_{\beta^{G}(p)}(x)\geq r$ . By Lemma 12, $\left\lVert p-g_{\beta^{G}(p)}\right\rVert>\left\lVert p-g_{\min\left\{\operatorname*{arg\,max}_{i}\eta_{i}(x)\right\}}\right\rVert$ , contradicting the definition of $\beta^{G}(p)$ . Hence, $C_{1,x}(p)=\max_{i}\eta_{i}(x)-\eta_{\beta^{G}(p)}(x)<r$ .

∎

Now we’re ready to prove Theorem 4.

Proof of Theorem 4.

	$\displaystyle\mathcal{R}_{L^{G},P}(f)-\mathcal{R}_{L^{G},P}^{*}$	$\displaystyle=\int_{\mathcal{X}}C_{1,x}(f(x))$		(40)
		$\displaystyle=\int_{x:d(x)<r}C_{1,x}(f(x))+\int_{x:d(x)\geq r}C_{1,x}(f(x)).$		(41)

We bound each integral individually.

By Lemma 13, $\forall x\in\mathcal{X}$ , $\forall r>\frac{2\lambda^{G}}{1+\lambda^{G}}$ , and $\forall p\in\mathbb{C}^{n}$ ,

C_{1,x}(p)\geq r\implies C_{2,x}(p)\geq\frac{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}{4-2\lambda^{G}}.

(42)

Hence,

	$\displaystyle C_{1,x}(p)>\frac{2\lambda^{G}}{1+\lambda^{G}}$	$\displaystyle\implies C_{2,x}(p)\geq\frac{\left((1+\lambda^{G})C_{1,x}(p)-2\lambda^{G}\right)^{2}}{4-2\lambda^{G}}$		(43)
		$\displaystyle\implies C_{1,x}(p)\leq\frac{2\lambda^{G}}{1+\lambda^{G}}+\frac{1}{1+\lambda^{G}}\sqrt{(4-2\lambda^{G})C_{2,x}(p)}.$		(44)

Note the last inequality actually holds for all $p\in\mathbb{C}^{n}$ , that is, it holds even when $C_{1,x}(p)\leq\frac{2\lambda^{G}}{1+\lambda^{G}}$ . Then,

$\displaystyle\int_{x:d(x)<r}C_{1,x}(f(x))$	$\displaystyle\leq\int_{x:d(x)<r}\frac{2\lambda^{G}}{1+\lambda^{G}}+\frac{1}{1+\lambda^{G}}\sqrt{(4-2\lambda^{G})C_{2,x}(f(x))}$	(45)
	$\displaystyle=\frac{2\lambda^{G}}{1+\lambda^{G}}P_{\mathcal{X}}(d(X)<r)+\frac{\sqrt{4-2\lambda^{G}}}{1+\lambda^{G}}\int_{x:d(x)<r}\sqrt{C_{2,x}(f(x))}$	(46)
	$\displaystyle=\frac{2\lambda^{G}}{1+\lambda^{G}}P_{\mathcal{X}}(d(X)<r)+\frac{\sqrt{4-2\lambda^{G}}}{1+\lambda^{G}}\left\lVert\mathbbm{1}_{d(x)<r}\sqrt{C_{2,x}(f(x))}\right\rVert_{P_{\mathcal{X}},1}$	(47)
	$\displaystyle\leq\frac{2\lambda^{G}}{1+\lambda^{G}}P_{\mathcal{X}}(d(X)<r)+\frac{\sqrt{4-2\lambda^{G}}}{1+\lambda^{G}}\left\lVert\mathbbm{1}_{d(x)<r}\right\rVert_{P_{\mathcal{X}},2}\left\lVert\sqrt{C_{2,x}(f(x))}\right\rVert_{P_{\mathcal{X}},2}$	(48)
	$\displaystyle=\frac{2\lambda^{G}}{1+\lambda^{G}}P_{\mathcal{X}}(d(X)<r)+\frac{\sqrt{4-2\lambda^{G}}}{1+\lambda^{G}}\sqrt{P_{\mathcal{X}}(d(X)<r)\left(\mathcal{R}_{\ell^{G},P}(f)-\mathcal{R}^{*}_{\ell^{G},P}\right)}.$	(49)

In inequality (48), we apply Holder’s inequality.

When $C_{1,x}(p)>0$ , $C_{1,x}(p)=\max_{i}\eta_{i}(x)-\eta_{\beta^{G}(p)}(x)\geq\max_{i}\eta_{i}(x)-\max_{i\notin\operatorname*{arg\,max}_{j}\eta_{j}(x)}\eta_{i}(x)=d(x)$ . By (42), if $d(x)\geq r$ and $C_{1,x}(p)>0$ , then $C_{2,x}(p)\geq\frac{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}{4-2\lambda^{G}}$ . As $C_{1,x}(p)\in\left[0,1\right]$ , $d(x)\geq r$ and $C_{1,x}(p)>0$ $\implies$ $C_{2,x}(p)\geq\frac{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}{4-2\lambda^{G}}C_{1,x}(p)$ . It is trivial that $C_{2,x}(p)\geq\frac{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}{4-2\lambda^{G}}C_{1,x}(p)$ also holds when $C_{1,x}(p)=0$ . Thus, $\forall x\in\mathcal{X}$ and $\forall p\in\mathbb{C}^{n}$ , $d(x)\geq r\implies$ $C_{2,x}(p)\geq\frac{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}{4-2\lambda^{G}}C_{1,x}(p)\implies C_{1,x}(p)\leq\frac{4-2\lambda^{G}}{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}C_{2,x}(p)$ . Therefore,

$\displaystyle\int_{x:d(x)\geq r}C_{1,x}(f(x))$	$\displaystyle\leq\int_{x:d(x)\geq r}\frac{4-2\lambda^{G}}{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}C_{2,x}(f(x))$	(50)
	$\displaystyle\leq\frac{4-2\lambda^{G}}{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}\int_{\mathcal{X}}C_{2,x}(f(x))$	(51)
	$\displaystyle=\frac{4-2\lambda^{G}}{\left((1+\lambda^{G})r-2\lambda^{G}\right)^{2}}\left(\mathcal{R}_{\ell^{G},P}(f)-\mathcal{R}^{*}_{\ell^{G},P}\right).$	(52)

∎

Appendix B Experiment Details

In this section, we provide the details of our experiments.

B.1 Neural Networks

In this section, we provide details on the architectures and hyperparameter choices for the neural networks used in our experiments. The architectures and hyperparameters are selected by trial-and-error on a heldout dataset.

B.1.1 LSHTC1

The proposed embedding strategy adopts a 2-layer neural network architecture, employing a hidden layer of 4096 neurons with ReLU activation. The output of the neural network is normalized to have a Euclidean norm of 1. An Adamax optimizer with a learning rate of $0.001$ is utilized together with a batch size of 128 for training. The model is trained for a total of 5 epochs. In order to effectively manage the learning rate, a scheduler is deployed, which scales down the learning rate by a factor of 0.1 at the second epoch.

Our cross-entropy baseline retains a similar network architecture to the embedding strategy, with an adjustment in the output layer to reflect the number of classes. Here, the learning rate is $0.01$ and the batch size is 128 for training. The model undergoes training for a total of 5 epochs, with a scheduler set to decrease the learning rate after the third epoch.

Finally, the squared loss baseline follows the architecture of our cross-entropy baseline, with the learning rate and batch size mirroring the parameters of the embedding strategy. As with the embedding strategy, the output is normalized. The model is trained for a total of 5 epochs, and the learning rate is scheduled to decrease after the third epoch.

B.1.2 DMOZ

For the DMOZ dataset, our proposed label embedding strategy employed a 2-layer neural network with a hidden layer of 2500 neurons activated by the ReLU function. The output of the network is normalized to have a norm of 1. We trained the model using the Adamax optimizer with a learning rate of $0.001$ and a batch size of 256. The model was trained for 5 epochs, and the learning rate was scheduled to decrease at the second and fourth epochs by a factor of 0.1.

For the cross-entropy loss baseline, we used the same network architecture with an adjustment in the output layer to match the number of classes. The learning rate was $0.01$ and the batch size was 256. The model underwent training for a total of 5 epochs, with the learning rate decreasing after the third epoch as determined by the scheduler.

Lastly, the squared loss baseline utilized the same architecture, learning rate, and batch size as the proposed label embedding strategy. Similarly, the model’s output was normalized. The model was trained for 5 epochs, with the learning rate scheduled to decrease after the third epoch.

B.1.3 ODP

For the ODP dataset, the experiments utilized a neural network model composed of 4 layers. The size of the hidden layers progressively increased from $2^{10}$ to $2^{14}$ , then decreased to $2^{13}$ . Each of these layers employed the ReLU activation function and was followed by batch normalization to promote faster, more stable convergence. The final layer output size corresponded with the embedding dimension for the label embedding strategy and the number of classes for the cross-entropy and squared loss baselines.

In the label embedding framework, the output was normalized to yield a norm of 1. This model was trained using the Adamax optimizer, a learning rate of 0.001, and a batch size of 2048. The training spanned 20 epochs, with a learning rate decrease scheduled after the 10th epoch by a factor of 0.1.

For the cross-entropy loss baseline, the same network architecture was preserved, with an adjustment to the penultimate layer, reduced by half, and the final output layer resized to match the number of classes. This slight modification in the penultimate layer was necessary to accommodate the models within the 48GB GPU memory. Notably, the neural network output was normalized by dividing each output vector by its Euclidean norm before applying the softmax function, a non-standard operation that significantly enhanced performance. This model was trained using a learning rate of 0.01 over 20 epochs, following a similar learning rate schedule.

Finally, the squared loss baseline used the same architecture as the cross-entropy baseline and the same learning rate and batch size as the label embedding model. Here, the output was also normalized. The model underwent training for 20 epochs, with a learning rate decrease scheduled after the 10th epoch.

B.2 Elastic Net

We aim to solve

\min_{W\in\mathbb{R}^{D\times n}}\left\lVert XW-Y\right\rVert_{fro}^{2}+\lambda_{1}\left\lVert W\right\rVert_{1,1}+\lambda_{2}\left\lVert W\right\rVert_{fro}^{2},

(53)

where $X\in\mathbb{R}^{N\times D}$ is the data matrix, the rows of $Y\in\mathbb{R}^{N\times n}$ are embedding vectors.

The problem (53) can be broken down into $n$ independent real-output regression problems of the form

\min_{W_{j}\in\mathbb{R}^{D}}\left\lVert XW_{j}-Y_{j}\right\rVert_{fro}^{2}+\lambda_{1}\left\lVert W_{j}\right\rVert_{1}+\lambda_{2}\left\lVert W_{j}\right\rVert_{2}^{2},

(54)

where $W_{j}$ is the $j$ -th column of $W$ and $Y_{j}$ is the $j$ -th column of $Y$ . Consequently, We can distribute the $n$ real-output regression problems across multiple cores.

We develop a regression variant of the Fully-Corrective Block-Coordinate Frank-Wolfe (FC-BCFW) algorithm (Yen et al.,, 2016) and use it to solve the real-output regression problems. As the solver operates iteratively, we set it to run for a predefined number of iterations, denoted as $N_{\text{iter}}$ . The chosen hyperparameters are outlined in table 6.

Dataset	$\lambda_{1}$	$\lambda_{2}$	$N_{\text{iter}}$
LSHTC1	0.1	1	20
DMOZ	0.01	0.01	20
ODP	0.01	0.1	40

Table 6: Hyperparameters for elastic net.