Deep Imbalanced Regression via Hierarchical Classification Adjustment

Haipeng Xiong National University of Singapore Angela Yao National University of Singapore

Abstract

Regression tasks in computer vision, such as age estimation or counting, are often formulated into classification by quantizing the target space into classes. Yet real-world data is often imbalanced – the majority of training samples lie in a head range of target values, while a minority of samples span a usually larger tail range. By selecting the class quantization, one can adjust imbalanced regression targets into balanced classification outputs, though there are trade-offs in balancing classification accuracy and quantization error. To improve regression performance over the entire range of data, we propose to construct hierarchical classifiers for solving imbalanced regression tasks. The fine-grained classifiers limit the quantization error while being modulated by the coarse predictions to ensure high accuracy. Standard hierarchical classification approaches, however, when applied to the regression problem, fail to ensure that predicted ranges remain consistent across the hierarchy. As such, we propose a range-preserving distillation process that can effectively learn a single classifier from the set of hierarchical classifiers. Our novel hierarchical classification adjustment (HCA) for imbalanced regression shows superior results on three diverse tasks: age estimation, crowd counting and depth estimation. We will release the source code upon acceptance.

Refer to caption — Figure 1: Hierarchical classification adjustment (HCA). At the top is the finest classifier $H$ . Progressing from the top to the middle and bottom, classifier $H$ is adjusted by coarser classifiers ( $1\sim H-1$ ) which bring the prediction closer to the ground truth.

1 Introduction

In machine learning, classification is used to predict categorical targets, while regression is used for predicting continuous targets. Yet for many regression tasks in computer vision, such as depth estimation [4], age estimation [24] and crowd-counting [18], performance is better under a classification formulation. Most commonly, the continuous output is quantized, and each bin is treated as a class.

Depending on the data distribution of the output space, different quantization schemes are more suitable, i.e., linear [39] if the data is approximately balanced, versus logarithmic [7] or adaptive [35] for long-tail and imbalanced data. An adaptive quantization can transform imbalanced regression targets into more balanced class labels. The number of classes and the target range each class covers should be selected to ensure a sufficient number of samples for learning. However, the larger the interval, i.e. to ensure balanced and sufficient class samples, the greater the quantization error when recovering the target regression values. In practice, the number of classes and quantization scheme is chosen to trade off the classification accuracy and quantization error [4, 40].

By transforming regression to classification, can we apply imbalanced classification approaches? Existing long-tail classification works try to balance the input samples [3, 43], features [22, 13, 12], losses [37, 29] or output logits [21, 45, 16] during training. It is challenging to re-balance a single classifier to cover both the head and the tail; often, improvements on the tail classes come at the expense of harming the head.

To capture the entire target range, one alternative is to ensemble a set of classifiers [19, 11, 15]. For regression converted to classification settings, a convenient way to create an ensemble of diverse classifiers is to apply different quantizations. When merging the ensemble, we wish to benefit from the higher performance of coarse classifiers while preserving the resolution of fine classifiers.

In this work, we advocate for the adjustment of fine-grained classifiers with progressively coarser ones. We refer to this procedure as Hierarchical Classification Adjustment, or HCA. HCA works with the logits of the classifier ensemble; as shown in Fig. 1, adding the coarser (but more accurate) logits progressively improves the accuracy.

In addition, we propose to distill the entire hierarchical ensemble into a single classifier; we refer to this process as HCA-d. HCA-d is inspired by [1, 8], which distill image classifiers from a set of classifier learned with multi-granular image labels. However, such approaches applied to the regression setting are not range-preserving. Specifically, across the classifiers in the ensemble, the estimated target ranges do not remain consistent (see detailed example in Fig. 2). To mitigate the errors that may arise, we propose a range-preserving adjustment to ensure that HCA-d remains consistent in the distillation process. Overall, HCA-d is simple but efficient, showing improvements over the whole range of the target space. Our contributions can be summarized as:

•

A novel Hierarchical Classification Adjustment (HCA) to merge a finely-quantized classifier with an ensemble of progressively coarser classifiers over an imbalanced target range;
•

A range-preserving distillation technique, HCA-d, which ensures consistent class (range) predictions predicted across the hierarchy of classifiers.
•

HCA shows comparable or superior performance on imbalanced visual regression tasks, including age estimation, crowd counting and depth estimation.

2 Related Works

Real-world data [26, 44] is commonly imbalanced so standard approaches try to re-weight the importance between head and tail samples [14, 21]. Most works [3, 43, 22, 13, 12, 37, 29, 19, 11, 15] focus on imbalanced classification, while only a few [42, 9, 30, 23] studied imbalanced settings in a regression.

Imbalanced Classification Current research working with single classifiers tries to improve tail class performance via sample weighting [3, 43, 29], upsampling [22, 12] or adjusting class margins [21, 45, 16]. However, a trade-off is hard to achieve with a single classifier and invariably sacrifices the head for the tail.

While a single classifier may not be suitable, ensembling multiple classifiers can cover all the classes [11, 14, 36, 15, 41]. In an ensemble, a critical issue is ensuring individual learners’ diversity. A classic approach to introduce diversity is bagging [2] - sampling with replacements for different training data partitions for each of the classifiers. Hido et al. [11] sample subsets, which are individually imbalanced but globally balanced when averaging all subsets, to ensure balance and diversity. Li et al. [15] construct individual learners by hard example mining. Xu et al. [41] progressively split head and tail classes to alleviate imbalance and learn diverse classifiers. For deep imbalanced regression, it is convenient to get diverse classifiers by applying different quantization strategies without splitting the dataset.

From Imbalanced Classification to Regression Imbalanced regression [42, 9, 30, 23] is less explored than classification. Most works [42, 30, 23] are inspired by classification techniques. Yang et al. [42] showed that regression values are locally correlated and can benefit from a smoothed version of sample weighting. Ren et al. [23] is inspired from logit adjustment [21] and proposed a balanced Mean Square Error (BMSE) loss. Conversely, Liu et al. [18] and Wang et al. [35] choose distribution-aware quantization to transform imbalanced regression into a less imbalanced classification problem. We also transform imbalanced regression into classification but explore hierarchical classifiers, where each classifier trades off head and tail performances differently while the combination is suitable to the whole ranges.

Hierarchical classification [1, 5, 8, 38] leverages the taxonomy or hierarchical structure of classes to ensure more semantically meaningful mistakes. For example, a poodle is better to be mis-classified as a dog instead of a cat. To learn the hierarchy, hierarchical classifiers are trained together in [1]. A key issue is how to align hierarchical outputs and propagate supervision from the coarse to the fine classifiers [1, 5, 8, 38]. The standard approaches [1, 8] treat classifier outputs after the softmax as posterior probabilities and sum them. Such a paradigm does not ensure consistent predictions when pooling fine-grained predictions to a coarser level. For regression, such inconsistencies adversely affect the learning and serve as the motivation for our proposed range-preserving distillation.

3 Hierarchical Classification Adjustment

We propose a Hierarchical Classification Adjustment (HCA) for imbalanced regression. It learns an ensemble of hierarchical classifiers and then leverages the set of predictions to improve the performance of few-shot ranges while maintaining the performance of many-shot ranges. In this section, we first describe how to set a single classifier for a regression problem (Sec. 3.1), then partition the continuous label space into an ensemble of discrete hierarchical classes (Sec. 3.2), followed by the hierarchical adjustment (Sec. 3.3) and distillation (Sec. 3.4).

3.1 A Vanilla Classifier for Continuous Targets

Consider a continuous dataset $D=\{x,v\}$ , where $x$ and $v$ denote the input and target value, respectively. Let $V_{min}$ and $V_{max}$ denote the minimal and maximal values of $v$ in the training set. Like [39, 18], we divide the target range $[V_{min},V_{max}]$ into $C$ intervals $(V_{min},V_{1}]$ , $(V_{1},V_{2}]$ , $...$ , $(V_{C-1},V_{max}]$ and treat samples within each interval as samples belonging to classes $c=1...C$ . A standard classifier can be trained to estimate the interval index $c$ based on feature representations of $x$ . Consider an input sample $x$ , represented by a feature $f\in\mathbbm{R}^{d}$ extracted by network $F$ :

f=F(x),

(1)

with a predicted class logit $\hat{p}\in\mathbbm{R}^{C_{h}}$ where

\hat{p}=\text{Softmax}\{G(f)\}

(2)

and G is a mapping function with learnable weights.

For learning $F$ and $G$ , we apply a cross entropy loss $L_{ce}$ with $\hat{p}$

L_{ce}=-\sum_{j=1}^{C}{p}[j]\times log(\hat{p}[j]).

(3)

where $p\in\mathbbm{R}^{C}$ is the one-hot ground-truth. We can also apply label smoothing to $p_{i}$ ; the soft ordinal loss (SORD) [6] applies a Gaussian smoothing to ensure that ordinal relationships are partially preserved in the target classes.

After training, the predicted class can be determined by the max dimension of $\hat{p}$ . The class is then mapped back to a representative regression value for evaluation. Following [35], we choose the mean values of samples fallen in each class interval. For age estimation, we adopt linear intervals with length $1$ , since ages increase with step $1$ ; while for counting or depth estimation tasks, we choose log-spaced intervals as per [18, 40] for fair comparison.

3.2 Hierarchical Classifier Ensemble

Consider $H$ classifiers; these classifiers are hierarchical, in that each covers a progressively coarser quantization. Consider the finest quantization, which we designate as the setting of the $H$ -th classifier. We can merge classes of the $H$ -th classifier to form coarser quantized classifiers. For $h=1$ to $H-1$ , the classifier has $C_{h}=2^{h}$ classes, where each class interval’s range is determined to normalize the number of data samples per class. For example, for the first classifier ( $h\!=\!1$ ), the two classes cover ranges $(V_{min},V_{med}]$ and $[V_{med},V_{max})$ , where $V_{med}$ is the value selected from $V_{i}$ that is closest to the median; for $h\!=\!2$ , the 4 intervals are selected in $V_{i}$ to cover quartiles of the data samples. Fig. 3 (a) shows an example of $H=3$ hierarchical classifiers. We can observe that the label distribution of $1\sim(H-1)$ hierarchical classifiers is more balanced than the $H$ -th classifier.

The $h$ -th classifier predicts $\hat{p}^{h}\in\mathbbm{R}^{C_{h}}$ based on

\hat{p}^{h}=\text{Softmax}\{G_{h}(f)\},

(4)

where $G_{h}$ is a mapping function with learnable weights.

Its cross entropy $L_{ce}^{h}$ can be given as

L_{ce}^{h}=-\sum_{j=1}^{C_{h}}{p}^{h}[j]\times log(\hat{p}^{h}[j]),

(5)

where $p\in\mathbbm{R}^{C_{h}}$ is the ground-truth for the $h$ -th classifier.

The overall loss for training feature network $F$ and hierarchical classifiers $G_{h}$ is the sum of all the cross-entropies:

L=\sum_{h=1}^{H}L_{ce}^{h}.

(6)

Note that we do not weight each $L_{ce}^{h}$ differently since they have the same scale.

3.3 Hierarchical Classifier Adjustment (HCA)

In the ensemble of classifiers learned by Eq. (6), classifier $H$ has the finest quantization (and therefore the lowest quantization error) but is also the least accurate. In contrast, as the classifier gets progressively coarser, it gets more accurate, but also has higher quantization error (see Fig. 3). To merge these results, we can adjust the prediction of classifer $H$ with the coarser classifiers $H-1$ to $1$ .

Specifically, from the hierarchical predictions $\hat{p}_{i}^{h}$ , we can estimate an adjusted prediction through a summation

\hat{p}^{a}=\hat{p}^{H}+\sum_{h=1}^{H-1}T_{h,H}^{T}\cdot\hat{p}^{h},

(7)

or a multiplication

\hat{p}^{m}=log(\hat{p}^{H})+\sum_{h=1}^{H-1}T_{h,H}^{T}\cdot log(\hat{p}^{h}),

(8)

operation. In Eqs. 7 and 8, $p^{a},p^{m}$ are addition- and multiplication-adjusted predictions that keep the finest quantization as $H$ -th classifier. $T_{h,H}\in\mathbbm{R}^{C_{h}\times C_{H}}$ is the class mapping from $h$ -th classifier to $H$ -th classifier. If the $u$ -th class in $H$ -th classifier is the $v$ -th class in the $h$ -th classifier, then $T_{v,u}=1$ ; otherwise $T_{v,u}=0$ . Fig. 3 (a) visualizes an example of $T_{h,H}$ for $H=3$ hierarchical classifiers. Note that the multiplication merging in Eq. (8) has a similar form as logit adjustment, but here we use hierarchical prediction $\hat{p}^{h}$ to adjust $\hat{p}^{H}$ rather than the frequency of each class. The final class is recovered by taking a max over $p^{a}$ or $p^{m}$ for addition or multiplication adjustments respectively.

HCA, while proposed with the concept of adjusting the finest-quantized classifier with coarser ones, is effectively an ensembling approach, voting with either the logits (Eq. (7)) or log of the logits (Eq. (8)). However, such an ensembling approach cannot ensure that the adjusted or ensembled result $p^{a}$ and $p^{m}$ will predict a final result consistent with $p^{h}$ .

3.4 Range-Preserving Distillation (HCA-d)

In addition to the inconsistencies, the adjustment procedure, like other ensembling methods, is inefficient because it requires running $H$ classifiers during testing. Alternatively, we propose to distill the ensemble of classifiers into a single adjusted classifier. The ensemble is learned during training in a first stage, frozen, and then distilled into a single classifier during a second training stage; during inference, only the adjusted classifier is applied. Such an approach is motivated by hierarchical classification [1, 8], which also distills hierarchical classifiers, though their aim is to learn hierarchy-aware features.

Consider a classifier $T$ which predicts $\hat{p}\in R^{d}$ with a mapping function $G_{T}$ :

\hat{p}^{T}=\text{Softmax}(G_{T}(f)),

(9)

where $\hat{p}^{T}$ distills the hierarchical information of $\hat{p}^{h}$ . This can be achieved by adopting a Kullback–Leibler divergence loss the between softmax normalized logits $\hat{p}^{T}$ and $\hat{p}^{h}$ .

As $\hat{p}_{i}^{T}\in R^{C_{H}}$ and $\hat{p}_{i}^{h}\in R^{C_{h}}$ have different resolutions, they must be aligned before the distillation.

Previous works on hierarchical classification [1, 8] view $\hat{p}^{T}\in R^{C_{H}}$ as posterior probabilities and thus simply sum the corresponding dimensions in $\hat{p}^{T}$ to get a down-sampled versions of $\overline{p}^{T,h}\in R^{C_{h}}$ to match with $\hat{p}^{h}$ , i.e.

\overline{p}^{T,h}[j]=\sum_{k=1}^{C_{H}}T_{h,H}[j,k]\times\hat{p}^{T}[k].

(10)

After aligning $\hat{p}^{T}$ with the individual $\hat{p}^{h}$ , we can apply the Kullback–Leibler (KL) divergence between $\hat{p}_{i}^{h}$ and $\overline{p}_{i}^{T,h}$ :

L_{\text{hd}}^{h}=\text{KL}\{\hat{p}^{h}||\overline{p}^{T,h}\},

(11)

and an overall hierarchical distillation by summing over all the classifiers:

L_{\text{hd}}=\sum_{h=1}^{H}L_{\text{hd}}^{h}.

(12)

The hierarchical loss in Eq. (11) is not range-preserving when we choose Eq. (10) as the hierarchical alignment. As indicated in Fig. 3 (c), $L_{hd}=0$ does not indicate the class predicted by $\hat{p}^{T}$ is within the range of classes predicted by $\hat{p}^{h}$ . We can adjust Eq. (10) to range-preserving by considering the maximum of $T_{h,H}[j,k]$ :

\ddot{p}^{T,h}[j]=\max_{k=1,...,C_{H}}T_{h,H}[j,k]\times\hat{p}^{T}[k].

(13)

and then $\ddot{p}^{T,h}$ is normalized to get $\overline{p}^{T,h}\in R^{C_{h}}$

\overline{p}^{T,h}[j]=\frac{\ddot{p}^{T,h}[j]}{\sum_{l=1}^{C_{h}}\ddot{p}^{T,h}[l]}.

(14)

Proposition (Range-Preserving): Let $v=\text{argmax}_{j}\overline{p}^{T,h}[j]$ , $u=\text{argmax}_{k}\hat{p}^{T}[k]$ . If $\overline{p}^{T,h}$ is computed by eqs. (13) and (14), then $T_{h,H}[v,u]=1$ , which indicates the class predicted by $\hat{p}^{T}$ is within the range of that predicted by $\overline{p}^{T,h}$ .

4 Experiment and Discussion

4.1 Implementation Details

We conduct experiments on three imbalanced regression tasks: IMDB-WIKI-DIR [42] for age estimation, SHTech [44] for crowd counting, and NYUDv2-DIR [42]. For the age and depth datasets, we follow the same ResNet50 [10] backbone for feature extraction and training setting as [42]. For SHTech, we use VGG16 [27] backbone and the same training setting as [40]. The finest class numbers $C_{H}$ are $121$ for ages and $100$ for depth and counting. Since $C_{H}\leq 2^{7}$ for all datasets, we set $H$ as 7 for all datasets. For $G_{h}$ $(h=1,...,H)$ , we adopt one linear layer, which maps features $f\in R^{d}$ to outputs $\hat{p}^{h}\in R^{C_{h}}$ . For $G_{T}$ , a linear layer is also feasible, while a non-linear mapping is more adequate to distill the hierarchical information. Specifically, we adopt two fully connected layers with hidden dimensions $\frac{d}{4}$ and softplus activation. Detailed experiments of $G_{T}$ can be found in the supplementary.

We first train the hierarchical classifiers with the summed cross-entropy loss in Eq. (6). We can then apply learning-free HCA results, HCA-add and HCA-mul with Eq. (7) and Eq. (8), respectively. For range-preserving HCA (HCA-d), classifiers $1\sim H$ and feature extraction network $F$ are fixed. Only classifier $T$ is trained with $L_{hd}$ in Eq. (12) for additional $20\%$ epochs of stage 1 until convergence.

4.2 Ablation Studies

We first do ablation studies to verify some factors of HCA, including ground-truth labels, hierarchical class settings and two variants of HCA. IMDB-WIKI-DIR [42] and SHTech Part A (SHA) [44] datasets are chosen for ablation studies. Mean absolute error (MAE) and its balanced version bMAE [23] are adopted as evaluation metrics for SHA and IMDB-WIKI-DIR, respectively. Lower MAE and bMAE denote better performance.

$i$ ) One-hot or Gaussian-smoothed ground-truth labels One-hot and Gaussian-smoothed [6] ground truths $p^{h}$ are two common choices for cross-entropy losses in eq. (5). Compared to one-hot $p^{h}$ , Gaussian-smoothed ground truths further encode the ordinal relationship among labels. We compare both of them in Table 1. From Table 1, we can observe that HCA shows improvements with both hard and soft ground truths, and HCA with soft ground truths delivers better performance. We use soft labels by default in all of the remaining experiments.

Method	GT	IMDB-WIKI-DIR				SHA
Method	GT	All	Many	Med.	Few	SHA
CLS	one-hot	13.48	7.25	13.65	32.57	58.8
HCA-d	one-hot	12.93	7.20	12.81	30.71	55.0
CLS	soft [6]	13.58	7.13	13.95	33.21	58.2
HCA-d	soft [6]	12.70	7.00	13.18	29.94	53.7

Table 1: Soft vs. hard one-hot ground truth of classification.

$ii$ ) Hierarchical Class Settings Combining extra classifiers could improve a single vanilla classifier, but could we just duplicate the vanilla classifier at the finest level rather than setting hierarchical classifiers? Besides, how about splitting hierarchical classifiers that equalize the interval length of each class rather than equalize sample numbers? We compare hierarchical class settings in Table 2. Compared with a single classifier, assembling duplicated classifiers can be helpful (overall bMAE from $13.50$ to $13.42$ ), but the improvement is limited compared to that of HCA. Moreover, it is more beneficial to split hierarchical classes by equaling the sample number within each class rather than equaling the length of class intervals.

Configuration	IMDB-WIKI-DIR				SHA
Configuration	All	Many	Med.	Few	SHA
Single CLS	13.58	7.13	13.95	33.21	58.2
Same CLSs	13.42	7.10	14.38	32.22	57.9
E-Num HCA-d	12.70	7.00	13.18	29.94	53.7
E-Len HCA-d	12.77	7.23	12.92	29.77	56.2

Table 2: Comparison of various (hierarchical) class settings. “Same CLSs” means

H

classifiers adopts the same class splitting as the

H

-th classifier. “E-Num” means equaling the number of samples within each class during hierarchical class splitting, while “E-Len” will equal the length of each class interval.

$iii$ ) Comparing two variants of HCA Learning-free HCA and range-preserving HCA (HCA-d) are compared in Table 3. It can be observed: all variants of HCAs are clearly better than a single classifier or ensemble same classifiers in all shots; HCA-d is better than HCA-add and HCA-mul, suggesting that learning-free HCA cannot fully explore the hierarchical information in $\hat{p}_{h}$ and an explicit hierarchical distillation learning is more beneficial.

Combine	IMDB-WIKI-DIR				SHA
Combine	All	Many	Med.	Few	SHA
Single CLS	13.58	7.13	13.95	33.21	58.2
Same CLSs	13.42	7.10	14.38	32.22	57.9
Average	14.85	7.18	17.83	36.24	106.1
HCA-add	12.86	6.98	13.15	30.80	55.9
HCA-mul	12.89	7.00	13.36	30.74	54.7
HCA-d	12.70	7.00	13.18	29.94	53.7

Table 3: Comparison of two hierarchical adjustment approaches.

4.3 Analysis of HCA

$i$ ) coarse classifiers perform “worse” due to quantization errors: Fig. 4 (a) compares the performance of the classifiers individually versus their quantization error. The coarse classifiers ( $1-3$ ) perform worse individually compared to the vanilla $H$ -th classifier, due to the quantization error of representing the entire interval with a single value.

$ii$ ) coarse classifiers provide better range estimation while combining fine classifiers mitigate quantization errors: In Fig. 4 (b), a coarse $h$ -th classifier and the finest $H$ -th classifier are combined in a coarse-to-fine manner. Specifically, we first get a coarse range prediction from the $h$ -th classifier and then select a finer class within this coarse range according to $\hat{p}^{H}$ predicted by the $H$ -th classifier. It can be observed that merging coarse predictions will significantly decrease the error of the $H$ -th classifier, suggesting coarse classifiers provide better range estimation than the finest $H$ -th classifier. Meanwhile, selecting a finer class within the coarse range will decrease the bMAE of coarse classifiers ( $1\sim 3$ ), implying that combining fine classifiers can mitigate quantization error in coarse classifiers.

$iii$ ) Range-preserving distillation is the key to successful HCA: In Sec. 3.3, we argue that summation alignment Eq. (10) is not range preserving. Here we experimentally compare summation and range-preserving alignment (13) in Table 4. It can be observed that using sum alignment harms the performance of HCA in all shots, while the range-preserving alignment can benefit vanilla classification. We further analyze the percentage of samples, which provides inconsistent range estimation when using eq. (10). Fig. 5 shows the percentage of inconsistent samples percentage for each classifier head. We can see that when the number of classes increases, the inconsistency increases, which can be explained by the decreased maximum value of prediction $\hat{p}^{i}$ in finer classifiers. Ideally, if the maximum value of $\hat{p}^{h}$ is $1$ , then the sum of $\hat{p}^{h}$ will not change the range predicted by $\hat{p}^{h}$ ; however, the maximum value of $\hat{p}^{h}$ will be much less than $1$ in regression by finer classifiers, thus sum operation in eq. (10) cannot ensure consistent ranges cross the hierarchy, as shown in Fig. 2.

To justify the influence of second-stage training, we add a “CLS+GT sup” baseline, which uses the ground-truth labels to train the classifier $T$ . “CLS+GT sup” does not present any improvement over the vanilla classification, indicating that hierarchical distillation rather than extra training stages is helpful for imbalanced regression.

Combine	IMDB-WIKI-DIR				SHA
Combine	All	Many	Med.	Few	SHA
CLS	13.58	7.13	13.95	33.21	58.2
CLS+GT sup	13.64	7.20	14.94	32.54	57.0
HCA sum (10)	27.08	15.14	38.57	55.11	150.3
HCA max (13)	12.70	7.00	13.18	29.94	53.7

Table 4: Comparison of the summation and ranging preserving alignments of hierarchical predictions. “CLS+GT sup” denotes using the ground-truth labels rather than hierarchical prediction to supervise the classifier

T

Configuration	Balanced Subsets			Imbalanced Subsets
Configuration	1000:1000	100:100	10:10	1900:100	190:10	19:1
Regression	6.00 $\pm$ 0.10	7.50 $\pm$ 0.04	7.56 $\pm$ 0.07	6.78 $\pm$ 0.04	7.68 $\pm$ 0.05	7.74 $\pm$ 0.12
CLS	6.09 $\pm$ 0.03	7.63 $\pm$ 0.05	7.61 $\pm$ 0.07	6.78 $\pm$ 0.03	7.74 $\pm$ 0.12	7.90 $\pm$ 0.07
HCA-d	6.06 $\pm$ 0.04	7.53 $\pm$ 0.03	7.53 $\pm$ 0.03	6.72 $\pm$ 0.04	7.54 $\pm$ 0.03	7.54 $\pm$ 0.05

Table 5: Comparison on subsampled balanced and imbalanced subsets of IMDB-WIKI-DIR. Each method is repeated for

5

times.

$iv$ ) Imbalanced or Insufficient? In imbalanced regression tasks, like age estimation and counting, few-shot samples are always rare ( $\leq 20$ samples per class). HCA is helpful in this imbalanced and insufficient scenario, but will it also be helpful in an imbalanced and sufficient dataset where relatively rare classes have plenty of samples? Moreover, is HCA applicable to balanced regression? To verify the influence of imbalance and insufficiency, we resample the IMDB-WIKI-DIR dataset to create these scenarios. Specifically, we first generate balanced subsets with $N$ samples per age and ages ranging from $20$ to $49$ . $N$ can be $1000$ , $100$ and $10$ , covering from sufficient to insufficient scenarios. Then for imbalanced subsets, we take $20\sim 34$ ages as many and $35\sim 49$ ages as few, while keeping the ratios between many to few as $19$ . To make a fair comparison, we keep the total sample number the same among balanced and imbalanced subsets, thus having $1900:100$ , $190:100$ and $19:1$ imbalanced subsets covering sufficient and insufficient cases. Table 5 presents quantitative results. We can observe that: $i$ )HCA does not show significant improvement when the training set is balanced or imbalanced but with sufficient samples ( $1900:100$ ); $ii$ ) HCA outperforms vanilla classification or regression by a clear margin when the training set is both imbalanced and insufficient.

4.4 Comparison with SOTA on Regression Tasks

SHTech Dataset SHTech [44] is a crowd-counting dataset, which presents severe imbalanced distribution [39, 18, 40]. It has two subsets, part A and part B. Part A presents crowded scenes captured in arbitrary camera views, while part B presents relatively sparse scenes captured by surveillance cameras. We follow the same network setting as [40], where $100$ logarithm classes are adopted for $C_{H}$ . Mean absolute error (MAE) and rooted mean square error are adopted as evaluation metrics. Both MAE and RMSE are the lower, the better. Quantitative results are presented in Table 6. It can be observed that Hierarchical classification shows the best performance and improves plain classification by a large margin.

	SHA		SHB
	MAE $\downarrow$	RMSE $\downarrow$	MAE $\downarrow$	RMSE $\downarrow$
CSRNet [17]	68.2	115.0	10.6	16.0
DRCN [28]	64.0	98.4	8.5	14.4
BL [20]	62.8	101.8	7.7	12.7
PaDNet [31]	59.2	98.1	8.1	12.2
MNA [32]	61.9	99.6	7.4	11.3
OT [34]	59.7	95.7	7.4	11.8
GL [33]	61.3	95.4	7.3	11.7
Regression [40]	65.4	103.3	10.7	19.5
DC-regression [40]	60.7	101.0	7.1	11.0
CLS	58.2	96.7	7.0	11.8
HCA-add	55.9	92.8	6.7	11.4
HCA-mul	54.7	91.6	6.8	11.4
HCA-d	53.7	87.8	6.8	11.8

Table 6: Comparison on SHTech dataset [44]. Methods are grouped as density map regression, local count regression and classification approaches.

IMDB-WIKI-DIR Dataset IMDB-WIKI-DIR [42] is a large age estimation dataset, which is an imbalanced subset sampled from IMDB-WIKI [25]. There are $191509$ training samples, $11022$ validation samples, and $11022$ testing samples.

We choose three baselines of classification, they are: $i$ ) vanilla classification, which is $H$ -th classifier of HCA; $ii$ ) classification with label distribution smoothing (LDS) [42], which re-weight samples with inverse class frequency; $iii$ ) classification with label distribution smoothing (LDS) and ranksim [9] regularization. ranksim [9] regularizes feature space to have the same ordering as label space. Their HCA counterparts are also included.

Table 7 presents the quantitative results. From Table 7, we can observe that: $i$ ) HCA shows clear improvement in bMAE over naive classification baselines. Specifically, HCA-d can improve all the shots for “CLS” and “CLS+LDS” baselines, while for strong baseline “CLS+LDS+ranksim”, since the baseline results are already saturated for the many-shot, there is still a slight trade-off between many and few-shot (many-shot bMAE increases from $6.70$ to $6.88$ ). $ii$ ) HCA outperforms its regression baselines and other regression approaches. Noted that Balanced MSE [23] is a logit adjustment version for regression, it improves the few/medium-shot performances via significantly harming the many-shot (bMAE from $7.32$ to $7.56$ ), while for HCA-d, many-shot performance is roughly maintained or improved.

Methods	bMAE $\downarrow$				MAE $\downarrow$
Methods	All	Many	Med.	Few	All	Many	Med.	Few
Regression [42]	13.92	7.32	15.93	32.78	8.06	7.23	15.12	26.33
Regression+LDS [42]	13.37	7.55	13.96	30.92	8.11	7.47	13.41	23.50
Regression+LDS+ranksim [9]	12.83	7.00	13.28	30.51	7.56	6.94	12.61	23.43
Regression+FDS +ranksim [9]	12.39	6.91	12.82	29.01	7.35	6.81	11.50	22.75
Balanced MSE [23]	12.66	7.65	12.68	28.14	8.12	7.58	12.27	23.05
DC-regression [40]	14.18	7.30	16.04	34.00	8.05	7.18	15.40	26.48
DC-regression+LDS [40]	13.04	8.11	13.62	27.82	8.62	8.04	13.50	22.04
CLS	13.58	7.13	13.95	33.21	7.75	7.04	13.60	25.17
HCA-add	12.86	6.98	13.15	30.80	7.53	6.90	12.70	23.53
HCA-mul	12.89	7.00	13.36	30.74	7.57	6.92	12.91	23.52
HCA-d	12.70	7.00	13.18	29.94	7.54	6.91	12.69	22.96
CLS+LDS	12.85	7.31	13.40	29.54	7.84	7.25	12.53	23.56
HCA-add+LDS	12.64	7.15	12.83	29.47	7.66	7.09	12.20	23.31
HCA-mul+LDS	12.68	7.18	13.03	29.42	7.70	7.11	12.35	23.34
HCA-d+LDS	12.42	7.28	12.47	28.24	7.77	7.21	12.25	22.43
CLS+LDS+ranksim	12.33	6.70	13.16	29.10	7.25	6.63	12.26	22.77
HCA-add+LDS+ranksim	12.15	6.77	12.09	28.80	7.26	6.72	11.39	23.48
HCA-mul+LDS+ranksim	12.24	6.69	12.69	29.01	7.22	6.63	11.84	23.22
HCA-d+LDS+ranksim	11.92	6.88	11.67	27.72	7.31	6.82	10.99	22.04

Table 7: Comparison on IMDB-WIKI-DIR Dataset. Methods are grouped as regression, classification approaches.

AgeDB-DIR Dataset AgeDB-DIR [42] is an imbalanced re-sampled version of AgeDB dataset [moschoglou2017agedb]. It contains $12208$ training samples, $2140$ validation samples and $2140$ testing samples, with ages ranging from $0$ to $101$ . Table 8 presents the quantitative results. HCA approaches show consistent improvement over classification baselines and outperform regression approaches.

Methods	bMAE $\downarrow$				MAE $\downarrow$
Methods	All	Many	Med.	Few	All	Many	Med.	Few
Regression [42]	9.72	6.62	8.80	16.66	7.57	6.61	8.73	13.48
Regression+LDS [42]	9.12	6.98	8.87	13.66	7.67	6.98	8.87	10.91
Regression+LDS +ranksim [9]	7.96	6.34	7.84	11.35	6.91	6.34	7.80	9.92
Balanced MSE [23]	8.97	7.65	7.43	12.65	7.78	7.65	7.45	9.99
DC-regression [40]	9.70	6.82	8.77	16.16	7.65	6.82	8.70	12.55
DC-regression+LDS [40]	9.48	7.36	9.14	14.04	8.03	7.36	9.13	11.26
CLS	9.14	6.89	8.62	14.08	7.58	6.89	8.51	11.60
HCA-add	8.95	6.91	8.26	13.53	7.49	6.91	8.17	11.05
HCA-mul	8.97	6.93	8.35	13.52	7.52	6.93	8.25	11.10
HCA-d	8.85	6.86	8.31	13.26	7.45	6.86	8.22	10.90
CLS+LDS	8.75	7.17	8.29	12.27	7.63	7.17	8.30	10.14
HCA-add+LDS	8.40	7.22	7.83	11.18	7.53	7.22	7.82	9.61
HCA-mul+LDS	8.54	7.25	8.02	11.49	7.60	7.25	8.02	9.70
HCA-d+LDS	8.46	7.11	7.80	11.64	7.47	7.11	7.77	10.06
CLS+LDS+ranksim	7.99	6.66	7.21	11.20	6.97	6.66	7.16	9.34
HCA-add+LDS+ranksim	7.82	6.67	7.12	10.59	6.94	6.67	7.07	9.10
HCA-mul+LDS+ranksim	7.85	6.68	7.14	10.71	6.95	6.68	7.10	9.17
HCA-d+LDS+ranksim	7.87	6.74	7.14	10.66	7.01	6.74	7.13	9.22

Table 8: Comparison on AgeDB-DIR Dataset.

Methods	MAE $\downarrow$	RMSE $\downarrow$	AbsRel $\downarrow$	$\delta_{1}$ $\uparrow$	$\delta_{2}$ $\uparrow$	$\delta_{3}$ $\uparrow$
Regression [42]	1.004	1.486	0.179	0.678	0.908	0.975
Regression+LDS [42]	0.968	1.387	0.188	0.672	0.907	0.976
Regression+LDS+ranksim [9]	0.931	1.389	0.183	0.699	0.905	0.969
Balanced MSE [23]	0.922	1.279	0.219	0.695	0.878	0.947
CLS	1.011	1.512	0.184	0.678	0.906	0.958
HCA-add	0.987	1.470	0.180	0.686	0.909	0.961
HCA-mul	0.991	1.478	0.181	0.685	0.909	0.960
HCA-d	0.987	1.475	0.181	0.689	0.915	0.961
CLS+LDS	0.924	1.383	0.181	0.711	0.909	0.965
HCA-add+LDS	0.919	1.375	0.180	0.710	0.910	0.965
HCA-mul+LDS	0.920	1.377	0.180	0.710	0.910	0.965
HCA-d+LDS	0.911	1.367	0.179	0.714	0.911	0.966
CLS+LDS+ranksim	0.904	1.335	0.182	0.715	0.916	0.972
HCA-add+LDS+ranksim	0.901	1.330	0.181	0.714	0.919	0.972
HCA-mul+LDS+ranksim	0.902	1.332	0.181	0.714	0.918	0.972
HCA-d+LDS+ranksim	0.895	1.321	0.180	0.715	0.919	0.972

Table 9: Comparison on NYUD2-DIR dataset. Methods are grouped as regression and classification approaches.

NYUDv2-DIR Dataset NYUDv2-DIR [42] is an imbalanced version sampled from the NYU Depth Dataset V2 [26]. The depth values range from $0$ to $10$ meters, which are divided into $100$ logarithm classes for $C_{H}$ . Mean absolute error (MAE), rooted mean square error (RMSE), relative absolute error (RelAbs), $\delta_{1}$ , $\delta_{2}$ and $\delta_{1}$ are adopted as evaluation metrics. Noted that all classes in NYUDv2-DIR have more than $10^{7}$ samples, which should be all categorized as many-shot classes according to the criteria in IMDB-WIKI-DIR [42] ( $>100$ samples). We report the overall results in Table 9 and detailed results can be found in the supplementary. We can observe that HCA shows improvements to its naive classification baselines and it is also comparable to or better than other regression methods. Noted that the improvement of HCA to CLS in NYUDv2-DIR is small. It is because NYUDv2-DIR is imbalanced but with sufficient samples per class, thus HCA does not improve much. This result is also in accordance with simulated experiments in Table 5.

5 Conclusion

This paper proposes a hierarchical classification adjustment (HCA) for imbalanced regression. HCA leverages hierarchical class predictions to adjust the vanilla classifiers and improves the regression performance in the whole target space without introducing extra quantization errors. On imbalanced regression tasks including age estimation, crowd counting and depth estimation, HCA shows superior results to regression or vanilla classification approaches. HCA is extremely helpful in imbalanced and insufficient scenarios; while it is also helpful in balanced and sufficient scenarios.

References

[1] Luca Bertinetto, Romain Mueller, Konstantinos Tertikas, Sina Samangooei, and Nicholas A Lord. Making better mistakes: Leveraging class hierarchies with deep networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12506–12515, 2020.
[2] Leo Breiman. Bagging predictors. Machine learning, 24:123–140, 1996.
[3] Jonathon Byrd and Zachary Lipton. What is the effect of importance weighting in deep learning? In International conference on machine learning, pages 872–881. PMLR, 2019.
[4] Yuanzhouhan Cao, Zifeng Wu, and Chunhua Shen. Estimating depth from monocular images as classification using deep fully convolutional residual networks. IEEE Transactions on Circuits and Systems for Video Technology, 28(11):3174–3182, 2017.
[5] Dongliang Chang, Kaiyue Pang, Yixiao Zheng, Zhanyu Ma, Yi-Zhe Song, and Jun Guo. Your” flamingo” is my” bird”: fine-grained, or not. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11476–11485, 2021.
[6] Raul Diaz and Amit Marathe. Soft labels for ordinal regression. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4738–4747, 2019.
[7] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep ordinal regression network for monocular depth estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2002–2011, 2018.
[8] Ashima Garg, Depanshu Sani, and Saket Anand. Learning hierarchy aware features for reducing mistake severity. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 252–267. Springer, 2022.
[9] Yu Gong, Greg Mori, and Fred Tung. RankSim: Ranking similarity regularization for deep imbalanced regression. In Proceedings of the 39th International Conference on Machine Learning, pages 7634–7649, 2022.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[11] Shohei Hido, Hisashi Kashima, and Yutaka Takahashi. Roughly balanced bagging for imbalanced data. Statistical Analysis and Data Mining: The ASA Data Science Journal, 2(5-6):412–426, 2009.
[12] Yan Hong, Jianfu Zhang, Zhongyi Sun, and Ke Yan. Safa: Sample-adaptive feature augmentation for long-tailed image classification. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 587–603. Springer, 2022.
[13] Bingyi Kang, Yu Li, Sa Xie, Zehuan Yuan, and Jiashi Feng. Exploring balanced feature spaces for representation learning. In International Conference on Learning Representations, 2021.
[14] Taghi M Khoshgoftaar, Jason Van Hulse, and Amri Napolitano. Comparing boosting and bagging techniques with noisy and imbalanced data. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans, 41(3):552–568, 2010.
[15] Jun Li, Zichang Tan, Jun Wan, Zhen Lei, and Guodong Guo. Nested collaborative learning for long-tailed visual recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6949–6958, 2022.
[16] Mengke Li, Yiu-ming Cheung, and Yang Lu. Long-tailed visual recognition via gaussian clouded logit adjustment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6929–6938, 2022.
[17] Yuhong Li, Xiaofan Zhang, and Deming Chen. Csrnet: Dilated convolutional neural networks for understanding the highly congested scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1091–1100, 2018.
[18] Liang Liu, Hao Lu, Haipeng Xiong, Ke Xian, Zhiguo Cao, and Chunhua. Shen. Counting objects by blockwise classification. IEEE Trans. on Circuits and Systems for Video Technology, 30(10):3513–3527, 2019.
[19] Xu-Ying Liu, Jianxin Wu, and Zhi-Hua Zhou. Exploratory undersampling for class-imbalance learning. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):539–550, 2008.
[20] Zhiheng Ma, Xing Wei, Xiaopeng Hong, and Yihong Gong. Bayesian loss for crowd count estimation with point supervision. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 6142–6151, 2019.
[21] Aditya Krishna Menon, Sadeep Jayasumana, Ankit Singh Rawat, Himanshu Jain, Andreas Veit, and Sanjiv Kumar. Long-tail learning via logit adjustment. In International Conference on Learning Representations, 2021.
[22] Seulki Park, Youngkyu Hong, Byeongho Heo, Sangdoo Yun, and Jin Young Choi. The majority can help the minority: Context-rich minority oversampling for long-tailed classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6887–6896, 2022.
[23] Jiawei Ren, Mingyuan Zhang, Cunjun Yu, and Ziwei Liu. Balanced mse for imbalanced visual regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
[24] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE international conference on computer vision workshops, pages 10–15, 2015.
[25] Rasmus Rothe, Radu Timofte, and Luc Van Gool. Deep expectation of real and apparent age from a single image without facial landmarks. International Journal of Computer Vision, 126(2-4):144–157, 2018.
[26] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support inference from rgbd images. ECCV (5), 7576:746–760, 2012.
[27] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[28] Vishwanath A Sindagi, Rajeev Yasarla, and Vishal M Patel. Jhu-crowd++: Large-scale crowd counting dataset and a benchmark method. Technical Report, 2020.
[29] Saptarshi Sinha, Hiroki Ohashi, and Katsuyuki Nakamura. Class-wise difficulty-balanced loss for solving class-imbalance. In Proceedings of the Asian conference on computer vision, 2020.
[30] Michael Steininger, Konstantin Kobs, Padraig Davidson, Anna Krause, and Andreas Hotho. Density-based weighting for imbalanced regression. Machine Learning, 110:2187–2211, 2021.
[31] Yukun Tian, Yiming Lei, Junping Zhang, and James Z Wang. Padnet: Pan-density crowd counting. IEEE Transactions on Image Processing, 29:2714–2727, 2019.
[32] Jia Wan and Antoni Chan. Modeling noisy annotations for crowd counting. Advances in Neural Information Processing Systems, 33, 2020.
[33] Jia Wan, Ziquan Liu, and Antoni B Chan. A generalized loss function for crowd counting and localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1974–1983, 2021.
[34] Boyu Wang, Huidong Liu, Dimitris Samara, and Minh Hoai. Distribution matching for crowd counting. In Conference on Neural Information Processing Systems (NeurIPS), 2020.
[35] Changan Wang, Qingyu Song, Boshen Zhang, Yabiao Wang, Ying Tai, Xuyi Hu, Chengjie Wang, Jilin Li, Jiayi Ma, and Yang Wu. Uniformity in heterogeneity: Diving deep into count interval partition for crowd counting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3234–3242, 2021.
[36] Xudong Wang, Long Lian, Zhongqi Miao, Ziwei Liu, and Stella Yu. Long-tailed recognition by routing diverse distribution-aware experts. In International Conference on Learning Representations, 2021.
[37] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. Learning to model the tail. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[38] Hui Wu, Michele Merler, Rosario Uceda-Sosa, and John R Smith. Learning to make better mistakes: Semantics-aware visual food recognition. In Proceedings of the 24th ACM international conference on Multimedia, pages 172–176, 2016.
[39] Haipeng Xiong, Hao Lu, Chengxin Liu, Liang Liu, Zhiguo Cao, and Chunhua Shen. From open set to closed set: Counting objects by spatial divide-and-conquer. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 8362–8371, 2019.
[40] Haipeng Xiong and Angela Yao. Discrete-constrained regression for local counting models. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, pages 621–636. Springer, 2022.
[41] Yue Xu, Yong-Lu Li, Jiefeng Li, and Cewu Lu. Constructing balance from imbalance for long-tailed image recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XX, pages 38–56. Springer, 2022.
[42] Yuzhe Yang, Kaiwen Zha, Ying-Cong Chen, Hao Wang, and Dina Katabi. Delving into deep imbalanced regression. In International Conference on Machine Learning (ICML), 2021.
[43] Sihao Yu, Jiafeng Guo, Ruqing Zhang, Yixing Fan, Zizhen Wang, and Xueqi Cheng. A re-balancing strategy for class-imbalanced classification based on instance difficulty. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 70–79, 2022.
[44] Yingying Zhang, Desen Zhou, Siqin Chen, Shenghua Gao, and Yi Ma. Single-image crowd counting via multi-column convolutional neural network. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 589–597, 2016.
[45] Zhisheng Zhong, Jiequan Cui, Shu Liu, and Jiaya Jia. Improving calibration for long-tailed recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16489–16498, 2021.