Bottom-up Hierarchical Classification
Using Confusion-based Logit Compression

Tong Liang, Jim Davis
Ohio State University
Columbus, Ohio 43210
{liang.693, davis.1719}@osu.edu Roman Ilin
AFRL/RYAP
Wright-Patterson Air Force Base, Ohio 45433
[email protected]

Abstract

In this work, we propose a method to efficiently compute label posteriors of a base flat classifier in the presence of few validation examples within a bottom-up hierarchical inference framework. A stand-alone validation set (not used to train the base classifier) is preferred for posterior estimation to avoid overfitting the base classifier, however a small validation set limits the number of features one can effectively use. We propose a simple, yet robust, logit vector compression approach based on generalized logits [6] and label confusions for the task of label posterior estimation within the context of hierarchical classification. Extensive comparative experiments with other compression techniques are provided across multiple sized validation sets, and a comparison with related hierarchical classification approaches is also conducted. The proposed approach mitigates the problem of not having enough validation examples for reliable posterior estimation while maintaining strong hierarchical classification performance.

1 Introduction

In comparison with flat classification approaches, hierarchical classification commonly utilizes a (semantic) label hierarchy that can be manually defined or automatically derived from the semantic relationship of the set of terminal labels. This enables confident hierarchical predictions at different levels of granularity (e.g., terrier $\rightarrow$ Dog $\rightarrow$ Animal) instead of enforcing flat predictions only at the terminal level with potentially low confidence. Hierarchical classification typically generalizes (bottom-up) or specifies (top-down) the initial prediction based on a given confidence threshold. Thus, estimation of reliable label posteriors is critical.

Proper posterior estimation on the output of a base classifier should employ a validation set as not to overfit the base classifier predictions on the training set [17]. Therefore, the number of available validation examples is of particular importance. Many popular datasets either do not have a separately provided validation set or only have a small number of validation examples. Moreover, for posterior estimation techniques based on the output logits of a neural network base classifier, the associated logit vector dimensionality may be too large as compared to a smaller number of validation examples. In order to address the aforementioned issues, we propose a novel compression approach of the logits produced from a pretrained base classifier. The method harnesses relevant label confusion information from terminal predictions of the classifier on a given validation set and compresses the least confusable labels into a single derived logit value. The compressed logit vector is then used directly for label posterior estimation and subsequent hierarchical classification. Overall, the proposed method is a post-processing approach since we are modeling the confusion behavior of the base classifier instead of the inherent semantic relationship across the terminal labels. Therefore, the semantic label hierarchy is only used during inference.

2 Related Work

Hierarchical classification typically employs a label hierarchy in the form of a tree [8, 6, 5, 22, 1, 18] or directed acyclic graph [7, 12] that explicitly injects prior knowledge of the label relationships into the model. The relationship between the labels can be either ‘IS-A’ or not depending on the associated classification problem [1, 8, 7, 6, 5, 22, 18, 3, 2]. One could either define the label hierarchy manually with domain knowledge [1, 12, 18] or derive the hierarchy automatically [8, 6, 5, 7, 22] from a well established semantic lexical database, such as WordNet [16].

There are multiple approaches that combine hierarchical classification with deep learning networks. In [9], a convolutional neural network (CNN) is trained for feature extraction followed by a recurrent neural network to refine the hierarchical predictions. Hierarchical losses are studied in [3, 21, 2] for training a flat CNN classification model to study the influence of prior knowledge of the label hierarchy expressed in the training loss. Similarly, [7] defines a conditional random field to model posteriors of the labels in a hierarchy which induced a hierarchical loss for training a common CNN model. In [22], a stochastic tree sampling method is proposed to dynamically share parameters among local classifiers in a label hierarchy trained with features extracted by a CNN backend to tackle both hierarchical classification and imbalanced data classification problems. Another dynamic model based on a CNN is proposed in [15] that incorporates information of a label hierarchy into the network architecture for semantic segmentation.

In [9, 22, 15], a hierarchical label is predicted in a top-down manner, where any mistakes made by a non-terminal local classifier are propagated downward to the terminal level. A combined model of both flat and top-down hierarchical classifications is proposed to tackle this issue and novelty detection in [14]. The methods proposed by [8, 5, 6] instead post-process a given flat classifier’s terminal predictions by generalizing the predicted terminal labels in a bottom-up fashion through the hierarchy, where the label generalization depends on estimated label posteriors in the hierarchy.

Our proposed method is most related to the bottom-up approaches of [8, 5, 6] where no hierarchical information is used in training the base classifier. In [8], a search algorithm is proposed to balance the trade-off between the specificity of the predictions and the overall hierarchical accuracy. The logit of a one-vs-rest support vector machine for each terminal label is employed for posterior estimation with Platt scaling [19]. The label posterior of a non-terminal ‘Super-class’ is approximated via summation across all terminal descendants (assumes independence). In [5], a non-parametric histogram binning approach is proposed to approximate posteriors of all labels in the hierarchy, but monotonicity of the estimated posteriors along the ancestral path between a predicted terminal label and the root is not guaranteed. Unlike [8, 5] that uses a single logit or softmax value of the associated label for posterior estimation, a tree-based logit vector compression approach is proposed by [6] for label posterior estimation. Our approach employs the generalized logit technique proposed in [6], but does not depend on a label hierarchy and instead leverages label confusions from the base classifier.

3 Method

For a dataset with terminal label set $C$ , a given neural network flat classifier is denoted as $f(x)$ with output logit vector $L_{x}\in\rm I\!R^{|\it{C}|}$ , i.e., $L_{x}=f(x)$ . We refer to this flat classifier $f$ as the ‘base classifier’. The proposed generalized hierarchical classifier $h$ serves as a post-processing step after the base classifier, given by

(y,\hat{p}_{y})=h(L_{x},H,T)

(1)

with label hierarchy H and confidence threshold $T\>(0\leq T\leq 1)$ . The output is the label prediction $y$ and its posterior $\hat{p}_{y}$ , where $\hat{p}_{y}\geq T$ . The prediction $y$ in Eqn. 1 could either be a hierarchical label in $H$ or a subset of terminal labels in $C$ depending on the choice of label generalization.

3.1 Generalized Logit

A generalized logit [6] is a derived logit for an aggregation of terminal label logits from a flat classifier. It assumes the softmax value of a Super-class label is the sum of softmax values of its terminal descendant labels in the label hierarchy. For instance, a terminal label set $C=\{bird,dog,fish,cat,truck,car\}$ with a Super-class Vehicle consisting of terminal labels truck and car has a softmax value $s_{Vehicle}=s_{truck}+s_{car}=\sum_{i\in\{\textit{truck},\textit{car}\}}s_{i}$ . Expanding $S_{Vehicle}$ based on the logit-softmax relationship, we acquire

\begin{split}s_{Vehicle}&=\frac{e^{l_{truck}}}{\sum_{j\in C}e^{l_{j}}}+\frac{e^{l_{car}}}{\sum_{j\in C}e^{l_{j}}}\\ &=\frac{\sum_{i\in\{\textit{truck},\textit{car}\}}e^{l_{i}}}{\sum_{i\in\{\textit{truck},\textit{car}\}}e^{l_{i}}+\sum_{j\in C/\{\textit{truck},\textit{car}\}}e^{l_{j}}}\\ &=\frac{e^{\hat{l}_{\textit{Vehicle}}}}{e^{\hat{l}_{\textit{Vehicle}}}+\sum_{j\in C/\textit{Vehicle}}e^{l_{j}}}\end{split}

(2)

where $l$ denotes the logit of a terminal label and $\hat{l}$ denotes the generalized logit of a Super-class label. The generalized logit relationship between the Super-class Vehicle and its children is given by $e^{\hat{l}_{\textit{Vehicle}}}=\sum_{i\in\{\textit{truck},\textit{car}\}}e^{l_{i}}$ . Therefore, the corresponding logit value of the non-terminal label Vehicle can be derived by $\hat{l}_{\textit{Vehicle}}=ln(e^{\hat{l}_{\textit{Vehicle}}})=ln(\sum_{i\in\{\textit{truck},\textit{car}\}}e^{l_{i}})$ . In this work, we will adopt this generalized logit formulation to acquire the logit value corresponding to a subset of terminal labels that are least confusing with the argmax predicted label.

3.2 Compressed Logits via Label Confusion

Unlike data-driven compression techniques that are usually designed to compress large amounts of data, we are motivated to compress the logit vector due to a lack of validation data but large logit dimensionality. We propose to construct a subset of labels (to be compressed) using the confusion matrix of the base classifier $f$ evaluated on the validation set. For a given predicted label, we aggregate the logit values of its least confused labels (not including the ground truth label) and compress them into a single generalized logit. The intuition is that logit values of the least confusing labels carry little information for posterior estimation. Furthermore, logit values preserve more residual confusion information than the exponential operation with softmax which ‘squashes’ the probability mass of non-argmax selected classes close to 0.

The columns in the confusion matrix $M$ of $f$ correspond to the predicted labels, and the rows in $M$ are the associated ground truth labels. The aggregation procedure takes $M_{i}$ , the $i^{th}$ column of $M$ corresponding to predicted label $i$ , and a given compression level $c$ ( $1\leq c\leq|C|$ ) as input, and outputs $Q^{c}_{i}$ , a list containing the ground truth label, the $(c-2)$ most confusing labels, and the single subset of the remaining least confusing labels (for label $i$ ). For a real application scenario, proper selection of compression level $c$ can be achieved via cross validation using the validation set (see Sect. 4.5).

To acquire $Q^{c}_{i}$ , we first apply stable-sort (retains the order of ties) to the column vector $M_{i}$ in descending order while tracking the corresponding labels (i.e., ground truth row indices in $M_{i}$ ). Then, we set $Q^{c}_{i}$ to include the first $(c-1)$ individual labels in the sorted list (ground truth label and most confusing labels for class $i$ ). Finally, we aggregate the remaining labels (the ‘tail’ labels whose logits are to be compressed into a single generalized logit) into one set and append it to $Q^{c}_{i}$ . Employing this procedure for each column in $M$ , we acquire $Q^{c}$ .

For a given example $x$ and its corresponding logit vector $L_{x}$ $=$ $f(x)$ , the initial label hypothesis is given by $\hat{y}$ $=$ $argmax(L_{x})$ . We then select $Q^{c}_{\hat{y}}$ from $Q^{c}$ and use the generalized logit formulation to compress the appropriate least confusing labels. For example, if $C=\{\textit{bird},\textit{dog},\textit{fish},\textit{cat},\textit{truck},\textit{car}\}$ and $\hat{y}=\textit{dog}$ , then perhaps $Q^{c}[\hat{y}]$ $=$ $Q^{c}_{dog}$ $=$ $\{dog,cat,\{bird,fish,truck,car\}\}$ with $c$ $=$ $3$ . The initial logit vector $L_{x}$ $=$ $[l_{\textit{bird}},l_{\textit{dog}},l_{\textit{fish}},l_{\textit{cat}},l_{\textit{truck}},l_{\textit{car}}]$ would be compressed to $L^{Q^{c}_{\textit{dog}}}_{x}$ $=$ $[l_{\textit{dog}},l_{\textit{cat}},\hat{l}_{\{\textit{bird},\textit{fish},\textit{truck},\textit{car}\}}]$ , reducing the dimensionality from 6 to 3. The compressed logit vector $L^{Q^{c}_{\textit{dog}}}_{x}$ will be used in label posterior estimation and inference for the examples initially classified as dog by base classifier.

3.3 Label Posterior Estimation

For the label posterior estimation process, we only need to estimate the posteriors of the terminal labels conditioned on the associated compressed logit vector. The posterior of any Super-class can be acquired by summing over the posteriors of its associated terminal descendants.

The posterior estimator $\hat{p}_{j|i}$ approximates the conditional probability $P(y=j|\hat{y}=i,L^{Q^{c}_{i}}_{x})$ , where $y=j$ denotes the ground truth of example $x$ to be label $j$ , $\hat{y}=i$ denotes the argmax-selected base prediction for $x$ to be label $i$ , and $L^{Q^{c}_{i}}_{x}$ is the compressed logit vector with dimension of $c$ computed from $L_{x}$ . The posterior estimator $\hat{p}_{j|i}$ for any $i,j\in C$ in the algorithm is modeled independently with a logistic sigmoid function given by

\hat{p}_{j|i}(L^{Q^{c}_{i}}_{x})=\frac{1}{1+exp(-W^{T}_{j|i}L^{Q^{c}_{i}}_{x}+B_{j|i})}

(3)

where $W_{j|i}\in\rm I\!R^{c}$ and $B_{j|i}\in\rm I\!R$ denote the parameters for the posterior estimator. To estimate $W_{j|i}$ and $B_{j|i}$ , we employed logistic regression with the Limit-memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS-B) [27].

During posterior estimation, each terminal label $i$ is assigned its own subset $V_{i}$ of the validation set $V$ based on the initial argmax predictions from the base classifier on $V$ , i.e., $V_{i}=\{x\>|\>i=argmax(L_{x})\>and\>L_{x}\in V\}$ . This argmax-selected validation subset inherently captures the confusing examples directly for the posterior estimation process. A set of posterior estimators for all terminal labels ( $\forall j\in C$ ) conditioned on the same base prediction $i$ is trained with the same subset of validation data $V_{i}$ using the ground truth indicating whether the example actually belongs to the associated label ( $j$ ). This is done for all combinations of terminal label posteriors ( $\forall j\in C$ ) that are conditioned on any possible initial base prediction ( $\forall i\in C$ ), leading to a total of $|C|\times|C|$ posterior estimators. In the unlikely case of $V_{i}=\emptyset$ , the base classifier’s terminal softmax output can be used for the label posteriors.

For inference with test example $x$ , we acquire the original logit vector from the base classifier $L_{x}=f(x)$ and use its base prediction label $\hat{y}=argmax(L_{x})$ to retrieve the label aggregation result $Q^{c}_{\hat{y}}$ . Next, we compress its logit vector from $L_{x}$ to $L^{Q^{c}_{\hat{y}}}_{x}$ . Lastly, the estimated terminal label posterior value $\hat{p}_{i}$ for any terminal label $i\in C$ (previously learned using validation set $V_{\hat{y}}$ ) is given by

\hat{p}_{i}=\frac{\hat{p}_{i|\hat{y}}(L^{Q^{c}_{\hat{y}}}_{x})}{\sum_{j\in C}\hat{p}_{j|\hat{y}}(L^{Q^{c}_{\hat{y}}}_{x})}

(4)

The L1-normalization is used to ensure no numerical imprecision and the probabilities across the terminal labels sum to 1. At this point we could either return the terminal label with the largest posterior (flat classification) or perform hierarchical inference.

3.4 Bottom-Up Hierarchical Classification

The proposed hierarchical inference approach starts with checking if $\hat{p}_{\hat{y}}$ surpasses the given confidence threshold $T$ . If so, $\hat{y}$ is returned as the prediction result along with $\hat{p}_{\hat{y}}$ . Otherwise, we set our label hypothesis to the parent label of $\hat{y}$ using the semantic label hierarchy $H$ . We compute the label posterior of the non-terminal label hypothesis via summation of its associated terminal descendant posteriors and check if this posterior surpasses the confidence threshold. This upward label generalization process is repeated until we find a label on the ancestral path of $\hat{y}$ with its posterior $\geq$ $T$ . As the label posterior of any non-terminal label is computed from the summation of its L1-normalized terminal descendant posteriors, monotonic non-decreasing posterior inference behavior upward through the hierarchy is ensured (not guaranteed in [5]). The overall hierarchical classification approach is provided in Algorithm 1.

input:

set of terminal labels

C

semantic label hierachy

H

base flat classifier

f

user-defined confidence threshold

T

label posterior estimators

\hat{p}_{j|i}

for

i

j

\in

C

confusion aggregation

Q^{c}

with given compression level

c

testing example

x

Output:

hierarchical prediction and its estimated label posterior

(y,\hat{p}_{y})

# get original logit and label hypothesis

L_{x}=f(x)

\hat{y}=argmax(L_{x})

# convert to compressed logits

Q^{c}_{\hat{y}}=Q^{c}[\hat{y}]

L^{Q^{c}_{\hat{y}}}_{x}=CompressLogits(L_{x},Q^{c}_{\hat{y}})

# get estimated terminal label posteriors

p_{t}=zeros([len(L_{x}),1])

for $j\in C$ do

p_{t}[j]=\hat{p}_{j|\hat{y}}(L^{Q^{c}_{\hat{y}}}_{x})

end for

# L1-normalization

p_{t}=p_{t}/sum(p_{t})

# iteratively check labels on ancestral path from

\hat{y}

to root of

H

\hat{p}_{y}=0

path=AncestralPath(\hat{y},H)

for k in path do

td=TerminalDescendants(k,H)

\hat{p}_{k}=sum(p_{t}[td])

if $\hat{p}_{k}\geq T$ then

y=k

\hat{p}_{y}=\hat{p}_{k}

return\>(y,\hat{p}_{y})

end if

end for

Algorithm 1 Bottom-up Hierarchical Classification

4 Experiments

We first compared our logit compression method to alternative compression approaches across multiple validation set sizes within hierarchical classification tasks. We selected two ideal datasets that have a very large number of validation examples to fully explore and compare the approaches. CINIC-10 [4] has 10 terminal labels with 9K examples each for training, validation, and testing per label. Places-20 is derived from the training set of Places365-Challenge [26] that consists 20 terminal labels and has 20K training, 10K validation, and 10K testing examples per label. In our experiments, we examined the hierarchical classification performance and posterior calibration quality as the number of validation examples is reduced.

For all of the experiments on CINIC-10, we used a base flat classifier with a ResNet-20 architecture [10] trained to 82% test accuracy. For Places-20, a WideResNet-18 [25] base classifier trained to 80% test accuracy was employed. The label hierarchy for CINIC-10 was adopted from [6]. The label hierarchy for Places-20 is a subtree derived from the full label hierarchy of Places365-Challenge [26] and is shown in Fig. 1.

4.1 Validation Set

We seek to evaluate our approach on different sizes of the validation set to examine its robustness against fewer examples. For CINIC-10, we randomly shuffled the original validation set $V$ in a class-wise manner and then split it into eight validation sets ( $V^{i},i=1,..,8$ ) with reducing sizes of 9K (original size), 5K, 1K, 500, 250, 100, 50, and 25 validation examples per label. Similar shuffling and splitting was applied to the validation set of Places-20 to acquire validation sets with reducing sizes of 10K (original size), 5K, 1K, 500, 250, 100, 50, and 25 validation examples per label.

4.2 Alternative Compression Techniques

We compared our logit compression approach with the following alternative compression methods:

•

Global PCA compression. A global PCA transformation using all of the logit data from a given validation set is computed. Logit vectors are then compressed (projected) to the specified dimensionality (i.e., using the principal components retained).
•

Local PCA compression. A local PCA transformation is associated with each terminal label $i\in C$ and computed using the validation examples initially predicted as the terminal label $i$ . The base prediction for any new input is used to index the corresponding local PCA transformation for logit compression to the specified dimensionality.
•

Tree-based compression. The tree-based logit compression method proposed by [6] reduces the original logit vector to a smaller set containing logits of the base prediction label and (generalized) logits of labels corresponding to the upper side nodes/labels branching out from the ancestral path of the base prediction. The compressed logit vector size is dependant on both the given tree structure and the associated base prediction.

For all compression methods, we used the same label posterior estimation and hierarchical classification procedures introduced in Sect. 3.3-3.4.

Unknown

Indoor

Outdoor

Cultural

Home or Hotel

Outdoor Natural

Outdoor Man Made

Water, Ice, Snow

Mountains, Hills, Deserts, Sky

Sports Fields, Parks, Leisure Spaces

Cultural or Historical Building/Place

aqua- rium

art gallery

church indoor

bath- room

bed- room

corridor

car interior

beach

coast

canyon

cliff

amuse- ment park

athletic field

baseball field

arch

castle

cemet- ery

church outdoor

bridge

building facade

Figure 1: Label hierarchy for Places-20.

4.3 Evaluation Metrics

We employed the bottom-up hierarchical evaluation metrics from [5, 6] that focus on the two resulting subsets from test examples that were either correctly ( $S_{c}$ ) or incorrectly ( $S_{ic}$ ) classified by the base classifier $f$ and also a measure that correlates with their label depth in the tree:

•

C-persist, C-corrupt, C-withdrawn are the fractions of the correct base predictions ( $S_{c}$ ) that are unchanged as terminal labels, corrupted to incorrect predictions, or assigned to the whole set of terminal labels $C$ (i.e., the root Unknown), respectively. Unlike [8], our hierarchical classification approach does not corrupt any base predictions (C-corrupt = 0). C-soft is the remaining fraction in $S_{c}$ being generalized to a non-root Super-class that has the ground truth label as a descendant.
•

IC-persist and IC-withdrawn are similar to C-persist and C-withdrawn but for the incorrect base predictions ( $S_{ic}$ ). IC-reform is the fraction of base predictions in $S_{ic}$ that are generalized to a correct non-root label. IC-remain is the remaining fraction of examples in $S_{ic}$ being generalized to a non-root Super-class that does not have the ground truth as a descendant.

•

Avg-sIG is the average scaled Information Gain, adapted from [8]’s metric of Information Gain (IG). Scaled Information Gain (sIG) is IG normalized with the maximum possible IG for a given terminal label set, and Avg-sIG is the average sIG across all test examples and correlates with the average label depth. Avg-sIG is given by

\text{Avg-sIG}=\frac{1}{|S|}\sum_{x\in S}\frac{log_{2}|C|-log_{2}|y|}{log_{2}|C|}\mathbbm{1}_{\{t_{x}\in y\}}

(5)

where $y$ is the final prediction of input $x$ , $|y|$ is the number of terminal descendants of $y$ , $t_{x}$ is the ground truth of test example $x$ , $S=S_{c}\cup S_{ic}$ , and $\mathbbm{1}_{\{.\}}$ is the indicator function. An incorrect prediction has Information Gain of 0, as defined in [8].

We adopted the Technique of Order Preference Similarity to the Ideal Solution (TOPSIS) [11] as an overall evaluation metric that unifies the 9 individual hierarchical metrics for the comparison across alternative approaches. TOPSIS is a commonly used metric for multiple criteria decision analysis. It treats the multi-criteria evaluation results (e.g., the 9 hierarchical metrics listed above) of an approach as vector $a$ and computes the L2-distance to the best alternative ( $b$ ) and worst alternative ( $w$ ) vectors. The final TOPSIS score for $a$ ranges from zero to one (larger is better) and is given by

s_{a}=\frac{d_{aw}}{d_{aw}+d_{ab}}

(6)

where $d_{ab}$ and $d_{aw}$ are the L2-distances from $a$ to the best ( $b$ ) and worst ( $w$ ) alternatives, respectively.

For the best case scenario ( $b$ ), C-persist = IC-reform = Avg-sIG = 1, and the remaining criteria/metrics are 0. For the worst case scenario ( $w$ ), C-corrupt = IC-persist = 1, and the remaining criteria are 0. We note that the metric/criterion values are typically rescaled in the range between 0 and 1 and can be weighted before computing the L2-distances. Our hierarchical metrics are naturally defined within the same range and are treated equally.

Lastly, we evaluated the calibration quality of the estimated label posteriors with Expected Calibration Error (ECE) [17], as the inference approach is highly dependent on the label posteriors. ECE is a scalar statistical summary of the quality of the estimated label posteriors. This is evaluated through histogram binning of the posterior values and comparing the average of the estimated label posteriors in each bin to the associated precision of the hierarchical predictions in the bin. Since our hierarchical predictions are acquired under a given confidence threshold, we apply the histogram binning for ECE on the probability range between the confidence threshold $T$ and 1 (with 10 bins), and it is only applied to predictions that are not withdrawn/generalized to the root of the hierarchy (which always has posterior of 1). Lower values of this metric are preferred.

Refer to caption — Figure 2: Comparison of compression techniques on (a) CINIC-10 and (b) Places-20 with normal semantic tree.

4.4 Results

We compared the results on CINIC-10 and Places-20 under various compression techniques with decreasing sizes of the validation set and report the TOPSIS and ECE scores of each method using the confidence threshold $T=0.9$ . We show the TOPSIS and ECE scores of each method that corresponded to the compression level giving the best TOPSIS results (ranging from 1 to 10). To automatically select the compression level in an application scenario, cross validation could be used (to be demonstrated in Sect. 4.5). We note that the tree-based compression method does not support different choices of compression level, as it is specified by the tree itself.

The comparison results for experiments on CINIC-10 and Places-20 are shown in Fig. 2. The left y-axis of each subplot denotes the TOPSIS score (red), and the right y-axis denotes the associated ECE score (blue). The x-axis of each subplot denotes the size of the validation set used to acquire the associated TOPSIS and ECE scores. Axes of all subplots are given in the same range to make it easier for visual comparison. We only show the results derived from the smaller validation sets ( $V^{3}$ – $V^{8}$ ) since all methods stabilized with any further increase of the validation data.

Notably, the results for TOPSIS and ECE derived from our proposed compression method and the tree-based compression approach are better than both PCA-based approaches when the validation set is small (i.e., 25, 50, 100 examples per label). The results of our method is better than the tree-based approach on CINIC-10 and on par with it for Places-20. This suggests that our compression approach derived from the confusion matrix of the base classifier is comparable to the tree-based compression approach that has injected knowledge of the semantic label relationships. Both methods employ extra knowledge (classifier confusions or label semantics) and have better performance than pure data driven PCA-based approaches when the validation set is small. Our method reached the highest TOPSIS score before saturation among the methods. Similar saturation is also observed for the ECE scores, where the tree-based compression method first reached saturation, with the lowest ECE score, followed closely by our compression approach. However, our method had the best ECE score for the smallest validation set in CINIC-10.

Since all four methods in the comparison rely on a given semantic hierarchy for their hierarchical classification, we next conducted experiments with the same settings but randomly corrupted the semantic hierarchies to study the influence of the tree itself. We applied a random shuffling of the terminal label positions only within each hierarchy while keeping the tree structure fixed. The results for 3 random terminal shufflings of the hierarchy for Places-20 are shown in Fig. 3. Comparing the results from the original label tree with the shuffled trees, the semantic information of the label relationships provided by the hierarchy does indeed play an important role in hierarchical classification, as degraded performance for all methods is present if an improper tree is used. Notably, the ECE scores for our approach and the tree-based method are still better than the PCA-based approaches for smaller validation sets. Similar results were seen with shuffled trees on CINIC-10.

A similar comparative experiment among these methods without the use of a hierarchy was also conducted. The tree-based compression approach is not applicable here due to its requirement of a tree, so it is excluded from this comparison. We employed a basic label generalization method that takes the estimated label posteriors of all terminal labels and sorts them in descending order, then sums the sorted posteriors until reaching the given confidence threshold $T$ . The set of aggregated terminal labels is then output as a (generalized) hierarchical prediction (i.e., a subset of terminal labels in $C$ ). Our evaluation metrics are still applicable to these set-based predictions. The results on CINIC-10 and Places-20 are shown in Fig. 4. Comparing results to Fig. 2, the TOPSIS and ECE scores for each approach without using a tree are worse for smaller validation sets, though the proposed approach is again better than the PCA methods. Additionally, the TOPSIS scores do not saturate as much as before (when employing a tree for inference).

configuration	original	worst	best
25 val. examples per class	.633 (10)	.630 (1)	.690 (2)
50 val. examples per class	.650 (10)	.644 (1)	.701 (3)
100 val. examples per class	.660 (10)	.656 (1)	.711 (3)
250 val. examples per class	.673 (10)	.670 (1)	.721 (6)

Table 1: TOPSIS scores and the number of compressed logits used for hierarchical classification at 90% confidence threshold. The number in parenthesis is the associated compression level used.

We conducted an ablation study on the influence of the proposed confusion-based compression approach by comparing the associated hierarchical classification performance with and without the proposed compression method on CINIC-10 when the validation set is small (i.e., 25, 50, 100, and 250 examples per label). For the proposed approach, we show its best and worst TOPSIS scores across compression levels ranging from 1 to 9 in Table 1. In comparison, we also show the TOPSIS score of hierarchical classification using the original logit features with no compression. The results clearly show that the proposed compression approach improves the hierarchical classification performance significantly in comparison with using the original logits without compression. The results also indicate that a smaller validation set will likely require a smaller compression level to reach better performance.

4.4.1 Discussion

The use of label hierarchy appears to have a regularization effect on the inference procedure. The semantic relationship of labels is therefore helpful during inference when the given validation set is relatively small, but tree-based inference becomes more of a constraint and limits the performance (as compared with label generalization without a tree) when a sufficient amount of validation examples is available. Therefore, the proposed overall hierarchical classification framework with compressed logits is most suitable when relatively fewer validation examples are given or when a label hierarchy is unavailable.

4.5 Scalability

We next compared our proposed framework against the related bottom-up hierarchical classification methods of [8], [5], and [6] (described in Sect. 2) under the same confidence threshold $T=0.9$ . The experiments were conducted on Fashion-MNIST [23] (10 terminal labels), CIFAR-100 [13] (100 terminal labels), and ImageNet-Animal [5] (398 terminal labels) with increasing scale of the dataset and the related label hierarchy (provided in [6]). We applied a similar random partitioning on the original test sets of Fashion-MNIST and CIFAR-100 to obtain the associated validation and test sets. The validation and test sets for ImageNet-Animal were acquired by class-wise random 1:1 partition of the original validation set, since no test set is given. The validation sets of all three datasets were balanced with 500, 50, and 25 validation examples per terminal label for Fashion-MNIST, CIFAR-100, and ImageNet-Animal, respectively. A model based on VGGNet [20] was used for Fashion-MNIST, a ResNeXt model [24] was trained and used for CIFAR-100, and a pretrained ResNet-152 model [10] was employed as the base classifier for ImageNet-Animal.

For [8], the single logit feature of each label from the base (CNN) classifier was used to produce the hierarchical predictions instead of logits of SVMs used in the original work. For [5], we employed the setting of 10 histogram bins per label. To automatically determine the proper compression level for our method on each dataset, we applied class-wise Leave-One-Out Cross Validation (LOOCV) on each validation set. The search range was from 1 to the minimum of $|C|$ and the number of validation examples per label. The compression level that led to the highest TOPSIS score for each dataset was 5, 9, and 12 for Fashion-MNIST, CIFAR-100, and ImageNet-Animal, respectively. These compression levels were employed by our approach to produce the final hierarchical classification results.

The TOPSIS scores corresponding to each method are shown in Table 2, where ‘Base’ refers to the flat classification results of the associated base classifiers. The performance of our proposed approach reached the highest TOPSIS scores on all 3 datasets. Across the datasets, as the number of validation examples reduced (Fashion-MNIST $>$ CIFAR-100 $>$ ImageNet-Animal), the number of terminal labels increased. With this inverse relationship, the condition for posterior estimation is worsening. However, our proposed approach demonstrated robustness in each case.

The non-parametric histogram binning approach [5] demonstrated good performance closest to our proposed method across the datasets, though monotonicity of label posteriors along the hierarchy is not guaranteed in this approach. The lower performance of [8] was due to its consistently high IC-persist, indicating that more of the original misclassifications made by the base classifier were not corrected in their hierarchical process. The lower TOPSIS scores of the tree-based compression approach [6] on CIFAR-100 and ImageNet-Animal were due to a higher overall label generalization during inference (i.e., high C-soft and corresponding low Avg-sIG). We found that using argmax-selected validation data (as employed in the proposed approach) for [6] would significantly improve its performance, though [6] still requires the use of a specified hierarchy for both compression and inference, unlike the proposed approach.

Dataset	Base	[8]	[5]	[6]	Ours
Fashion-MNIST (10)	.544	.615	.844	.821	.852
CIFAR-100 (100)	.534	.690	.735	.672	.744
ImageNet-Animal (398)	.537	.697	.715	.569	.720

Table 2: TOPSIS scores of hierarchical classification results at 90% confidence threshold. The number in the parenthesis indicates the number of terminal labels.

5 Conclusion

In the context of hierarchical classification when posterior estimation from limited validation examples is encountered, we proposed a robust and effective logit vector compression approach to preserve as much useful information as possible in the original logit vector from the perspective of label confusions. By examining the confusion matrix of a given flat classifier, we constructed a subset of the terminal labels that are least confusing with respect to each terminal ground truth label and employed the generalized logit formulation [6] to compress the associated logits. We compared the proposed approach to data-driven compression techniques and a tree-based compression technique on the CINIC-10 and Places-20 datasets under a) different sizes of the validation set, b) randomization in the label hierarchy, and c) absence of a label hierarchy. The strong performance of our proposed approach was demonstrated across the evaluations and maintained robust performance in extended experiments comparing with related hierarchical classification approaches on datasets Fashion-MNIST, CIFAR-100, and ImageNet-Animal scaling from small to large.

Acknowledgement

This research was supported by the U.S. Air Force Research Laboratory under Contract #GRT00054740.

References

[1] Jayme Barbedo and Lopes Amauri. Automatic Genre Classification of Musical Signals. EURASIP Journal on Advances in Signal Processing, 2007, 01 2007.
[2] Luca Bertinetto, Romain Mueller, Konstantinos Tertikas, Sina Samangooei, and Nicholas A. Lord. Making better mistakes: Leveraging class hierarchies with deep networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[3] Alsallakh Bilal, Amin Jourabloo, Mao Ye, Xiaoming Liu, and Liu Ren. Do convolutional neural networks learn class hierarchy? IEEE Transactions on Visualization and Computer Graphics, 24(1):152–162, 2018.
[4] Luke N. Darlow, Elliot J. Crowley, Antreas Antoniou, and Amos J. Storkey. CINIC-10 is not ImageNet or CIFAR-10, 2018.
[5] Jim Davis, Tong Liang, James Enouen, and Roman Ilin. Hierarchical Semantic Labeling with Adaptive Confidence. In the International Symposium on Visual Computing, 2019.
[6] Jim Davis, Tong Liang, James Enouen, and Roman Ilin. Hierarchical Classification with Confidence Using Generalized Logits. In the International Conference on Pattern Recognition, 2020.
[7] Jia Deng, Nan Ding, Yangqing Jia, Andrea Frome, Kevin Murphy, Samy Bengio, Yuan Li, Hartmut Neven, and Hartwig Adam. Large-Scale Object Classification Using Label Relation Graphs. In European Conference on Computer Vision (ECCV), volume 8689, 2014.
[8] Jia Deng, Jonathan Krause, Alexander Berg, and Li Fei-Fei. Hedging Your Bets: Optimizing Accuracy-Specificity Trade-offs in Large Scale Visual Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI, USA, 06 2012.
[9] Yanming Guo, Yu Liu, Erwin Bakker, Yuanhao Guo, and Michael Lew. CNN-RNN: A Large-scale Hierarchical Image Classification Framework. Multimedia Tools and Applications, 12 2017.
[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 06 2016.
[11] Ching-Lai Hwang and Kwangsun Yoon. Methods for Multiple Attribute Decision Making. Springer, 1981.
[12] Bo Jin, Brian Muller, Chengxiang Zhai, and Xinghua Lu. Multi-label Literature Classification based on the Gene Ontology Graph. BMC bioinformatics, 9:525, 01 2009.
[13] Alex Krizhevsky. Learning Multiple Layers of Features from Tiny Images. University of Toronto, 05 2012.
[14] Kibok Lee, Kimin Lee, Kyle Min, Yuting Zhang, Jinwoo Shin, and Honglak Lee. Hierarchical novelty detection for visual object recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[15] Xiaodan Liang, Hongfei Zhou, and Eric Xing. Dynamic-structured Semantic Propagation Network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[16] George A. Miller. WordNet: A Lexical Database for English. Commun. ACM, 38(11):39–41, Nov. 1995.
[17] Mahdi Pakdaman Naeini, Gregory F. Cooper, and Milos Hauskrecht. Obtaining Well Calibrated Probabilities Using Bayesian Binning. In Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, page 2901–2907. AAAI Press, 2015.
[18] Neeru Narang, Michael Martin, Dimitris Metaxas, and Thirimachos Bourlai. Learning Deep Features for Hierarchical Classification of Mobile Phone Face Datasets in Heterogeneous Environments. In IEEE International Conference on Automatic Face & Gesture Recognition, pages 186–193, 05 2017.
[19] John Platt. Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. Advances in Large Margin Classifiers, 10, 06 2000.
[20] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In International Conference on Learning Representations, 2015.
[21] Cinna Wu, M. Tygert, and Y. LeCun. A Hierarchical Loss and Its Problems when Classifying Non-hierarchically. PLoS ONE, 14, 2019.
[22] Tz-Ying Wu, Pedro Morgado, Pei Wang, Chih-Hui Ho, and Nuno Vasconcelos. Solving Long-tailed Recognition with Deep Realistic Taxonomic Classifier. In European Conference on Computer Vision (ECCV), 07 2020.
[23] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms, 2017.
[24] Saining Xie, Ross Girshick, Piotr Dollar, Z. Tu, and Kaiming He. Aggregated Residual Transformations for Deep Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5987–5995, 07 2017.
[25] Sergey Zagoruyko and Nikos Komodakis. Wide Residual Networks. In Edwin R. Hancock Richard C. Wilson and William A. P. Smith, editors, Proceedings of the British Machine Vision Conference (BMVC), pages 87.1–87.12. BMVA Press, September 2016.
[26] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba. Places: A 10 million Image Database for Scene Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
[27] Ciyou Zhu, Richard H. Byrd, Peihuang Lu, and Jorge Nocedal. Algorithm 778: L-BFGS-B: Fortran Subroutines for Large-Scale Bound-Constrained Optimization. ACM Transactions on Mathematical Software, 23(4):550–560, Dec. 1997.

Proposed	Global PCA	Local PCA	Tree-Based [6]

(a) CINIC-10

(b) Places-20

Bottom-up Hierarchical Classification Using Confusion-based Logit Compression