Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities

Xintong Shi¹ Wenzhi Cao¹ Sebastian Raschka^1,2
¹Department of Statistics, University of Wisconsin-Madison ²Lightning AI
Corresponding author: [email protected]

Abstract

In recent times, deep neural networks achieved outstanding predictive performance on various classification and pattern recognition tasks. However, many real-world prediction problems have ordinal response variables, and this ordering information is ignored by conventional classification losses such as the multi-category cross-entropy. Ordinal regression methods for deep neural networks address this. One such method is the CORAL method, which is based on an earlier binary label extension framework and achieves rank consistency among its output layer tasks by imposing a weight-sharing constraint. However, while earlier experiments showed that CORAL’s rank consistency is beneficial for performance, it is limited by a weight-sharing constraint in a neural network’s fully connected output layer, which may restrict the expressiveness and capacity of a network trained using CORAL. We propose a new method for rank-consistent ordinal regression without this limitation. Our rank-consistent ordinal regression framework (CORN) achieves rank consistency by a novel training scheme. This training scheme uses conditional training sets to obtain the unconditional rank probabilities through applying the chain rule for conditional probability distributions. Experiments on various datasets demonstrate the efficacy of the proposed method to utilize the ordinal target information, and the absence of the weight-sharing restriction improves the performance substantially compared to the CORAL reference approach. Additionally, the suggested CORN method is not tied to any specific architecture and can be utilized with any deep neural network classifier to train it for ordinal regression tasks.

1 Introduction

Many real-world prediction tasks involve ordinal target labels. Popular examples of such ordinal tasks are customer ratings (e.g., a product rating system from 1 to 5 stars) and medical diagnoses (e.g., disease severity labels such as none, mild, moderate, and severe). While we can apply conventional classification losses, such as the multi-category cross-entropy, to such problems, they are suboptimal since they ignore the intrinsic order among the ordinal targets. For example, for a patient with severe disease status, predicting none and moderate would incur the same loss even though the difference between none and severe is more significant than the difference between moderate and severe. Moreover, unlike in metric regression, we cannot quantify the distance between the ordinal ranks. For instance, the difference between a disease status of none and mild cannot be quantitatively compared to the difference between mild and moderate. Hence, ordinal regression (also called ordinal classification or ranking learning) can be considered as an intermediate problem between classification and regression.

Among the most common machine learning-based approaches to ordinal regression is Li and Lin’s extended binary classification framework [9] that was adopted for deep neural networks by Niu et al. in 2016 [14]. In this work, we solve the rank inconsistency problem (Fig. 1) of this ordinal regression framework without imposing constraints that could limit the expressiveness of the neural network and without substantially increasing the computational complexity.

The contributions of our paper are as follows:

1.

A new rank-consistent ordinal regression framework, CORN (Conditional Ordinal Regression for Neural Networks), based on the chain rule for conditional probability distributions;
2.

Rank consistency guarantees without imposing the weight-sharing constraint used in the CORAL reference framework [1];
3.

Experiments with different neural network architectures and datasets showing that CORN’s removal of the weight-sharing constraint improves the predictive performance compared to the more restrictive reference framework.

Refer to caption — Figure 1: Illustration of the difference between rank-consistent and rank-inconsistent methods.

2 Related Work

2.1 Ordinal Regression Based on Extended Binary Classification Subtasks

Ordinal regression is a classic problem in statistics, going back to early proportional hazards and proportional odds models [13]. To take advantage of well-studied and well-tuned binary classifiers, the machine learning field developed ordinal regression methods based on extending the rank prediction to multiple binary label classification subtasks [9]. This approach relies on three steps: (1) extending rank labels to binary vectors, (2) training binary classifiers on the extended labels, and (3) computing the predicted rank label from the binary classifiers. Modified versions of this approach have been proposed in connection with perceptrons [3] and support vector machines [22, 17, 2]. In 2007, Li and Lin presented a reduction framework unifying these extended binary classification approaches [9].

2.2 Addressing Rank Consistency in Neural Networks for Ordinal Regression

In 2016, Niu et al. adapted Li and Lin’s extended binary classification framework to train deep neural networks for ordinal regression [14]; we refer to this method as OR-NN. Across different image datasets, OR-NN was able to outperform other reference methods. However, Niu et al. pointed out that ORD-NN suffers from rank inconsistencies among the binary tasks and that addressing this limitation might raise the training complexity substantially. Cao et al. [1] recently addressed this rank inconsistency limitation via the CORAL method. To avoid increasing the training complexity, CORAL achieves rank consistency by imposing a weight-sharing constraint in the last layer, such that the binary classifiers only differ in their bias units. However, while CORAL outperformed the OR-NN method across several face image datasets for age prediction, the weight-sharing constraint may impose a severe limitation in terms of functions that the neural network can approximate. In this paper, we investigate an alternative approach to guarantee rank consistency without increasing the training complexity and restricting the neural network’s expressiveness and capacity.

2.3 Other Neural Network-Based Methods for Ordinal Regression

Several deep neural networks for ordinal regression do not build on the extended binary classification framework. These methods include Zhu et al.’s [25] convolutional ordinal regression forest for image data, which combines a convolutional neural network with differentiable decision trees. Diaz and Marathe [5] proposed a soft ordinal label representation obtained from a softmax layer, which can be used for scenarios where interclass distances are known. Another method that does not rely on the extended binary classification framework is Suarez et al.’s distance metric learning algorithm [24]. Petersen et al. [16] developed a method based on differentiable sorting networks based on pairwise swapping operations with relaxed sorting operations, which can be used for ranking where the relative ordering is known but the absolute target values are unknown. Liu et al. adapted pairwise ranking constraints from RankingSVM [7] to reformlate the multi-category loss as a constrained optimization problem for ordinal regression [10].

This paper focuses on addressing the rank inconsistency on OR-NN without imposing the weight-sharing of CORAL [1], which is why an exhaustive study of the methods mentioned above is outside the scope of this paper. However, additional experiments and comparisons with SORD [5] and CNNPOR [10] are included in the Supplementary Material in section Comparison with Other Deep Learning Methods for Ordinal Regression.

3 Proposed Method

This section describes the details of our CORN method, which addresses the rank inconsistency in Niu et al.’s OR-NN [14] without requiring CORAL’s [1] weight-sharing constraint.

3.1 Preliminaries

Let ${D=\left\{\mathbf{x}^{[i]},y^{[i]}\right\}_{i=1}^{N}}$ denote a dataset for supervised learning consisting of $N$ training examples, where $\mathbf{x}^{[i]}\in\mathcal{X}$ denotes the inputs of the $i$ -th training example and $y^{[i]}$ its corresponding class label. In an ordinal regression context, we refer to $y^{[i]}$ as the rank, where ${y^{[i]}\in\mathcal{Y}=\{r_{1},r_{2},...r_{K}\}}$ with rank order ${r_{K}\succ r_{K-1}\succ\ldots\succ r_{1}}$ . The objective of an ordinal regression model is then to find a mapping $h:\mathcal{X}\rightarrow\mathcal{Y}$ that minimizes a loss function $L(h)$ .

3.2 Motivation

With CORAL, Cao et al. [1] proposed a deep neural network for ordinal regression that addressed the rank inconsistency of Niu et al.’s OR-NN [14], and experiments showed that addressing rank consistency had a positive effect on predictive performance.

Both CORAL and OR-NN built on an extended binary classification framework [9], where the rank labels are recast into a set of binary tasks, such that ${y^{[i]}_{k}\in\{0,1\}}$ indicates whether $y^{[i]}$ exceeds rank $r_{k}$ . The label predictions are then obtained via $h\left(\mathbf{x}^{[i]}\right)=r_{q}$ , where $q\in\{1,2,...,K\}$ is the rank index, which is computed as

q=1+\sum_{k=1}^{K-1}\mathbbm{1}\left\{f_{k}\left(\mathbf{x}^{[i]}\right)>0.5\right\}.

(1)

Here, $f_{k}\left(\mathbf{x}^{[i]}\right)\in[0,1]$ is the probability prediction of the $k$ -th binary classifier in the output layer, and $\mathbbm{1}\{\cdot\}$ is an indicator function that returns $1$ if the inner condition is true and $0$ otherwise.

The CORAL method ensures that the $\{f_{k}\}_{k=1}^{K-1}$ predictions are rank-monotonic, that is, ${f_{1}\left(\mathbf{x}^{[i]}\right)\geq f_{2}\left(\mathbf{x}^{[i]}\right)\geq\dots\geq f_{K-1}\left(\mathbf{x}^{[i]}\right)}$ , which provides rank consistency to the ordinal regression model. While the rank label calculation via Eq. 1 does not strictly require consistency among the $K-1$ task predictions, $f_{k}\left(\mathbf{x}^{[i]}\right)$ , it is intuitive to see why rank consistency can be theoretically beneficial and can lead to more interpretable results via the binary subtasks. While CORAL provides this rank consistency, CORAL’s limitation is a weight-sharing constraint in the output layer. Consequently, all binary classification tasks use the same weight parameters and only differ in their bias units, which may limit the flexibility and expressiveness of an ordinal regression neural network based on CORAL.

The proposed CORN model is a neural network for ordinal regression that guarantees rank consistency without any weight-sharing constraint in the output layer (Fig. 2). Instead, CORN uses a new training procedure with conditional training subsets that ensures rank consistency through applying the chain rule of probability.

3.3 Rank-consistent Ordinal Regression based on Conditional Probabilities

Given a training set ${D=\left\{\mathbf{x}^{[i]},y^{[i]}\right\}_{i=1}^{N}}$ , CORN applies a label extension to the rank labels $y^{[i]}$ similar to CORAL, such that the resulting binary label ${y^{[i]}_{k}\in\{0,1\}}$ indicates whether $y^{[i]}$ exceeds rank $r_{k}$ . Similar to CORAL, CORN also uses $K-1$ learning tasks associated with ranks $r_{1},r_{2},...,r_{K}$ in the output layer as illustrated in Fig. 2.

However, in contrast to CORAL, CORN estimates a series of conditional probabilities using conditional training subsets (described in Section 3.4) such that the output of the $k-$ th binary task $f_{k}\left(\mathbf{x}^{[i]}\right)$ represents the conditional probability¹¹1When $k=1$ , $f_{k}\left(\mathbf{x}^{[i]}\right)$ represents the initial unconditional probability $f_{1}\left(\mathbf{x}^{[i]}\right)=\hat{P}\left(y^{[i]}>r_{1}\right)$ .

f_{k}\left(\mathbf{x}^{[i]}\right)=\hat{P}\left(y^{[i]}>r_{k}\,|\,y^{[i]}>r_{k-1}\right),

(2)

where the events are nested: $\left\{y^{[i]}>r_{k}\right\}\subseteq\left\{y^{[i]}>r_{k-1}\right\}$ .

The transformed, unconditional probabilities can then be computed by applying the chain rule for probabilities to the model outputs:

\hat{P}\left(y^{[i]}>r_{k}\right)=\prod^{k}_{j=1}f_{j}\left(\mathbf{x}^{[i]}\right).

(3)

Since $\forall j,\;0\leq f_{j}\left(\mathbf{x}^{[i]}\right)\leq 1$ , we have

\hat{P}\left(y^{[i]}>r_{1}\right)\geq\hat{P}\left(y^{[i]}>r_{2}\right)\geq...\geq\hat{P}\left(y^{[i]}>r_{K-1}\right),

(4)

which guarantees rank consistency among the $K-1$ binary tasks.

3.4 Conditional Training Subsets

Our model aims to estimate $f_{1}\left(\mathbf{x}^{[i]}\right)$ and the conditional probabilities $f_{2}\left(\mathbf{x}^{[i]}\right),...,f_{K-1}\left(\mathbf{x}^{[i]}\right)$ . Estimating $f_{1}\left(\mathbf{x}^{[i]}\right)$ is a classic binary classification task under the extended binary classification framework with the binary labels $y_{1}^{[i]}$ . To estimate the conditional probabilities such as $\hat{P}\left(y^{[i]}>r_{2}\,|\,y^{[i]}>r_{1}\right)$ , we focus only on the subset of the training data where $y^{[i]}>r_{1}$ . As a result, when we minimize the binary cross-entropy loss on these conditional subsets, for each binary task, the estimated output probability has a proper conditional probability interpretation²²2When training a neural network using backpropagation, instead of minimizing the $K-1$ loss functions corresponding to the $K-1$ conditional probabilities on each conditional subset separately, we can minimize their sum, as shown in the loss function we propose in Section 3.5, to optimize the binary tasks simultaneously..

In order to model the conditional probabilities in Eq. 3, we construct conditional training subsets for training, which are used in the loss function (Section 3.5) that is minimized via backpropagation. The conditional training subsets are obtained from the original training set as follows:

	$\displaystyle S_{1}:\text{ all }\left\{\left(\mathbf{x}^{[i]},y^{[i]}\right)\right\},\text{ for }i\in\{1,...,N\},$
	$\displaystyle S_{2}:\left\{(\mathbf{x}^{[i]},y^{[i]})\;\|\;y^{[i]}>r_{1}\right\},$
	$\displaystyle\dots$
	$\displaystyle S_{K-1}:\left\{(\mathbf{x}^{[i]},y^{[i]})\;\|\;y^{[i]}>r_{k-2}\right\},$

where $N=|S_{1}|\geq|S_{2}|\geq...\geq|S_{K-1}|$ , and $|S_{k}|$ denotes the size of $S_{k}$ . Note that the labels $y^{[i]}$ are subject to the binary label extension as described in Section 3.3. Each conditional training subset $S_{k}$ is used for training the conditional probability prediction $\hat{P}\left(y^{[i]}>r_{k}\,|\,y^{[i]}>r_{k-1}\right)$ for $k\geq 2$ .

Additional theoretical justification for constructing the conditional training subsets is provided in the Supplementary Material in section Theoretical Analysis of Conditional Probability Estimation. Section 5.1 compares the predictive performance of the CORN method with and without training subsets.

3.5 Loss Function

Let $f_{j}\left(\mathbf{x}^{[i]}\right)$ denote the predicted value of the $j$ -th node in the output layer of the network (Fig. 2), and let $|S_{j}|$ denote the size of the $j$ -th conditional training set. To train a CORN neural network using backpropagation, we minimize the following loss function:

L(\mathbf{X},\mathbf{y})=\\ -\frac{1}{\sum_{j=1}^{K-1}|S_{j}|}\sum_{j=1}^{K-1}\sum_{i=1}^{|S_{j}|}\bigg{[}\log\left(f_{j}(\mathbf{x}^{[i]})\right)\cdot\mathbbm{1}\left\{y^{[i]}>r_{j}\right\}\\ +\log\left(1-f_{j}\left(\mathbf{x}^{[i]}\right)\right)\cdot\mathbbm{1}\left\{y^{[i]}\leq r_{j}\right\}\bigg{]},

(5)

We note that in $f_{j}(\mathbf{x}^{[i]})$ , $\mathbf{x}^{[i]}$ represents the $i$ -th training example in $S_{j}$ . To simplify the notation, we omit an additional index $j$ to distinguish between $\mathbf{x}^{[i]}$ in different conditional training sets.

To improve the numerical stability of the loss gradients during training, we implement the following alternative formulation of the loss, where $\mathbf{Z}$ are the net inputs of the last layer (aka logits), as shown in Fig. 2, and $\log\left(\sigma\left(\mathbf{z}^{[i]}\right)\right)=\log\left(f_{j}\left(\mathbf{x}^{[i]}\right)\right)$ :

L(\mathbf{Z},\mathbf{y})=\\ -\frac{1}{\sum_{j=1}^{K-1}|S_{j}|}\sum_{j=1}^{K-1}\sum_{i=1}^{|S_{j}|}\bigg{[}\log\left(\sigma\left(\mathbf{z}^{[i]}\right)\right)\cdot\mathbbm{1}\left\{y^{[i]}>r_{j}\right\}\\ +\left(\log\left(\sigma\left(\mathbf{z}^{[i]}\right)\right)-\mathbf{z}^{[i]}\right)\cdot\mathbbm{1}\left\{y^{[i]}\leq r_{j}\right\}\bigg{]}.

(6)

A derivation showing that the two loss equations are equivalent and a PyTorch implementation are included in the Supplementary Material in the section Numerically Stable Loss Function. In addition, the Supplementary Material includes a visual illustration of the loss computation based on the conditional training subsets (Figure S1) and a theoretical Generalization Bounds analysis.

3.6 Rank Prediction

To obtain the rank index $q$ of the $i$ -th training example, and any new data record during inference, we threshold the predicted probabilities corresponding to the $K-1$ binary tasks and sum the binary labels as follows:

q^{[i]}=1+\sum_{j=1}^{K-1}\mathbbm{1}\left(\hat{P}\left(y^{[i]}>r_{j}\right)>0.5\right),

where the predicted rank is $r_{q^{[i]}}$ .

4 Experiments

4.1 Datasets and Preprocessing

The MORPH-2 dataset³³3https://www.faceaginggroup.com/morph/ [19] contains 55,608 face images, which were processed as described in [1]: facial landmark detection [20] was used to compute the average eye location, which was then used by the EyepadAlign function in MLxtend v0.14 [18] to align the face images. The original MORPH-2 dataset contains age labels in the range of 16-70 years. In this study, we use a balanced version of the MORPH-2 dataset containing 20,625 face images with 33 evenly distributed age labels within the range of 16-48 years.

The Asian Face dataset (AFAD)⁴⁴4https://github.com/afad-dataset/tarball [14] contains 165,501 faces in the age range of 15-40 years. No additional preprocessing was applied to this dataset since the faces were already centered. In this study, we use a balanced version of the AFAD dataest with 13 age labels in the age range of 18-30 years.

The Image Aesthetic dataset (AES)⁵⁵5http://www.di.unito.it/~schifane/dataset/beauty-icwsm15/ [21] used in this study contains 13,868 images, each with a list of beauty scores ranging from 1 to 5. To create ordinal regression labels, we replaced the beauty score list of each image with its average score rounded to the nearest integer in the range 1-5. Compared to the other image datasets MORPH-2 and AFAD, the size of the AES dataset was relatively small, and we did not attempt to create a class-balanced version of this dataset for this study.

The Fireman dataset (Fireman)⁶⁶6https://github.com/gagolews/ordinal_regression_data is a tabular dataset that contains 40,768 instances, 10 numeric features, and an ordinal response variable with 16 categories. We created a balanced version of this dataset consisting of 2,543 instances per class and 40,688 from the 16 ordinal classes in total.

Each dataset was randomly divided into 75% training data, 5% validation data, and 20% test data. We share the partitions for all datasets, along with all preprocessing code used in this paper, in the code repository (see Section 4.4).

4.2 Neural Network Architectures

4.2.1 Comparison with binary label extension frameworks for ordinal regression

For the main method comparisons to other binary extension frameworks for ordinal regression on the image datasets (MORPH-2 and AFAD,), we used ResNet34 [6] as the backbone architecture since it is an established architecture that is known to achieve good performance on a variety of image classification datasets. Besides the hyperparameter settings listed in Tables 1 and 2; we adopt all other settings from the ResNet34 paper.

For the tabular Fireman dataset, we used a simple multilayer perceptron architecture (MLP) with leaky ReLU [12] activation functions (negative slope 0.01). Since the MLP architectures were prone to overfitting, a dropout layer with drop probability 0.2 was added after the leaky ReLU activations in each hidden layer. In addition, we used the AdamW [11] optimizer with a weight decay rate of 0.2. The number of hidden layers (one or two) and the number of units per hidden layer were determined by hyperparameter tuning (see Section 4.3 for more details).

In this paper, we focus on comparing the performance of a neural network trained via the rank-consistent CORN approach to the two prominent binary extension-based ordinal regression frameworks for deep learning, the Niu et al. [14] OR-NN method (no rank consistency) and CORAL (rank consistency by using identical weight parameters for all nodes in the output layer). As a performance baseline, we implement neural network classifiers trained with standard multicategory cross-entropy loss as a baseline, which we refer to as CE-NN. While all methods (CE-NN, OR-NN, CORAL, and CORN) use different loss functions during training, it is worth emphasizing that they can share similar backbone architectures and only require small changes in the output layer. For instance, to implement a neural network for ordinal regression using the proposed CORN method, we replaced the network’s output layer with the corresponding binary conditional probability task layer.

4.3 Training and Evaluation

The model evaluations and comparisons are based on the mean absolute error (MAE) and root mean squared error (RMSE), which are defined as follows:

	$\displaystyle\mathrm{MAE}=\frac{1}{N}\sum_{i=1}^{N}\left\|y_{i}-h\left(\mathbf{x}_{i}\right)\right\|\quad\text{ and }$
	$\displaystyle\mathrm{RMSE}=\sqrt{\frac{1}{N}\sum_{i=1}^{N}\left(y_{i}-h\left(\mathbf{x}_{i}\right)\right)^{2}},$

where $y^{[i]}$ is the ground truth rank of the $i$ -th test example and $h(\mathbf{x}^{[i]})$ is the predicted rank, respectively.

Then, using the best hyperparameter setting for each method, we repeated the model training five times using different random seeds (0, 1, 2, 3, and 4) for the random weight initialization and dataset shuffling. We considered the exact same hyperparameter ranges for each method. (A detailed list of the hyperparameter configurations we considered is shown in Table 1.) Then, we selected the best hyperparameter configuration, using grid search, based on its validation set performance for each method before computing the test set performance. Note that both the hyperparameter configuration and the best training epoch were determined based on the validation set before computing the final model performance on the independent test set. The best hyperparameter values for each method are listed in Table 2.

Table 1: Configurations for hyperparameter tuning.

Backbone	Learning rates	Batch sizes	Layer sizes
ResNet34	5e-5, 1e-4, 2.5e-4, 5e-4, 1e-3, 5e-3	16, 32, 64, 128, 256, 512	NA
MLP	1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3	16, 32, 64, 128, 256, 512	Layer1: 50, 100, 200, 300
			Layer2: 50, 100, 200, 300

Table 2: Best hyperparameter settings for image and tabular datasets.

Datasets	Backbone	Methods	Learning rates	Batch sizes	Number of layers	Layer hidden units
Image Datasets	ResNet34	CE-NN	5e-4	256	-	-
Image Datasets	ResNet34	OR-NN	5e-4	256	-	-
Image Datasets	ResNet34	CORAL	5e-4	256	-	-
Image Datasets	ResNet34	CORN	5e-4	16	-	-
Fireman	MLP	CE-NN	5e-4	64	2	$300\times 200$
Fireman	MLP	OR-NN	5e-4	128	2	$300\times 300$
Fireman	MLP	CORAL	5e-4	64	2	$300\times 200$
Fireman	MLP	CORN	1e-3	128	2	$300\times 300$

The models were trained for 200 epochs using stochastic gradient descent via adaptive moment estimation [8] with the default decay rates and carefully checked for convergence such that training and validation MAE started to diverge and the validation MAE started to stagnate or decline. The complete training logs for all methods are provided in the code repository (Section 4.4).

4.4 Hardware and Software

All neural networks were implemented in PyTorch 1.8 [15]. The models were trained on NVIDIA GeForce RTX 2080Ti graphics cards on a private workstation as well as T4 graphics cards using the Grid.ai platform. We make all source code used for the experiments available⁷⁷7https://github.com/Raschka-research-group/corn-ordinal-neuralnet and provide a user-friendly implementation of CORN in the coral-pytorch Python package⁸⁸8https://github.com/Raschka-research-group/coral-pytorch.

5 Results and Discussion

To compare deep neural networks trained with our proposed CORN method to CORAL [1], Niu et al.’s OR-NN [14], and the baseline cross-entropy loss (CE-NN), we conducted a series of experiments on three image datasets and one tabular dataset. As detailed in Section 4.2, the experiments on the MORPH and AFAD image datasets were based on the ResNet34 architecture. We used a multilayer perceptron for the tabular Fireman dataset.

An additional study using a VGG16 backbone pre-trained on ImageNet and comparisons with SORD and CNNPOR can be found in the Supplementary Material in section Comparison with Other Deep Learning Methods for Ordinal Regression. In addition, results on text datasets and recurrent neural networks are included in the Supplementary Material in section Additional Results on Text Datasets using Recurrent Neural Networks.

As the main results in Table 3 show, CORN outperforms all other binary label extension methods for ordinal regression on the MORPH-2 and AFAD image datasets and is tied with OR-NN on the Fireman tabular dataset. We repeated the experiments with different random seeds for model weight initialization and data shuffling, which ensures that the results are not coincidental.

Table 3: Prediction errors on the test sets (lower is better). Each cell represents the average (AVG) and standard deviation (SD) for 5 random seeds runs. Best results are highlighted in bold. A ResNet34 backbone was used for the MORPH-2 and AFAD image datasets. A multilayer perceptron backbone was used for the Fireman tabular dataset. The class labels in all datasets were balanced. The full table of all random seeds runs can be found in the Supplementary Materials (Table S5).

Method	Metrics format	MORPH-2 (Balanced)		AFAD (Balanced)		Fireman
		MAE	RMSE	MAE	RMSE	MAE	RMSE
CE-NN	AVG $\pm$ SD	3.73 $\pm$ 0.12	5.04 $\pm$ 0.20	3.28 $\pm$ 0.04	4.19 $\pm$ 0.06	0.80 $\pm$ 0.01	1.14 $\pm$ 0.01
OR-NN [14]	AVG $\pm$ SD	3.13 $\pm$ 0.09	4.23 $\pm$ 0.10	2.85 $\pm$ 0.03	3.48 $\pm$ 0.04	0.76 $\pm$ 0.01	1.08 $\pm$ 0.01
CORAL [1]	AVG $\pm$ SD	2.99 $\pm$ 0.04	4.01 $\pm$ 0.03	2.99 $\pm$ 0.03	3.70 $\pm$ 0.07	0.82 $\pm$ 0.01	1.15 $\pm$ 0.01
CORN (ours)	AVG $\pm$ SD	2.98 $\pm$ 0.02	3.99 $\pm$ 0.05	2.81 $\pm$ 0.02	3.46 $\pm$ 0.02	0.76 $\pm$ 0.01	1.08 $\pm$ 0.01

It is worth noting that even though CORAL’s rank consistency was found to be beneficial for model performance [1], it performs noticeably worse than OR-NN on the balanced MORPH-2 and AFAD datasets. This might likely be due to CORAL’s weight-sharing constraint in the output layer, which could affect the expressiveness of the neural networks and thus limit the complexity of what it can learn. In contrast the CORN method, which is also rank-consistent, performs better than OR-NN on MORPH-2 and AFAD.

We found that OR-NN and CORN have identical performances on the tabular Fireman dataset (Table 3), outperforming both the CE-NN and CORAL in both test MAE and test RMSE. Here, the performances are relatively close, and the 16-category prediction task is relatively easy for a fully connected neural network regardless of the loss function.

5.1 Ablation Study

Given the superior performance of CORN across several datasets, we studied the importance of the training subsets. In this ablation study, created an alternative CORN method without training subsets subsets. Here, the conditional probability of the $k-$ th binary task is computed as

f_{k}\left(\mathbf{x}^{[i]}\right)=\hat{P}\left(y^{[i]}>r_{k}\,\right),

(7)

which is a modified version of Eq. 2. Note that this modification results in meaningless probability scores, however, the rank consistency via Eq. 4 is still guaranteed since the probability scores are still computed via Eq. 3, and each score cannot be greater than 1.

We shall note that the modified CORN method without training subsets sees at least as many training examples as the regular CORN method. This is because each task now has access to the full training batch rather than a subset.

As the results in Table 4 show, the subsets do not only play a crucial role for yielding meaningful and theoretically justified rank probability values in CORN but they also improve the predictive performance. Across all datasets, with the exception of MORPH-2, the neural network trained with the regular CORN method outperforms the alternative version without subsets.

Table 4: MAE prediction errors on the test sets for the ResNet34 backbone. The class labels in all datasets were balanced. Best results are highlighted in bold.

CORN

w/o subsets

MORPH-2

2.98

\pm

0.02

2.93

\pm

0.04

AFAD

2.81

\pm

0.02

3.06

\pm

0.02

AES

0.43

\pm

0.01

0.68

\pm

0.01

Fireman

0.76

\pm

0.01

0.81

\pm

0.01

6 Conclusions

In this paper, we developed the rank-consistent CORN framework for ordinal regression via conditional training datasets. We used CORN to train convolutional and fully connected neural architectures on ordinal response variables. Our experimental results showed that the CORN method improved the predictive performance compared to the rank-consistent reference framework CORAL. While our experiments focused on image and tabular datasets, the generality of our CORN method allows it to be readily applied to other types of datasets to solve ordinal regression problems with various neural network structures.

7 Acknowledgements

This research was supported by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison with funding from the Wisconsin Alumni Research Foundation.

References

[1] W. Cao, V. Mirjalili, and S. Raschka. Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognition Letters, 140:325–331, 2020.
[2] W. Chu and S. S. Keerthi. New approaches to support vector ordinal regression. In Proceedings of the International Conference on Machine Learning, pages 145–152. ACM, 2005.
[3] K. Crammer and Y. Singer. Pranking with ranking. In Advances in Neural Information Processing Systems, pages 641–647, 2002.
[4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
[5] R. Diaz and A. Marathe. Soft labels for ordinal regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4738–4747, 2019.
[6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
[7] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 133–142, 2002.
[8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, International Conference on Learning Representations, pages 1–8, 2015.
[9] L. Li and H.-T. Lin. Ordinal regression by extended binary classification. In Advances in Neural Information Processing Systems, pages 865–872, 2007.
[10] Y. Liu, A. W. K. Kong, and C. K. Goh. A constrained deep neural network for ordinal regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 831–839, 2018.
[11] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (Poster), 2019.
[12] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Citeseer, 2013.
[13] P. McCullagh. Regression models for ordinal data. Journal of the Royal Statistical Society. Series B (Methodological), pages 109–142, 1980.
[14] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua. Ordinal regression with multiple output CNN for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4920–4928, 2016.
[15] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035, 2019.
[16] F. Petersen, C. Borgelt, H. Kuehne, and O. Deussen. Differentiable sorting networks for scalable sorting and ranking supervision. In International Conference on Machine Learning, 2021.
[17] S. Rajaram, A. Garg, X. S. Zhou, and T. S. Huang. Classification approach towards ranking and sorting problems. In Proceedings of the European Conference on Machine Learning, pages 301–312. Springer, 2003.
[18] S. Raschka. MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. The Journal of Open Source Software, 3(24):1–2, 2018.
[19] K. Ricanek and T. Tesafaye. Morph: A longitudinal image database of normal adult age-progression. In Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition, pages 341–345, 2006.
[20] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: database and results. Image and Vision Computing, 47:3–18, 2016.
[21] R. Schifanella, M. Redi, and L. M. Aiello. An image is worth more than a thousand favorites: Surfacing the hidden beauty of flickr pictures. In International AAAI Conference on Web and Social Media, 2015.
[22] A. Shashua, A. Levin, et al. Ranking with large margin principle: Two approaches. Advances in Neural Information Processing Systems, pages 961–968, 2003.
[23] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
[24] J. L. Suárez, S. García, and F. Herrera. Ordinal regression with explainable distance metric learning based on ordered sequences. Machine Learning, pages 1–34, 2021.
[25] H. Zhu, H. Shan, Y. Zhang, L. Che, X. Xu, J. Zhang, J. Shi, and F.-Y. Wang. Convolutional ordinal regression forest for image ordinal estimation. IEEE Transactions on Neural Networks and Learning Systems, 2021.

8 Supplementary Material

8.1 Theoretical Analysis of Conditional Probability Estimation

Suppose we are interested in estimating a series of conditional probabilities

	$\displaystyle f_{1}(\mathbf{x})=P\left(y>r_{1}\;\|\;\mathbf{x}\right),$
	$\displaystyle f_{2}(\mathbf{x})=P\left(y>r_{2}\;\|\;y>r_{1},\mathbf{x}\right),$
	$\displaystyle...,$
	$\displaystyle f_{K-1}(\mathbf{x})=P\left(y>r_{K-1}\;\|\;>r_{k-2},\mathbf{x}\right),$

with the observed dataset ${D=\left\{\mathbf{x}^{[i]},y^{[i]}\right\}_{i=1}^{N}}$ , where $f_{k}(\mathbf{x})$ is the functional form of the neural network model outputs that depend on the neural network model weights. The likelihood of the model weights can be written as

L=\prod_{j=1}^{K-1}\prod_{j=1}^{|S_{j}|}\left[f_{j}\left(\mathbf{x}^{[j]}\right)^{\mathbbm{1}\left\{y^{[i]}>r_{j}\right\}}\cdot\left(1-f_{j}\left(\mathbf{x}^{[j]}\right)\right)^{\mathbbm{1}\left\{y^{[i]}\leq r_{j}\right\}}\right].

(8)

Hence, minimizing the loss function (Eq. 5) is equivalent to solving the maximum likelihood estimate of the functional form representations of the conditional probabilities. This is also the theoretical justification that we construct the conditional training sets in the data preparation for the CORN loss function. Without using the conditional set in the loss function, the estimated probabilities do not have a conditional probability maximum likelihood interpretation. After solving the maximum likelihood estimates of the conditional probabilities, it is natural to use the probability chain rule to find the unconditional probabilities of exceeding rank $r_{k}$ in Eq. 3 given each input $\mathbf{x}$ .

8.2 Generalization Bounds

Analogous to CORAL [1] and based on established generalization bounds for binary classification, Theorem 1 shows that the final rank prediction by CORN generalizes well when the binary classification tasks generalize well.

Theorem 1 (reduction of generalization error).

Let $\mathcal{C}$ be the cost matrix for the ordinal label predictions, where $\mathcal{C}_{y,y}=0$ and $\mathcal{C}_{y,r_{k}}>0$ for $k\neq y$ . $P$ is the underlying distribution of $(\mathbf{x},y)$ , i.e., $(\mathbf{x},y)\sim P$ . Furthermore, let $h(\mathbf{x})$ be the model output yielding the predicted rank $r_{q}$ ; that is, $h(\mathbf{x})=r_{q}$ . Let $y^{(k)}=\mathbbm{1}\{y>r_{k}\}$ , and $\hat{y}^{(k)}=\mathbbm{1}\{\hat{P}(y>r_{k})>0.5\}=\mathbbm{1}\{f_{1}f_{2}\ldots f_{k}>0.5\}$ be the prediction of $y^{(k)}$ . Given the binary classification tasks $\{f_{k}\}_{k=1}^{K-1}$ , which we obtain from minimizing the loss in Eq. 5, and the rank-monotonic $\hat{y}_{k}$ , we have

\leavevmode\resizebox{310.04001pt}{}{$\underset{(\mathbf{x},y)\sim P}{\mathbb{E}}\mathcal{C}_{y,h(\mathbf{x})}\leq\sum_{k=1}^{K-1}\big{|}\mathcal{C}_{y,r_{k}}-\mathcal{C}_{y,r_{k+1}}\big{|}\underset{(\mathbf{x},y)\sim P}{\mathbb{E}}\mathbbm{1}\{\hat{y}^{(k)}\neq y^{(k)}\}$}.

(9)

Proof.

For any $\mathbf{x}\in\mathcal{X}$ , by Eq. 4 we have

\hat{y}^{(1)}\geq\hat{y}^{(2)}\geq\ldots\geq\hat{y}^{(K-1)}.

If $h(\mathbf{x})=y$ , then $\mathcal{C}_{y,h(\mathbf{x})}=0$ .
If $h(\mathbf{x})=r_{q}\prec y=r_{s}$ , then $q<s$ . We have

\hat{y}^{(1)}=\hat{y}^{(2)}=\ldots=\hat{y}^{(q-1)}=1

and

\hat{y}^{(q)}=\hat{y}^{(q+1)}=\ldots=\hat{y}^{(K-1)}=0.

Also,

y^{(1)}=y^{(2)}=\ldots=y^{(s-1)}=1

and

y^{(s)}=y^{(s+1)}=\ldots=y^{(K-1)}=0.

Thus, $\mathbbm{1}\{\hat{y}^{(k)}\neq y^{(k)}\}=1$ if and only if $q\leq k\leq s-1$ . Since $\mathcal{C}_{y,y}=0,$

	$\displaystyle\mathcal{C}_{y,h(\mathbf{x})}$	$\displaystyle=\sum_{k=q}^{s-1}(\mathcal{C}_{y,r_{k}}-\mathcal{C}_{y,r_{k+1}})\cdot\mathbbm{1}\{\hat{y}^{(k)}\neq y^{(k)}\}$
		$\displaystyle\leq\sum_{k=q}^{s-1}\big{\|}\mathcal{C}_{y,r_{k}}-\mathcal{C}_{y,r_{k+1}}\big{\|}\cdot\mathbbm{1}\{\hat{y}^{(k)}\neq y^{(k)}\}$
		$\displaystyle\leq\sum_{k=1}^{K-1}\big{\|}\mathcal{C}_{y,r_{k}}-\mathcal{C}_{y,r_{k+1}}\big{\|}\cdot\mathbbm{1}\{\hat{y}^{(k)}\neq y^{(k)}\}.$

Similarly, if $h(x)=r_{q}\succ y=r_{s}$ , then $q>s$ and

	$\displaystyle\mathcal{C}_{y,h(\mathbf{x})}$	$\displaystyle=\sum_{k=s}^{q-1}(\mathcal{C}_{y,r_{k+1}}-\mathcal{C}_{y,r_{k}})\cdot\mathbbm{1}\{\hat{y}^{(k)}\neq y^{(k)}\}$
		$\displaystyle\leq\sum_{k=1}^{K-1}\big{\|}\mathcal{C}_{y,r_{k+1}}-\mathcal{C}_{y,r_{k}}\big{\|}\cdot\mathbbm{1}\{\hat{y}^{(k)}\neq y^{(k)}\}.$

In any case, we have

\mathcal{C}_{y,h(\mathbf{x})}\leq\sum_{k=1}^{K-1}\big{|}\mathcal{C}_{y,r_{k}}-\mathcal{C}_{y,r_{k+1}}\big{|}\cdot\mathbbm{1}\{\hat{y}^{(k)}\neq y^{(k)}\}.

(10)

By taking the expectation on both sides with $(\mathbf{x},y)\sim P$ , we arrive at Eq. 9. ∎

8.3 Comparison with Other Deep Learning Methods for Ordinal Regression

We compare CORN with two additional, recent ordinal regression methods that do not rely on the binary extension framework:

1.

the convolutional neural network with pairwise regularization for ordinal regression (CNNPOR) method by Liu, Long, and Goh [10];
2.

the soft ordinal vectors (SORD) method by Diaz and Marathe [5].

To facilitate a fair comparison, we adopted the exact same architecture and preprocessing steps from [5] and [10]. Similar to CNNPOR and SORD, we used a VGG16 [23] backbone pre-trained on ImageNet [4] where only the last layer (output layer) was re-initialized with random weights following. Also, following the preprocessing steps in CNNPOR and SORD, the training images in the AES dataset were resized to $256\times 256$ pixels and randomly cropped to $224\times 224$ as well as randomly flipped across the horizontal axis.

As these additional results on the AES dataset show, CORN also outperforms other recent ordinal regression methods for deep learning (CNNPOR [10] and SORD [5]) overall when trained with a VGG16 backbone that was pre-trained on ImageNet (Table S2).

Table S1: Best hyperparameter settings for the AES datasets. For CNNPOR [10] and SORD [5] settings, please refer to the respective papers.

Datasets	Backbone	Methods	Learning rates	Batch sizes
AES Nature, Animals, Urban, People	VGG16	CE-NN	5e-5, 5e-5, 5e-5, 5e-5	32, 32, 32, 16
AES Nature, Animals, Urban, People	VGG16	OR-NN	1e-4, 1e-4, 1e-4, 5e-4	32, 32, 16, 16
AES Nature, Animals, Urban, People	VGG16	CORAL	5e-4, 1e-3, 1e-3, 5e-4	16, 16, 16, 32
AES Nature, Animals, Urban, People	VGG16	CORAL	5e-5, 5e-5, 5e-5, 5e-5	64, 64, 64, 32

Table S2: Prediction errors on the test sets for the VGG16 backbone pre-trained on ImageNet. Best results are highlighted in bold.

MAE (lower is better)

CE-NN

OR-NN [14]

CORAL [1]

CORN

(ours)

CNNPOR [10]

SORD [5]

Nature

0.29

0.28

0.29

0.27

Animals

0.28

0.25

0.30

0.26

0.32

0.31

Urban

0.26

0.27

0.25

0.33

0.28

People

0.29

0.28

0.29

0.26

0.32

0.31

Overall

0.28

0.27

0.29

0.27

0.32

0.29

Table S3: Prediction errors on the test sets (lower is better). A ResNet34 backbone was used for the MORPH-2 and AFAD image datasets. A multilayer perceptron backbone was used for the AES and Fireman tabular datasets. The class labels in all datasets were balanced. Best results are highlighted in bold.

Method	Seed	MORPH-2		AFAD		Fireman
Method	Seed	MAE	RMSE	MAE	RMSE	MAE	RMSE
CE-NN	0	3.81	5.19	3.31	4.27	0.80	1.14
	1	3.60	4.8	3.28	4.19	0.80	1.14
	2	3.61	4.84	3.32	4.22	0.79	1.13
	3	3.85	5.21	3.24	4.15	0.80	1.16
	4	3.80	5.14	3.24	4.13	0.80	1.15
	AVG $\pm$ SD	3.73 $\pm$ 0.12	5.04 $\pm$ 0.20	3.28 $\pm$ 0.04	4.19 $\pm$ 0.06	0.80 $\pm$ 0.01	1.14 $\pm$ 0.01
OR-NN [14]	0	3.21	4.25	2.81	3.45	0.75	1.07
	1	3.16	4.25	2.87	3.54	0.76	1.08
	2	3.16	4.31	2.82	3.46	0.77	1.10
	3	2.98	4.05	2.89	3.49	0.76	1.08
	4	3.13	4.27	2.86	3.45	0.74	1.07
	AVG $\pm$ SD	3.13 $\pm$ 0.09	4.23 $\pm$ 0.10	2.85 $\pm$ 0.03	3.48 $\pm$ 0.04	0.76 $\pm$ 0.01	1.08 $\pm$ 0.01
CORAL [1]	0	2.94	3.98	2.95	3.60	0.82	1.14
	1	2.97	4.03	2.99	3.69	0.83	1.16
	2	3.01	3.98	2.98	3.70	0.81	1.13
	3	2.98	4.01	3.00	3.78	0.82	1.16
	4	3.03	4.06	3.04	3.75	0.82	1.15
	AVG $\pm$ SD	2.99 $\pm$ 0.04	4.01 $\pm$ 0.03	2.99 $\pm$ 0.03	3.70 $\pm$ 0.07	0.82 $\pm$ 0.01	1.15 $\pm$ 0.01
CORN (ours)	0	2.98	4	2.80	3.45	0.75	1.07
	1	2.99	4.01	2.81	3.44	0.76	1.08
	2	2.97	3.97	2.84	3.48	0.77	1.10
	3	3.00	4.06	2.80	3.48	0.76	1.08
	4	2.95	3.92	2.79	3.45	0.74	1.07
	AVG $\pm$ SD	2.98 $\pm$ 0.02	3.99 $\pm$ 0.05	2.81 $\pm$ 0.02	3.46 $\pm$ 0.02	0.76 $\pm$ 0.01	1.08 $\pm$ 0.01

8.4 Additional Results on Text Datasets using Recurrent Neural Networks

This section describes additional results we obtained from comparing CORN to other methods on text datasets using recurrent neural networks (RNNs) with long short-term memory (LSTM) cells.

Table S4: Prediction errors on the test sets for the RNN backbone (lower is better). The class labels for both the Coursera and TripAdvisor were balanced. Best results are highlighted in bold.

Method	Seed	TripAdvisor		Coursera
Method	Seed	MAE	RMSE	MAE	RMSE
CE-RNN	0	1.13	1.56	1.01	1.48
	1	1.04	1.53	0.97	1.05
	2	1.05	1.54	1.12	1.65
	3	1.23	1.81	1.18	1.76
	4	1.03	1.52	0.84	1.26
	AVG $\pm$ SD	1.10 $\pm$ 0.09	1.59 $\pm$ 0.12	1.02 $\pm$ 0.13	1.53 $\pm$ 0.19
OR-RNN [14]	0	1.06	1.53	0.98	1.34
	1	1.09	1.50	0.93	1.24
	2	1.11	1.53	1.12	1.47
	3	1.23	1.52	1.11	1.53
	4	1.07	1.40	0.85	1.16
	AVG $\pm$ SD	1.11 $\pm$ 0.07	1.50 $\pm$ 0.06	1.00 $\pm$ 0.12	1.35 $\pm$ 0.15
CORAL [1]	0	1.15	1.58	0.99	1.29
	1	1.14	1.49	1.03	1.39
	2	1.16	1.46	1.14	1.40
	3	1.19	1.41	1.20	1.40
	4	1.13	1.47	0.82	1.11
	AVG $\pm$ SD	1.15 $\pm$ 0.02	1.48 $\pm$ 0.06	1.04 $\pm$ 0.15	1.33 $\pm$ 0.13
CORN (ours)	0	1.09	1.55	0.95	1.37
	1	1.09	1.53	0.90	1.32
	2	1.01	1.45	1.07	1.49
	3	1.12	1.51	1.05	1.47
	4	1.03	1.46	0.78	1.14
	AVG $\pm$ SD	1.07 $\pm$ 0.05	1.50 $\pm$ 0.04	0.95 $\pm$ 0.12	1.36 $\pm$ 0.14

Both the 100K Coursera’s courses reviews dataset⁹⁹9https://www.kaggle.com/septa97/100k-courseras-course-reviews-dataset and TripAdvisor hotels reviews dataset¹⁰¹⁰10https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews contain reviews with 5 rating labels ranging from 1 to 5 stars. We used balanced versions of these datasets to distribute the ratings evenly. The balanced Coursera dataset contains 11,852 reviews, and the TripAdvisor dataset contains 7,000 reviews. Each dataset was randomly divided into 75% training data, 5% validation data, and 20% test data. The dataset splits and preprocessing code can be found in the code repository (see Section 4.4 of the main manuscript).

For method comparisons on the text datasets, we use a standard RNN with one LSTM cell. Similar to the image datasets, we compare the performance of a neural network trained via the rank-consistent CORN approach to both Niu et al.’s OR-RNN method (no rank consistency) and CORAL (rank consistency by using identical weight parameters for all nodes in the output layer). We also implemented RNN classifiers trained with standard multicategory cross-entropy loss as a baseline, which we refer to as CE-RNN. All methods share similar backbone architectures and only require minor changes in the output layer.

The training and evaluation steps are similar to those of the image datasets in the main manuscript. The RNN models were trained for 200 epochs using ADAM with default settings. The model with the best validation set performance was then chosen as the final model for evaluation on the test set. The training logs for all runs are available in the CORN GitHub repository (see Section 4.4). The learning rates considered for hyperparameter tuning were 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, and we considered batch sizes 16, 32, 64, 128, 256, 512.

As the results in Table S4 show, CORN outperforms all other methods on the two text datasets, TripAdvisor and Coursera, in terms of the test set MAE. All experiments were repeated for different random seeds to ensure that the results were not coincidental.

It is worth noticing that while the CORN method showed superior performance in terms of test MAE, the CORAL method performed better in test RMSE compared with all other methods. One possible explanation is that since RMSE penalizes large gaps more harshly than MAE, CORAL may behave slightly better on outliers while CORN may make fewer mistakes in total. However, both methods show reliable performance over the text datasets.

8.5 Detailed Performance Table

TableS5 is a more detailed version of the results table shown in the main paper, listing the performance for each individual random seed.

Table S5: Prediction errors on the test sets (lower is better). A ResNet34 backbone was used for the MORPH-2 and AFAD image datasets. A multilayer perceptron backbone was used for the AES and Fireman tabular datasets. The class labels in all datasets were balanced. Best results are highlighted in bold.

Method	Seed	MORPH-2		AFAD		Fireman
Method	Seed	MAE	RMSE	MAE	RMSE	MAE	RMSE
CE-NN	0	3.81	5.19	3.31	4.27	0.80	1.14
	1	3.60	4.8	3.28	4.19	0.80	1.14
	2	3.61	4.84	3.32	4.22	0.79	1.13
	3	3.85	5.21	3.24	4.15	0.80	1.16
	4	3.80	5.14	3.24	4.13	0.80	1.15
	AVG $\pm$ SD	3.73 $\pm$ 0.12	5.04 $\pm$ 0.20	3.28 $\pm$ 0.04	4.19 $\pm$ 0.06	0.80 $\pm$ 0.01	1.14 $\pm$ 0.01
OR-NN [14]	0	3.21	4.25	2.81	3.45	0.75	1.07
	1	3.16	4.25	2.87	3.54	0.76	1.08
	2	3.16	4.31	2.82	3.46	0.77	1.10
	3	2.98	4.05	2.89	3.49	0.76	1.08
	4	3.13	4.27	2.86	3.45	0.74	1.07
	AVG $\pm$ SD	3.13 $\pm$ 0.09	4.23 $\pm$ 0.10	2.85 $\pm$ 0.03	3.48 $\pm$ 0.04	0.76 $\pm$ 0.01	1.08 $\pm$ 0.01
CORAL [1]	0	2.94	3.98	2.95	3.60	0.82	1.14
	1	2.97	4.03	2.99	3.69	0.83	1.16
	2	3.01	3.98	2.98	3.70	0.81	1.13
	3	2.98	4.01	3.00	3.78	0.82	1.16
	4	3.03	4.06	3.04	3.75	0.82	1.15
	AVG $\pm$ SD	2.99 $\pm$ 0.04	4.01 $\pm$ 0.03	2.99 $\pm$ 0.03	3.70 $\pm$ 0.07	0.82 $\pm$ 0.01	1.15 $\pm$ 0.01
CORN (ours)	0	2.98	4	2.80	3.45	0.75	1.07
	1	2.99	4.01	2.81	3.44	0.76	1.08
	2	2.97	3.97	2.84	3.48	0.77	1.10
	3	3.00	4.06	2.80	3.48	0.76	1.08
	4	2.95	3.92	2.79	3.45	0.74	1.07
	AVG $\pm$ SD	2.98 $\pm$ 0.02	3.99 $\pm$ 0.05	2.81 $\pm$ 0.02	3.46 $\pm$ 0.02	0.76 $\pm$ 0.01	1.08 $\pm$ 0.01

8.6 Numerically Stable Loss Function

We can convert the CORN loss function,

L(\mathbf{X},\mathbf{y})=\\ -\frac{1}{\sum_{j=1}^{k-1}|S_{j}|}\sum_{j=1}^{k-1}\sum_{i=1}^{|S_{j}|}\bigg{[}\log\left(f_{j}(\mathbf{x}^{[i]})\right)\cdot\mathbbm{1}\left\{y^{[i]}>r_{j}\right\}\\ +\log\left(1-f_{j}\left(\mathbf{x}^{[i]}\right)\right)\cdot\mathbbm{1}\left\{y^{[i]}\leq r_{j}\right\}\bigg{]},

(11)

into an alternative version

L(\mathbf{Z},\mathbf{y})=\\ -\frac{1}{\sum_{j}^{k-1}|S_{j}|}\sum_{j=1}^{k-1}\sum_{i=1}^{|S_{j}|}\bigg{[}\log\left(\sigma\left(\mathbf{z}^{[i]}\right)\right)\cdot\mathbbm{1}\left\{y^{[i]}>r_{j}\right\}\\ +\left(\log\left(\sigma\left(\mathbf{z}^{[i]}\right)\right)-\mathbf{z}^{[i]}\right)\cdot\mathbbm{1}\left\{y^{[i]}\leq r_{j}\right\}\bigg{]},

(12)

where $\mathbf{Z}$ are the net inputs of the last layer (aka logits) and $\log\left(\sigma\left(\mathbf{z}^{[i]}\right)\right)=\log\left(f_{j}\left(\mathbf{x}^{[i]}\right)\right)$ , since

	$\displaystyle\log\left(1-\frac{1}{1+e^{-z}}\right)$
	$\displaystyle=\log\left(1-\frac{e^{z}}{1+e^{z}}\right)$
	$\displaystyle=\log\left(\frac{1}{1+e^{z}}\right)$
	$\displaystyle=\log\left(\frac{e^{z}}{1+e^{z}}\right)-\log(e^{z})$
	$\displaystyle=\log\left(\frac{1}{1+e^{-z}}\right)-z$
	$\displaystyle=\log\left(\sigma(z)\right)-z.$

This allows us to use the logsigmoid(z) function that is implemented in deep learning libraries such as PyTorch as opposed to using log(1-sigmoid(z)); the former yields numerically more stable gradients during backpropagation. A PyTorch implementation of the CORN loss function is shown in Fig. S1.

	$\displaystyle f_{1}(\mathbf{x})=P\left(y>r_{1}\;\|\;\mathbf{x}\right),$
	$\displaystyle f_{2}(\mathbf{x})=P\left(y>r_{2}\;\|\;y>r_{1},\mathbf{x}\right),$
	$\displaystyle...,$
	$\displaystyle f_{K-1}(\mathbf{x})=P\left(y>r_{K-1}\;\|\;>r_{k-2},\mathbf{x}\right),$

Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities

Abstract

1 Introduction

2 Related Work

2.1 Ordinal Regression Based on Extended Binary Classification Subtasks

2.2 Addressing Rank Consistency in Neural Networks for Ordinal Regression

2.3 Other Neural Network-Based Methods for Ordinal Regression

3 Proposed Method

3.1 Preliminaries

3.2 Motivation

3.3 Rank-consistent Ordinal Regression based on Conditional Probabilities

3.4 Conditional Training Subsets

3.5 Loss Function

3.6 Rank Prediction

4 Experiments

4.1 Datasets and Preprocessing

4.2 Neural Network Architectures

4.2.1 Comparison with binary label extension frameworks for ordinal regression

4.3 Training and Evaluation

4.4 Hardware and Software

5 Results and Discussion

5.1 Ablation Study

6 Conclusions

7 Acknowledgements

References

8 Supplementary Material

8.1 Theoretical Analysis of Conditional Probability Estimation

8.2 Generalization Bounds

Theorem 1 (reduction of generalization error).

Proof.

8.3 Comparison with Other Deep Learning Methods for Ordinal Regression

8.4 Additional Results on Text Datasets using Recurrent Neural Networks

8.5 Detailed Performance Table

8.6 Numerically Stable Loss Function

8.7 Additional Figures Explaining the CORN Method