Deep Neural Networks for Rank-Consistent Ordinal Regression Based On Conditional Probabilities
Abstract
In recent times, deep neural networks achieved outstanding predictive performance on various classification and pattern recognition tasks. However, many real-world prediction problems have ordinal response variables, and this ordering information is ignored by conventional classification losses such as the multi-category cross-entropy. Ordinal regression methods for deep neural networks address this. One such method is the CORAL method, which is based on an earlier binary label extension framework and achieves rank consistency among its output layer tasks by imposing a weight-sharing constraint. However, while earlier experiments showed that CORAL’s rank consistency is beneficial for performance, it is limited by a weight-sharing constraint in a neural network’s fully connected output layer, which may restrict the expressiveness and capacity of a network trained using CORAL. We propose a new method for rank-consistent ordinal regression without this limitation. Our rank-consistent ordinal regression framework (CORN) achieves rank consistency by a novel training scheme. This training scheme uses conditional training sets to obtain the unconditional rank probabilities through applying the chain rule for conditional probability distributions. Experiments on various datasets demonstrate the efficacy of the proposed method to utilize the ordinal target information, and the absence of the weight-sharing restriction improves the performance substantially compared to the CORAL reference approach. Additionally, the suggested CORN method is not tied to any specific architecture and can be utilized with any deep neural network classifier to train it for ordinal regression tasks.
1 Introduction
Many real-world prediction tasks involve ordinal target labels. Popular examples of such ordinal tasks are customer ratings (e.g., a product rating system from 1 to 5 stars) and medical diagnoses (e.g., disease severity labels such as none, mild, moderate, and severe). While we can apply conventional classification losses, such as the multi-category cross-entropy, to such problems, they are suboptimal since they ignore the intrinsic order among the ordinal targets. For example, for a patient with severe disease status, predicting none and moderate would incur the same loss even though the difference between none and severe is more significant than the difference between moderate and severe. Moreover, unlike in metric regression, we cannot quantify the distance between the ordinal ranks. For instance, the difference between a disease status of none and mild cannot be quantitatively compared to the difference between mild and moderate. Hence, ordinal regression (also called ordinal classification or ranking learning) can be considered as an intermediate problem between classification and regression.
Among the most common machine learning-based approaches to ordinal regression is Li and Lin’s extended binary classification framework [9] that was adopted for deep neural networks by Niu et al. in 2016 [14]. In this work, we solve the rank inconsistency problem (Fig. 1) of this ordinal regression framework without imposing constraints that could limit the expressiveness of the neural network and without substantially increasing the computational complexity.
The contributions of our paper are as follows:
-
1.
A new rank-consistent ordinal regression framework, CORN (Conditional Ordinal Regression for Neural Networks), based on the chain rule for conditional probability distributions;
-
2.
Rank consistency guarantees without imposing the weight-sharing constraint used in the CORAL reference framework [1];
-
3.
Experiments with different neural network architectures and datasets showing that CORN’s removal of the weight-sharing constraint improves the predictive performance compared to the more restrictive reference framework.
2 Related Work
2.1 Ordinal Regression Based on Extended Binary Classification Subtasks
Ordinal regression is a classic problem in statistics, going back to early proportional hazards and proportional odds models [13]. To take advantage of well-studied and well-tuned binary classifiers, the machine learning field developed ordinal regression methods based on extending the rank prediction to multiple binary label classification subtasks [9]. This approach relies on three steps: (1) extending rank labels to binary vectors, (2) training binary classifiers on the extended labels, and (3) computing the predicted rank label from the binary classifiers. Modified versions of this approach have been proposed in connection with perceptrons [3] and support vector machines [22, 17, 2]. In 2007, Li and Lin presented a reduction framework unifying these extended binary classification approaches [9].
2.2 Addressing Rank Consistency in Neural Networks for Ordinal Regression
In 2016, Niu et al. adapted Li and Lin’s extended binary classification framework to train deep neural networks for ordinal regression [14]; we refer to this method as OR-NN. Across different image datasets, OR-NN was able to outperform other reference methods. However, Niu et al. pointed out that ORD-NN suffers from rank inconsistencies among the binary tasks and that addressing this limitation might raise the training complexity substantially. Cao et al. [1] recently addressed this rank inconsistency limitation via the CORAL method. To avoid increasing the training complexity, CORAL achieves rank consistency by imposing a weight-sharing constraint in the last layer, such that the binary classifiers only differ in their bias units. However, while CORAL outperformed the OR-NN method across several face image datasets for age prediction, the weight-sharing constraint may impose a severe limitation in terms of functions that the neural network can approximate. In this paper, we investigate an alternative approach to guarantee rank consistency without increasing the training complexity and restricting the neural network’s expressiveness and capacity.
2.3 Other Neural Network-Based Methods for Ordinal Regression
Several deep neural networks for ordinal regression do not build on the extended binary classification framework. These methods include Zhu et al.’s [25] convolutional ordinal regression forest for image data, which combines a convolutional neural network with differentiable decision trees. Diaz and Marathe [5] proposed a soft ordinal label representation obtained from a softmax layer, which can be used for scenarios where interclass distances are known. Another method that does not rely on the extended binary classification framework is Suarez et al.’s distance metric learning algorithm [24]. Petersen et al. [16] developed a method based on differentiable sorting networks based on pairwise swapping operations with relaxed sorting operations, which can be used for ranking where the relative ordering is known but the absolute target values are unknown. Liu et al. adapted pairwise ranking constraints from RankingSVM [7] to reformlate the multi-category loss as a constrained optimization problem for ordinal regression [10].
This paper focuses on addressing the rank inconsistency on OR-NN without imposing the weight-sharing of CORAL [1], which is why an exhaustive study of the methods mentioned above is outside the scope of this paper. However, additional experiments and comparisons with SORD [5] and CNNPOR [10] are included in the Supplementary Material in section Comparison with Other Deep Learning Methods for Ordinal Regression.
3 Proposed Method
This section describes the details of our CORN method, which addresses the rank inconsistency in Niu et al.’s OR-NN [14] without requiring CORAL’s [1] weight-sharing constraint.
3.1 Preliminaries
Let denote a dataset for supervised learning consisting of training examples, where denotes the inputs of the -th training example and its corresponding class label. In an ordinal regression context, we refer to as the rank, where with rank order . The objective of an ordinal regression model is then to find a mapping that minimizes a loss function .
3.2 Motivation
With CORAL, Cao et al. [1] proposed a deep neural network for ordinal regression that addressed the rank inconsistency of Niu et al.’s OR-NN [14], and experiments showed that addressing rank consistency had a positive effect on predictive performance.
Both CORAL and OR-NN built on an extended binary classification framework [9], where the rank labels are recast into a set of binary tasks, such that indicates whether exceeds rank . The label predictions are then obtained via , where is the rank index, which is computed as
(1) |
Here, is the probability prediction of the -th binary classifier in the output layer, and is an indicator function that returns if the inner condition is true and otherwise.
The CORAL method ensures that the predictions are rank-monotonic, that is, , which provides rank consistency to the ordinal regression model. While the rank label calculation via Eq. 1 does not strictly require consistency among the task predictions, , it is intuitive to see why rank consistency can be theoretically beneficial and can lead to more interpretable results via the binary subtasks. While CORAL provides this rank consistency, CORAL’s limitation is a weight-sharing constraint in the output layer. Consequently, all binary classification tasks use the same weight parameters and only differ in their bias units, which may limit the flexibility and expressiveness of an ordinal regression neural network based on CORAL.
The proposed CORN model is a neural network for ordinal regression that guarantees rank consistency without any weight-sharing constraint in the output layer (Fig. 2). Instead, CORN uses a new training procedure with conditional training subsets that ensures rank consistency through applying the chain rule of probability.
3.3 Rank-consistent Ordinal Regression based on Conditional Probabilities
Given a training set , CORN applies a label extension to the rank labels similar to CORAL, such that the resulting binary label indicates whether exceeds rank . Similar to CORAL, CORN also uses learning tasks associated with ranks in the output layer as illustrated in Fig. 2.
However, in contrast to CORAL, CORN estimates a series of conditional probabilities using conditional training subsets (described in Section 3.4) such that the output of the th binary task represents the conditional probability111When , represents the initial unconditional probability .
(2) |
where the events are nested: .
The transformed, unconditional probabilities can then be computed by applying the chain rule for probabilities to the model outputs:
(3) |
Since , we have
(4) |
which guarantees rank consistency among the binary tasks.
3.4 Conditional Training Subsets
Our model aims to estimate and the conditional probabilities . Estimating is a classic binary classification task under the extended binary classification framework with the binary labels . To estimate the conditional probabilities such as , we focus only on the subset of the training data where . As a result, when we minimize the binary cross-entropy loss on these conditional subsets, for each binary task, the estimated output probability has a proper conditional probability interpretation222When training a neural network using backpropagation, instead of minimizing the loss functions corresponding to the conditional probabilities on each conditional subset separately, we can minimize their sum, as shown in the loss function we propose in Section 3.5, to optimize the binary tasks simultaneously..
In order to model the conditional probabilities in Eq. 3, we construct conditional training subsets for training, which are used in the loss function (Section 3.5) that is minimized via backpropagation. The conditional training subsets are obtained from the original training set as follows:
where , and denotes the size of . Note that the labels are subject to the binary label extension as described in Section 3.3. Each conditional training subset is used for training the conditional probability prediction for .
Additional theoretical justification for constructing the conditional training subsets is provided in the Supplementary Material in section Theoretical Analysis of Conditional Probability Estimation. Section 5.1 compares the predictive performance of the CORN method with and without training subsets.
3.5 Loss Function
Let denote the predicted value of the -th node in the output layer of the network (Fig. 2), and let denote the size of the -th conditional training set. To train a CORN neural network using backpropagation, we minimize the following loss function:
(5) |
We note that in , represents the -th training example in . To simplify the notation, we omit an additional index to distinguish between in different conditional training sets.
To improve the numerical stability of the loss gradients during training, we implement the following alternative formulation of the loss, where are the net inputs of the last layer (aka logits), as shown in Fig. 2, and :
(6) |
A derivation showing that the two loss equations are equivalent and a PyTorch implementation are included in the Supplementary Material in the section Numerically Stable Loss Function. In addition, the Supplementary Material includes a visual illustration of the loss computation based on the conditional training subsets (Figure S1) and a theoretical Generalization Bounds analysis.
3.6 Rank Prediction
To obtain the rank index of the -th training example, and any new data record during inference, we threshold the predicted probabilities corresponding to the binary tasks and sum the binary labels as follows:
where the predicted rank is .
4 Experiments
4.1 Datasets and Preprocessing
The MORPH-2 dataset333https://www.faceaginggroup.com/morph/ [19] contains 55,608 face images, which were processed as described in [1]: facial landmark detection [20] was used to compute the average eye location, which was then used by the EyepadAlign function in MLxtend v0.14 [18] to align the face images. The original MORPH-2 dataset contains age labels in the range of 16-70 years. In this study, we use a balanced version of the MORPH-2 dataset containing 20,625 face images with 33 evenly distributed age labels within the range of 16-48 years.
The Asian Face dataset (AFAD)444https://github.com/afad-dataset/tarball [14] contains 165,501 faces in the age range of 15-40 years. No additional preprocessing was applied to this dataset since the faces were already centered. In this study, we use a balanced version of the AFAD dataest with 13 age labels in the age range of 18-30 years.
The Image Aesthetic dataset (AES)555http://www.di.unito.it/~schifane/dataset/beauty-icwsm15/ [21] used in this study contains 13,868 images, each with a list of beauty scores ranging from 1 to 5. To create ordinal regression labels, we replaced the beauty score list of each image with its average score rounded to the nearest integer in the range 1-5. Compared to the other image datasets MORPH-2 and AFAD, the size of the AES dataset was relatively small, and we did not attempt to create a class-balanced version of this dataset for this study.
The Fireman dataset (Fireman)666https://github.com/gagolews/ordinal_regression_data is a tabular dataset that contains 40,768 instances, 10 numeric features, and an ordinal response variable with 16 categories. We created a balanced version of this dataset consisting of 2,543 instances per class and 40,688 from the 16 ordinal classes in total.
Each dataset was randomly divided into 75% training data, 5% validation data, and 20% test data. We share the partitions for all datasets, along with all preprocessing code used in this paper, in the code repository (see Section 4.4).
4.2 Neural Network Architectures
4.2.1 Comparison with binary label extension frameworks for ordinal regression
For the main method comparisons to other binary extension frameworks for ordinal regression on the image datasets (MORPH-2 and AFAD,), we used ResNet34 [6] as the backbone architecture since it is an established architecture that is known to achieve good performance on a variety of image classification datasets. Besides the hyperparameter settings listed in Tables 1 and 2; we adopt all other settings from the ResNet34 paper.
For the tabular Fireman dataset, we used a simple multilayer perceptron architecture (MLP) with leaky ReLU [12] activation functions (negative slope 0.01). Since the MLP architectures were prone to overfitting, a dropout layer with drop probability 0.2 was added after the leaky ReLU activations in each hidden layer. In addition, we used the AdamW [11] optimizer with a weight decay rate of 0.2. The number of hidden layers (one or two) and the number of units per hidden layer were determined by hyperparameter tuning (see Section 4.3 for more details).
In this paper, we focus on comparing the performance of a neural network trained via the rank-consistent CORN approach to the two prominent binary extension-based ordinal regression frameworks for deep learning, the Niu et al. [14] OR-NN method (no rank consistency) and CORAL (rank consistency by using identical weight parameters for all nodes in the output layer). As a performance baseline, we implement neural network classifiers trained with standard multicategory cross-entropy loss as a baseline, which we refer to as CE-NN. While all methods (CE-NN, OR-NN, CORAL, and CORN) use different loss functions during training, it is worth emphasizing that they can share similar backbone architectures and only require small changes in the output layer. For instance, to implement a neural network for ordinal regression using the proposed CORN method, we replaced the network’s output layer with the corresponding binary conditional probability task layer.
4.3 Training and Evaluation
The model evaluations and comparisons are based on the mean absolute error (MAE) and root mean squared error (RMSE), which are defined as follows:
where is the ground truth rank of the -th test example and is the predicted rank, respectively.
Then, using the best hyperparameter setting for each method, we repeated the model training five times using different random seeds (0, 1, 2, 3, and 4) for the random weight initialization and dataset shuffling. We considered the exact same hyperparameter ranges for each method. (A detailed list of the hyperparameter configurations we considered is shown in Table 1.) Then, we selected the best hyperparameter configuration, using grid search, based on its validation set performance for each method before computing the test set performance. Note that both the hyperparameter configuration and the best training epoch were determined based on the validation set before computing the final model performance on the independent test set. The best hyperparameter values for each method are listed in Table 2.
Backbone | Learning rates | Batch sizes | Layer sizes |
---|---|---|---|
ResNet34 | 5e-5, 1e-4, 2.5e-4, 5e-4, 1e-3, 5e-3 | 16, 32, 64, 128, 256, 512 | NA |
MLP | 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3 | 16, 32, 64, 128, 256, 512 | Layer1: 50, 100, 200, 300 |
Layer2: 50, 100, 200, 300 |
Datasets | Backbone | Methods | Learning rates | Batch sizes | Number of layers | Layer hidden units |
Image Datasets | ResNet34 | CE-NN | 5e-4 | 256 | - | - |
Image Datasets | ResNet34 | OR-NN | 5e-4 | 256 | - | - |
Image Datasets | ResNet34 | CORAL | 5e-4 | 256 | - | - |
Image Datasets | ResNet34 | CORN | 5e-4 | 16 | - | - |
Fireman | MLP | CE-NN | 5e-4 | 64 | 2 | |
Fireman | MLP | OR-NN | 5e-4 | 128 | 2 | |
Fireman | MLP | CORAL | 5e-4 | 64 | 2 | |
Fireman | MLP | CORN | 1e-3 | 128 | 2 |
The models were trained for 200 epochs using stochastic gradient descent via adaptive moment estimation [8] with the default decay rates and carefully checked for convergence such that training and validation MAE started to diverge and the validation MAE started to stagnate or decline. The complete training logs for all methods are provided in the code repository (Section 4.4).
4.4 Hardware and Software
All neural networks were implemented in PyTorch 1.8 [15]. The models were trained on NVIDIA GeForce RTX 2080Ti graphics cards on a private workstation as well as T4 graphics cards using the Grid.ai platform. We make all source code used for the experiments available777https://github.com/Raschka-research-group/corn-ordinal-neuralnet and provide a user-friendly implementation of CORN in the coral-pytorch Python package888https://github.com/Raschka-research-group/coral-pytorch.
5 Results and Discussion
To compare deep neural networks trained with our proposed CORN method to CORAL [1], Niu et al.’s OR-NN [14], and the baseline cross-entropy loss (CE-NN), we conducted a series of experiments on three image datasets and one tabular dataset. As detailed in Section 4.2, the experiments on the MORPH and AFAD image datasets were based on the ResNet34 architecture. We used a multilayer perceptron for the tabular Fireman dataset.
An additional study using a VGG16 backbone pre-trained on ImageNet and comparisons with SORD and CNNPOR can be found in the Supplementary Material in section Comparison with Other Deep Learning Methods for Ordinal Regression. In addition, results on text datasets and recurrent neural networks are included in the Supplementary Material in section Additional Results on Text Datasets using Recurrent Neural Networks.
As the main results in Table 3 show, CORN outperforms all other binary label extension methods for ordinal regression on the MORPH-2 and AFAD image datasets and is tied with OR-NN on the Fireman tabular dataset. We repeated the experiments with different random seeds for model weight initialization and data shuffling, which ensures that the results are not coincidental.
Method | Metrics format | MORPH-2 (Balanced) | AFAD (Balanced) | Fireman | |||
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||
CE-NN | AVGSD | 3.73 0.12 | 5.04 0.20 | 3.28 0.04 | 4.19 0.06 | 0.80 0.01 | 1.14 0.01 |
OR-NN [14] | AVGSD | 3.13 0.09 | 4.23 0.10 | 2.85 0.03 | 3.48 0.04 | 0.76 0.01 | 1.08 0.01 |
CORAL [1] | AVGSD | 2.99 0.04 | 4.01 0.03 | 2.99 0.03 | 3.70 0.07 | 0.82 0.01 | 1.15 0.01 |
CORN (ours) | AVGSD | 2.98 0.02 | 3.99 0.05 | 2.81 0.02 | 3.46 0.02 | 0.76 0.01 | 1.08 0.01 |
It is worth noting that even though CORAL’s rank consistency was found to be beneficial for model performance [1], it performs noticeably worse than OR-NN on the balanced MORPH-2 and AFAD datasets. This might likely be due to CORAL’s weight-sharing constraint in the output layer, which could affect the expressiveness of the neural networks and thus limit the complexity of what it can learn. In contrast the CORN method, which is also rank-consistent, performs better than OR-NN on MORPH-2 and AFAD.
We found that OR-NN and CORN have identical performances on the tabular Fireman dataset (Table 3), outperforming both the CE-NN and CORAL in both test MAE and test RMSE. Here, the performances are relatively close, and the 16-category prediction task is relatively easy for a fully connected neural network regardless of the loss function.
5.1 Ablation Study
Given the superior performance of CORN across several datasets, we studied the importance of the training subsets. In this ablation study, created an alternative CORN method without training subsets subsets. Here, the conditional probability of the th binary task is computed as
(7) |
which is a modified version of Eq. 2. Note that this modification results in meaningless probability scores, however, the rank consistency via Eq. 4 is still guaranteed since the probability scores are still computed via Eq. 3, and each score cannot be greater than 1.
We shall note that the modified CORN method without training subsets sees at least as many training examples as the regular CORN method. This is because each task now has access to the full training batch rather than a subset.
As the results in Table 4 show, the subsets do not only play a crucial role for yielding meaningful and theoretically justified rank probability values in CORN but they also improve the predictive performance. Across all datasets, with the exception of MORPH-2, the neural network trained with the regular CORN method outperforms the alternative version without subsets.
CORN |
|
|||
---|---|---|---|---|
MORPH-2 | 2.98 0.02 |
|
||
AFAD |
|
|
||
AES |
|
|
||
Fireman |
|
|
6 Conclusions
In this paper, we developed the rank-consistent CORN framework for ordinal regression via conditional training datasets. We used CORN to train convolutional and fully connected neural architectures on ordinal response variables. Our experimental results showed that the CORN method improved the predictive performance compared to the rank-consistent reference framework CORAL. While our experiments focused on image and tabular datasets, the generality of our CORN method allows it to be readily applied to other types of datasets to solve ordinal regression problems with various neural network structures.
7 Acknowledgements
This research was supported by the Office of the Vice Chancellor for Research and Graduate Education at the University of Wisconsin-Madison with funding from the Wisconsin Alumni Research Foundation.
References
- [1] W. Cao, V. Mirjalili, and S. Raschka. Rank consistent ordinal regression for neural networks with application to age estimation. Pattern Recognition Letters, 140:325–331, 2020.
- [2] W. Chu and S. S. Keerthi. New approaches to support vector ordinal regression. In Proceedings of the International Conference on Machine Learning, pages 145–152. ACM, 2005.
- [3] K. Crammer and Y. Singer. Pranking with ranking. In Advances in Neural Information Processing Systems, pages 641–647, 2002.
- [4] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255. IEEE, 2009.
- [5] R. Diaz and A. Marathe. Soft labels for ordinal regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4738–4747, 2019.
- [6] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- [7] T. Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 133–142, 2002.
- [8] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Y. Bengio and Y. LeCun, editors, International Conference on Learning Representations, pages 1–8, 2015.
- [9] L. Li and H.-T. Lin. Ordinal regression by extended binary classification. In Advances in Neural Information Processing Systems, pages 865–872, 2007.
- [10] Y. Liu, A. W. K. Kong, and C. K. Goh. A constrained deep neural network for ordinal regression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 831–839, 2018.
- [11] I. Loshchilov and F. Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations (Poster), 2019.
- [12] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al. Rectifier nonlinearities improve neural network acoustic models. In Proc. icml, volume 30, page 3. Citeseer, 2013.
- [13] P. McCullagh. Regression models for ordinal data. Journal of the Royal Statistical Society. Series B (Methodological), pages 109–142, 1980.
- [14] Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua. Ordinal regression with multiple output CNN for age estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4920–4928, 2016.
- [15] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala. PyTorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035, 2019.
- [16] F. Petersen, C. Borgelt, H. Kuehne, and O. Deussen. Differentiable sorting networks for scalable sorting and ranking supervision. In International Conference on Machine Learning, 2021.
- [17] S. Rajaram, A. Garg, X. S. Zhou, and T. S. Huang. Classification approach towards ranking and sorting problems. In Proceedings of the European Conference on Machine Learning, pages 301–312. Springer, 2003.
- [18] S. Raschka. MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. The Journal of Open Source Software, 3(24):1–2, 2018.
- [19] K. Ricanek and T. Tesafaye. Morph: A longitudinal image database of normal adult age-progression. In Proceedings of the IEEE Conference on Automatic Face and Gesture Recognition, pages 341–345, 2006.
- [20] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. 300 faces in-the-wild challenge: database and results. Image and Vision Computing, 47:3–18, 2016.
- [21] R. Schifanella, M. Redi, and L. M. Aiello. An image is worth more than a thousand favorites: Surfacing the hidden beauty of flickr pictures. In International AAAI Conference on Web and Social Media, 2015.
- [22] A. Shashua, A. Levin, et al. Ranking with large margin principle: Two approaches. Advances in Neural Information Processing Systems, pages 961–968, 2003.
- [23] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
- [24] J. L. Suárez, S. García, and F. Herrera. Ordinal regression with explainable distance metric learning based on ordered sequences. Machine Learning, pages 1–34, 2021.
- [25] H. Zhu, H. Shan, Y. Zhang, L. Che, X. Xu, J. Zhang, J. Shi, and F.-Y. Wang. Convolutional ordinal regression forest for image ordinal estimation. IEEE Transactions on Neural Networks and Learning Systems, 2021.
8 Supplementary Material
8.1 Theoretical Analysis of Conditional Probability Estimation
Suppose we are interested in estimating a series of conditional probabilities
with the observed dataset , where is the functional form of the neural network model outputs that depend on the neural network model weights. The likelihood of the model weights can be written as
(8) |
Hence, minimizing the loss function (Eq. 5) is equivalent to solving the maximum likelihood estimate of the functional form representations of the conditional probabilities. This is also the theoretical justification that we construct the conditional training sets in the data preparation for the CORN loss function. Without using the conditional set in the loss function, the estimated probabilities do not have a conditional probability maximum likelihood interpretation. After solving the maximum likelihood estimates of the conditional probabilities, it is natural to use the probability chain rule to find the unconditional probabilities of exceeding rank in Eq. 3 given each input .
8.2 Generalization Bounds
Analogous to CORAL [1] and based on established generalization bounds for binary classification, Theorem 1 shows that the final rank prediction by CORN generalizes well when the binary classification tasks generalize well.
Theorem 1 (reduction of generalization error).
Let be the cost matrix for the ordinal label predictions, where and for . is the underlying distribution of , i.e., . Furthermore, let be the model output yielding the predicted rank ; that is, . Let , and be the prediction of . Given the binary classification tasks , which we obtain from minimizing the loss in Eq. 5, and the rank-monotonic , we have
(9) |
8.3 Comparison with Other Deep Learning Methods for Ordinal Regression
We compare CORN with two additional, recent ordinal regression methods that do not rely on the binary extension framework:
-
1.
the convolutional neural network with pairwise regularization for ordinal regression (CNNPOR) method by Liu, Long, and Goh [10];
-
2.
the soft ordinal vectors (SORD) method by Diaz and Marathe [5].
To facilitate a fair comparison, we adopted the exact same architecture and preprocessing steps from [5] and [10]. Similar to CNNPOR and SORD, we used a VGG16 [23] backbone pre-trained on ImageNet [4] where only the last layer (output layer) was re-initialized with random weights following. Also, following the preprocessing steps in CNNPOR and SORD, the training images in the AES dataset were resized to pixels and randomly cropped to as well as randomly flipped across the horizontal axis.
As these additional results on the AES dataset show, CORN also outperforms other recent ordinal regression methods for deep learning (CNNPOR [10] and SORD [5]) overall when trained with a VGG16 backbone that was pre-trained on ImageNet (Table S2).
Datasets | Backbone | Methods | Learning rates | Batch sizes |
---|---|---|---|---|
AES Nature, Animals, Urban, People | VGG16 | CE-NN | 5e-5, 5e-5, 5e-5, 5e-5 | 32, 32, 32, 16 |
AES Nature, Animals, Urban, People | VGG16 | OR-NN | 1e-4, 1e-4, 1e-4, 5e-4 | 32, 32, 16, 16 |
AES Nature, Animals, Urban, People | VGG16 | CORAL | 5e-4, 1e-3, 1e-3, 5e-4 | 16, 16, 16, 32 |
AES Nature, Animals, Urban, People | VGG16 | CORAL | 5e-5, 5e-5, 5e-5, 5e-5 | 64, 64, 64, 32 |
Method | Seed | MORPH-2 | AFAD | Fireman | |||
---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||
CE-NN | 0 | 3.81 | 5.19 | 3.31 | 4.27 | 0.80 | 1.14 |
1 | 3.60 | 4.8 | 3.28 | 4.19 | 0.80 | 1.14 | |
2 | 3.61 | 4.84 | 3.32 | 4.22 | 0.79 | 1.13 | |
3 | 3.85 | 5.21 | 3.24 | 4.15 | 0.80 | 1.16 | |
4 | 3.80 | 5.14 | 3.24 | 4.13 | 0.80 | 1.15 | |
AVGSD | 3.73 0.12 | 5.04 0.20 | 3.28 0.04 | 4.19 0.06 | 0.80 0.01 | 1.14 0.01 | |
OR-NN [14] | 0 | 3.21 | 4.25 | 2.81 | 3.45 | 0.75 | 1.07 |
1 | 3.16 | 4.25 | 2.87 | 3.54 | 0.76 | 1.08 | |
2 | 3.16 | 4.31 | 2.82 | 3.46 | 0.77 | 1.10 | |
3 | 2.98 | 4.05 | 2.89 | 3.49 | 0.76 | 1.08 | |
4 | 3.13 | 4.27 | 2.86 | 3.45 | 0.74 | 1.07 | |
AVGSD | 3.13 0.09 | 4.23 0.10 | 2.85 0.03 | 3.48 0.04 | 0.76 0.01 | 1.08 0.01 | |
CORAL [1] | 0 | 2.94 | 3.98 | 2.95 | 3.60 | 0.82 | 1.14 |
1 | 2.97 | 4.03 | 2.99 | 3.69 | 0.83 | 1.16 | |
2 | 3.01 | 3.98 | 2.98 | 3.70 | 0.81 | 1.13 | |
3 | 2.98 | 4.01 | 3.00 | 3.78 | 0.82 | 1.16 | |
4 | 3.03 | 4.06 | 3.04 | 3.75 | 0.82 | 1.15 | |
AVGSD | 2.99 0.04 | 4.01 0.03 | 2.99 0.03 | 3.70 0.07 | 0.82 0.01 | 1.15 0.01 | |
CORN (ours) | 0 | 2.98 | 4 | 2.80 | 3.45 | 0.75 | 1.07 |
1 | 2.99 | 4.01 | 2.81 | 3.44 | 0.76 | 1.08 | |
2 | 2.97 | 3.97 | 2.84 | 3.48 | 0.77 | 1.10 | |
3 | 3.00 | 4.06 | 2.80 | 3.48 | 0.76 | 1.08 | |
4 | 2.95 | 3.92 | 2.79 | 3.45 | 0.74 | 1.07 | |
AVGSD | 2.98 0.02 | 3.99 0.05 | 2.81 0.02 | 3.46 0.02 | 0.76 0.01 | 1.08 0.01 |
8.4 Additional Results on Text Datasets using Recurrent Neural Networks
This section describes additional results we obtained from comparing CORN to other methods on text datasets using recurrent neural networks (RNNs) with long short-term memory (LSTM) cells.
Method | Seed | TripAdvisor | Coursera | ||
---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | ||
CE-RNN | 0 | 1.13 | 1.56 | 1.01 | 1.48 |
1 | 1.04 | 1.53 | 0.97 | 1.05 | |
2 | 1.05 | 1.54 | 1.12 | 1.65 | |
3 | 1.23 | 1.81 | 1.18 | 1.76 | |
4 | 1.03 | 1.52 | 0.84 | 1.26 | |
AVGSD | 1.10 0.09 | 1.59 0.12 | 1.02 0.13 | 1.53 0.19 | |
OR-RNN [14] | 0 | 1.06 | 1.53 | 0.98 | 1.34 |
1 | 1.09 | 1.50 | 0.93 | 1.24 | |
2 | 1.11 | 1.53 | 1.12 | 1.47 | |
3 | 1.23 | 1.52 | 1.11 | 1.53 | |
4 | 1.07 | 1.40 | 0.85 | 1.16 | |
AVGSD | 1.11 0.07 | 1.50 0.06 | 1.00 0.12 | 1.35 0.15 | |
CORAL [1] | 0 | 1.15 | 1.58 | 0.99 | 1.29 |
1 | 1.14 | 1.49 | 1.03 | 1.39 | |
2 | 1.16 | 1.46 | 1.14 | 1.40 | |
3 | 1.19 | 1.41 | 1.20 | 1.40 | |
4 | 1.13 | 1.47 | 0.82 | 1.11 | |
AVGSD | 1.15 0.02 | 1.48 0.06 | 1.04 0.15 | 1.33 0.13 | |
CORN (ours) | 0 | 1.09 | 1.55 | 0.95 | 1.37 |
1 | 1.09 | 1.53 | 0.90 | 1.32 | |
2 | 1.01 | 1.45 | 1.07 | 1.49 | |
3 | 1.12 | 1.51 | 1.05 | 1.47 | |
4 | 1.03 | 1.46 | 0.78 | 1.14 | |
AVGSD | 1.07 0.05 | 1.50 0.04 | 0.95 0.12 | 1.36 0.14 |
Both the 100K Coursera’s courses reviews dataset999https://www.kaggle.com/septa97/100k-courseras-course-reviews-dataset and TripAdvisor hotels reviews dataset101010https://www.kaggle.com/andrewmvd/trip-advisor-hotel-reviews contain reviews with 5 rating labels ranging from 1 to 5 stars. We used balanced versions of these datasets to distribute the ratings evenly. The balanced Coursera dataset contains 11,852 reviews, and the TripAdvisor dataset contains 7,000 reviews. Each dataset was randomly divided into 75% training data, 5% validation data, and 20% test data. The dataset splits and preprocessing code can be found in the code repository (see Section 4.4 of the main manuscript).
For method comparisons on the text datasets, we use a standard RNN with one LSTM cell. Similar to the image datasets, we compare the performance of a neural network trained via the rank-consistent CORN approach to both Niu et al.’s OR-RNN method (no rank consistency) and CORAL (rank consistency by using identical weight parameters for all nodes in the output layer). We also implemented RNN classifiers trained with standard multicategory cross-entropy loss as a baseline, which we refer to as CE-RNN. All methods share similar backbone architectures and only require minor changes in the output layer.
The training and evaluation steps are similar to those of the image datasets in the main manuscript. The RNN models were trained for 200 epochs using ADAM with default settings. The model with the best validation set performance was then chosen as the final model for evaluation on the test set. The training logs for all runs are available in the CORN GitHub repository (see Section 4.4). The learning rates considered for hyperparameter tuning were 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3, and we considered batch sizes 16, 32, 64, 128, 256, 512.
As the results in Table S4 show, CORN outperforms all other methods on the two text datasets, TripAdvisor and Coursera, in terms of the test set MAE. All experiments were repeated for different random seeds to ensure that the results were not coincidental.
It is worth noticing that while the CORN method showed superior performance in terms of test MAE, the CORAL method performed better in test RMSE compared with all other methods. One possible explanation is that since RMSE penalizes large gaps more harshly than MAE, CORAL may behave slightly better on outliers while CORN may make fewer mistakes in total. However, both methods show reliable performance over the text datasets.
8.5 Detailed Performance Table
TableS5 is a more detailed version of the results table shown in the main paper, listing the performance for each individual random seed.
Method | Seed | MORPH-2 | AFAD | Fireman | |||
---|---|---|---|---|---|---|---|
MAE | RMSE | MAE | RMSE | MAE | RMSE | ||
CE-NN | 0 | 3.81 | 5.19 | 3.31 | 4.27 | 0.80 | 1.14 |
1 | 3.60 | 4.8 | 3.28 | 4.19 | 0.80 | 1.14 | |
2 | 3.61 | 4.84 | 3.32 | 4.22 | 0.79 | 1.13 | |
3 | 3.85 | 5.21 | 3.24 | 4.15 | 0.80 | 1.16 | |
4 | 3.80 | 5.14 | 3.24 | 4.13 | 0.80 | 1.15 | |
AVGSD | 3.73 0.12 | 5.04 0.20 | 3.28 0.04 | 4.19 0.06 | 0.80 0.01 | 1.14 0.01 | |
OR-NN [14] | 0 | 3.21 | 4.25 | 2.81 | 3.45 | 0.75 | 1.07 |
1 | 3.16 | 4.25 | 2.87 | 3.54 | 0.76 | 1.08 | |
2 | 3.16 | 4.31 | 2.82 | 3.46 | 0.77 | 1.10 | |
3 | 2.98 | 4.05 | 2.89 | 3.49 | 0.76 | 1.08 | |
4 | 3.13 | 4.27 | 2.86 | 3.45 | 0.74 | 1.07 | |
AVGSD | 3.13 0.09 | 4.23 0.10 | 2.85 0.03 | 3.48 0.04 | 0.76 0.01 | 1.08 0.01 | |
CORAL [1] | 0 | 2.94 | 3.98 | 2.95 | 3.60 | 0.82 | 1.14 |
1 | 2.97 | 4.03 | 2.99 | 3.69 | 0.83 | 1.16 | |
2 | 3.01 | 3.98 | 2.98 | 3.70 | 0.81 | 1.13 | |
3 | 2.98 | 4.01 | 3.00 | 3.78 | 0.82 | 1.16 | |
4 | 3.03 | 4.06 | 3.04 | 3.75 | 0.82 | 1.15 | |
AVGSD | 2.99 0.04 | 4.01 0.03 | 2.99 0.03 | 3.70 0.07 | 0.82 0.01 | 1.15 0.01 | |
CORN (ours) | 0 | 2.98 | 4 | 2.80 | 3.45 | 0.75 | 1.07 |
1 | 2.99 | 4.01 | 2.81 | 3.44 | 0.76 | 1.08 | |
2 | 2.97 | 3.97 | 2.84 | 3.48 | 0.77 | 1.10 | |
3 | 3.00 | 4.06 | 2.80 | 3.48 | 0.76 | 1.08 | |
4 | 2.95 | 3.92 | 2.79 | 3.45 | 0.74 | 1.07 | |
AVGSD | 2.98 0.02 | 3.99 0.05 | 2.81 0.02 | 3.46 0.02 | 0.76 0.01 | 1.08 0.01 |
8.6 Numerically Stable Loss Function
We can convert the CORN loss function,
(11) |
into an alternative version
(12) |
where are the net inputs of the last layer (aka logits) and , since
This allows us to use the logsigmoid(z)
function that is implemented in deep learning libraries such as PyTorch as opposed to using log(1-sigmoid(z))
; the former yields numerically more stable gradients during backpropagation. A PyTorch implementation of the CORN loss function is shown in Fig. S1.