Distributionally Robust Multi-Output Regression Ranking
Abstract
Despite their empirical success, most existing listwise learning-to-rank (LTR) models are not built to be robust to errors in labeling or annotation, distributional data shift, or adversarial data perturbations. To fill this gap, we introduce a new listwise LTR model called Distributionally Robust Multi-output Regression Ranking (DRMRR). Different from existing methods, the scoring function of DRMRR was designed as a multivariate mapping from a feature vector to a vector of deviation scores, which captures local context information and cross-document interactions. DRMRR uses a Distributionally Robust Optimization (DRO) framework to minimize a multi-output loss function under the most adverse distributions in the neighborhood of the empirical data distribution defined by a Wasserstein ball. We show that this is equivalent to a regularized regression problem with a matrix norm regularizer. Our experiments were conducted on two real-world applications, medical document retrieval, and drug response prediction, showing that DRMRR notably outperforms state-of-the-art LTR models. We also conducted a comprehensive analysis to assess the resilience of DRMRR against various types of noise: Gaussian noise, adversarial perturbations, and label poisoning. We show that DRMRR is not only able to achieve significantly better performance than other baselines, but it can maintain a relatively stable performance as more noise is added to the data.
1 Introduction
There exist many real-world applications such as recommendation systems, document retrieval, machine translation, and computational biology where the correct ordering of instances is of equal or greater importance than minimizing regression or classification errors Ru et al. (2021). Learning-to-rank (LTR) refers to a group of algorithms that apply machine learning techniques to tackle these ranking problems. Generally speaking, LTR methods learn a scoring function that maps an instance-query feature vector to a relevance score (i.e., multi-level rating/label) that is then used to rank instances for a given query. Ideally, the resulting ranked list should maximize a ranking metric Qin et al. (2010); Sotudian et al. (2021); Bruch (2021). We considered two medical applications of LTR, namely medical document retrieval and drug response prediction. Healthcare applications commonly face various challenges including susceptibilities in data collection due to instrument and environmental noise or data entry errors; ambiguous or improper data annotation; lack of large-scale data for training and testing of algorithms; imbalanced data sets; missing data; divergence of training and testing data distributions (e.g., data is recorded by different hospitals using different procedures); and more importantly, the threat of adversarial attacks Papangelou et al. (2018); Finlayson et al. (2019); Qayyum et al. (2020). Consequently, robustness is critical for the wider adoption and deployment of algorithms into healthcare systems Qayyum et al. (2020).
In this work, without loss of generality, we take document retrieval as an example to explain the concepts and formulations. The main goal of document retrieval is to rank a set of documents by their relevance to a query. A slightly different example in computational biology is drug response prediction. Prescribing the right therapeutic option for each cancer patient is an intricate task since the efficacy of cancer medications varies among patients. Nevertheless, the biological differences among patients’ cancers can be used to design genomic predictors of drug responses from large panels of cancer cell lines Sotudian and Paschalidis (2021). In drug response prediction, large-scale screenings of cancer cell lines against libraries of pharmacological compounds are used to predict precise and individualized medications.
Existing LTR approaches fall into three categories, namely pointwise, pairwise, and listwise Liu (2011). The pointwise approach formulates ranking as classification or regression techniques – most early LTR algorithms such as linear regression ranking Liu (2011) or RankNet Burges et al. (2005) take a very similar approach. In the pairwise approach, a classification method is employed to classify the preference order within document pairs. Representative pairwise ranking algorithms include RankBoost Freund et al. (2003), RankNet Burges et al. (2005), and ordinal regression Liu (2011). Both approaches are misaligned with the ranking utilities such as Normalized Discounted Cumulative Gain (NDCG) and do not straightforwardly model the ranking problem. The listwise models can overcome this drawback by taking the entire list of retrieved documents for a query as instances and train a ranking function through the minimization of a listwise loss function. Experimental results show that the listwise approaches generally outperform the pointwise and pairwise algorithms Cao et al. (2007). The literature offers a variety of approaches from deriving a smooth approximation to ranking utilities (e.g., ApproxNDCG Qin, Liu, and Li (2010) and SoftRank Taylor et al. (2008)), to constructing differentiable surrogate loss functions (e.g., ListMLE Xia et al. (2008), LamdaMART Wu et al. (2010), and ListNet Cao et al. (2007)).
Most existing studies on LTR achieve impressive performance but often neglect the importance of robustness Liu (2011). Systematic noise can become part of a data set in many ways and deceive LTR models to rank an item at an incorrect position with high confidence. While Empirical Risk Minimization (ERM) has been effective to optimize loss, ERM often does not yield models that are robust to adversarially crafted samples Biggio et al. (2013). Distributionally Robust Optimization (DRO) is a modeling paradigm for data-driven decision-making under uncertainty. It has been successful in handling problems with corrupted training data through hedging against the most adverse distribution within a Wasserstein ball Chen and Paschalidis (2020a). Recently, DRO has been an active area of research owing to its robustness to adversarial examples, rigorous out-of-sample and asymptotic consistency guarantees, and remarkable empirical performance Sinha et al. (2017).
In the present work, we seek to induce robustness into LTR problems through using the DRO framework. Equipped with this perspective, we make the following contributions. Unlike other LTR frameworks, our algorithm approaches listwise ranking in a novel way and employs ranking metrics (i.e., NDCG) in its output. In particular, we use the notion of position deviation to define a vector of relevance scores instead of a scalar. We then adopt the DRO framework to minimize a worst-case expected multi-output loss function over a probabilistic ambiguity set that is defined by the Wasserstein metric. To the best of our knowledge, ours is the first study that utilizes a multi-output Wasserstein DRO framework to robustify LTR problems. We present an equivalent convex reformulation of the DRO problem, which is shown to be tighter than earlier work Chen and Paschalidis (2020b). In experiments, our approach yields state-of-the-art results in two challenging applications of LTR, namely medical document retrieval and drug response prediction. More importantly, we evaluated our model to verify its robustness against various types of attacks including adversarial attacks and label attacks, showing that our model maintains a consistently good performance under various attack scenarios.
Notational conventions: We use boldfaced lowercase letters to denote vectors, ordinary lowercase letters to denote scalars, boldfaced uppercase letters to denote matrices, and calligraphic capital letters to denote sets. All vectors are column vectors. For space saving reasons, we write to denote the column vector , where is the dimension of . We use prime to denote the transpose, for the set for any integer , for the norm with , and for the -dimensional identity matrix. For a matrix , we use to denote its induced norm that is defined as .
2 Preliminaries
2.1 Learning-to-Rank
In a ranking problem, the data consists of a set of triples (query, document, relevance score). A feature vector is used to represent a query-document pair. The relevance score indicates the degree of relevance of this document to its corresponding query. Given a ranking data set , indexes a query, and and represent the list of retrieved documents and corresponding relevance scores, respectively. The -th query contains documents and has rows , each of which is a -dimensional document feature vector. The vector contains the corresponding ground-truth relevance scores, where a higher indicates that the document with features is more relevant. In the learning-to-rank framework, denoting by and the random variables that represent the document feature vector and relevance score, respectively, the goal is to learn a scoring function that best predicts the relevance score:
(1) |
where is a loss function, predicts the relevance score of each document, and is the underlying true probability distribution of . Given that is unknown, most existing LTR algorithms solve (1) through estimating the expected loss by its empirical substitute (2):
(2) |
For a test query consisting of documents, the final predicted ranking list is simply obtained by ranking the rows in based on their inferred ranking scores . Eq. (2) is restrictive in the sense that it does not take into account the inter-dependency of scores between documents, and the empirical estimate is very sensitive to data perturbations.
2.2 Distributionally Robust Optimization
Distributionally Robust optimization (DRO) hedges against a set of probability distributions instead of just the empirical distribution. DRO minimizes a worst-case loss over a probabilistic ambiguity set:
where the ambiguity set can be defined through moment constraints Wiesemann, Kuhn, and Sim (2014), or as a ball of distributions using some probabilistic distance function such as the Wasserstein distance Blanchet and Murthy (2019); Shafieezadeh-Abadeh, Kuhn, and Esfahani (2019). The Wasserstein DRO model has been extensively studied in the machine learning community; see, for example, Blanchet et al. (2019); Chen and Paschalidis (2018) for robustified regression models, Sinha et al. (2017) for adversarial training in neural networks, and Shafieezadeh-Abadeh, Esfahani, and Kuhn (2015) for distributionally robust logistic regression. These works and Gao, Chen, and Kleywegt (2017); Chen and Paschalidis (2020a) provided a comprehensive analysis of the Wasserstein-based distributionally robust statistical learning framework.
3 Problem Formulation
Next, we introduce our DRO formulation of the LTR problem. Different from the existing works where a univariate relevance score is used for each document , we define a Ground Truth Deviation vector to characterize different levels of importance for the document in the -th query. Here, is a constant to be defined later (end of Sec. 3.1). We derive an equivalent reformulation of the DRO problem in Sec. 3.2.
3.1 Ground Truth Deviation
As a popular evaluation criterion in information retrieval, Normalized Discounted Cumulative Gain (NDCG) can deal with cases that have more than two degrees of relevancy for documents Ravikumar, Tewari, and Yang (2011). Let be a discount function, , a monotone increasing gain function, and a set of documents ordered according to their ground-truth rank, with and being a document feature vector and a relevance score, respectively. Assume is a (predicted) ranked list for ; then the Discounted Cumulative Gain (DCG) of is defined as , where is the index of the document ranked at position of . The reason for introducing the discount function is that the user cares less about documents ranked lower Wang et al. (2013). NDCG normalizes DCG by the Ideal DCG (IDCG), , which is the DCG score of the ideal ranking result Valizadegan et al. (2009) and can be computed by .
Considering the -th query that contains documents, we define Ground Truth Deviation (GTD) vector for document as follows:
(3) |
where is the Hadamard product (a.k.a. the element-wise product). The vector is comprised of the following three components:
NDCG deviation score ().
To compute this vector, first, the elements of are sorted in descending order of their individual relevance scores, and the document feature vectors are also sorted correspondingly. We denote them by and , respectively. The NDCG score for is equal to 1. If we switch two documents in , the NDCG will decrease or in some cases may stay the same (i.e., if their relevance scores are equal). For document in query , we define NDCG deviation score vector as where is the NDCG score of when we switch the position of document with the document that is in -th position of and can be formulated as follows:
Here, is the position of the document in , is the index of the document ranked at the -th position of , and is the IDCG. The details about the derivation can be found in the supplementary materials. We can perceive the -th element of the GTD vector as a score that indicates the degree of congruence between a document and the -th rank.
Position deviation score ().
This vector is defined to further push the relevant documents to the top of the ranking list, and penalize documents based on their position in the ranking list. Position deviation score works in conjunction with . We defined it as where can be calculated by
where . As can be seen in Figure 1, specifies the GTD’s maximum score and regulates the magnitude of the penalty for a position deviation. Here, we use the red dashed curve for positive deviations (i.e., when a document ranked higher than its optimal position) and the black curve for negative deviations. This would induce our model to tolerate a positive deviation more than a negative one. Consequently, the model pushes the relevant documents to the top of the ranking list.


Document importance score ().
It is defined to place greater emphasis on highly relevant documents and can be computed as follows:
where is the maximum possible value for relevance scores. Figure 1 presents for different values of .
Ultimately, instead of a relevance score for each document, we have a GTD vector. The GTD vector characterizes different levels of importance for a document in a query where the first element is the first level of importance, the second element is the second level of importance, and so on. Since each query may have different number of documents, we just consider the first elements of and in our model, corresponding to levels of importance. In this way, all GTD vectors are the same length. We prefer to use a low value for since it forces the model to focus on the most relevant documents. In case a large needs to be used and , we can simply repeat the last element of and to pad our vector.
3.2 Distributionally Robust Multi-Output Regression
Consider a setting where there are levels of importance with features and importance scores distributed according to and , respectively. We restrict our attention to linear function classes by assuming where . The matrix characterizes the dependency structure of the different levels of importance. Nonlinearity can be introduced by applying a transformation (e.g., kernel function) on the feature . The Distributionally Robust Multi-output Regression Ranking (DRMRR) formulation minimizes the worst-case expected loss as follows:
(4) |
where is a Lipschitz continuous loss function on the metric spaces and , where , are the domain and co-domain of , respectively. is the probability distribution of , where is the space of all probability distributions supported on and is the uncertainty set of , is a positive constant (i.e., Wasserstein ball radius), is the empirical distribution that assigns an equal probability to all training samples, with , where is the number of queries, and is the order-1 Wasserstein distance between and defined as follows:
In , with , drawn from and , respectively, and specifies the joint distribution of and with marginals and . Note that the same norm is used to define the Wasserstein metric and the domain space of . In the following theorem we propose an equivalent reformulation of (4) by using duality for the inner maximization problem.
Theorem 3.1
Suppose our dataset consists of queries and each query contains documents, , where is the document feature matrix with rows , , and is the GTD matrix with rows . Define a loss function . If the Wasserstein metric is induced by , the DRMRR problem (4) can be equivalently reformulated as:
(5) |
where ; ; .
The proof can be found in the supplementary materials. Thm. 3.1 establishes a connection between distributional robustness and regularization, which has also been studied by, e.g., Shafieezadeh-Abadeh, Esfahani, and Kuhn (2015); Shafieezadeh-Abadeh, Kuhn, and Esfahani (2019); Gao, Chen, and Kleywegt (2017). However, most of the existing studies focused on a univariate output. By contrast, our work adapts the DRO framework to a multi-output setting, which is more suitable for the ranking problem. Recently, Chen and Paschalidis (2020b) studied a multi-output regression problem under the Wasserstein DRO framework. However, our results in Theorem 3.1 present a tighter reformulation than theirs. In the case where the Wasserstein metric is induced by the norm (), Eq. (5) yields a regularizer which is the spectral norm (largest singular value) of , while Chen and Paschalidis (2020b) derived a regularizer in the Frobenius norm which is looser.
3.3 Score calculation
Suppose we are given a test query ; we can estimate the GTD matrix as . In matrix , columns correspond different ranks and rows refers to different documents. Algorithm 1 demonstrates the procedure of ranking using the output of the DRMRR algorithm where is the remainder of dividing by .
4 Experimental Results
4.1 Experiment setup
Data sets:
We conducted experiments on two publicly available benchmark datasets: OHSUMED and Drug Response Prediction (DRP). As a subset of the MEDLINE database (a database on medical publications), the OHSUMED corpus Hersh et al. (1994) consists of about 0.3 million records from 270 medical journals from 1987 to 1991. A query set with 106 queries on the OHSUMED corpus has been extensively used in previous works, in which each query is represented by 45 features Qin et al. (2010). There are in total 16,140 query-document pairs with relevance judgments. LETOR Qin et al. (2010) defined three ratings 0, 1, 2, corresponding to “irrelevant,” “partially relevant,” and “definitely relevant,” respectively. In addition to OHSUMED, we trained and evaluated our method using the cell line data and drug sensitivity data from the Cancer Cell Line Encyclopedia (CCLE) CCLE (2021) and the Cancer Therapeutics Response Portal (CTRP v2) CTRP (2021). A total of 332 cell lines and 50 drug responses were used. The “Act Area” (the area above the fitted dose-response curve) was used to quantify drug sensitivity where lower response value indicates higher drug sensitivity. After several pre-processing steps, cell lines are represented by 251 numeric features (i.e., genes) and drug sensitivities are labeled with graded relevance from 0 to 2 (i.e., “insensitive,” “sensitive,” and “highly sensitive,” respectively) with larger labels indicating a higher sensitivity. Further details of the pre-processing steps can be found in the supplement.
Evaluation metrics:
We evaluated model performance using two metrics: NDCG@k and AP@k. NDCG@k is the top-k version of NDCG, where the discount function is for . Precision at position k (P@k) is the fraction of relevant documents in the top-k. Suppose we have binary relevance for the documents in a -query; we define P@k as where is the indicator function. We define Average Precision at position k as , where is the total number of relevant documents in the top-k of the ranking list. AP is a highly localized performance measure and captures the quality of rankings for applications where only the first few results matter. The main difference between AP and NDCG is that NDCG differentiates between “partially relevant” and “definitely relevant” documents while AP treats them equally. Given a set of testing queries and a performance metric, we are interested in the mean metric which is simply the mean of the performance metric for all queries. From now on, we use NDCG@k and AP@k to denote mean NDCG@k and mean AP@k, respectively.
Competing Methods:
Although the list of published LTR algorithms is endless, LamdaMARTMAP Wu et al. (2010), LamdaMARTNDCG Wu et al. (2010), and XE-MARTNDCG Bruch (2021) have been demonstrated repeatedly to outperfrom other algorithms including RankNet Burges et al. (2005), Coordinate Ascent Metzler and Croft (2007), ListNet Cao et al. (2007), Random Forests Breiman (2001), BoltzRank Volkovs and Zemel (2009), ListMLE Xia et al. (2008), Position-Aware ListMLE Lan et al. (2014), RankBoost Freund et al. (2003), AdaRank Xu and Li (2007), SoftRank Taylor et al. (2008), ApproxNDCG Qin, Liu, and Li (2010), ApproxAP Qin, Liu, and Li (2010), and several direct optimization methods Xu et al. (2008); Metzler, Croft, and McCallum (2005). Here, we rely on prior research Bruch (2021); Wang et al. (2018) and do not include the weaker methods in our experiments. It is important to note that the author of XE-MARTNDCG proposed this model as a robust alternative to LamdaMART-based models. We used open-source python packages (i.e., LightGBM Ke et al. (2017); Meng et al. (2016); Zhang, Si, and Hsieh (2017) and XGBoost Chen and Guestrin (2016)) to implement the baseline models.
Experimental settings and hyper-parameter optimization:
In our experiments, we followed the standard supervised LTR framework Liu (2011). Authors of LETOR Qin et al. (2010) partitioned the OHSUMED data set into five parts for five-fold cross-validation where three parts are used for training, one part for validation (i.e., tuning the hyperparameters of the learning algorithms), and the remaining part for evaluating the performance of the learned model. Similarly, we partitioned the drug response data set into five folds and conducted five-fold cross-validation to train, validate and evaluate the ranking algorithms. In all experiments, the average on the test set over the 5 folds was reported. Algorithm parameters were tuned on the validation sets. We optimized the algorithm parameters to maximize NDCG@5 and NDCG@10. The details of the parameter-tuning procedure and the optimal parameters for each algorithm can be found in the supplement.
4.2 Overall Comparison
We compared the performance of the proposed method on OHSUMED and Drug Response Prediction (DRP) data sets with baseline methods introduced in Sec. 4.1. The results are in Table 1. The values inside the parentheses denote the Standard Deviation (SD) of the corresponding metrics. Bold numbers indicate the best performance among all methods for each metric. DRMRR consistently outperforms all baseline methods across all metrics. In our experiment on OHSUMED data, LamdaMARTNDCG demonstrated a reasonably good overall performance and it is the second-best method. However, XE-MARTNDCG was the second-best method in our experiment on the DRP data. The difference between the best and the second-best methods for the DRP data set is greater than what we obtained for OHSUMED. Due to the limited number of samples available and the specific structure of the DRP data, the performance of the baseline methods diminished significantly. On the other hand, DRMRR was able to maintain its high performance. To sum up, the proposed method is not only able to push the most relevant documents (or sensitive drugs) to the top of the ranking list, but it can put them in the right order.
Algorithms | NDCG@5 | NDCG@10 | AP@5 | AP@10 | |
OHSUMED | LamdaMARTMAP | 45.18% (5.07%) | 43.65% (3.55%) | 67.94% (7.26%) | 64.12% (5.83%) |
LamdaMARTNDCG | 46.17% (5.91%) | 44.40% (5.00%) | 68.53% (8.40%) | 65.25% (6.47%) | |
XE-MARTNDCG | 44.31% (6.58%) | 44.79% (5.65%) | 65.25% (8.08%) | 62.41% (6.76%) | |
DRMRR | 47.79% (6.58%) | 45.36% (4.84%) | 70.84% (7.35%) | 65.31% (7.35%) | |
DRP | LamdaMARTMAP | 58.11% (1.90%) | 63.39% (2.17%) | 83.27% (1.57%) | 77.25% (1.07%) |
LamdaMARTNDCG | 58.73% (2.54%) | 62.87% (2.83%) | 83.07% (1.99%) | 76.95% (1.19%) | |
XE-MARTNDCG | 59.37% (1.92%) | 63.51% (2.18%) | 83.70% (1.28%) | 77.43% (1.34%) | |
DRMRR | 68.40% (1.74%) | 71.27% (1.78%) | 85.03% (1.10%) | 81.03% (1.00%) | |
4.3 Robustness Comparison
In this section, we empirically study the behavior of DRMRR in the presence of noise. While our overall performance analysis suggested that DRMRR should be the most “well-behaved” of the four, that analysis was performed on the clean data. The robustness of a ranking model to noise is crucial in practice, especially in the healthcare domain. We put this hypothesis to test through four types of experiments. We conducted all experiments on the OHSUMED data set since it is a popular and standard LTR data set. We just reported AP@5 and NDCG@5, other metrics are presented in the supplementary material. In all experiments, the values are the average of 5 folds.
Gaussian Noise Attack:
We added Gaussian noise to the test documents to deliberately corrupt them; therefore, depreciating their predictability. Gaussian noise was added to 75% of the test queries randomly. Experiments were conducted using various means and a fixed standard deviation of 0.001. We used the perturbed test data to evaluate the trained models in Sec. 4.2 (i.e., all algorithms were tested on the same perturbed test data). Fig. 2 demonstrates the performance of the algorithms on the perturbed test data. Two observations are in order: DRMRR outperformed the baseline models at different levels of noise; and DRMRR demonstrated a relatively stable performance.


Universal Adversarial Perturbation Attack:
We built an adversarial model to introduce perturbations that break the neighborhood relationships by altering the input slightly. To that end, a pointwise linear regression ranking model was trained as the adversarial model on the clean training set. Then, 75% of the test queries were perturbed using the coefficient of the adversarial model and the Fast Gradient Sign Method (FGSM) method: where is the perturbed feature vector, controls the magnitude of the perturbations, and is the cost function of the adversarial model Goodfellow, Shlens, and Szegedy (2014).
All algorithms that we trained in Sec. 4.2 were evaluated on the same perturbed test data. In this case, the adversary had no knowledge of the ranking models; however, it was trained on the same training data. Fig. 3 shows the performance of the algorithms on the perturbed test data. As we increase the level of perturbations (i.e., ), we can see that DRMRR is less sensitive to adversarial perturbations in comparison with the competing methods. It demonstrated a stable performance across all metrics. Among the baselines, XE-MARTNDCG that performed well in terms of NDCG@5, demonstrated poor performance in terms of AP@5.


Black-box Adversarial Attack:
The black-box adversarial attack restricts the attacker’s knowledge only to the deployed model Bhambri et al. (2019). The setting of black-box attacks is closer to the real-world scenario; therefore, this is the most practical experiment to measure the robustness of our algorithm. Please refer to Bhambri et al. (2019) for more information on the black-box adversarial attacks. Since the adversary has no access to the model’s weights and parameters, the adversary can choose to train a parallel model called a substitute model to imitate the original model. Usually, the attacker can use a much superior architecture than the original models for the weight estimation. Here, we use Neural Networks (NN) as our substitute models. The architecture of this model is presented in the supplement. To construct the substitute models, the training data were independently fed to each model and the output was observed. Then, for each algorithm, an NN ranking model was trained as an adversarial substitute model using the training feature vectors and the observed output of that specific algorithm. Subsequently, 75% of the test queries were perturbed using the FGSM method and the parameters of the substitute model corresponding to each algorithm.
We trained four substitute models corresponding to each ranking algorithm. We used the specific perturbed test data to evaluate the best trained models (i.e., each model has a different perturbed test set). Fig. 4 demonstrates the performance of the algorithms on the perturbed test data. The values are the average of 5 folds. We can see that both figures show the same trend – increasing the level of perturbations (i.e., ) leads to significant differences between DRMRR and the baselines methods. The competing methods were greatly affected by this type of noise, whereas their performance was modest in the simpler experiments, namely universal adversarial attack and Gaussian attack. We conclude that DRMRR is robust to adversarial perturbations, an important property that leads to good generalization ability.


Label Attack:
In practice, the vagueness of query intent, insufficient domain knowledge, and ambiguous definition of relevance levels make it hard for human judges to assign proper relevance labels to some documents. Practically speaking, the probability of judgment errors in various relevance degrees is not equal. Even if human annotators misjudge a document, they are more probable to label it closer to its ground-truth label. Inspired by Niu et al. (2015), we define the non-uniform error probabilities in Table 2 where entries of this table correspond to the probability that a document with ground-truth label is prone to be labeled as . We randomly changed the labels of the training data using the probabilities in Table 2. Then, each model was trained on the noisy training data. Clean test data were used to evaluate each model. We conducted two sets of experiment, namely low label noise (i.e., ) and high label noise (i.e., ). Table 3 reports the performance of the algorithms on the clean test data. The values in these figures are the average of 5 folds. For the low noise scenario, the differences between the average AP@5 and NDCG@5 of the baseline models and DRMRR were 2.53% and 1.59%, respectively. Notably, the gaps were even larger for the high noise scenario (AP@5=3.08%, NDCG@5=1.68%). Since noise in human-labeled data is an inevitable issue, we can argue that the baseline models are susceptible and degrade more severely as more noise is added to the training set.
0 | 1 | 2 | |||
0 | |||||
1 | |||||
2 | |||||
AP@5 | NDCG@5 | |||
High | Low | High | Low | |
DRMRR | 70.25% | 69.30% | 48.43% | 47.76% |
LamdaMARTMAP | 67.97% | 67.87% | 47.16% | 45.87% |
LamdaMARTNDCG | 67.82% | 66.79% | 46.89% | 46.46% |
XE-MARTNDCG | 65.70% | 65.65% | 46.19% | 46.18% |
5 Discussion and Conclusion
This paper went beyond conventional listwise learning-to-rank approaches and introduced a distributionally robust learning-to-rank framework with multiple outputs, referred to as DRMRR. Unlike existing methods, the scoring function in DRMRR was designed as a multivariate mapping from a feature vector to a vector of deviation scores (a.k.a. GTD vector). The GTD vector captures local context information and cross-document interactions. Moreover, we formulated DRMRR as a min-max problem where one minimizes a worst-case expected loss over a probabilistic ambiguity set. The ambiguity set was defined as a ball of distributions using the Wasserstein metric. Notably, we presented the compact and computationally solvable relaxations of the min-max formulation of DRMRR. We compared DRMRR with the baseline models in terms of the overall performance on two real-world applications and the robustness to various types and degrees of noise. In medical document retrieval, DRMRR outperformed state-of-the-art LTR models and established its capability in differentiating relevant documents from irrelevant ones. In drug response prediction, our results indicated that DRMRR leads to substantially improved performance when compared to the competing methods across all performance metrics. Thus, DRMRR can infer robust predictors of drug responses from patient genomic or proteomic profiles which can lead to selecting highly effective personalized treatment.
In our robustness evaluations, we conducted a comprehensive analysis to assess the resilience of DRMRR against various types of noise and perturbations. Experimental results demonstrated that DRMRR is effective against: Gaussian noise; universal adversarial perturbations by a substitute model with no knowledge of the victim model; black-box adversarial perturbations by a substitute model with access only to the deployed victim model; and probabilistic perturbation of relevance labels. Interestingly, the performance of DRMRR was consistently better than the baseline methods for all levels of noise. More importantly, DRMRR showed no significant change in its performance with the increase in the noise intensity. Two attributes of DRMRR did help to enhance its performance and robustness: efficiently capturing the contextual information and interrelationship between documents/drugs via the GTD vector; and the distributional robustness by hedging against a family of plausible distributions, including the true distribution with high confidence.
References
- Bhambri et al. (2019) Bhambri, S.; Muku, S.; Tulasi, A.; and Buduru, A. B. 2019. A survey of black-box adversarial attacks on computer vision models. arXiv preprint arXiv:1912.01667.
- Biggio et al. (2013) Biggio, B.; Corona, I.; Maiorca, D.; Nelson, B.; Šrndić, N.; Laskov, P.; Giacinto, G.; and Roli, F. 2013. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, 387–402. Springer.
- Blanchet et al. (2019) Blanchet, J.; Glynn, P. W.; Yan, J.; and Zhou, Z. 2019. Multivariate distributionally robust convex regression under absolute error loss. Advances in Neural Information Processing Systems, 32: 11817–11826.
- Blanchet and Murthy (2019) Blanchet, J.; and Murthy, K. 2019. Quantifying distributional model risk via optimal transport. Mathematics of Operations Research, 44(2): 565–600.
- Breiman (2001) Breiman, L. 2001. Random forests. Machine learning, 45(1): 5–32.
- Bruch (2021) Bruch, S. 2021. An alternative cross entropy loss for learning-to-rank. In Proceedings of the Web Conference 2021, 118–126.
- Burges et al. (2005) Burges, C.; Shaked, T.; Renshaw, E.; Lazier, A.; Deeds, M.; Hamilton, N.; and Hullender, G. 2005. Learning to rank using gradient descent. In Proceedings of the 22nd international conference on Machine learning, 89–96.
- Cao et al. (2007) Cao, Z.; Qin, T.; Liu, T.-Y.; Tsai, M.-F.; and Li, H. 2007. Learning to rank: from pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, 129–136.
- CCLE (2021) CCLE. 2021. Cancer Cell Line Encyclopedia (CCLE). https://portals.broadinstitute.org/ccle [Accessed: 21.07.2021].
- Chen and Paschalidis (2018) Chen, R.; and Paschalidis, I. C. 2018. A robust learning approach for regression models based on distributionally robust optimization. Journal of Machine Learning Research, 19(13).
- Chen and Paschalidis (2020a) Chen, R.; and Paschalidis, I. C. 2020a. Distributionally robust learning. Foundations and Trends® in Optimization, 4(1-2).
- Chen and Paschalidis (2020b) Chen, R.; and Paschalidis, I. C. 2020b. Robustified Multivariate Regression and Classification Using Distributionally Robust Optimization under the Wasserstein Metric. arXiv preprint arXiv:2006.06090.
- Chen and Guestrin (2016) Chen, T.; and Guestrin, C. 2016. XGBOOST: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 785–794.
- CTRP (2021) CTRP. 2021. Cancer Therapeutics Response Portal. https://portals.broadinstitute.org/ctrp.v2.1/ [Accessed: 21.07.2021].
- Finlayson et al. (2019) Finlayson, S. G.; Bowers, J. D.; Ito, J.; Zittrain, J. L.; Beam, A. L.; and Kohane, I. S. 2019. Adversarial attacks on medical machine learning. Science, 363(6433): 1287–1289.
- Freund et al. (2003) Freund, Y.; Iyer, R.; Schapire, R. E.; and Singer, Y. 2003. An efficient boosting algorithm for combining preferences. Journal of machine learning research, 4(Nov): 933–969.
- Gao, Chen, and Kleywegt (2017) Gao, R.; Chen, X.; and Kleywegt, A. J. 2017. Wasserstein distributional robustness and regularization in statistical learning. arXiv e-prints, arXiv–1712.
- Goodfellow, Shlens, and Szegedy (2014) Goodfellow, I. J.; Shlens, J.; and Szegedy, C. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.
- Hersh et al. (1994) Hersh, W.; Buckley, C.; Leone, T.; and Hickam, D. 1994. OHSUMED: an interactive retrieval evaluation and new large test collection for research. In SIGIR’94, 192–201. Springer.
- Ke et al. (2017) Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; and Liu, T.-Y. 2017. LightGBM: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30: 3146–3154.
- Lan et al. (2014) Lan, Y.; Zhu, Y.; Guo, J.; Niu, S.; and Cheng, X. 2014. Position-Aware ListMLE: A Sequential Learning Process for Ranking. In UAI, 449–458.
- Liu (2011) Liu, T.-Y. 2011. Learning to rank for information retrieval.
- Meng et al. (2016) Meng, Q.; Ke, G.; Wang, T.; Chen, W.; Ye, Q.; Ma, Z.-M.; and Liu, T.-Y. 2016. A communication-efficient parallel algorithm for decision tree. arXiv preprint arXiv:1611.01276.
- Metzler and Croft (2007) Metzler, D.; and Croft, W. B. 2007. Linear feature-based models for information retrieval. Information Retrieval, 10(3): 257–274.
- Metzler, Croft, and McCallum (2005) Metzler, D. A.; Croft, W. B.; and McCallum, A. 2005. Direct maximization of rank-based metrics for information retrieval. CIIR report, 429.
- Niu et al. (2015) Niu, S.; Lan, Y.; Guo, J.; Wan, S.; and Cheng, X. 2015. Which noise affects algorithm robustness for learning to rank. Information Retrieval Journal, 18(3): 215–245.
- Papangelou et al. (2018) Papangelou, K.; Sechidis, K.; Weatherall, J.; and Brown, G. 2018. Toward an understanding of adversarial examples in clinical trials. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 35–51. Springer.
- Qayyum et al. (2020) Qayyum, A.; Qadir, J.; Bilal, M.; and Al-Fuqaha, A. 2020. Secure and robust machine learning for healthcare: A survey. IEEE Reviews in Biomedical Engineering, 14: 156–180.
- Qin, Liu, and Li (2010) Qin, T.; Liu, T.-Y.; and Li, H. 2010. A general approximation framework for direct optimization of information retrieval measures. Information retrieval, 13(4): 375–397.
- Qin et al. (2010) Qin, T.; Liu, T.-Y.; Xu, J.; and Li, H. 2010. LETOR: A benchmark collection for research on learning to rank for information retrieval. Information Retrieval, 13(4): 346–374.
- Ravikumar, Tewari, and Yang (2011) Ravikumar, P.; Tewari, A.; and Yang, E. 2011. On NDCG consistency of listwise ranking methods. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 618–626. JMLR Workshop and Conference Proceedings.
- Ru et al. (2021) Ru, X.; Ye, X.; Sakurai, T.; and Zou, Q. 2021. Application of learning to rank in bioinformatics tasks. Briefings in Bioinformatics.
- Shafieezadeh-Abadeh, Esfahani, and Kuhn (2015) Shafieezadeh-Abadeh, S.; Esfahani, P. M.; and Kuhn, D. 2015. Distributionally robust logistic regression. arXiv preprint arXiv:1509.09259.
- Shafieezadeh-Abadeh, Kuhn, and Esfahani (2019) Shafieezadeh-Abadeh, S.; Kuhn, D.; and Esfahani, P. M. 2019. Regularization via mass transportation. Journal of Machine Learning Research, 20(103): 1–68.
- Sinha et al. (2017) Sinha, A.; Namkoong, H.; Volpi, R.; and Duchi, J. 2017. Certifying some distributional robustness with principled adversarial training. arXiv preprint arXiv:1710.10571.
- Sotudian et al. (2021) Sotudian, S.; Desta, I. T.; Hashemi, N.; Zarbafian, S.; Kozakov, D.; Vakili, P.; Vajda, S.; and Paschalidis, I. C. 2021. Improved cluster ranking in protein–protein docking using a regression approach. Computational and structural biotechnology journal, 19: 2269–2278.
- Sotudian and Paschalidis (2021) Sotudian, S.; and Paschalidis, I. C. 2021. Machine Learning for Pharmacogenomics and Personalized Medicine: A Ranking Model for Drug Sensitivity Prediction. IEEE/ACM Transactions on Computational Biology and Bioinformatics.
- Taylor et al. (2008) Taylor, M.; Guiver, J.; Robertson, S.; and Minka, T. 2008. Softrank: optimizing non-smooth rank metrics. In Proceedings of the 2008 International Conference on Web Search and Data Mining, 77–86.
- Valizadegan et al. (2009) Valizadegan, H.; Jin, R.; Zhang, R.; and Mao, J. 2009. Learning to Rank by Optimizing NDCG Measure. In NIPS, volume 22, 1883–1891.
- Volkovs and Zemel (2009) Volkovs, M. N.; and Zemel, R. S. 2009. Boltzrank: learning to maximize expected ranking gain. In Proceedings of the 26th Annual International Conference on Machine Learning, 1089–1096.
- Wang et al. (2018) Wang, X.; Li, C.; Golbandi, N.; Bendersky, M.; and Najork, M. 2018. The lambdaloss framework for ranking metric optimization. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, 1313–1322.
- Wang et al. (2013) Wang, Y.; Wang, L.; Li, Y.; He, D.; Chen, W.; and Liu, T.-Y. 2013. A theoretical analysis of NDCG ranking measures. In Proceedings of the 26th annual conference on learning theory (COLT 2013), volume 8, 6.
- Wiesemann, Kuhn, and Sim (2014) Wiesemann, W.; Kuhn, D.; and Sim, M. 2014. Distributionally robust convex optimization. Operations Research, 62(6): 1358–1376.
- Wu et al. (2010) Wu, Q.; Burges, C. J.; Svore, K. M.; and Gao, J. 2010. Adapting boosting for information retrieval measures. Information Retrieval, 13(3): 254–270.
- Xia et al. (2008) Xia, F.; Liu, T.-Y.; Wang, J.; Zhang, W.; and Li, H. 2008. Listwise approach to learning to rank: theory and algorithm. In Proceedings of the 25th international conference on Machine learning, 1192–1199.
- Xu and Li (2007) Xu, J.; and Li, H. 2007. Adarank: a boosting algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, 391–398.
- Xu et al. (2008) Xu, J.; Liu, T.-Y.; Lu, M.; Li, H.; and Ma, W.-Y. 2008. Directly optimizing evaluation measures in learning to rank. In Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval, 107–114.
- Zhang, Si, and Hsieh (2017) Zhang, H.; Si, S.; and Hsieh, C.-J. 2017. GPU-acceleration for Large-scale Tree Boosting. arXiv preprint arXiv:1706.08359.