Counterfactual Evaluation of Peer-Review Assignment Policies
Abstract
Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment—in order to mitigate fraud—as a valuable opportunity to evaluate counterfactual assignment policies. Specifically, we exploit how such randomized assignments provide a positive probability of observing the reviews of many assignment policies of interest. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce novel methods for partial identification based on monotonicity and Lipschitz smoothness assumptions for the mapping between reviewer-paper covariates and outcomes. We apply our methods to peer-review data from two computer science venues: the TPDP’21 workshop (95 papers and 35 reviewers) and the AAAI’22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers’ bids vs. textual similarity (between the review’s past papers and the submission), and (ii) the “cost of randomization”, capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. Our methods for partial identification may be of independent interest, while our off-policy approach can likely find use evaluating a broad class of algorithmic matching systems.
1 Introduction
The assignment of reviewers to submissions is one of the most important parts of the peer-review process [1, 2, 3]. In many peer-review applications—ranging from peer review of academic conference submissions [4, 5], to grant proposals [6, 7], to proposals for the allocation of other scientific resources [8, 9]—a set of submissions are simultaneously received and must all be assigned reviewers from an impaneled pool. However, when the number of submissions or reviewers is too large, it may not be feasible to manually assign suitable reviewers for each submission. As a result, automated systems must be used to determine the reviewer assignment.
In computer science, conferences are the primary terminal venue for scientific publications, with recent iterations of large conferences such as NeurIPS and AAAI receiving several thousands of submissions [5]. The automated reviewer-assignment systems deployed by such conferences typically use three sources of information: (i) bids, i.e., reviewers’ self-reported preferences to review the papers; (ii) text similarity between the paper and the reviewer’s publications; and (iii) reviewer- and author-selected subject areas. Given a prescribed way to combine these signals into a single score, an optimization procedure then proposes a reviewer-paper assignment that maximizes the sum of the scores of the assigned pairs [10].
The design of effective peer-review systems has received considerable research attention [5, 4, 11, 12]. Popular peer-review platforms such as OpenReview and Microsoft CMT offer many features that conference organizers can use to assign reviewers, such as integration with the Toronto Paper Matching System (TPMS) [13] for computing text-similarity scores. However, it has been persistently challenging to evaluate how changes to peer-review assignment algorithms affect review quality. An implicit assumption underlying such approaches is that review quality is an increasing function of bid enthusiasm, text similarity, and subject area match, but how to combine these signals into a score is approached via heuristics. Researchers typically observe only the reviews actually assigned by the algorithm and have no way of measuring the quality of reviews under an assignment generated by an alternative algorithm.
One approach to comparing different peer-review assignment policies is running randomized control trials or A/B tests. Several conferences (NeurIPS’14 [14, 15], WSDM’17 [16], ICML’20 [17], and NeurIPS’21 [18]) have run A/B tests to evaluate various aspects of their review process, such as differences between single- vs. double-blind review. However, such experiments are extremely costly in the peer review context, with the NeurIPS experiments requiring a significant number of additional reviews, overloading already strained peer review systems. Moreover, A/B tests typically compare only a handful of design decisions, while assignment algorithms typically require making many such decisions (see Section 2).
Present Work.
In this work, we propose off-policy evaluation as a less costly alternative that exploits existing randomness to enable the comparison of many alternative policies. Our proposed technique “harvests” [19] the randomness introduced in peer-review assignments generated by recently-adopted techniques that counteract fraud in peer review. In recent years, in-depth investigations have uncovered evidence of rings of colluding reviewers in a few computer science conferences [20, 21]. These reviewers conspire to manipulate the paper assignment in order to give positive reviews to the papers of co-conspirators. To mitigate this kind of collusion, conference organizers have adopted various techniques, including a recently introduced randomized assignment algorithm [22]. This algorithm limits the maximum probability (a parameter set by the organizers) of any reviewer getting assigned any particular paper. This randomization thus limits the expected rewards of reviewer collusion at the cost of some reduction in the expected sum-of-similarities objective, and has been implemented in OpenReview since 2021 and used by several conferences, including the AAAI 2022 and 2023 conferences.
The key insight of the present work is that under this randomized assignment policy, a range of reviewer-paper pairs other than the exactly optimal assignment become probable to observe. We can then adapt the tools of off-policy evaluation and importance sampling to evaluate the quality of many alternative policies. A major challenge, however, is that off-policy evaluation assumes overlap between the on-policy and the off-policy, i.e., that each reviewer-paper assignment that has a positive probability under the off-policy also had a positive probability under the on-policy. In practice, positivity violations are inevitable even when the maximum probability of assigning any reviewer-paper pair is low enough to induce significant randomization, especially as we are interested in evaluating a wide range of design choices of the assignment policy. To address this challenge, we build on existing literature for partial identification and propose methods that bound the off-policy estimates while making weak assumptions on how positivity violations arise.
More specifically, we propose two approaches for analysis that rely on different assumptions on the mapping between the covariates (e.g., bid, text similarity, subject area match) and the outcome (e.g., review quality) of the reviewer-paper pairs. First, we assume monotonicity in the covariates-outcome mapping. Understood intuitively, this assumption states that if reviewer-paper pair has higher or equal bid, text similarity, and subject area match than a reviewer-paper pair , then we assume that the quality of the review for pair is higher or equal to the review for pair . Alternatively, we assume Lipschitz smoothness in the covariate-outcome mapping. Intuitively, this assumption captures the idea that two reviewer-paper pairs that have similar bids, text similarity, and subject area match, should result in a similar review quality. We find that this Lipschitz assumption naturally generalizes so-called Manski bounds [23], the partial identification strategy that assumes only bounded outcomes.
We apply our methods to data collected by two computer science venues that used the recently-introduced randomized assignment strategy: the 2021 Workshop on Theory and Practice of Differential Privacy (TPDP) with 95 papers and 35 reviewers, and the 2022 AAAI Conference on Advancement in Artificial Intelligence (AAAI) with 8,450 papers and 3,145 reviewers. TPDP is an annual workshop co-located with the machine learning conference ICML, and AAAI is one of the largest annual artificial intelligence conferences. We evaluate two design choices: (i) how varying the weights of the bids vs. text similarity vs. subject area match (latter available only in AAAI) affects the overall quality of the reviews, and (ii) the “cost of randomization”, i.e., how much the review quality decreased as a result of introducing randomness in the assignment. As our measure of assignment quality, we consider the expertise and confidence reported by the reviewers for their assigned papers. We find that our proposed methods for partial identification assuming monotonicity and Lipschitz smoothness significantly reduce the bounds of the estimated review quality off-policy, leading to more informative results. Substantively, we find that placing a larger weight on text similarity results in higher review quality, and that introducing randomization in the assignment leads to a very small reduction in review quality.
Beyond our contributions to the design and study of peer review systems, the methods proposed in this work should also apply to other matching systems such as recommendation systems [24, 25, 26], advertising [27], and ride-sharing assignment systems [28]. Further, our contributions to off-policy evaluation under partial identification should be of independent interest.
We release our replication code at: https://github.com/msaveski/counterfactual-peer-review.
2 Preliminaries
We start by reviewing the fundamentals of peer-review assignment algorithms.
Reviewer-Paper Similarity.
Consider a peer review scenario with a set of reviewers and a set of papers . Standard assignment algorithms for large-scale peer review rely on “similarity scores” for every reviewer-paper pair , representing the assumed quality of review by that reviewer for that paper. These scores , typically non-negative real values, are commonly computed from a combination of up to three sources of information:
- •
-
•
: overlap between the subject areas selected by each reviewer and each paper’s authors; and
-
•
: reviewer-provided “bids” on each paper.
Without any principled methodology for evaluating the choice of similarity score, conference organizers manually select a parametric functional form and choose parameter values by spot-checking a few reviewer-paper assignments. For example, a simple similarity function is a convex combination of the component scores: . Conferences have also used more complex non-linear functions: NeurIPS’16 [34] used the functional form , while AAAI’21 [4] used . Beyond the choice of how to combine the component scores, numerous other aspects of the similarity computation also imply choices: the language-processing techniques used to compute text-similarity scores, the input given to them, the range and interpretation of bid options shown to reviewers, etc. The range of possible functional forms results in a wide design space, which we explore in this work.
Deterministic Assignment.
Let be an assignment matrix where denotes whether the reviewer-paper pair was assigned or not. Given a matrix of reviewer-paper similarity scores , a standard objective is to find an assignment of reviewers to papers that maximizes the sum of similarities of the assigned pairs, subject to constraints that each paper is assigned to an appropriate number of reviewers, each reviewer is assigned no more than a maximum number of papers, and conflicts of interest are respected [13, 35, 36, 37, 38, 39, 10]. This optimization problem can be formulated as a linear program. We provide a detailed formulation in Appendix A. While other objective functions have been proposed [40, 41, 42], here we focus on the sum-of-similarities.
Randomized Assignment.
As one approach to strategyproofness, Jecmen et al. [22] introduce the idea of using randomization to prevent colluding reviewers and authors from being able to guarantee their assignments. Specifically, the program chairs first choose a parameter . Then, the algorithm computes a randomized paper assignment, where the marginal probability of assigning any reviewer-paper pair is at most . These marginal probabilities are determined by a linear program, which maximizes the expected similarity of the assignment subject to the probability limit (detailed formulation in Appendix A). A reviewer-paper assignment is then sampled using a randomized procedure that iteratively redistributes the probability mass placed on each reviewer-paper pair until all probabilities are either zero or one.
Review Quality.
The above assignments are chosen based on maximizing the (expected) similarities of assigned reviewer-paper pairs, but those similarities may not be accurate proxies for the quality of review that the reviewer can provide for that paper. In practice, automated similarity-based assignments result in numerous complaints of low-expertise paper assignments from both authors and reviewers [3], and recent work [43] finds that current text-similarity algorithms make significant errors in predicting reviewer expertise. Meanwhile, self-reported assessments of reviewer-paper assignment quality can be collected from the reviewers themselves after the review. Conferences often ask reviewers to score their expertise in the paper’s topic and/or confidence in their review [34, 4, 44]. Other indicators of review quality can also be considered; e.g., some conferences ask “meta-reviewers” or other reviewers to evaluate the quality of written reviews directly [45, 44]. In this work, we consider self-reported expertise and confidence as our measures of review quality.
3 Off-Policy Evaluation
One attractive property of the randomized assignment described above is that while only one reviewer-paper assignment is sampled and deployed, many other assignments could have been sampled, and those assignments could equally well have been generated by some alternative assignment policy. The positive probability of other assignments allows us to investigate whether alternative assignment policies might have resulted in higher-quality reviews.
Let be a randomized assignment policy with a probability density , where ; , ; and only for feasible assignments . Let be another policy with density , defined similarly. We denote by and the marginal probabilities of assigning reviewer-paper pair under and respectively. Finally, let , where , be the measure of the quality of reviewer ’s review of paper , e.g., reviewer self-reported expertise or confidence as introduced in Section 2.
We follow the potential outcomes framework of causal inference [46]. Throughout this work, we will let be the on-policy or the logging policy, i.e., the policy that the review data was collected under, while will denote one of several alternative policies of interest. In Section 6, we will describe the specific alternative policies we consider in this work. Define as the total number of reviews, fixed across policies and set ahead of time. We are interested in the following estimands:
where and are the expected review quality under policy and , respectively.
In practice, we do not have access to all , but only those that were assigned. Let be the assignment sampled under the on-policy A, drawn from . We define the following Horvitz-Thompson estimators of the means:
(1) |
For now, suppose that has positive probability only where is positive (also known as satisfying “positivity”): for all where . Then, all weights where are bounded. As we will see, many policies of interest go beyond the support of .
Under the positivity assumption, and are unbiased estimators of and respectively [47]. Moreover, the Horvitz-Thompson estimator is admissible in the class of all unbiased estimators [48]. Note that is simply the empirical mean of the observed assignment sampled on-policy, and is a weighted mean of the observed assignment based on inverse probability weighting: placing weights greater than one on reviewer-paper pairs that are more likely off- than on-policy and less than or equal to one otherwise. These estimators also rely on a standard causal inference assumption of no interference. In Appendix B, we discuss the implications of this assumption in the peer review context.
Challenges.
In off-policy evaluation, we are interested in evaluating a policy based on data collected under policy . However, our ability to do so is typically limited to policies where the assignments that would be made under are possible under . In practice, many interesting policies step outside of the support of . Outcomes for reviewer-paper pairs outside the support of but with positive probability under (“positivity violations”) cannot be estimated and must either be imputed by some model or have their contribution to the average outcome () bounded.
In addition to positivity violations, we identify three other mechanisms through which missing data with potential confounding may arise in the peer review context: absent reviewers, selective attrition, and manual reassignments. For absent reviewers, i.e., reviewers who have not submitted any reviews, we do not have a reason to believe that the reviews are missing due to the quality of the reviewer-paper assignment. Hence, we assume that their reviews are missing at random, and impute them with the weighted mean outcome of the observed reviews. For selective attrition, i.e., when some but not all reviews are completed, we instead employ conservative bounding techniques as for policy-based positivity violations. Finally, reviews might be missing due to manual reassignments by the program chairs, after the assignment has been sampled. As a result, the originally assigned reviews will be missing and new reviews will be added. In such cases, we treat removed assignments as attrition (i.e., bounding their contribution) and ignore the newly introduced assignments as they did not arise from any determinable process.
Concretely, we partition the reviewer-paper pairs into the following (mutually exclusive and exhaustive) sets:
-
•
: positivity violations, ,
-
•
: missing reviews where the reviewer was absent (submitted no reviews),
-
•
: remaining missing reviews, and
-
•
: remaining pairs without positivity violations or missing reviews, .
In the next section, we present methods for imputing or bounding the contribution of to the estimate of , and and to the estimates of and .
4 Imputation and Partial Identification
In the previous section, we defined three sets of reviewer-paper pairs for which outcomes must be imputed rather than estimated: , , . In this section, we describe varied methods for imputing these outcomes that rely on different strengths of assumptions, including methods that output point estimates (Sections 4.1 and 4.2) and methods that output lower and upper bounds of (Sections 4.3 and 4.4). In Section 6, we apply these methods to peer-review data from two computer science venues.
For missing reviews where the reviewer is absent (), we assume that the reviewer did not participate in the review process for reasons unrelated to the assignment quality (e.g., too busy). Specifically, we assume that the reviewers are missing at random and thus impute the mean outcome among , the pairs with no positivity violations or missing reviews:
Correspondingly, we set for all in estimator (1).
In contrast, for positivity violations () and the remaining missing reviews (), we allow for the possibility that these reviewer-paper pairs being unobserved is correlated with their unobserved outcome. Thus, we consider imputing arbitrary values for in these subsets, which we denote by and place into a matrix , leaving entries for undefined. This strategy corresponds to setting for in estimator (1). To obtain bounds, we impute both the assumed minimal and maximal values of .
These modifications result in a Horvitz-Thompson off-policy estimator with imputation. To denote this, we redefine to be a function , where denotes the estimator resulting from imputing entries from a particular choice of :
The estimator computes the weighted mean of the observed () and imputed outcomes ( and ). We impute for the attrition () and positivity violation () pairs, and for the absent reviewers (). Note that we weight the imputed positivity violations () by rather than , since the latter is undefined. Under the assumption that the imputed outcomes are accurate, is an unbiased estimator of .
To construct confidence intervals, we estimate the variance of as follows:
The covariance terms (taken over ) are not known exactly, owing to the fact that the procedure by Jecmen et al. [22] only constrains the marginal probabilities of individual reviewer-paper pairs, but pairs of pairs can be non-trivially correlated. In the absence of a closed-form expression, we use Monte Carlo methods to tightly estimate these covariances (further details provided in Appendix C).
In the following subsections, we detail several methods by which we choose . These methods rely on various different assumptions of different strength about the unobserved outcomes.
4.1 Mean Imputation
As a first approach, we assume that the mean outcome within is representative of the mean outcome among the other pairs. This is a strong assumption, since the presence of a pair in or may not be independent of their outcome. For example, if reviewers choose not to submit reviews when the assignment quality is poor, is not representative of the outcomes in . Nonetheless, under this strong assumption, we can simply impute the mean outcome for all pairs necessitating imputation. Setting for all , we consider the following point estimate of : . While following from an overly strong assumption, we find it useful to compare our findings under this assumption to findings under subsequent weaker assumptions.
4.2 Model Imputation
Instead of simply imputing the mean outcome, we can assume that the unobserved outcomes are some simple function of known covariates (where is the number of covariates) for each reviewer-paper pair . If so, we can directly estimate this function using a variety of statistical models, resulting in a point estimate of . In doing so, we implicitly take on the assumptions made by each model, which determine how to generalize the covariate-outcome mapping from the observed pairs to the unobserved pairs. These assumptions are typically quite strong, since this mapping may be very different between the observed pairs (typically good matches) and unobserved pairs (typically less good matches).
More specifically, given the set of all observed reviewer-paper pairs , we train a model using the observed data . Let denote the outcomes predicted by that model for each pair. We then consider as a point estimate of . In our experiments, we employ standard methods for classification, ordinal regression, and collaborative filtering:
-
•
Logistic regression (clf-logistic);
-
•
Ridge classification (clf-ridge);
-
•
Ordered logit (ord-logit);
-
•
Ordered probit (ord-probit);
-
•
SVD++, collaborative filtering (cf-svd++);
-
•
K-nearest-neighbors, collaborative filtering (cf-knn).
Note that, unlike the other methods, the methods based on collaborative filtering model the missing data by using only the observed reviewer-paper outcomes (). We discuss our choice of methods, hyperparameters, and implementation details in Appendix E.
4.3 Manski Bounds
As a more conservative approach, we can exploit the fact that the outcomes are bounded, letting us bound the mean of the counterfactual policy without making any assumptions on how the positivity violations arise. Such bounds are often called Manski bounds [23] in the econometrics literature on partial identification. To employ Manski bounds, we assume that all outcomes can take only values between and , e.g., self-reported expertise and confidence scores are limited to a pre-specified range on the review questionnaire. Then, setting or for all , we can estimate the upper and lower bound of as and .
We adopt a well-established inference procedure for constructing 95% confidence intervals that asymptotically contain the true value of with probability at least . Following Imbens and Manski [49], we construct the interval:
where the -score analog (), is set by their procedure such that the interval asymptotically has at least coverage under plausible regularity conditions; for further details, see the discussion in Appendix D.
4.4 Monotonicity and Lipschitz Smoothness
We now propose two styles of weak assumptions on the covariate-outcome mapping that can be leveraged to achieve tighter bounds on than the Manski bounds. In contrast to the strong modeling assumptions used in the sections on mean and model imputation, these assumptions can be more intuitively understood and justified as conservative assumptions given particular choices of covariates.
Monotonicity.
The first weak assumption we consider is a monotonicity condition. Intuitively, monotonicity captures the idea that we expect higher expertise for a reviewer-paper pair when some covariates are higher, all else equal. For example, in our experiments we use the similarity component scores (bids, text similarity, subject area match) as covariates. Specifically, for covariate vectors and , define the dominance relationship to mean that is greater than or equal to in all components and is strictly greater than in at least one component. Then, the monotonicity assumption states that: if , then , .
Using this assumption to restrict the range of possible values for the unobserved outcomes, we seek upper and lower bounds on . Recall that is the set of all observed reviewer-paper pairs. One challenge is that the observed outcomes themselves ( for ) may violate the monotonicity condition. Thus, to find an upper or lower bound, we compute surrogate values that satisfy the monotonicity constraint for all while ensuring that the surrogate values for are as close as possible to the outcomes . The surrogate values for can then be imputed as outcomes.
Inspired by isotonic regression [50], we implement a two-level optimization problem. The primary objective minimizes the distance between and for pairs with observed outcomes . The second objective either minimizes (for a lower bound) or maximizes (for an upper bound) the sum of for the unobserved pairs , weighted as in . Define the universe of relevant pairs and define as a very large constant. This results in the following pair of optimization problems, which compute matrices (leaving entries undefined):
s.t. | |||
The sign of the second objective term depends on whether a lower (negative) or upper (positive) bound is being computed. The last set of constraints corresponds to the same constraints used to construct the Manski bounds described earlier, which is combined here with monotonicity to jointly constrain the possible outcomes. The above problem can be reformulated and solved as a linear program using standard techniques.
Lipschitz Smoothness.
The second weak assumption we consider is a Lipschitz smoothness assumption on the correspondence between covariates and outcomes. Intuitively, this captures the idea that we expect two reviewer-paper pairs who are very similar in covariate space to have similar expertise. For covariate vectors and , define as some notion of distance between the covariates. Then, the Lipschitz assumption states that there exists a constant such that for all . In practice, we can choose an appropriate value of by studying the many pairs of observed outcomes in the data (Section 5.2 and Appendix G), though this approach assumes that the Lipschitz smoothness of the covariate-outcome function is the same for observed and unobserved pairs.
As in the previous section, we introduce surrogate values and implement a two-level optimization problem to address Lipschitz violations within the observed outcomes (i.e., if two observed pairs are very close in covariate space but have different outcomes). Defining and as above, this results in the following pair of optimization problems, which compute matrices (leaving entries undefined):
As before, the sign of the second objective term depends on whether a lower (negative) or upper (positive) bound is being computed. The last set of constraints are again the same constraints used to construct the Manski bounds described earlier, which here are combined with the Lipschitz assumption to jointly constrain the possible outcomes. In the limit, as , the Lipschitz constraints become vacuous and we recover the Manski bounds. This problem can again be reformulated and solved as a linear program using standard techniques.
5 Experimental Setup
We apply our framework to data from two venues that used randomized paper assignments as described in Section 2: the 2021 Workshop on Theory and Practice of Differential Privacy (TPDP) and the 2022 AAAI Conference on Advancement in Artificial Intelligence (AAAI). In both settings, we aim to understand the effect that changing parameters of the assignment policies would have on review quality. The analyses were approved by our institutions’ IRBs.
5.1 Datasets
TPDP.
The TPDP workshop received 95 submissions and had a pool of 35 reviewers. Each paper received exactly 3 reviews, and reviewers were assigned 8 or 9 reviews, for a total of 285 assigned reviewer-paper pairs. The reviewers were asked to bid on the papers and could place one of the following bids (the corresponding value of is shown in the parenthesis): “very low” (), “low” (), “neutral” (), “high” (), or “very high” (), with “neutral” as the default. The similarity for each reviewer-paper pair was defined as a weighted sum of the bid score, , and text-similarity scores, : , with . The randomized assignment was run with an upper bound of . In their review, the reviewers were asked to assess the alignment between the paper and their expertise (between 1: irrelevant and 4: very relevant), and to report their review confidence (between 1: educated guess and 5: absolutely certain). We consider these two responses as our measures of quality. Once the assignment was generated, the organizers manually changed three reviewer-paper assignments, which we handle using the techniques discussed in Section 4.
AAAI.
In the AAAI conference, submissions were assigned to reviewers in multiple sequential stages across two rounds of submissions. We examine the stage of the first round where the randomized assignment algorithm was used to assign all submissions to a pool of “senior reviewers.” The assignment involved 8,450 papers and 3,145 reviewers; each paper was assigned to one reviewer, and each reviewer was assigned at most 3 or 4 papers based on their primary subject area. The similarity for every reviewer-paper pair was based on three scores: text-similarity , subject-area score , and bid . Bids were chosen from the following list (with the corresponding value of shown in the parenthesis, where is a parameter scaling the impact of positive bids as compared to neutral/negative bids): “not willing” (), “not entered” (), “in a pinch” (), “willing” (), “eager” (). The default option was “not entered”. Similarities were computed as: , with . The actual similarities differed from this base similarity formula in a few special cases (e.g., missing data); we provide the full description of the similarity computation in Appendix F. The randomized assignment was run with . Reviewers reported an expertise score (between 0: not knowledgeable and 5: expert) and a confidence score (between 0: not confident and 4: very confident), which we consider as our quality measures. After reviewers were assigned, several assignments were manually changed by the conference organizers, while several assigned reviews were also simply not submitted; we handle these cases as described in Section 4.
5.2 Assumption Suitability
For both the monotonicity and Lipschitz assumptions (as well as the model imputations), we work with the covariates , a vector of the two (TPDP) or three (AAAI) component scores used in the computation of similarities. We now consider whether these assumptions are reasonable with respect to our choices of outcome variables and of covariates.
Monotonicity.
Monotonicity assumes that when any component of the covariates increases, the review quality should not be lower. We can test this assumption on the observed outcomes: among all pairs of reviewer-paper pairs with both outcomes observed, 65.7% (TPDP)/28.0% (AAAI) have a dominance relationship () and of those pairs, 79.8% (TPDP)/76.4% (AAAI) satisfy the monotonicity condition when using expertise as an outcome and 76% (TPDP)/78.9% (AAAI) when using confidence as an outcome. The fraction of dominant pairs for TPDP is higher since we consider only two covariates.

Lipschitz Smoothness.
For the Lipschitz assumption, a choice of distance in covariate space is required. We choose the distance, normalized in each dimension so that all component distances lie in , and divided by the number of dimensions. For AAAI, some reviewer-paper pairs are missing a covariate; if so, we impute a distance of in that component. We then choose several potential Lipschitz constants by analyzing the reviewer-paper pairs with observed outcomes. In Figure 1, we plot the fraction of pairs of observations that violate the Lipschitz condition for a given value of with respect to expertise; we show the corresponding plots for confidence in Appendix K. In our later experiments, we use values of corresponding to less than , , and violations from these plots.
With these choices, the Lipschitz assumptions correspond to beliefs that the outcome does not change too much as the similarity components change. As one example, for on AAAI, when one similarity component differs by , the outcomes can differ by at most . Effectively, the imputed outcome of each unobserved pair is restricted to be relatively close to the outcome for the closest observed pair. In Appendix G, we examine the distribution of distances between unobserved reviewer-paper pairs and their nearest observed pair, observing median distances of 0.0014 (TPDP) and 0.0011 (AAAI) across the pairs violating positivity under any of the modified similarity functions that we analyze in what follows. We conclude that most imputed pairs are very close to some observed pair, and even large values of can significantly decrease the size of the bound when compared to the Manski bounds.
In solving the optimization problems for both the monotonicity and Lipschitz methods, we choose the constant to be large enough such that the first term of the objective dominates the second, while not causing numerical instability issues.
6 Results
We now present the analyses of the two datasets using the methods introduced in Section 4. For brevity, we report our analysis using self-reported expertise as the quality measure , but include the results using self-reported confidence in Appendix K. When solving the LPs that output alternative randomized assignments (Appendix A), we often encounter multiple unique optimal solutions and employ a persistent arbitrary tie-breaking procedure to choose among them (Appendix I).

TPDP.
We perform two analyses on the TPDP data, shown in Figure 2 (left). First, we analyze the choice of how to interpolate between the bids and the text similarity when computing the composite similarity score for each reviewer-paper pair. We examine a range of assignments, from an assignment based only on the bids () to an assignment based only on the text similarity (), focusing our off-policy evaluation on deterministic assignments (i.e., policies with ). Setting results in very similar assignments, each of which has Manski bounds overlapping with the on-policy. Within this region, the models, monotonicity bounds, and Lipschitz bounds all agree that the expertise is similar to the on-policy. However, setting results in a significant improvement in average expertise, even without any additional assumptions. Finally, setting leads to assignments that are significantly different from the assignments supported by the on-policy, which results in many positivity violations and wider confidence intervals, even under the monotonicity and Lipschitz smoothness assumptions. Note that within this region, the models significantly disagree on the expertise, indicating that the strong assumptions made by such models may not be accurate. Altogether, these results suggest that putting more weight on the text similarity (versus bids) leads to higher-expertise reviews.
Second, we investigate the “cost of randomization” to prevent fraud, measuring the effect of increasing and thereby reducing randomness in the optimized random assignment. We consider values between and (optimal deterministic assignment). Recall the on-policy has . When varying , we find that except for a small increase in the region around , the average expertise for policies with is very similar to that of the on-policy. This result suggests that using a randomized instead of a deterministic policy does not lead to a significant reduction in self-reported expertise, an observation that should be contrasted with the previously documented reduction in the expected sum-similarity objective under randomized assignment [22]; see further analysis in Appendix H.
AAAI.
We perform three analyses on the AAAI data, shown in Figure 2 (right). First, we examine the effect of interpolating between the text-similarity scores and the subject area scores by varying , again considering only deterministic policies (i.e., ). The on-policy sets . Due to large numbers of positivity violations, the Manski bounds are uninformative and so we turn to the other estimators. The model imputation analysis indicates that policies with may have slightly higher expertise than the on-policy and indicates lower expertise in the region where . However, the models differ somewhat in their predictions for low , indicating that the assumptions made by these models may not be reliable. The monotonicity bounds more clearly indicate low expertise compared to the on-policy when , but are also slightly more pessimistic about the region than the models. The Lipschitz bounds indicate slightly higher than on-policy expertise for and potentially suggest slightly lower than on-policy expertise for . Overall, all methods of analysis indicate that low values of result in worse assignments, but the effect of considerably increasing is unclear.
Second, we examine the effect of increasing the weight on positive bids by varying the values of . Recall that corresponds to the on-policy and a higher (respectively lower) value of indicates greater (respectively lesser) priority given to positive bids relative to neutral/negative bids. We investigate policies that vary within the range , and again consider only deterministic policies (i.e., ). The Manski bounds are again too wide to be informative. The models all indicate similar values of expertise for all values of and are all slightly more optimistic about expertise than the Manski bounds around the on-policy. The monotonicity and Lipschitz bounds both agree that the region has slightly higher expertise as compared to the on-policy. Overall, our analyses provide some indication that increasing may result in slightly higher levels of expertise.
Finally, we also examine the effect of varying within the range (the “cost of randomization”). Recall that the on-policy sets . We see that the models, the monotonicity bounds, and the Lipschitz bounds all strongly agree that the region has slightly higher expertise than the region . However, the magnitude of this change is small, indicating that the “cost of randomization” is not very significant.
Power Investigation: Purposefully Bad Policies.
As many of the off-policy assignments we consider have relatively similar estimated quality, we also ran additional analyses to show that our methods can discern differences between good policies (optimized toward high reviewer-paper similarity assignments) and policies intentionally chosen to have poor quality (“optimized” toward low reviewer-paper similarity assignments). We refer the interested reader to Appendix J for further discussion.
7 Discussion and Conclusion
In this work, we evaluate the quality of off-policy reviewer-paper assignments in peer review using data from two venues that deployed randomized reviewer assignments. We propose new techniques for partial identification that allow us to draw useful conclusions about the off-policy review quality, even in the presence of large numbers of positivity violations and missing reviews.
One limitation of off-policy evaluation is that our ability to make inferences inherently depends on the amount of randomness introduced on-policy. For instance, if there is a small amount of randomness we will be able to estimate only policies that are relatively close to the on-policy, unless we are willing to make some assumptions. The approaches presented in this work allow us to examine the strength of the evidence under a wide range of types and strengths of assumptions—model imputation, boundness of the outcome, monotonicity, and Lipschitz smoothness—and to test whether these assumptions lead to converging conclusions. For a more theoretical treatment of the methods proposed in this work, we refer the interested reader to Khan et al. [51].
Our work opens many avenues for future work. In the context of peer review, the present work considers only a few parameterized slices of the vast space of reviewer-paper assignment policies, while there are many other substantive questions that our methodology can be used to answer. For instance, one could evaluate assignment quality under a different method of computing similarity scores (e.g., different NLP algorithms [52]), additional constraints on the assignment (e.g., based on seniority or geographic diversity [4]), or objective functions other than the sum-of-similarities (e.g., various fairness-based objectives [40, 41, 53, 54]). Additional thought should also be given to the trade-offs between maximizing review quality vs. broader considerations of reviewer welfare: while assignments based on high text similarity may yield slightly higher-quality reviews, reviewers may be more willing to review again if the assignment policy more closely follows their bids. Beyond peer review, our work is applicable to off-policy evaluation in other matching problems, including education [55, 56], advertising [27], and ride-sharing [28]. Furthermore, our methods for partial identification under monotonicity and Lipschitz smoothness assumptions should be of independent interest for off-policy evaluation work more broadly.
8 Acknowledgements
We thank Gautam Kamath and Rachel Cummings for allowing us to conduct this study in TPDP and Melisa Bok and Celeste Martinez Gomez from OpenReview for helping us with the OpenReview APIs. We are also grateful to Samir Khan and Tal Wagner for helpful discussions. This work was supported in part by NSF CAREER Award 2143176, NSF CAREER Award 1942124, NSF CIF 1763734, and ONR N000142212181.
References
- [1] Jim McCullough. First comprehensive survey of NSF applicants focuses on their concerns about proposal review. Science, Technology, & Human Values, 1989.
- [2] Marko A. Rodriguez, Johan Bollen, and Herbert Van de Sompel. Mapping the bid behavior of conference referees. Journal of Informetrics, 1(1):68–82, 2007.
- [3] Terne Thorn Jakobsen and Anna Rogers. What factors should paper-reviewer assignments rely on? community perspectives on issues and ideals in conference peer-review. In Conference of the North American Chapter of the Association for Computational Linguistics, pages 4810–4823, 2022.
- [4] Kevin Leyton-Brown, Mausam, Yatin Nandwani, Hedayat Zarkoob, Chris Cameron, Neil Newman, and Dinesh Raghu. Matching papers and reviewers at large conferences. arXiv preprint arXiv:2202.12273, 2022.
- [5] Nihar B. Shah. Challenges, experiments, and computational solutions in peer review. Communications of the ACM, 65(6):76–87, 2022.
- [6] Vittorio Demicheli, Carlo Di Pietrantonj, and Cochrane Methodology Review Group. Peer review for improving the quality of grant applications. Cochrane Database of Systematic Reviews, 2010(1), 1996.
- [7] Mikael Fogelholm, Saara Leppinen, Anssi Auvinen, Jani Raitanen, Anu Nuutinen, and Kalervo Väänänen. Panel discussion does not improve reliability of peer review for medical research grant proposals. Journal of Clinical Epidemiology, 65:47–52, 08 2011.
- [8] Michael R Merrifield and Donald G Saari. Telescope time without tears: a distributed approach to peer review. Astronomy & Geophysics, 50(4):4–16, 2009.
- [9] Wolfgang E Kerzendorf, Ferdinando Patat, Dominic Bordelon, Glenn van de Ven, and Tyler A Pritchard. Distributed peer review enhanced with natural language processing and machine learning. Nature Astronomy, pages 1–7, 2020.
- [10] Laurent Charlin, Richard S. Zemel, and Craig Boutilier. A framework for optimizing paper matching. In Uncertainty in Artificial Intelligence, volume 11, pages 86–95, 2011.
- [11] Simon Price and Peter A. Flach. Computational support for academic peer review: A perspective from artificial intelligence. Communications of the ACM, 60(3):70–79, 2017.
- [12] Baochun Li and Y Thomas Hou. The new automated IEEE INFOCOM review assignment system. IEEE Network, 30(5):18–24, 2016.
- [13] Laurent Charlin and Richard S. Zemel. The Toronto Paper Matching System: An automated paper-reviewer assignment system. In ICML Workshop on Peer Reviewing and Publishing Models, 2013.
- [14] Neil D. Lawrence. The NIPS experiment. https://inverseprobability.com/2014/12/16/the-nips-experiment, 2014. Accessed May 17, 2023.
- [15] Eric Price. The NIPS experiment. http://blog.mrtz.org/2014/12/15/the-nips-experiment.html, 2014. Accessed May 17, 2023.
- [16] Andrew Tomkins, Min Zhang, and William D. Heavlin. Reviewer bias in single- versus double-blind peer review. Proceedings of the National Academy of Sciences, 114(48):12708–12713, 2017.
- [17] Ivan Stelmakh, Charvi Rastogi, Nihar B Shah, Aarti Singh, and Hal Daumé III. A large scale randomized controlled trial on herding in peer-review discussions. arXiv preprint arXiv:2011.15083, 2020.
- [18] Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan. The NeurIPS 2021 consistency experiment. https://blog.neurips.cc/2021/12/08/the-neurips-2021-consistency-experiment/, 2021. Accessed May 17, 2023.
- [19] Mathias Lecuyer, Joshua Lockerman, Lamont Nelson, Siddhartha Sen, Amit Sharma, and Aleksandrs Slivkins. Harvesting randomness to optimize distributed systems. In ACM Workshop on Hot Topics in Networks, pages 178–184, 2017.
- [20] Michael Littman. Collusion rings threaten the integrity of computer science research. Communications of the ACM, 2021.
- [21] T. N. Vijaykumar. Potential organized fraud in ACM/IEEE computer architecture conferences. https://medium.com/@tnvijayk/potential-organized-fraud-in-acm-ieee-computer-architecture-conferences-ccd61169370d, 2020. Accessed May 17, 2023.
- [22] Steven Jecmen, Hanrui Zhang, Ryan Liu, Nihar B. Shah, Vincent Conitzer, and Fei Fang. Mitigating manipulation in peer review via randomized reviewer assignments. Advances in Neural Information Processing Systems, 2020.
- [23] Charles F Manski. Nonparametric bounds on treatment effects. The American Economic Review, 80(2):319–323, 1990.
- [24] Noveen Sachdeva, Yi Su, and Thorsten Joachims. Off-policy bandits with deficient support. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 965–975, 2020.
- [25] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In International Conference on Machine Learning, pages 1670–1679. PMLR, 2016.
- [26] Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. Offline a/b testing for recommender systems. In ACM International Conference on Web Search and Data Mining, pages 198–206, 2018.
- [27] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(11), 2013.
- [28] Alex Wood-Doughty and Cameron Bruggeman. The incentives platform at lyft. In ACM International Conference on Web Search and Data Mining, pages 1654–1654, 2022.
- [29] David Mimno and Andrew McCallum. Expertise modeling for matching papers with reviewers. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 500–509. ACM, 2007.
- [30] Xiang Liu, Torsten Suel, and Nasir Memon. A robust model for paper reviewer assignment. In ACM Conference on Recommender Systems, pages 25–32, 2014.
- [31] Marko A. Rodriguez and Johan Bollen. An algorithm to determine peer-reviewers. In ACM Conference on Information and Knowledge Management, pages 319–328. ACM, 2008.
- [32] Hong Diep Tran, Guillaume Cabanac, and Gilles Hubert. Expert suggestion for conference program committees. In International Conference on Research Challenges in Information Science, pages 221–232, May 2017.
- [33] Graham Neubig, John Wieting, Arya McCarthy, Amanda Stent, Natalie Schluter, and Trevor Cohn. Acl reviewer matching code. https://github.com/acl-org/reviewer-paper-matching, 2020. Accessed May 17, 2023.
- [34] Nihar B. Shah, Behzad Tabibian, Krikamol Muandet, Isabelle Guyon, and Ulrike Von Luxburg. Design and analysis of the nips 2016 review process. Journal of Machine Learning Research, 2018.
- [35] Cheng Long, Raymond Wong, Yu Peng, and Liangliang Ye. On good and fair paper-reviewer assignment. In IEEE International Conference on Data Mining, pages 1145–1150, 12 2013.
- [36] Judy Goldsmith and Robert H. Sloan. The AI conference paper assignment problem. AAAI Workshop, WS-07-10:53–57, 12 2007.
- [37] Wenbin Tang, Jie Tang, and Chenhao Tan. Expertise matching via constraint-based optimization. In International Conference on Web Intelligence and Intelligent Agent Technology, pages 34–41. IEEE Computer Society, 2010.
- [38] Peter A. Flach, Sebastian Spiegler, Bruno Golénia, Simon Price, John Guiver, Ralf Herbrich, Thore Graepel, and Mohammed J. Zaki. Novel tools to streamline the conference review process: Experiences from SIGKDD’09. SIGKDD Explorations Newsletter, 11(2):63–67, May 2010.
- [39] Camillo J. Taylor. On the optimal assignment of conference papers to reviewers. Technical report, Department of Computer and Information Science, University of Pennsylvania, 2008.
- [40] Ivan Stelmakh, Nihar B. Shah, and Aarti Singh. PeerReview4All: Fair and accurate reviewer assignment in peer review. In Algorithmic Learning Theory, 2019.
- [41] Ari Kobren, Barna Saha, and Andrew McCallum. Paper matching with local fairness constraints. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1247–1257, 2019.
- [42] Komal Dhull, Steven Jecmen, Pravesh Kothari, and Nihar B. Shah. The price of strategyproofing peer assessment. In AAAI Conference on Human Computation and Crowdsourcing, 2022.
- [43] Ivan Stelmakh, John Wieting, Graham Neubig, and Nihar B. Shah. A gold standard dataset for the reviewer assignment problem. arXiv preprint arXiv:2303.16750, 2023.
- [44] Ivan Stelmakh, Nihar B. Shah, Aarti Singh, and Hal Daumé III. A novice-reviewer experiment to address scarcity of qualified reviewers in large conferences. In AAAI Conference on Artificial Intelligence, volume 35, pages 4785–4793, 2021.
- [45] Ines Arous, Jie Yang, Mourad Khayati, and Philippe Cudré-Mauroux. Peer grading the peer reviews: A dual-role approach for lightening the scholarly paper review process. In Web Conference 2021, pages 1916–1927, 2021.
- [46] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688, 1974.
- [47] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
- [48] VP Godambe and VM Joshi. Admissibility and bayes estimation in sampling finite populations. i. The Annals of Mathematical Statistics, 36(6):1707–1722, 1965.
- [49] Guido W. Imbens and Charles F. Manski. Confidence intervals for partially identified parameters. Econometrica, 2004.
- [50] Richard E Barlow and Hugh D Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337):140–147, 1972.
- [51] Samir Khan, Martin Saveski, and Johan Ugander. Off-policy evaluation beyond overlap: partial identification through smoothness. arXiv preprint arXiv:2305.11812, 2023.
- [52] Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. Specter: Document-level representation learning using citation-informed transformers. In Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, 2020.
- [53] Justin Payan and Yair Zick. I will have order! optimizing orders for fair reviewer assignment. In International Joint Conference on Artificial Intelligence, 2022.
- [54] Jing Wu Lian, Nicholas Mattei, Renee Noble, and Toby Walsh. The conference paper assignment problem: Using order weighted averages to assign indivisible goods. In AAAI Conference on Artificial Intelligence, volume 32, 2018.
- [55] David J Deming, Justine S Hastings, Thomas J Kane, and Douglas O Staiger. School choice, school quality, and postsecondary attainment. American Economic Review, 104(3):991–1013, 2014.
- [56] Joshua D Angrist, Parag A Pathak, and Christopher R Walters. Explaining charter school effectiveness. American Economic Journal: Applied Economics, 5(4):1–27, 2013.
- [57] Yichong Xu, Han Zhao, Xiaofei Shi, and Nihar B Shah. On strategyproof conference peer review. In International Joint Conference on Artificial Intelligence, pages 616–622, 2019.
- [58] Kevin Leyton-Brown and Mausam. AAAI 2021 - introduction. https://slideslive.com/38952457/aaai-2021-introduction?ref=account-folder-79533-folders; minute 8 onwards in the video, 2021.
- [59] David Roxbee Cox. Planning of experiments. Wiley, 1958.
- [60] Johan Ugander, Brian Karrer, Lars Backstrom, and Jon Kleinberg. Graph cluster randomization: Network exposure to multiple universes. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 329–337, 2013.
- [61] Martin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Souvik Ghosh, Ya Xu, and Edoardo M. Airoldi. Detecting network effects: Randomizing over randomized experiments. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1027–1035, 2017.
- [62] Susan Athey, Dean Eckles, and Guido W. Imbens. Exact p-values for network interference. Journal of the American Statistical Association, 113(521):230–240, 2018.
- [63] Jean Pouget-Abadie, Guillaume Saint-Jacques, Martin Saveski, Weitao Duan, Souvik Ghosh, Ya Xu, and Edoardo M Airoldi. Testing for arbitrary interference on experimentation platforms. Biometrika, 106(4):929–940, 2019.
Appendix
Appendix A Linear Programs for Peer Review Assignment
Deterministic Assignment.
Let be an assignment matrix where denotes whether reviewer is assigned to paper . Given a matrix of similarity scores , a standard objective is to find an assignment of papers to reviewers that maximizes the sum of similarities of the assigned pairs, subject to constraints that each paper is assigned to an appropriate number of reviewers , each reviewer is assigned no more than a maximum number of papers , and conflicts of interest are respected [13, 35, 36, 37, 38, 39, 10]. Denoting the set of conflict-of-interest pairs by , this optimization problem can be formulated as the following linear program:
By total unimodularity conditions, this problem has an optimal solution where .
Although the above strategy is the primary method used for paper assignments in large-scale peer review, other variants of this method have been proposed and used in the literature. These algorithms consider various properties in addition to the total similarity, such as fairness [40, 41], strategyproofness [57, 42], envy-freeness [53] and diversity [58]. We focus on the sum-of-similarities objective here, but our off-policy evaluation framework is agnostic to the specific objective function.
Randomized Assignment.
As one approach to strategyproofness, Jecmen et al. [22] introduce the idea of using randomization to prevent colluding reviewers and authors from being able to guarantee their assignments. Specifically, the algorithm computes a randomized paper assignment, where the marginal probability of assigning any reviewer to any paper is at most a parameter , chosen a priori by the program chairs. These marginal probabilities are determined by the following linear program, which maximizes the expected similarity of the assignment:
(2) | ||||
A reviewer-paper assignment is then sampled using a randomized procedure that iteratively redistributes the probability mass placed on each reviewer-paper pair until all probabilities are either zero or one. This procedure ensures only that the desired marginal assignment probabilities are satisfied, providing no guarantees on the joint distributions of assigned pairs.
Appendix B Stable Unit Treatment Value Assumption
The Stable Unit Treatment Value Assumption (SUTVA) in causal inference [59] states that the treatment of one unit does not affect the outcomes for the other units, i.e., there is no interference between the units. In the context of peer review, SUTVA implies that: () The quality of the review by reviewer reviewing paper does not depend on what other reviewers are assigned to paper ; and () the quality also does not depend on the other papers that reviewer was assigned to review. The first assumption is quite realistic as in most peer review systems the reviewers cannot see other reviews until they submit their own. The second assumption is important to understand, as there could be “batch effects”: a reviewer may feel more or less confident about their assessment (if measuring quality by confidence) depending on what other papers they were assigned to review. We do not test for batch effects or other violations of SUTVA in this work, which typically require either strong modeling assumptions or complex experimental designs [60, 61, 62, 63] specifically tailored for testing SUTVA, but consider it important future work.
Appendix C Covariance Estimation
As described in Section 4, we estimate the variance of as:
However, the covariance terms (taken over ) are not known exactly. This is due to the fact that the procedure by Jecmen et al. [22] only constrains the marginal probabilities of individual reviewer-paper pairs, but pairs of pairs can be non-trivially correlated. In the absence of a closed-form expression, we use Monte Carlo methods to tightly estimate these covariances. In both our analyses of the TPDP and AAAI datasets, we sampled 1 million assignments and computed the empirical covariance. We ran an additional analysis to investigate the variability of our variance estimates. We took a bootstrap sample of 100,000 assignments (from the set of all 1 million assignments we sampled) and computed the variance based only on the (smaller) bootstrap sample. We repeated this procedure 1,000 times and computed the variance of our variance estimates. We found that the variance of our variance estimates is very small (less than ) even when we use 10 times fewer sampled assignments, suggesting that we have sampled enough assignments to accurately estimate the variance.
Appendix D Coverage of Imbens-Manski Confidence Intervals
Under Manski, monotonicity, and Lipschitz assumptions, we employ a standard technique due to Imbens and Manski [49] for constructing confidence intervals for partially identified parameters. These intervals converge uniformly to the specified -level coverage under a set of regularity assumptions on the behavior of the estimators of the upper and lower endpoints of the interval estimate: Assumption 1 from [49], establishing the coverage result in Lemma 4 there. It is difficult to verify whether Assumption 1 is satisfied for the designs (sampling reviewer-paper matchings) and interval endpoint estimators (Manski, monotonocity, Lipschitz) in this work.
A different set of assumptions, most significantly that the fraction of missing data is known before assignment, support a different method for computing confidence intervals with the coverage result in Lemma 3 from [49], obviating the need for Assumption 1. In our setting, small amounts of attrition (relative to the number of policy-induced positivity violations) mean that the fraction of data that is missing is not exactly known before assignment, but almost. In practice, we find that the Imbens-Manski interval estimates from their Lemma 3 (assuming a known fraction of missing data) and Lemma 4 (assuming Assumption 1) are nearly identical for all three of the Manksi-, monotonicity-, and Lipschitz-based estimates, suggesting the coverage is well-behaved. A detailed theoretical analysis of whether the estimators obey the regularity conditions of Assumption 1 is beyond the scope of this work; see [51] for some theoretical developments related to the rates of convergence of Lipschitz-based estimates.
Appendix E Model Implementation
To impute the outcomes of the unobserved reviewer-paper pairs, we train classification, ordinal regression, and collaborative filtering models. Classification models are suitable since the reviewers select their expertise and confidence scores from a set of pre-specified choices. Ordinal regression models additionally model the fact that the scores have a natural ordering. Collaborative filtering models, in contrast to the classification and ordinal regression models, do not rely on covariates and instead model the structure of the observed entries in the reviewer-paper outcome matrix, which is akin to user-item rating matrices found in recommender systems.
In the classification and regression models, we use the covariates for each reviewer-paper pair as input features. In our analysis, we consider the two/three component scores used to compute the similarities: for TPDP, ; for AAAI, . These are the primary components used by conference organizers to compute similarities, so we expect them to be usefully correlated with match quality. Although we perform our analysis with this choice of covariates, one could also include various other features of each reviewer-paper pair, e.g., some encoding of reviewer and paper subject areas, reviewer seniority, etc.
To evaluate the performance of the models, we randomly split the observed reviewer-paper pairs into train (75%) and test (25%) sets, fit the models on the train set, and measure the mean absolute error (MAE) of the predictions on the test set. To get more robust estimates of the performance, we repeat this process 10 times. In the training phase, we use 10-fold cross-validation to tune the hyperparameters, using MAE as a selection criterion, and retrain the model on the full training set with the best hyperparameters. We also consider two preprocessing decisions: (a) whether to encode the bids as one-hot categorical variables or continuous variables with the values described in Section 5.1, and (b) whether to standardize the features. In both cases, we used the settings that, overall, worked best (at prediction) for each model. We tested several models from each model category. To simplify the exposition, we only report the results of the two best-performing models in each category. The code repository referenced in Section 1 contains the implementation of all models, including the sets of hyperparameters considered for each model.

Figure 3 shows the test MAE across the 10 random train/test splits (means and 95% CIs) using expertise and confidence outcomes for both TPDP and AAAI. We note that all models perform significantly better than a baseline that predicts the mean outcome in the train set. For TPDP, we find that all models perform similarly, except for cf-svd++, which performs slightly better than the other models, both for expertise and confidence. For AAAI, all classification and regression models perform similarly, but the collaborative filtering models perform slightly worse. This difference in performance is perhaps due to the fact that we consider a larger set of covariates for AAAI than TPDP, likely making the classification and ordinal regression models more predictive.
Finally, to estimate , we train each model on the set of all observed reviewer-paper pairs, predict the outcomes for all unobserved pairs, and impute the predicted outcomes as described in Section 4.2. In the training phase, we use 10-fold cross-validation to select the hyperparameters and refit the model on the full set of observed reviewer-paper pairs.
Appendix F Details of AAAI Assignment
In Section 5.1, we described a simplified version of the stage of the AAAI assignment procedure that we analyze, i.e., the assignment of senior reviewers to the first round of submissions. In this section, we describe this stage of the AAAI paper assignment more precisely.
A randomized assignment was computed between senior reviewers and first-round paper submissions, independent of all other stages of the reviewer assignment. The set of senior reviewers was determined based on reviewing experience and publication record; these criteria were external to the assignment. Each paper was assigned senior reviewer. Reviewers were assigned to at most papers, with the exception of reviewers with a “Machine Learning” primary area or in the “AI For Social Impact” track, who were assigned to at most papers. The probability limit was .
The similarities were computed from text-similarity scores , subject-area scores , and bids . Either the text-similarity scores or the area scores could be missing for a given reviewer-paper pair, due to either a reviewer failing to provide the needed information or due to other errors in the computation of the scores. The text-similarity scores were created using text-based scores from two different sources: () the Toronto Paper Matching System (TPMS) [13], and () the ACL Reviewer Matching code [33]. The text-similarity scores was set equal to the TPMS score for all pairs where this score was not missing, set equal to the ACL score for all other pairs where the ACL score was not missing, and marked as missing if both scores were missing. The subject-area scores were computed from reviewer and paper subject areas using the procedure described in Appendix A of [4].
Next, base scores were then computed with , if both and were not missing. If either or was missing, the base score was equal to the non-missing score of the two. If both were missing, the base score was set as . For pairs where the bid was “willing” or “eager” and , the base score was set as .
Next, final scores were computed as , using the bid values “not willing” (), “not entered” (), “in a pinch” (), “willing” (), “eager” (); with . If and was not missing, the final score was recomputed as . Finally, for reviewers who did not provide their profile for use in conflict-of-interest detection, the final score was reduced by .
In all of our analyses, we follow this same procedure to determine the assignment under alternative policies (varying only the parameters , , and ).


Appendix G Details Regarding Assumption Suitability
In this section, we provide additional details on the discussion in Section 5.2 on the suitability of the monotonicity and Lipschitz smoothness assumptions.
First, we examine the fraction of pairs of observed reviewer-paper pairs that violate the Lipschitz condition for each value of . Figure 4 shows the CCDF of for pairs of observations (in other words, the fraction of violating observation-pairs for each value of ) with respect to confidence. The corresponding plot for expertise is shown in Figure 1.
Next, we examine the distances from unobserved reviewer-paper pairs to their closest observed reviewer-paper pair. In Figure 5, we show the CCDF of these distances for unobserved reviewer-paper pairs within a set of “relevant” pairs. We define the set of “relevant” unobserved pairs to be all pairs not supported on-policy that have positive probability in at least one policy among all off-policies with varying with for TPDP, and all off-policies varying and with for AAAI.
Appendix H Similarity Cost of Randomization
In [22], Jecmen et al. empirically analyze the “cost of randomization” in terms of the expected total assignment similarity, i.e., the objective value of LP (2), as changes. This approach is also used by conference program chairs to choose an acceptable level of in practice. In Figure 6, we show this trade-off between and sum-similarity (as a ratio to the optimal deterministic sum-similarity) for both TPDP and AAAI. Note that in contrast, our approach in this work is to measure assignment quality via self-reported expertise or confidence rather than by similarity. In particular, the cost of randomization for TPDP is high in terms of sum-similarity but is revealed by our analysis to be mild in terms of expertise (Section 6).

Appendix I Tie-Breaking Behavior
In Section 6, we specify a policy in terms of the parameters of LP (2) (specifically, by altering the parameters , , and from the on-policy values). However, LP (2) may not have a unique solution, and thus each policy may not correspond to a unique set of assignment probabilities. Of particular concern, the on-policy specification of LP (2) does not uniquely identify the actual on-policy assignment probabilities.
Ideally, we could use the same tie-breaking methodology as was used in the on-policy to pick a solution in each off-policy to avoid introducing additional effects from variations in tie-breaking behavior. However, this behavior was not specified in the venues we analyze. To resolve this, we fixed arbitrary tie-breaking behaviors such that the on-policy solution to LP (2) matches the actual on-policy assignment probabilities; we then use these same behaviors for all off-policies.
In the TPDP analyses, we perturb all similarities by small constants such that all similarity values are unique. Specifically, we change the objective of LP (2) to , where , and is the same for all policies. To choose , we sampled each entry uniformly at random from and checked that the solution of the perturbed on-policy LP matches the on-policy assignment probabilities, resampling until it does. This value of was then fixed for all policies.
In the AAAI analyses, the larger size of the similarity matrix meant that randomly choosing an that recovers the on-policy solution was not feasible. Instead, we more directly choose how to perturb the similarities in order to achieve consistency with the on-policy. We change the objective of LP (2) to , where is chosen for each policy by the following procedure. For each policy, is chosen to be the largest value from such that the difference in total similarity between the solution of the original and perturbed LPs is no greater than a tolerance of . We confirmed that using this procedure to perturb the on-policy LP recovers the on-policy assignment probabilities, as desired.
Appendix J Power Investigation: Purposefully Bad Policies
Many of the off-policy assignments we consider in Section 5 have shown to have relatively similar estimated quality. A possible explanation for this tendency is that most “reasonable” optimized policies are roughly equivalent in terms of quality, since our analyses only consider adjusting parameters of the (presumably reasonable) optimized on-policy. To investigate this possibility, we analyze a policy intentionally chosen to have poor quality.
Designing a “bad” policy that can be feasibly analyzed presents a challenge, as the on-policies are both optimized and thus rarely place probability on obviously bad reviewer-paper pairs. To work within this constraint, we look for bad policies where all reviewer pairs with zero on-policy probability are regarded as conflicts. We then contrast the deterministic () policy that maximizes the total similarity score with the “bad” policy that minimizes it. Since the on-policy similarities are presumably somewhat indicative of expertise, we expect the minimization policy to be worse.
The results of this comparison are presented in Table 1. On both TPDP and AAAI, we see that our methods clearly identify the minimization policies as worse. The differences in quality between the policies becomes clearer with the addition of Lipschitz and monotonicity assumptions to address attrition. This illustrates that our methods are able to distinguish a good policy (the best of the best matches) from a clearly worse one (the worst of the best matches). Thus, it is likely that our primary analyses are simply exploring high-quality regions of the assignment-policy space, and that peer review assignment quality is often robust to the exact values of the various parameters.
Policy | Manski | Monotonicity | Lipschitz |
---|---|---|---|
TPDP Max | [2.6115, 2.7045] | [2.6551, 2.6782] | [2.6498, 2.6744] |
TPDP Min | [2.5521, 2.6126] | [2.5521, 2.5986] | [2.5521, 2.5937] |
AAAI Max | [3.3919, 3.5213] | [3.4756, 3.4783] | [3.4764, 3.4809] |
AAAI Min | [3.2591, 3.3846] | [3.3394, 3.3419] | [3.3396, 3.3443] |
Appendix K Results for Confidence Outcomes
Figure 7 shows the results of our analyses using confidence as a quality measure (). We find that the results are substantively very similar to those reported in Section 6 using expertise.
(continues on the next page)
