This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Counterfactual Evaluation of Peer-Review Assignment Policies

Martin Saveski
Stanford University
[email protected]
   Steven Jecmen
Carnegie Mellon University
[email protected]
   Nihar B. Shah
Carnegie Mellon University
[email protected]
   Johan Ugander
Stanford University
[email protected]
Abstract

Peer review assignment algorithms aim to match research papers to suitable expert reviewers, working to maximize the quality of the resulting reviews. A key challenge in designing effective assignment policies is evaluating how changes to the assignment algorithm map to changes in review quality. In this work, we leverage recently proposed policies that introduce randomness in peer-review assignment—in order to mitigate fraud—as a valuable opportunity to evaluate counterfactual assignment policies. Specifically, we exploit how such randomized assignments provide a positive probability of observing the reviews of many assignment policies of interest. To address challenges in applying standard off-policy evaluation methods, such as violations of positivity, we introduce novel methods for partial identification based on monotonicity and Lipschitz smoothness assumptions for the mapping between reviewer-paper covariates and outcomes. We apply our methods to peer-review data from two computer science venues: the TPDP’21 workshop (95 papers and 35 reviewers) and the AAAI’22 conference (8,450 papers and 3,145 reviewers). We consider estimates of (i) the effect on review quality when changing weights in the assignment algorithm, e.g., weighting reviewers’ bids vs. textual similarity (between the review’s past papers and the submission), and (ii) the “cost of randomization”, capturing the difference in expected quality between the perturbed and unperturbed optimal match. We find that placing higher weight on text similarity results in higher review quality and that introducing randomization in the reviewer-paper assignment only marginally reduces the review quality. Our methods for partial identification may be of independent interest, while our off-policy approach can likely find use evaluating a broad class of algorithmic matching systems.

1 Introduction

The assignment of reviewers to submissions is one of the most important parts of the peer-review process [1, 2, 3]. In many peer-review applications—ranging from peer review of academic conference submissions [4, 5], to grant proposals [6, 7], to proposals for the allocation of other scientific resources [8, 9]—a set of submissions are simultaneously received and must all be assigned reviewers from an impaneled pool. However, when the number of submissions or reviewers is too large, it may not be feasible to manually assign suitable reviewers for each submission. As a result, automated systems must be used to determine the reviewer assignment.

In computer science, conferences are the primary terminal venue for scientific publications, with recent iterations of large conferences such as NeurIPS and AAAI receiving several thousands of submissions [5]. The automated reviewer-assignment systems deployed by such conferences typically use three sources of information: (i) bids, i.e., reviewers’ self-reported preferences to review the papers; (ii) text similarity between the paper and the reviewer’s publications; and (iii) reviewer- and author-selected subject areas. Given a prescribed way to combine these signals into a single score, an optimization procedure then proposes a reviewer-paper assignment that maximizes the sum of the scores of the assigned pairs [10].

The design of effective peer-review systems has received considerable research attention [5, 4, 11, 12]. Popular peer-review platforms such as OpenReview and Microsoft CMT offer many features that conference organizers can use to assign reviewers, such as integration with the Toronto Paper Matching System (TPMS) [13] for computing text-similarity scores. However, it has been persistently challenging to evaluate how changes to peer-review assignment algorithms affect review quality. An implicit assumption underlying such approaches is that review quality is an increasing function of bid enthusiasm, text similarity, and subject area match, but how to combine these signals into a score is approached via heuristics. Researchers typically observe only the reviews actually assigned by the algorithm and have no way of measuring the quality of reviews under an assignment generated by an alternative algorithm.

One approach to comparing different peer-review assignment policies is running randomized control trials or A/B tests. Several conferences (NeurIPS’14 [14, 15], WSDM’17 [16], ICML’20 [17], and NeurIPS’21 [18]) have run A/B tests to evaluate various aspects of their review process, such as differences between single- vs. double-blind review. However, such experiments are extremely costly in the peer review context, with the NeurIPS experiments requiring a significant number of additional reviews, overloading already strained peer review systems. Moreover, A/B tests typically compare only a handful of design decisions, while assignment algorithms typically require making many such decisions (see Section 2).

Present Work.

In this work, we propose off-policy evaluation as a less costly alternative that exploits existing randomness to enable the comparison of many alternative policies. Our proposed technique “harvests” [19] the randomness introduced in peer-review assignments generated by recently-adopted techniques that counteract fraud in peer review. In recent years, in-depth investigations have uncovered evidence of rings of colluding reviewers in a few computer science conferences [20, 21]. These reviewers conspire to manipulate the paper assignment in order to give positive reviews to the papers of co-conspirators. To mitigate this kind of collusion, conference organizers have adopted various techniques, including a recently introduced randomized assignment algorithm [22]. This algorithm limits the maximum probability (a parameter set by the organizers) of any reviewer getting assigned any particular paper. This randomization thus limits the expected rewards of reviewer collusion at the cost of some reduction in the expected sum-of-similarities objective, and has been implemented in OpenReview since 2021 and used by several conferences, including the AAAI 2022 and 2023 conferences.

The key insight of the present work is that under this randomized assignment policy, a range of reviewer-paper pairs other than the exactly optimal assignment become probable to observe. We can then adapt the tools of off-policy evaluation and importance sampling to evaluate the quality of many alternative policies. A major challenge, however, is that off-policy evaluation assumes overlap between the on-policy and the off-policy, i.e., that each reviewer-paper assignment that has a positive probability under the off-policy also had a positive probability under the on-policy. In practice, positivity violations are inevitable even when the maximum probability of assigning any reviewer-paper pair is low enough to induce significant randomization, especially as we are interested in evaluating a wide range of design choices of the assignment policy. To address this challenge, we build on existing literature for partial identification and propose methods that bound the off-policy estimates while making weak assumptions on how positivity violations arise.

More specifically, we propose two approaches for analysis that rely on different assumptions on the mapping between the covariates (e.g., bid, text similarity, subject area match) and the outcome (e.g., review quality) of the reviewer-paper pairs. First, we assume monotonicity in the covariates-outcome mapping. Understood intuitively, this assumption states that if reviewer-paper pair ii has higher or equal bid, text similarity, and subject area match than a reviewer-paper pair jj, then we assume that the quality of the review for pair ii is higher or equal to the review for pair jj. Alternatively, we assume Lipschitz smoothness in the covariate-outcome mapping. Intuitively, this assumption captures the idea that two reviewer-paper pairs that have similar bids, text similarity, and subject area match, should result in a similar review quality. We find that this Lipschitz assumption naturally generalizes so-called Manski bounds [23], the partial identification strategy that assumes only bounded outcomes.

We apply our methods to data collected by two computer science venues that used the recently-introduced randomized assignment strategy: the 2021 Workshop on Theory and Practice of Differential Privacy (TPDP) with 95 papers and 35 reviewers, and the 2022 AAAI Conference on Advancement in Artificial Intelligence (AAAI) with 8,450 papers and 3,145 reviewers. TPDP is an annual workshop co-located with the machine learning conference ICML, and AAAI is one of the largest annual artificial intelligence conferences. We evaluate two design choices: (i) how varying the weights of the bids vs. text similarity vs. subject area match (latter available only in AAAI) affects the overall quality of the reviews, and (ii) the “cost of randomization”, i.e., how much the review quality decreased as a result of introducing randomness in the assignment. As our measure of assignment quality, we consider the expertise and confidence reported by the reviewers for their assigned papers. We find that our proposed methods for partial identification assuming monotonicity and Lipschitz smoothness significantly reduce the bounds of the estimated review quality off-policy, leading to more informative results. Substantively, we find that placing a larger weight on text similarity results in higher review quality, and that introducing randomization in the assignment leads to a very small reduction in review quality.

Beyond our contributions to the design and study of peer review systems, the methods proposed in this work should also apply to other matching systems such as recommendation systems [24, 25, 26], advertising [27], and ride-sharing assignment systems [28]. Further, our contributions to off-policy evaluation under partial identification should be of independent interest.

We release our replication code at: https://github.com/msaveski/counterfactual-peer-review.

2 Preliminaries

We start by reviewing the fundamentals of peer-review assignment algorithms.

Reviewer-Paper Similarity.

Consider a peer review scenario with a set of reviewers \mathcal{R} and a set of papers 𝒫\mathcal{P}. Standard assignment algorithms for large-scale peer review rely on “similarity scores” for every reviewer-paper pair i=(r,p)×𝒫i=({r,p})\in\mathcal{R}\times\mathcal{P}, representing the assumed quality of review by that reviewer for that paper. These scores SiS_{i}, typically non-negative real values, are commonly computed from a combination of up to three sources of information:

  • TiT_{i}: text-similarity between each paper and reviewer’s past work, using various techniques [29, 30, 31, 32, 13, 33];

  • KiK_{i}: overlap between the subject areas selected by each reviewer and each paper’s authors; and

  • BiB_{i}: reviewer-provided “bids” on each paper.

Without any principled methodology for evaluating the choice of similarity score, conference organizers manually select a parametric functional form and choose parameter values by spot-checking a few reviewer-paper assignments. For example, a simple similarity function is a convex combination of the component scores: Si=wtextTi+(1wtext)BiS_{i}=w_{\mathrm{text}}T_{i}+(1-w_{\mathrm{text}})B_{i}. Conferences have also used more complex non-linear functions: NeurIPS’16 [34] used the functional form Si=(0.5Ti+0.5Ki)2BiS_{i}=(0.5T_{i}+0.5K_{i})2^{B_{i}}, while AAAI’21 [4] used Si=(0.5Ti+0.5Ki)1/BiS_{i}=(0.5T_{i}+0.5K_{i})^{1/B_{i}}. Beyond the choice of how to combine the component scores, numerous other aspects of the similarity computation also imply choices: the language-processing techniques used to compute text-similarity scores, the input given to them, the range and interpretation of bid options shown to reviewers, etc. The range of possible functional forms results in a wide design space, which we explore in this work.

Deterministic Assignment.

Let Z{0,1}||×|𝒫|Z\in\{0,1\}^{|\mathcal{R}|\times|\mathcal{P}|} be an assignment matrix where ZiZ_{i} denotes whether the reviewer-paper pair ii was assigned or not. Given a matrix of reviewer-paper similarity scores S0||×|𝒫|S\in\mathbb{R}^{|\mathcal{R}|\times|\mathcal{P}|}_{\geq 0}, a standard objective is to find an assignment of reviewers to papers that maximizes the sum of similarities of the assigned pairs, subject to constraints that each paper is assigned to an appropriate number of reviewers, each reviewer is assigned no more than a maximum number of papers, and conflicts of interest are respected [13, 35, 36, 37, 38, 39, 10]. This optimization problem can be formulated as a linear program. We provide a detailed formulation in Appendix A. While other objective functions have been proposed [40, 41, 42], here we focus on the sum-of-similarities.

Randomized Assignment.

As one approach to strategyproofness, Jecmen et al. [22] introduce the idea of using randomization to prevent colluding reviewers and authors from being able to guarantee their assignments. Specifically, the program chairs first choose a parameter q[0,1]q\in[0,1]. Then, the algorithm computes a randomized paper assignment, where the marginal probability P(Zi=1)P(Z_{i}=1) of assigning any reviewer-paper pair ii is at most qq. These marginal probabilities are determined by a linear program, which maximizes the expected similarity of the assignment subject to the probability limit qq (detailed formulation in Appendix A). A reviewer-paper assignment is then sampled using a randomized procedure that iteratively redistributes the probability mass placed on each reviewer-paper pair until all probabilities are either zero or one.

Review Quality.

The above assignments are chosen based on maximizing the (expected) similarities of assigned reviewer-paper pairs, but those similarities may not be accurate proxies for the quality of review that the reviewer can provide for that paper. In practice, automated similarity-based assignments result in numerous complaints of low-expertise paper assignments from both authors and reviewers [3], and recent work [43] finds that current text-similarity algorithms make significant errors in predicting reviewer expertise. Meanwhile, self-reported assessments of reviewer-paper assignment quality can be collected from the reviewers themselves after the review. Conferences often ask reviewers to score their expertise in the paper’s topic and/or confidence in their review [34, 4, 44]. Other indicators of review quality can also be considered; e.g., some conferences ask “meta-reviewers” or other reviewers to evaluate the quality of written reviews directly [45, 44]. In this work, we consider self-reported expertise and confidence as our measures of review quality.

3 Off-Policy Evaluation

One attractive property of the randomized assignment described above is that while only one reviewer-paper assignment is sampled and deployed, many other assignments could have been sampled, and those assignments could equally well have been generated by some alternative assignment policy. The positive probability of other assignments allows us to investigate whether alternative assignment policies might have resulted in higher-quality reviews.

Let AA be a randomized assignment policy with a probability density PAP_{A}, where Z{0,1}||×|𝒫|PA(Z)=1\sum_{Z\in\{0,1\}^{|\mathcal{R}|\times|\mathcal{P}|}}P_{A}(Z)=1; PA(Z)0P_{A}(Z)\geq 0, Z\forall Z; and PA(Z)>0P_{A}(Z)>0 only for feasible assignments ZZ. Let BB be another policy with density PBP_{B}, defined similarly. We denote by PA(Zi)P_{A}(Z_{i}) and PB(Zi)P_{B}(Z_{i}) the marginal probabilities of assigning reviewer-paper pair ii under AA and BB respectively. Finally, let YiY_{i}\in\mathbb{R}, where i=(r,p)×𝒫i=({r,p})\in\mathcal{R}\times\mathcal{P}, be the measure of the quality of reviewer rr’s review of paper pp, e.g., reviewer self-reported expertise or confidence as introduced in Section 2.

We follow the potential outcomes framework of causal inference [46]. Throughout this work, we will let AA be the on-policy or the logging policy, i.e., the policy that the review data was collected under, while BB will denote one of several alternative policies of interest. In Section 6, we will describe the specific alternative policies we consider in this work. Define N=i×𝒫ZiN=\sum_{i\in\mathcal{R}\times\mathcal{P}}Z_{i} as the total number of reviews, fixed across policies and set ahead of time. We are interested in the following estimands:

μA=𝔼ZPA[1Ni×𝒫YiZi],μB=𝔼ZPB[1Ni×𝒫YiZi],\displaystyle\mu_{A}={\mathop{{}\mathbb{E}}}_{Z\sim P_{A}}\left[\frac{1}{N}\sum_{i\in\mathcal{R}\times\mathcal{P}}Y_{i}Z_{i}\right],\quad\mu_{B}={\mathop{{}\mathbb{E}}}_{Z\sim P_{B}}\left[\frac{1}{N}\sum_{i\in\mathcal{R}\times\mathcal{P}}Y_{i}Z_{i}\right],

where μA\mu_{A} and μB\mu_{B} are the expected review quality under policy AA and BB, respectively.

In practice, we do not have access to all YiY_{i}, but only those that were assigned. Let ZA{0,1}||×|𝒫|Z^{A}\in\{0,1\}^{|\mathcal{R}|\times|\mathcal{P}|} be the assignment sampled under the on-policy A, drawn from PAP_{A}. We define the following Horvitz-Thompson estimators of the means:

μ^A\displaystyle\widehat{\mu}_{A} =1Ni×𝒫YiZiA,\displaystyle=\frac{1}{N}\sum_{i\in\mathcal{R}\times\mathcal{P}}Y_{i}Z^{A}_{i},
μ^B\displaystyle\widehat{\mu}_{B} =1Ni×𝒫YiZiAWi,where Wi=PB(Zi)PA(Zi)i×𝒫.\displaystyle=\frac{1}{N}\sum_{i\in\mathcal{R}\times\mathcal{P}}Y_{i}Z^{A}_{i}W_{i},\quad\text{where }W_{i}=\frac{P_{B}(Z_{i})}{P_{A}(Z_{i})}\ \forall i\in\mathcal{R}\times\mathcal{P}. (1)

For now, suppose that BB has positive probability only where AA is positive (also known as satisfying “positivity”): PA(Zi)>0P_{A}(Z_{i})>0 for all i×𝒫i\in\mathcal{R}\times\mathcal{P} where PB(Zi)>0P_{B}(Z_{i})>0. Then, all weights WiW_{i} where PB(Zi)>0P_{B}(Z_{i})>0 are bounded. As we will see, many policies of interest BB go beyond the support of AA.

Under the positivity assumption, μ^A\widehat{\mu}_{A} and μ^B\widehat{\mu}_{B} are unbiased estimators of μA\mu_{A} and μB\mu_{B} respectively [47]. Moreover, the Horvitz-Thompson estimator is admissible in the class of all unbiased estimators [48]. Note that μ^A\widehat{\mu}_{A} is simply the empirical mean of the observed assignment sampled on-policy, and μ^B\widehat{\mu}_{B} is a weighted mean of the observed assignment based on inverse probability weighting: placing weights greater than one on reviewer-paper pairs that are more likely off- than on-policy and less than or equal to one otherwise. These estimators also rely on a standard causal inference assumption of no interference. In Appendix B, we discuss the implications of this assumption in the peer review context.

Challenges.

In off-policy evaluation, we are interested in evaluating a policy BB based on data collected under policy AA. However, our ability to do so is typically limited to policies where the assignments that would be made under BB are possible under AA. In practice, many interesting policies step outside of the support of AA. Outcomes for reviewer-paper pairs outside the support of AA but with positive probability under BB (“positivity violations”) cannot be estimated and must either be imputed by some model or have their contribution to the average outcome (μB\mu_{B}) bounded.

In addition to positivity violations, we identify three other mechanisms through which missing data with potential confounding may arise in the peer review context: absent reviewers, selective attrition, and manual reassignments. For absent reviewers, i.e., reviewers who have not submitted any reviews, we do not have a reason to believe that the reviews are missing due to the quality of the reviewer-paper assignment. Hence, we assume that their reviews are missing at random, and impute them with the weighted mean outcome of the observed reviews. For selective attrition, i.e., when some but not all reviews are completed, we instead employ conservative bounding techniques as for policy-based positivity violations. Finally, reviews might be missing due to manual reassignments by the program chairs, after the assignment has been sampled. As a result, the originally assigned reviews will be missing and new reviews will be added. In such cases, we treat removed assignments as attrition (i.e., bounding their contribution) and ignore the newly introduced assignments as they did not arise from any determinable process.

Concretely, we partition the reviewer-paper pairs into the following (mutually exclusive and exhaustive) sets:

  • \mathcal{I}^{-}: positivity violations, {i=(r,p)×𝒫:PA(Zi)=0PB(Zi)>0}\{i=(r,p)\in\mathcal{R}\times\mathcal{P}:P_{A}(Z_{i})=0\land P_{B}(Z_{i})>0\},

  • Abs\mathcal{I}^{Abs}: missing reviews where the reviewer was absent (submitted no reviews),

  • Att\mathcal{I}^{Att}: remaining missing reviews, and

  • +\mathcal{I}^{+}: remaining pairs without positivity violations or missing reviews, (×𝒫)(AttAbs)(\mathcal{R}\times\mathcal{P})\setminus(\mathcal{I}^{Att}\cup\mathcal{I}^{Abs}\cup\mathcal{I}^{-}).

In the next section, we present methods for imputing or bounding the contribution of \mathcal{I}^{-} to the estimate of μ^B\widehat{\mu}_{B}, and Abs\mathcal{I}^{Abs} and Att\mathcal{I}^{Att} to the estimates of μ^A\widehat{\mu}_{A} and μ^B\widehat{\mu}_{B}.

4 Imputation and Partial Identification

In the previous section, we defined three sets of reviewer-paper pairs ii for which outcomes YiY_{i} must be imputed rather than estimated: \mathcal{I}^{-}, Abs\mathcal{I}^{Abs}, Att\mathcal{I}^{Att}. In this section, we describe varied methods for imputing these outcomes that rely on different strengths of assumptions, including methods that output point estimates (Sections 4.1 and 4.2) and methods that output lower and upper bounds of μ^B\widehat{\mu}_{B} (Sections 4.3 and 4.4). In Section 6, we apply these methods to peer-review data from two computer science venues.

For missing reviews where the reviewer is absent (Abs\mathcal{I}^{Abs}), we assume that the reviewer did not participate in the review process for reasons unrelated to the assignment quality (e.g., too busy). Specifically, we assume that the reviewers are missing at random and thus impute the mean outcome among +\mathcal{I}^{+}, the pairs with no positivity violations or missing reviews:

Y¯=i+YiZiAWii+ZiAWi.\displaystyle\overline{Y}=\frac{\sum_{i\in\mathcal{I}^{+}}Y_{i}Z^{A}_{i}W_{i}}{\sum_{i\in\mathcal{I}^{+}}Z^{A}_{i}W_{i}}.

Correspondingly, we set Yi=Y¯Y_{i}=\overline{Y} for all iAbsi\in\mathcal{I}^{Abs} in estimator (1).

In contrast, for positivity violations (\mathcal{I}^{-}) and the remaining missing reviews (Att\mathcal{I}^{Att}), we allow for the possibility that these reviewer-paper pairs being unobserved is correlated with their unobserved outcome. Thus, we consider imputing arbitrary values for ii in these subsets, which we denote by YiImputeY^{\text{\it{Impute}}}_{i} and place into a matrix YImpute||×|𝒫|Y^{\text{\it{Impute}}}\in\mathbb{R}^{|\mathcal{R}|\times|\mathcal{P}|}, leaving entries for iAtti\not\in\mathcal{I}^{-}\cup\mathcal{I}^{Att} undefined. This strategy corresponds to setting Yi=YiImputeY_{i}=Y^{\text{\it{Impute}}}_{i} for iAtti\in\mathcal{I}^{-}\cup\mathcal{I}^{Att} in estimator (1). To obtain bounds, we impute both the assumed minimal and maximal values of YiImputeY^{\text{\it{Impute}}}_{i}.

These modifications result in a Horvitz-Thompson off-policy estimator with imputation. To denote this, we redefine μ^B\widehat{\mu}_{B} to be a function μ^B:||×|𝒫|\widehat{\mu}_{B}:\mathbb{R}^{|\mathcal{R}|\times|\mathcal{P}|}\to\mathbb{R}, where μ^B(YImpute)\widehat{\mu}_{B}(Y^{\text{\it{Impute}}}) denotes the estimator resulting from imputing entries from a particular choice of YImputeY^{\text{\it{Impute}}}:

μ^B(YImpute)=\displaystyle\widehat{\mu}_{B}(Y^{\text{\it{Impute}}})= 1N(i+YiZiAWi+iAttYiImputeZiAWi+iAbsY¯ZiAWi+iYiImputePB(Zi)).\displaystyle\frac{1}{N}\left(\sum_{i\in\mathcal{I}^{+}}Y_{i}Z^{A}_{i}W_{i}+\sum_{i\in\mathcal{I}^{Att}}Y^{\text{\it{Impute}}}_{i}Z^{A}_{i}W_{i}+\sum_{i\in\mathcal{I}^{Abs}}\overline{Y}Z^{A}_{i}W_{i}+\sum_{i\in\mathcal{I}^{-}}Y^{\text{\it{Impute}}}_{i}P_{B}(Z_{i})\right).

The estimator computes the weighted mean of the observed (YiY_{i}) and imputed outcomes (YiImputeY^{\text{\it{Impute}}}_{i} and Y¯\overline{Y}). We impute YiImputeY^{\text{\it{Impute}}}_{i} for the attrition (Att\mathcal{I}^{Att}) and positivity violation (\mathcal{I}^{-}) pairs, and Y¯\overline{Y} for the absent reviewers (Abs\mathcal{I}^{Abs}). Note that we weight the imputed positivity violations (\mathcal{I}^{-}) by PB(Zi)P_{B}(Z_{i}) rather than ZiWiZ_{i}W_{i}, since the latter is undefined. Under the assumption that the imputed outcomes are accurate, μ^B(YImpute)\widehat{\mu}_{B}(Y^{\text{\it{Impute}}}) is an unbiased estimator of μB\mu_{B}.

To construct confidence intervals, we estimate the variance of μ^B(YImpute)\widehat{\mu}_{B}(Y^{\text{\it{Impute}}}) as follows:

Var^[μ^B(YImpute)]\displaystyle\widehat{{\text{Var}}}[\widehat{\mu}_{B}(Y^{\text{\it{Impute}}})] =1N2(i,j)(×𝒫)2Cov[Zi,Zj]ZiAZjAWiWjYiYj,\displaystyle=\frac{1}{N^{2}}\sum_{(i,j)\in(\mathcal{R}\times\mathcal{P})^{2}}\text{Cov}[Z_{i},Z_{j}]Z^{A}_{i}Z^{A}_{j}W_{i}W_{j}Y^{\prime}_{i}Y^{\prime}_{j},
where Yi\displaystyle\quad\text{where }Y^{\prime}_{i} ={Yiif i+YiImputeif iAttY¯if iAbs.\displaystyle=\begin{cases}Y_{i}&\text{if~{}}i\in\mathcal{I}^{+}\\ Y^{\text{\it{Impute}}}_{i}&\text{if~{}}i\in\mathcal{I}^{Att}\cup\mathcal{I}^{-}\\ \overline{Y}&\text{if~{}}i\in\mathcal{I}^{Abs}.\end{cases}

The covariance terms (taken over ZPAZ\sim P_{A}) are not known exactly, owing to the fact that the procedure by Jecmen et al. [22] only constrains the marginal probabilities of individual reviewer-paper pairs, but pairs of pairs can be non-trivially correlated. In the absence of a closed-form expression, we use Monte Carlo methods to tightly estimate these covariances (further details provided in Appendix C).

In the following subsections, we detail several methods by which we choose YImputeY^{\text{\it{Impute}}}. These methods rely on various different assumptions of different strength about the unobserved outcomes.

4.1 Mean Imputation

As a first approach, we assume that the mean outcome within +\mathcal{I}^{+} is representative of the mean outcome among the other pairs. This is a strong assumption, since the presence of a pair in \mathcal{I}^{-} or Att\mathcal{I}^{Att} may not be independent of their outcome. For example, if reviewers choose not to submit reviews when the assignment quality is poor, Y¯\overline{Y} is not representative of the outcomes in Att\mathcal{I}^{Att}. Nonetheless, under this strong assumption, we can simply impute the mean outcome Y¯\overline{Y} for all pairs necessitating imputation. Setting YiImpute=Y¯Y^{\text{\it{Impute}}}_{i}=\overline{Y} for all iAtti\in\mathcal{I}^{-}\cup\mathcal{I}^{Att}, we consider the following point estimate of μB\mu_{B}: μ^B(Y¯)\widehat{\mu}_{B}(\overline{Y}). While following from an overly strong assumption, we find it useful to compare our findings under this assumption to findings under subsequent weaker assumptions.

4.2 Model Imputation

Instead of simply imputing the mean outcome, we can assume that the unobserved outcomes YiY_{i} are some simple function of known covariates XicX_{i}\in\mathbb{R}^{c} (where cc is the number of covariates) for each reviewer-paper pair ii. If so, we can directly estimate this function using a variety of statistical models, resulting in a point estimate of μB\mu_{B}. In doing so, we implicitly take on the assumptions made by each model, which determine how to generalize the covariate-outcome mapping from the observed pairs to the unobserved pairs. These assumptions are typically quite strong, since this mapping may be very different between the observed pairs (typically good matches) and unobserved pairs (typically less good matches).

More specifically, given the set of all observed reviewer-paper pairs 𝒪={i+:ZiA=1}\mathcal{O}=\{i\in\mathcal{I}^{+}:Z^{A}_{i}=1\}, we train a model mm using the observed data {(Xi,Yi):i𝒪}\{(X_{i},Y_{i}):i\in\mathcal{O}\}. Let Y^(m)||×|𝒫|\widehat{Y}^{(m)}\in\mathbb{R}^{|\mathcal{R}|\times|\mathcal{P}|} denote the outcomes predicted by that model for each pair. We then consider μ^B(Y^(m))\widehat{\mu}_{B}(\widehat{Y}^{(m)}) as a point estimate of μB\mu_{B}. In our experiments, we employ standard methods for classification, ordinal regression, and collaborative filtering:

  • Logistic regression (clf-logistic);

  • Ridge classification (clf-ridge);

  • Ordered logit (ord-logit);

  • Ordered probit (ord-probit);

  • SVD++, collaborative filtering (cf-svd++);

  • K-nearest-neighbors, collaborative filtering (cf-knn).

Note that, unlike the other methods, the methods based on collaborative filtering model the missing data by using only the observed reviewer-paper outcomes (YY). We discuss our choice of methods, hyperparameters, and implementation details in Appendix E.

4.3 Manski Bounds

As a more conservative approach, we can exploit the fact that the outcomes YiY_{i} are bounded, letting us bound the mean of the counterfactual policy without making any assumptions on how the positivity violations arise. Such bounds are often called Manski bounds [23] in the econometrics literature on partial identification. To employ Manski bounds, we assume that all outcomes YY can take only values between yminy_{\min} and ymaxy_{\max}, e.g., self-reported expertise and confidence scores are limited to a pre-specified range on the review questionnaire. Then, setting YiImpute=yminY^{\text{\it{Impute}}}_{i}=y_{\min} or YiImpute=ymaxY^{\text{\it{Impute}}}_{i}=y_{\max} for all iAtti\in\mathcal{I}^{-}\cup\mathcal{I}^{Att}, we can estimate the upper and lower bound of μB\mu_{B} as μ^B(ymin)\widehat{\mu}_{B}(y_{\min}) and μ^B(ymax)\widehat{\mu}_{B}(y_{\max}).

We adopt a well-established inference procedure for constructing 95% confidence intervals that asymptotically contain the true value of μB\mu_{B} with probability at least 95%95\%. Following Imbens and Manski [49], we construct the interval:

μ^BCI[μ^B(ymin)zα,nVar^[μ^B(ymin)]/N,μ^B(ymax)+zα,nVar^[μ^B(ymax)]/N],\displaystyle\widehat{\mu}_{B}^{CI}\in\biggl{[}\widehat{\mu}_{B}(y_{\min})-z^{\prime}_{\alpha,n}\sqrt{\widehat{{\text{Var}}}[\widehat{\mu}_{B}(y_{\min})]/N},\ \widehat{\mu}_{B}(y_{\max})+z^{\prime}_{\alpha,n}\sqrt{\widehat{{\text{Var}}}[\widehat{\mu}_{B}(y_{\max})]/N}\biggr{]},

where the zz-score analog zα,nz^{\prime}_{\alpha,n} (α=0.95\alpha=0.95), is set by their procedure such that the interval asymptotically has at least 95%95\% coverage under plausible regularity conditions; for further details, see the discussion in Appendix D.

4.4 Monotonicity and Lipschitz Smoothness

We now propose two styles of weak assumptions on the covariate-outcome mapping that can be leveraged to achieve tighter bounds on μ^B\widehat{\mu}_{B} than the Manski bounds. In contrast to the strong modeling assumptions used in the sections on mean and model imputation, these assumptions can be more intuitively understood and justified as conservative assumptions given particular choices of covariates.

Monotonicity.

The first weak assumption we consider is a monotonicity condition. Intuitively, monotonicity captures the idea that we expect higher expertise for a reviewer-paper pair when some covariates are higher, all else equal. For example, in our experiments we use the similarity component scores (bids, text similarity, subject area match) as covariates. Specifically, for covariate vectors XiX_{i} and XjX_{j}, define the dominance relationship XiXjX_{i}\succ X_{j} to mean that XiX_{i} is greater than or equal to XjX_{j} in all components and XiX_{i} is strictly greater than XjX_{j} in at least one component. Then, the monotonicity assumption states that: if XiXjX_{i}\succ X_{j}, then YiYjY_{i}\geq Y_{j}, (i,j)(×𝒫)2\forall(i,j)\in(\mathcal{R}\times\mathcal{P})^{2}.

Using this assumption to restrict the range of possible values for the unobserved outcomes, we seek upper and lower bounds on μB\mu_{B}. Recall that 𝒪\mathcal{O} is the set of all observed reviewer-paper pairs. One challenge is that the observed outcomes themselves (YiY_{i} for i𝒪i\in\mathcal{O}) may violate the monotonicity condition. Thus, to find an upper or lower bound, we compute surrogate values TiT_{i}\in\mathbb{R} that satisfy the monotonicity constraint for all i𝒪AttAbsi\in\mathcal{O}\cup\mathcal{I}^{Att}\cup\mathcal{I}^{Abs}\cup\mathcal{I}^{-} while ensuring that the surrogate values TiT_{i} for i𝒪i\in\mathcal{O} are as close as possible to the outcomes YiY_{i}. The surrogate values TiT_{i} for iAtti\in\mathcal{I}^{Att}\cup\mathcal{I}^{-} can then be imputed as outcomes.

Inspired by isotonic regression [50], we implement a two-level optimization problem. The primary objective minimizes the 1\ell_{1} distance between TiT_{i} and YiY_{i} for pairs with observed outcomes i𝒪i\in\mathcal{O}. The second objective either minimizes (for a lower bound) or maximizes (for an upper bound) the sum of YiY_{i} for the unobserved pairs iAttAbsi\in\mathcal{I}^{Att}\cup\mathcal{I}^{Abs}\cup\mathcal{I}^{-}, weighted as in μ^B\widehat{\mu}_{B}. Define the universe of relevant pairs 𝒰=𝒪AttAbs\mathcal{U}=\mathcal{O}\cup\mathcal{I}^{Att}\cup\mathcal{I}^{Abs}\cup\mathcal{I}^{-} and define Ψ\Psi as a very large constant. This results in the following pair of optimization problems, which compute matrices TminM,TmaxM||×|𝒫|T^{M}_{min},T^{M}_{max}\in\mathbb{R}^{|\mathcal{R}|\times|\mathcal{P}|} (leaving entries i𝒰i\not\in\mathcal{U} undefined):

(TminM,TmaxM)=\displaystyle(T^{M}_{min},T^{M}_{max})= argminTi:i𝒰Ψi𝒪|TiYi|±(iAttAbsTiWi+iTiPB(Zi)),\displaystyle\operatorname*{argmin}_{T_{i}:i\in\mathcal{U}}\quad\Psi\sum_{i\in\mathcal{O}}|T_{i}-Y_{i}|\pm\left(\sum_{i\in\mathcal{I}^{Att}\cup\mathcal{I}^{Abs}}T_{i}W_{i}+\sum_{i\in\mathcal{I}^{-}}T_{i}P_{B}(Z_{i})\right),
s.t. TiTj,(i,j){𝒰2:XiXj},\displaystyle T_{i}\geq T_{j},\qquad\forall(i,j)\in\{\mathcal{U}^{2}:X_{i}\succ X_{j}\},
yminTiymax,i𝒰.\displaystyle y_{\min}\leq T_{i}\leq y_{\max},\qquad\forall{i}\in\mathcal{U}.

The sign of the second objective term depends on whether a lower (negative) or upper (positive) bound is being computed. The last set of constraints corresponds to the same constraints used to construct the Manski bounds described earlier, which is combined here with monotonicity to jointly constrain the possible outcomes. The above problem can be reformulated and solved as a linear program using standard techniques.

This procedure gives the following confidence intervals for μB\mu_{B},

μ^B|MCI[μ^B(TminM)zα,nVar^[μ^B(TminM)]/N,μ^B(TmaxM)+zα,nVar^[μ^B(TmaxM)]/N],\displaystyle\widehat{\mu}_{B|M}^{CI}\in\biggl{[}\widehat{\mu}_{B}(T^{M}_{min})-z^{\prime}_{\alpha,n}\sqrt{\widehat{{\text{Var}}}[\widehat{\mu}_{B}(T^{M}_{min})]/N},\ \widehat{\mu}_{B}(T^{M}_{max})+z^{\prime}_{\alpha,n}\sqrt{\widehat{{\text{Var}}}[\widehat{\mu}_{B}(T^{M}_{max})]/N}\biggr{]},

where the value zα,nz^{\prime}_{\alpha,n} is again set by the procedure of Imbens and Manski [49] (see discussion in Appendix D).

Lipschitz Smoothness.

The second weak assumption we consider is a Lipschitz smoothness assumption on the correspondence between covariates and outcomes. Intuitively, this captures the idea that we expect two reviewer-paper pairs who are very similar in covariate space to have similar expertise. For covariate vectors XiX_{i} and XjX_{j}, define d(Xi,Xj)d(X_{i},X_{j}) as some notion of distance between the covariates. Then, the Lipschitz assumption states that there exists a constant LL such that |YiYj|Ld(Xi,Xj)|Y_{i}-Y_{j}|\leq Ld(X_{i},X_{j}) for all (i,j)(×𝒫)2(i,j)\in(\mathcal{R}\times\mathcal{P})^{2}. In practice, we can choose an appropriate value of LL by studying the many pairs of observed outcomes in the data (Section 5.2 and Appendix G), though this approach assumes that the Lipschitz smoothness of the covariate-outcome function is the same for observed and unobserved pairs.

As in the previous section, we introduce surrogate values TiT_{i}\in\mathbb{R} and implement a two-level optimization problem to address Lipschitz violations within the observed outcomes (i.e., if two observed pairs are very close in covariate space but have different outcomes). Defining 𝒰\mathcal{U} and Ψ\Psi as above, this results in the following pair of optimization problems, which compute matrices TminL,TmaxL||×|𝒫|T^{L}_{min},T^{L}_{max}\in\mathbb{R}^{|\mathcal{R}|\times|\mathcal{P}|} (leaving entries i𝒰i\not\in\mathcal{U} undefined):

(TminL,TmaxL)=argminTi:i𝒰Ψi𝒪|TiYi|±(iAttAbsTiWi+iTiPB(Zi))\displaystyle\begin{aligned} (T^{L}_{min},T^{L}_{max})=&\operatorname*{argmin}_{T_{i}:i\in\mathcal{U}}\Psi\sum_{i\in\mathcal{O}}|T_{i}-Y_{i}|\pm\left(\sum_{i\in\mathcal{I}^{Att}\cup\mathcal{I}^{Abs}}T_{i}W_{i}+\sum_{i\in\mathcal{I}^{-}}T_{i}P_{B}(Z_{i})\right)\end{aligned}
s.t.|TiTj|Ld(Xi,Xj),(i,j)𝒰,yminTiymax,i𝒰.\displaystyle\begin{aligned} \text{s.t.}\quad&|T_{i}-T_{j}|\leq Ld(X_{i},X_{j}),&\forall(i,j)\in\mathcal{U},\qquad\qquad\qquad\quad\\ &y_{min}\leq T_{i}\leq y_{max},&\forall{i}\in\mathcal{U}.\qquad\qquad\qquad\quad\end{aligned}

As before, the sign of the second objective term depends on whether a lower (negative) or upper (positive) bound is being computed. The last set of constraints are again the same constraints used to construct the Manski bounds described earlier, which here are combined with the Lipschitz assumption to jointly constrain the possible outcomes. In the limit, as LL\to\infty, the Lipschitz constraints become vacuous and we recover the Manski bounds. This problem can again be reformulated and solved as a linear program using standard techniques.

This procedure gives the following confidence intervals for μB\mu_{B},

μ^B|LCI[μ^B(TminL)zα,nVar^[μ^B(TminL)]/N,μ^B(TmaxL)+zα,nVar^[μ^B(TmaxL)]/N],\displaystyle\widehat{\mu}_{B|L}^{CI}\in\biggl{[}\widehat{\mu}_{B}(T^{L}_{min})-z^{\prime}_{\alpha,n}\sqrt{\widehat{{\text{Var}}}[\widehat{\mu}_{B}(T^{L}_{min})]/N},\ \widehat{\mu}_{B}(T^{L}_{max})+z^{\prime}_{\alpha,n}\sqrt{\widehat{{\text{Var}}}[\widehat{\mu}_{B}(T^{L}_{max})]/N}\biggr{]},

where the value zα,nz^{\prime}_{\alpha,n} is again set by the procedure of Imbens and Manski [49] (see Appendix D).

5 Experimental Setup

We apply our framework to data from two venues that used randomized paper assignments as described in Section 2: the 2021 Workshop on Theory and Practice of Differential Privacy (TPDP) and the 2022 AAAI Conference on Advancement in Artificial Intelligence (AAAI). In both settings, we aim to understand the effect that changing parameters of the assignment policies would have on review quality. The analyses were approved by our institutions’ IRBs.

5.1 Datasets

TPDP.

The TPDP workshop received 95 submissions and had a pool of 35 reviewers. Each paper received exactly 3 reviews, and reviewers were assigned 8 or 9 reviews, for a total of 285 assigned reviewer-paper pairs. The reviewers were asked to bid on the papers and could place one of the following bids (the corresponding value of BiB_{i} is shown in the parenthesis): “very low” (1-1), “low” (0.5-0.5), “neutral” (0), “high” (0.50.5), or “very high” (11), with “neutral” as the default. The similarity for each reviewer-paper pair was defined as a weighted sum of the bid score, BiB_{i}, and text-similarity scores, TiT_{i}: Si=wtextTi+(1wtext)BiS_{i}=w_{\mathrm{text}}T_{i}+(1-w_{\mathrm{text}})B_{i}, with wtext=0.5w_{\mathrm{text}}=0.5. The randomized assignment was run with an upper bound of q=0.5q=0.5. In their review, the reviewers were asked to assess the alignment between the paper and their expertise (between 1: irrelevant and 4: very relevant), and to report their review confidence (between 1: educated guess and 5: absolutely certain). We consider these two responses as our measures of quality. Once the assignment was generated, the organizers manually changed three reviewer-paper assignments, which we handle using the techniques discussed in Section 4.

AAAI.

In the AAAI conference, submissions were assigned to reviewers in multiple sequential stages across two rounds of submissions. We examine the stage of the first round where the randomized assignment algorithm was used to assign all submissions to a pool of “senior reviewers.” The assignment involved 8,450 papers and 3,145 reviewers; each paper was assigned to one reviewer, and each reviewer was assigned at most 3 or 4 papers based on their primary subject area. The similarity SiS_{i} for every reviewer-paper pair ii was based on three scores: text-similarity Ti[0,1]T_{i}\in[0,1], subject-area score Ki[0,1]K_{i}\in[0,1], and bid BiB_{i}. Bids were chosen from the following list (with the corresponding value of BiB_{i} shown in the parenthesis, where λbid=1\lambda_{\mathrm{bid}}=1 is a parameter scaling the impact of positive bids as compared to neutral/negative bids): “not willing” (0.050.05), “not entered” (11), “in a pinch” (1+0.5λbid1+0.5\lambda_{\mathrm{bid}}), “willing” (1+1.5λbid1+1.5\lambda_{\mathrm{bid}}), “eager” (1+3λbid1+3\lambda_{\mathrm{bid}}). The default option was “not entered”. Similarities were computed as: Si=(wtextTi+(1wtext)Ki)1/BiS_{i}=(w_{\mathrm{text}}T_{i}+(1-w_{\mathrm{text}})K_{i})^{1/B_{i}}, with wtext=0.75w_{\mathrm{text}}=0.75. The actual similarities differed from this base similarity formula in a few special cases (e.g., missing data); we provide the full description of the similarity computation in Appendix F. The randomized assignment was run with q=0.52q=0.52. Reviewers reported an expertise score (between 0: not knowledgeable and 5: expert) and a confidence score (between 0: not confident and 4: very confident), which we consider as our quality measures. After reviewers were assigned, several assignments were manually changed by the conference organizers, while several assigned reviews were also simply not submitted; we handle these cases as described in Section 4.

5.2 Assumption Suitability

For both the monotonicity and Lipschitz assumptions (as well as the model imputations), we work with the covariates XiX_{i}, a vector of the two (TPDP) or three (AAAI) component scores used in the computation of similarities. We now consider whether these assumptions are reasonable with respect to our choices of outcome variables and of covariates.

Monotonicity.

Monotonicity assumes that when any component of the covariates increases, the review quality should not be lower. We can test this assumption on the observed outcomes: among all pairs of reviewer-paper pairs with both outcomes observed, 65.7% (TPDP)/28.0% (AAAI) have a dominance relationship (XiXjX_{i}~{}\succ~{}X_{j}) and of those pairs, 79.8% (TPDP)/76.4% (AAAI) satisfy the monotonicity condition when using expertise as an outcome and 76% (TPDP)/78.9% (AAAI) when using confidence as an outcome. The fraction of dominant pairs for TPDP is higher since we consider only two covariates.

Refer to caption
Figure 1: CCDF of the L=|YiYj|/d(Xi,Xj)L=|Y_{i}-Y_{j}|/d(X_{i},X_{j}) values for all pairs of observed points, where YYs are expertise scores. The dashed lines denote the LL values corresponding to less than 10%10\%, 5%5\%, and 1%1\% violations. For TPDP, these values are L=30,50,300L=30,50,300, respectively; for AAAI, L=30,40,100L=30,40,100.

Lipschitz Smoothness.

For the Lipschitz assumption, a choice of distance in covariate space is required. We choose the 1\ell_{1} distance, normalized in each dimension so that all component distances lie in [0,1][0,1], and divided by the number of dimensions. For AAAI, some reviewer-paper pairs are missing a covariate; if so, we impute a distance of 11 in that component. We then choose several potential Lipschitz constants LL by analyzing the reviewer-paper pairs with observed outcomes. In Figure 1, we plot the fraction of pairs of observations that violate the Lipschitz condition for a given value of LL with respect to expertise; we show the corresponding plots for confidence in Appendix K. In our later experiments, we use values of LL corresponding to less than 10%10\%, 5%5\%, and 1%1\% violations from these plots.

With these choices, the Lipschitz assumptions correspond to beliefs that the outcome does not change too much as the similarity components change. As one example, for L=30L=30 on AAAI, when one similarity component differs by 0.10.1, the outcomes can differ by at most 11. Effectively, the imputed outcome of each unobserved pair is restricted to be relatively close to the outcome for the closest observed pair. In Appendix G, we examine the distribution of distances between unobserved reviewer-paper pairs and their nearest observed pair, observing median distances of 0.0014 (TPDP) and 0.0011 (AAAI) across the pairs violating positivity under any of the modified similarity functions that we analyze in what follows. We conclude that most imputed pairs are very close to some observed pair, and even large values of LL can significantly decrease the size of the bound when compared to the Manski bounds.

In solving the optimization problems for both the monotonicity and Lipschitz methods, we choose the constant Ψ=109\Psi=10^{9} to be large enough such that the first term of the objective dominates the second, while not causing numerical instability issues.

6 Results

We now present the analyses of the two datasets using the methods introduced in Section 4. For brevity, we report our analysis using self-reported expertise as the quality measure YY, but include the results using self-reported confidence in Appendix K. When solving the LPs that output alternative randomized assignments (Appendix A), we often encounter multiple unique optimal solutions and employ a persistent arbitrary tie-breaking procedure to choose among them (Appendix I).

Refer to caption
Figure 2: Expertise of off-policies varying wtextw_{\mathrm{text}} and qq for TPDP, and wtextw_{\mathrm{text}}, λbid\lambda_{\mathrm{bid}}, and qq for AAAI, computed using the different estimation methods described in Section 4. The dashed blue lines indicate Manski bounds around the on-policy expertise and the grey lines indicate Manski bounds around the off-policy expertise. The error bands denoted μ^BCI\widehat{\mu}_{B}^{CI} for the Manski bounds, μ^B|MCI\widehat{\mu}_{B|M}^{CI} for the monotonicity bounds, and μ^B|LCI\widehat{\mu}_{B|L}^{CI} for the Lipschitz bounds, represent confidence intervals that asymptotically contain the true value of μB\mu_{B} with probability at least 95%95\% as described in Appendix D. We observe similar patterns using confidence as an outcome, as shown in Figure 7 in the Appendix. Note that the vertical axis does not start at zero in order to focus on the most relevant regions of the plots.

TPDP.

We perform two analyses on the TPDP data, shown in Figure 2 (left). First, we analyze the choice of how to interpolate between the bids and the text similarity when computing the composite similarity score for each reviewer-paper pair. We examine a range of assignments, from an assignment based only on the bids (wtext=0w_{\mathrm{text}}=0) to an assignment based only on the text similarity (wtext=1w_{\mathrm{text}}=1), focusing our off-policy evaluation on deterministic assignments (i.e., policies with q=1q=1). Setting wtext[0,0.75]w_{\mathrm{text}}\in[0,0.75] results in very similar assignments, each of which has Manski bounds overlapping with the on-policy. Within this region, the models, monotonicity bounds, and Lipschitz bounds all agree that the expertise is similar to the on-policy. However, setting wtext(0.75,0.9)w_{\mathrm{text}}\in(0.75,0.9) results in a significant improvement in average expertise, even without any additional assumptions. Finally, setting wtext(0.9,1]w_{\mathrm{text}}\in(0.9,1] leads to assignments that are significantly different from the assignments supported by the on-policy, which results in many positivity violations and wider confidence intervals, even under the monotonicity and Lipschitz smoothness assumptions. Note that within this region, the models significantly disagree on the expertise, indicating that the strong assumptions made by such models may not be accurate. Altogether, these results suggest that putting more weight on the text similarity (versus bids) leads to higher-expertise reviews.

Second, we investigate the “cost of randomization” to prevent fraud, measuring the effect of increasing qq and thereby reducing randomness in the optimized random assignment. We consider values between q=0.4q=0.4 and q=1q=1 (optimal deterministic assignment). Recall the on-policy has q=0.5q=0.5. When varying qq, we find that except for a small increase in the region around q=0.75q=0.75, the average expertise for policies with q>0.5q>0.5 is very similar to that of the on-policy. This result suggests that using a randomized instead of a deterministic policy does not lead to a significant reduction in self-reported expertise, an observation that should be contrasted with the previously documented reduction in the expected sum-similarity objective under randomized assignment [22]; see further analysis in Appendix H.

AAAI.

We perform three analyses on the AAAI data, shown in Figure 2 (right). First, we examine the effect of interpolating between the text-similarity scores and the subject area scores by varying wtext[0,1]w_{\mathrm{text}}\in[0,1], again considering only deterministic policies (i.e., q=1q=1). The on-policy sets wtext=0.75w_{\mathrm{text}}=0.75. Due to large numbers of positivity violations, the Manski bounds are uninformative and so we turn to the other estimators. The model imputation analysis indicates that policies with wtext0.75w_{\mathrm{text}}\geq 0.75 may have slightly higher expertise than the on-policy and indicates lower expertise in the region where wtext0.5w_{\mathrm{text}}\leq 0.5. However, the models differ somewhat in their predictions for low wtextw_{\mathrm{text}}, indicating that the assumptions made by these models may not be reliable. The monotonicity bounds more clearly indicate low expertise compared to the on-policy when wtext0.25w_{\mathrm{text}}\leq 0.25, but are also slightly more pessimistic about the wtext0.75w_{\mathrm{text}}\geq 0.75 region than the models. The Lipschitz bounds indicate slightly higher than on-policy expertise for wtext0.75w_{\mathrm{text}}\geq 0.75 and potentially suggest slightly lower than on-policy expertise for wtext0.25w_{\mathrm{text}}\leq 0.25. Overall, all methods of analysis indicate that low values of wtextw_{\mathrm{text}} result in worse assignments, but the effect of considerably increasing wtextw_{\mathrm{text}} is unclear.

Second, we examine the effect of increasing the weight on positive bids by varying the values of λbid\lambda_{\mathrm{bid}}. Recall that λbid=1\lambda_{\mathrm{bid}}=1 corresponds to the on-policy and a higher (respectively lower) value of λbid\lambda_{\mathrm{bid}} indicates greater (respectively lesser) priority given to positive bids relative to neutral/negative bids. We investigate policies that vary λbid\lambda_{\mathrm{bid}} within the range [0,3][0,3], and again consider only deterministic policies (i.e., q=1q=1). The Manski bounds are again too wide to be informative. The models all indicate similar values of expertise for all values of λbid\lambda_{\mathrm{bid}} and are all slightly more optimistic about expertise than the Manski bounds around the on-policy. The monotonicity and Lipschitz bounds both agree that the λbid1\lambda_{\mathrm{bid}}\geq 1 region has slightly higher expertise as compared to the on-policy. Overall, our analyses provide some indication that increasing λbid\lambda_{\mathrm{bid}} may result in slightly higher levels of expertise.

Finally, we also examine the effect of varying qq within the range [0.4,1][0.4,1] (the “cost of randomization”). Recall that the on-policy sets q=0.52q=0.52. We see that the models, the monotonicity bounds, and the Lipschitz bounds all strongly agree that the region q0.6q\geq 0.6 has slightly higher expertise than the region q[0.4,0.6]q\in[0.4,0.6]. However, the magnitude of this change is small, indicating that the “cost of randomization” is not very significant.

Power Investigation: Purposefully Bad Policies.

As many of the off-policy assignments we consider have relatively similar estimated quality, we also ran additional analyses to show that our methods can discern differences between good policies (optimized toward high reviewer-paper similarity assignments) and policies intentionally chosen to have poor quality (“optimized” toward low reviewer-paper similarity assignments). We refer the interested reader to Appendix J for further discussion.

7 Discussion and Conclusion

In this work, we evaluate the quality of off-policy reviewer-paper assignments in peer review using data from two venues that deployed randomized reviewer assignments. We propose new techniques for partial identification that allow us to draw useful conclusions about the off-policy review quality, even in the presence of large numbers of positivity violations and missing reviews.

One limitation of off-policy evaluation is that our ability to make inferences inherently depends on the amount of randomness introduced on-policy. For instance, if there is a small amount of randomness we will be able to estimate only policies that are relatively close to the on-policy, unless we are willing to make some assumptions. The approaches presented in this work allow us to examine the strength of the evidence under a wide range of types and strengths of assumptions—model imputation, boundness of the outcome, monotonicity, and Lipschitz smoothness—and to test whether these assumptions lead to converging conclusions. For a more theoretical treatment of the methods proposed in this work, we refer the interested reader to Khan et al. [51].

Our work opens many avenues for future work. In the context of peer review, the present work considers only a few parameterized slices of the vast space of reviewer-paper assignment policies, while there are many other substantive questions that our methodology can be used to answer. For instance, one could evaluate assignment quality under a different method of computing similarity scores (e.g., different NLP algorithms [52]), additional constraints on the assignment (e.g., based on seniority or geographic diversity [4]), or objective functions other than the sum-of-similarities (e.g., various fairness-based objectives [40, 41, 53, 54]). Additional thought should also be given to the trade-offs between maximizing review quality vs. broader considerations of reviewer welfare: while assignments based on high text similarity may yield slightly higher-quality reviews, reviewers may be more willing to review again if the assignment policy more closely follows their bids. Beyond peer review, our work is applicable to off-policy evaluation in other matching problems, including education [55, 56], advertising [27], and ride-sharing [28]. Furthermore, our methods for partial identification under monotonicity and Lipschitz smoothness assumptions should be of independent interest for off-policy evaluation work more broadly.

8 Acknowledgements

We thank Gautam Kamath and Rachel Cummings for allowing us to conduct this study in TPDP and Melisa Bok and Celeste Martinez Gomez from OpenReview for helping us with the OpenReview APIs. We are also grateful to Samir Khan and Tal Wagner for helpful discussions. This work was supported in part by NSF CAREER Award 2143176, NSF CAREER Award 1942124, NSF CIF 1763734, and ONR N000142212181.

References

  • [1] Jim McCullough. First comprehensive survey of NSF applicants focuses on their concerns about proposal review. Science, Technology, & Human Values, 1989.
  • [2] Marko A. Rodriguez, Johan Bollen, and Herbert Van de Sompel. Mapping the bid behavior of conference referees. Journal of Informetrics, 1(1):68–82, 2007.
  • [3] Terne Thorn Jakobsen and Anna Rogers. What factors should paper-reviewer assignments rely on? community perspectives on issues and ideals in conference peer-review. In Conference of the North American Chapter of the Association for Computational Linguistics, pages 4810–4823, 2022.
  • [4] Kevin Leyton-Brown, Mausam, Yatin Nandwani, Hedayat Zarkoob, Chris Cameron, Neil Newman, and Dinesh Raghu. Matching papers and reviewers at large conferences. arXiv preprint arXiv:2202.12273, 2022.
  • [5] Nihar B. Shah. Challenges, experiments, and computational solutions in peer review. Communications of the ACM, 65(6):76–87, 2022.
  • [6] Vittorio Demicheli, Carlo Di Pietrantonj, and Cochrane Methodology Review Group. Peer review for improving the quality of grant applications. Cochrane Database of Systematic Reviews, 2010(1), 1996.
  • [7] Mikael Fogelholm, Saara Leppinen, Anssi Auvinen, Jani Raitanen, Anu Nuutinen, and Kalervo Väänänen. Panel discussion does not improve reliability of peer review for medical research grant proposals. Journal of Clinical Epidemiology, 65:47–52, 08 2011.
  • [8] Michael R Merrifield and Donald G Saari. Telescope time without tears: a distributed approach to peer review. Astronomy & Geophysics, 50(4):4–16, 2009.
  • [9] Wolfgang E Kerzendorf, Ferdinando Patat, Dominic Bordelon, Glenn van de Ven, and Tyler A Pritchard. Distributed peer review enhanced with natural language processing and machine learning. Nature Astronomy, pages 1–7, 2020.
  • [10] Laurent Charlin, Richard S. Zemel, and Craig Boutilier. A framework for optimizing paper matching. In Uncertainty in Artificial Intelligence, volume 11, pages 86–95, 2011.
  • [11] Simon Price and Peter A. Flach. Computational support for academic peer review: A perspective from artificial intelligence. Communications of the ACM, 60(3):70–79, 2017.
  • [12] Baochun Li and Y Thomas Hou. The new automated IEEE INFOCOM review assignment system. IEEE Network, 30(5):18–24, 2016.
  • [13] Laurent Charlin and Richard S. Zemel. The Toronto Paper Matching System: An automated paper-reviewer assignment system. In ICML Workshop on Peer Reviewing and Publishing Models, 2013.
  • [14] Neil D. Lawrence. The NIPS experiment. https://inverseprobability.com/2014/12/16/the-nips-experiment, 2014. Accessed May 17, 2023.
  • [15] Eric Price. The NIPS experiment. http://blog.mrtz.org/2014/12/15/the-nips-experiment.html, 2014. Accessed May 17, 2023.
  • [16] Andrew Tomkins, Min Zhang, and William D. Heavlin. Reviewer bias in single- versus double-blind peer review. Proceedings of the National Academy of Sciences, 114(48):12708–12713, 2017.
  • [17] Ivan Stelmakh, Charvi Rastogi, Nihar B Shah, Aarti Singh, and Hal Daumé III. A large scale randomized controlled trial on herding in peer-review discussions. arXiv preprint arXiv:2011.15083, 2020.
  • [18] Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan. The NeurIPS 2021 consistency experiment. https://blog.neurips.cc/2021/12/08/the-neurips-2021-consistency-experiment/, 2021. Accessed May 17, 2023.
  • [19] Mathias Lecuyer, Joshua Lockerman, Lamont Nelson, Siddhartha Sen, Amit Sharma, and Aleksandrs Slivkins. Harvesting randomness to optimize distributed systems. In ACM Workshop on Hot Topics in Networks, pages 178–184, 2017.
  • [20] Michael Littman. Collusion rings threaten the integrity of computer science research. Communications of the ACM, 2021.
  • [21] T. N. Vijaykumar. Potential organized fraud in ACM/IEEE computer architecture conferences. https://medium.com/@tnvijayk/potential-organized-fraud-in-acm-ieee-computer-architecture-conferences-ccd61169370d, 2020. Accessed May 17, 2023.
  • [22] Steven Jecmen, Hanrui Zhang, Ryan Liu, Nihar B. Shah, Vincent Conitzer, and Fei Fang. Mitigating manipulation in peer review via randomized reviewer assignments. Advances in Neural Information Processing Systems, 2020.
  • [23] Charles F Manski. Nonparametric bounds on treatment effects. The American Economic Review, 80(2):319–323, 1990.
  • [24] Noveen Sachdeva, Yi Su, and Thorsten Joachims. Off-policy bandits with deficient support. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 965–975, 2020.
  • [25] Tobias Schnabel, Adith Swaminathan, Ashudeep Singh, Navin Chandak, and Thorsten Joachims. Recommendations as treatments: Debiasing learning and evaluation. In International Conference on Machine Learning, pages 1670–1679. PMLR, 2016.
  • [26] Alexandre Gilotte, Clément Calauzènes, Thomas Nedelec, Alexandre Abraham, and Simon Dollé. Offline a/b testing for recommender systems. In ACM International Conference on Web Search and Data Mining, pages 198–206, 2018.
  • [27] Léon Bottou, Jonas Peters, Joaquin Quiñonero-Candela, Denis X Charles, D Max Chickering, Elon Portugaly, Dipankar Ray, Patrice Simard, and Ed Snelson. Counterfactual reasoning and learning systems: The example of computational advertising. Journal of Machine Learning Research, 14(11), 2013.
  • [28] Alex Wood-Doughty and Cameron Bruggeman. The incentives platform at lyft. In ACM International Conference on Web Search and Data Mining, pages 1654–1654, 2022.
  • [29] David Mimno and Andrew McCallum. Expertise modeling for matching papers with reviewers. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 500–509. ACM, 2007.
  • [30] Xiang Liu, Torsten Suel, and Nasir Memon. A robust model for paper reviewer assignment. In ACM Conference on Recommender Systems, pages 25–32, 2014.
  • [31] Marko A. Rodriguez and Johan Bollen. An algorithm to determine peer-reviewers. In ACM Conference on Information and Knowledge Management, pages 319–328. ACM, 2008.
  • [32] Hong Diep Tran, Guillaume Cabanac, and Gilles Hubert. Expert suggestion for conference program committees. In International Conference on Research Challenges in Information Science, pages 221–232, May 2017.
  • [33] Graham Neubig, John Wieting, Arya McCarthy, Amanda Stent, Natalie Schluter, and Trevor Cohn. Acl reviewer matching code. https://github.com/acl-org/reviewer-paper-matching, 2020. Accessed May 17, 2023.
  • [34] Nihar B. Shah, Behzad Tabibian, Krikamol Muandet, Isabelle Guyon, and Ulrike Von Luxburg. Design and analysis of the nips 2016 review process. Journal of Machine Learning Research, 2018.
  • [35] Cheng Long, Raymond Wong, Yu Peng, and Liangliang Ye. On good and fair paper-reviewer assignment. In IEEE International Conference on Data Mining, pages 1145–1150, 12 2013.
  • [36] Judy Goldsmith and Robert H. Sloan. The AI conference paper assignment problem. AAAI Workshop, WS-07-10:53–57, 12 2007.
  • [37] Wenbin Tang, Jie Tang, and Chenhao Tan. Expertise matching via constraint-based optimization. In International Conference on Web Intelligence and Intelligent Agent Technology, pages 34–41. IEEE Computer Society, 2010.
  • [38] Peter A. Flach, Sebastian Spiegler, Bruno Golénia, Simon Price, John Guiver, Ralf Herbrich, Thore Graepel, and Mohammed J. Zaki. Novel tools to streamline the conference review process: Experiences from SIGKDD’09. SIGKDD Explorations Newsletter, 11(2):63–67, May 2010.
  • [39] Camillo J. Taylor. On the optimal assignment of conference papers to reviewers. Technical report, Department of Computer and Information Science, University of Pennsylvania, 2008.
  • [40] Ivan Stelmakh, Nihar B. Shah, and Aarti Singh. PeerReview4All: Fair and accurate reviewer assignment in peer review. In Algorithmic Learning Theory, 2019.
  • [41] Ari Kobren, Barna Saha, and Andrew McCallum. Paper matching with local fairness constraints. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1247–1257, 2019.
  • [42] Komal Dhull, Steven Jecmen, Pravesh Kothari, and Nihar B. Shah. The price of strategyproofing peer assessment. In AAAI Conference on Human Computation and Crowdsourcing, 2022.
  • [43] Ivan Stelmakh, John Wieting, Graham Neubig, and Nihar B. Shah. A gold standard dataset for the reviewer assignment problem. arXiv preprint arXiv:2303.16750, 2023.
  • [44] Ivan Stelmakh, Nihar B. Shah, Aarti Singh, and Hal Daumé III. A novice-reviewer experiment to address scarcity of qualified reviewers in large conferences. In AAAI Conference on Artificial Intelligence, volume 35, pages 4785–4793, 2021.
  • [45] Ines Arous, Jie Yang, Mourad Khayati, and Philippe Cudré-Mauroux. Peer grading the peer reviews: A dual-role approach for lightening the scholarly paper review process. In Web Conference 2021, pages 1916–1927, 2021.
  • [46] Donald B Rubin. Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66(5):688, 1974.
  • [47] Daniel G Horvitz and Donovan J Thompson. A generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association, 47(260):663–685, 1952.
  • [48] VP Godambe and VM Joshi. Admissibility and bayes estimation in sampling finite populations. i. The Annals of Mathematical Statistics, 36(6):1707–1722, 1965.
  • [49] Guido W. Imbens and Charles F. Manski. Confidence intervals for partially identified parameters. Econometrica, 2004.
  • [50] Richard E Barlow and Hugh D Brunk. The isotonic regression problem and its dual. Journal of the American Statistical Association, 67(337):140–147, 1972.
  • [51] Samir Khan, Martin Saveski, and Johan Ugander. Off-policy evaluation beyond overlap: partial identification through smoothness. arXiv preprint arXiv:2305.11812, 2023.
  • [52] Arman Cohan, Sergey Feldman, Iz Beltagy, Doug Downey, and Daniel S Weld. Specter: Document-level representation learning using citation-informed transformers. In Annual Meeting of the Association for Computational Linguistics, pages 2270–2282, 2020.
  • [53] Justin Payan and Yair Zick. I will have order! optimizing orders for fair reviewer assignment. In International Joint Conference on Artificial Intelligence, 2022.
  • [54] Jing Wu Lian, Nicholas Mattei, Renee Noble, and Toby Walsh. The conference paper assignment problem: Using order weighted averages to assign indivisible goods. In AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • [55] David J Deming, Justine S Hastings, Thomas J Kane, and Douglas O Staiger. School choice, school quality, and postsecondary attainment. American Economic Review, 104(3):991–1013, 2014.
  • [56] Joshua D Angrist, Parag A Pathak, and Christopher R Walters. Explaining charter school effectiveness. American Economic Journal: Applied Economics, 5(4):1–27, 2013.
  • [57] Yichong Xu, Han Zhao, Xiaofei Shi, and Nihar B Shah. On strategyproof conference peer review. In International Joint Conference on Artificial Intelligence, pages 616–622, 2019.
  • [58] Kevin Leyton-Brown and Mausam. AAAI 2021 - introduction. https://slideslive.com/38952457/aaai-2021-introduction?ref=account-folder-79533-folders; minute 8 onwards in the video, 2021.
  • [59] David Roxbee Cox. Planning of experiments. Wiley, 1958.
  • [60] Johan Ugander, Brian Karrer, Lars Backstrom, and Jon Kleinberg. Graph cluster randomization: Network exposure to multiple universes. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 329–337, 2013.
  • [61] Martin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Souvik Ghosh, Ya Xu, and Edoardo M. Airoldi. Detecting network effects: Randomizing over randomized experiments. In ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 1027–1035, 2017.
  • [62] Susan Athey, Dean Eckles, and Guido W. Imbens. Exact p-values for network interference. Journal of the American Statistical Association, 113(521):230–240, 2018.
  • [63] Jean Pouget-Abadie, Guillaume Saint-Jacques, Martin Saveski, Weitao Duan, Souvik Ghosh, Ya Xu, and Edoardo M Airoldi. Testing for arbitrary interference on experimentation platforms. Biometrika, 106(4):929–940, 2019.

Appendix

Appendix A Linear Programs for Peer Review Assignment

Deterministic Assignment.

Let Z{0,1}||×|𝒫|Z\in\{0,1\}^{|\mathcal{R}|\times|\mathcal{P}|} be an assignment matrix where Zr,pZ_{r,p} denotes whether reviewer rr\in\mathcal{R} is assigned to paper p𝒫p\in\mathcal{P}. Given a matrix of similarity scores S0||×|𝒫|S\in\mathbb{R}_{\geq 0}^{|\mathcal{R}|\times|\mathcal{P}|}, a standard objective is to find an assignment of papers to reviewers that maximizes the sum of similarities of the assigned pairs, subject to constraints that each paper is assigned to an appropriate number of reviewers \ell, each reviewer is assigned no more than a maximum number of papers kk, and conflicts of interest are respected [13, 35, 36, 37, 38, 39, 10]. Denoting the set of conflict-of-interest pairs by 𝒞×𝒫\mathcal{C}\subset\mathcal{R}\times\mathcal{P}, this optimization problem can be formulated as the following linear program:

maxZr,p:r,p𝒫r,p𝒫Zr,pSr,p\displaystyle\max_{Z_{r,p}:r\in\mathcal{R},p\in\mathcal{P}}\sum_{r\in\mathcal{R},p\in\mathcal{P}}Z_{r,p}S_{r,p}
s.t.rZr,p=\displaystyle\text{s.t.}\quad\sum_{r\in\mathcal{R}}Z_{r,p}=\ell p𝒫\displaystyle\forall p\in\mathcal{P}
p𝒫Zr,pk\displaystyle\sum_{p\in\mathcal{P}}Z_{r,p}\leq k r\displaystyle\forall r\in\mathcal{R}
Zr,p=0\displaystyle Z_{r,p}=0 (r,p)𝒞\displaystyle\forall(r,p)\in\mathcal{C}
0Zr,p1\displaystyle 0\leq Z_{r,p}\leq 1 r,p𝒫.\displaystyle\forall r\in\mathcal{R},p\in\mathcal{P}.

By total unimodularity conditions, this problem has an optimal solution where Zr,p{0,1},r,p𝒫Z_{r,p}\in\{0,1\},\forall r\in\mathcal{R},p\in\mathcal{P}.

Although the above strategy is the primary method used for paper assignments in large-scale peer review, other variants of this method have been proposed and used in the literature. These algorithms consider various properties in addition to the total similarity, such as fairness [40, 41], strategyproofness [57, 42], envy-freeness [53] and diversity [58]. We focus on the sum-of-similarities objective here, but our off-policy evaluation framework is agnostic to the specific objective function.

Randomized Assignment.

As one approach to strategyproofness, Jecmen et al. [22] introduce the idea of using randomization to prevent colluding reviewers and authors from being able to guarantee their assignments. Specifically, the algorithm computes a randomized paper assignment, where the marginal probability P(Zr,p)P(Z_{r,p}) of assigning any reviewer rr to any paper pp is at most a parameter q[0,1]q\in[0,1], chosen a priori by the program chairs. These marginal probabilities are determined by the following linear program, which maximizes the expected similarity of the assignment:

maxP(Zr,p):r,p𝒫r,p𝒫P(Zr,p)Sr,p\displaystyle\max_{P(Z_{r,p}):r\in\mathcal{R},p\in\mathcal{P}}\quad\sum_{r\in\mathcal{R},p\in\mathcal{P}}P(Z_{r,p})S_{r,p} (2)
s.t.rP(Zr,p)=\displaystyle\text{s.t.}\quad\sum_{r\in\mathcal{R}}P(Z_{r,p})=\ell p𝒫\displaystyle\forall p\in\mathcal{P}
p𝒫P(Zr,p)k\displaystyle\sum_{p\in\mathcal{P}}P(Z_{r,p})\leq k r\displaystyle\forall r\in\mathcal{R}
P(Zr,p)=0\displaystyle P(Z_{r,p})=0 (r,p)𝒞\displaystyle\forall(r,p)\in\mathcal{C}
0P(Zr,p)q\displaystyle 0\leq P(Z_{r,p})\leq q r,p𝒫.\displaystyle\forall r\in\mathcal{R},p\in\mathcal{P}.

A reviewer-paper assignment is then sampled using a randomized procedure that iteratively redistributes the probability mass placed on each reviewer-paper pair until all probabilities are either zero or one. This procedure ensures only that the desired marginal assignment probabilities are satisfied, providing no guarantees on the joint distributions of assigned pairs.

Appendix B Stable Unit Treatment Value Assumption

The Stable Unit Treatment Value Assumption (SUTVA) in causal inference [59] states that the treatment of one unit does not affect the outcomes for the other units, i.e., there is no interference between the units. In the context of peer review, SUTVA implies that: (ii) The quality Yr,pY_{r,p} of the review by reviewer rr reviewing paper pp does not depend on what other reviewers are assigned to paper pp; and (iiii) the quality also does not depend on the other papers that reviewer rr was assigned to review. The first assumption is quite realistic as in most peer review systems the reviewers cannot see other reviews until they submit their own. The second assumption is important to understand, as there could be “batch effects”: a reviewer may feel more or less confident about their assessment (if measuring quality by confidence) depending on what other papers they were assigned to review. We do not test for batch effects or other violations of SUTVA in this work, which typically require either strong modeling assumptions or complex experimental designs [60, 61, 62, 63] specifically tailored for testing SUTVA, but consider it important future work.

Appendix C Covariance Estimation

As described in Section 4, we estimate the variance of μ^B(YImpute)\widehat{\mu}_{B}(Y^{\text{\it{Impute}}}) as:

Var^[μ^B(YImpute)]\displaystyle\widehat{{\text{Var}}}[\widehat{\mu}_{B}(Y^{\text{\it{Impute}}})] =1N2(i,j)(×𝒫)2Cov[Zi,Zj]ZiAZjAWiWjYiYj,\displaystyle=\frac{1}{N^{2}}\sum_{(i,j)\in(\mathcal{R}\times\mathcal{P})^{2}}\text{Cov}[Z_{i},Z_{j}]Z^{A}_{i}Z^{A}_{j}W_{i}W_{j}Y^{\prime}_{i}Y^{\prime}_{j},
where Yi\displaystyle\quad\text{where }Y^{\prime}_{i} ={Yiif i+YiImputeif iAttY¯if iAbs.\displaystyle=\begin{cases}Y_{i}&\text{if~{}}i\in\mathcal{I}^{+}\\ Y^{\text{\it{Impute}}}_{i}&\text{if~{}}i\in\mathcal{I}^{Att}\cup\mathcal{I}^{-}\\ \overline{Y}&\text{if~{}}i\in\mathcal{I}^{Abs}.\end{cases}

However, the covariance terms (taken over ZPAZ\sim P_{A}) are not known exactly. This is due to the fact that the procedure by Jecmen et al. [22] only constrains the marginal probabilities of individual reviewer-paper pairs, but pairs of pairs can be non-trivially correlated. In the absence of a closed-form expression, we use Monte Carlo methods to tightly estimate these covariances. In both our analyses of the TPDP and AAAI datasets, we sampled 1 million assignments and computed the empirical covariance. We ran an additional analysis to investigate the variability of our variance estimates. We took a bootstrap sample of 100,000 assignments (from the set of all 1 million assignments we sampled) and computed the variance based only on the (smaller) bootstrap sample. We repeated this procedure 1,000 times and computed the variance of our variance estimates. We found that the variance of our variance estimates is very small (less than 10910^{-9}) even when we use 10 times fewer sampled assignments, suggesting that we have sampled enough assignments to accurately estimate the variance.

Appendix D Coverage of Imbens-Manski Confidence Intervals

Under Manski, monotonicity, and Lipschitz assumptions, we employ a standard technique due to Imbens and Manski [49] for constructing confidence intervals for partially identified parameters. These intervals converge uniformly to the specified α\alpha-level coverage under a set of regularity assumptions on the behavior of the estimators of the upper and lower endpoints of the interval estimate: Assumption 1 from [49], establishing the coverage result in Lemma 4 there. It is difficult to verify whether Assumption 1 is satisfied for the designs (sampling reviewer-paper matchings) and interval endpoint estimators (Manski, monotonocity, Lipschitz) in this work.

A different set of assumptions, most significantly that the fraction of missing data is known before assignment, support a different method for computing confidence intervals with the coverage result in Lemma 3 from [49], obviating the need for Assumption 1. In our setting, small amounts of attrition (relative to the number of policy-induced positivity violations) mean that the fraction of data that is missing is not exactly known before assignment, but almost. In practice, we find that the Imbens-Manski interval estimates from their Lemma 3 (assuming a known fraction of missing data) and Lemma 4 (assuming Assumption 1) are nearly identical for all three of the Manksi-, monotonicity-, and Lipschitz-based estimates, suggesting the coverage is well-behaved. A detailed theoretical analysis of whether the estimators obey the regularity conditions of Assumption 1 is beyond the scope of this work; see [51] for some theoretical developments related to the rates of convergence of Lipschitz-based estimates.

Appendix E Model Implementation

To impute the outcomes of the unobserved reviewer-paper pairs, we train classification, ordinal regression, and collaborative filtering models. Classification models are suitable since the reviewers select their expertise and confidence scores from a set of pre-specified choices. Ordinal regression models additionally model the fact that the scores have a natural ordering. Collaborative filtering models, in contrast to the classification and ordinal regression models, do not rely on covariates and instead model the structure of the observed entries in the reviewer-paper outcome matrix, which is akin to user-item rating matrices found in recommender systems.

In the classification and regression models, we use the covariates XiX_{i} for each reviewer-paper pair as input features. In our analysis, we consider the two/three component scores used to compute the similarities: for TPDP, Xi=(Ti,Bi)X_{i}=(T_{i},B_{i}); for AAAI, Xi=(Ti,Ki,Bi)X_{i}=(T_{i},K_{i},B_{i}). These are the primary components used by conference organizers to compute similarities, so we expect them to be usefully correlated with match quality. Although we perform our analysis with this choice of covariates, one could also include various other features of each reviewer-paper pair, e.g., some encoding of reviewer and paper subject areas, reviewer seniority, etc.

To evaluate the performance of the models, we randomly split the observed reviewer-paper pairs into train (75%) and test (25%) sets, fit the models on the train set, and measure the mean absolute error (MAE) of the predictions on the test set. To get more robust estimates of the performance, we repeat this process 10 times. In the training phase, we use 10-fold cross-validation to tune the hyperparameters, using MAE as a selection criterion, and retrain the model on the full training set with the best hyperparameters. We also consider two preprocessing decisions: (a) whether to encode the bids as one-hot categorical variables or continuous variables with the values described in Section 5.1, and (b) whether to standardize the features. In both cases, we used the settings that, overall, worked best (at prediction) for each model. We tested several models from each model category. To simplify the exposition, we only report the results of the two best-performing models in each category. The code repository referenced in Section 1 contains the implementation of all models, including the sets of hyperparameters considered for each model.

Refer to caption
Figure 3: Test performance of the imputation models described in Section 4.2, averaged across 10 random train/test splits of all observed reviewer-paper pairs. The error bars show 95% confidence intervals.

Figure 3 shows the test MAE across the 10 random train/test splits (means and 95% CIs) using expertise and confidence outcomes for both TPDP and AAAI. We note that all models perform significantly better than a baseline that predicts the mean outcome in the train set. For TPDP, we find that all models perform similarly, except for cf-svd++, which performs slightly better than the other models, both for expertise and confidence. For AAAI, all classification and regression models perform similarly, but the collaborative filtering models perform slightly worse. This difference in performance is perhaps due to the fact that we consider a larger set of covariates for AAAI than TPDP, likely making the classification and ordinal regression models more predictive.

Finally, to estimate μ^B\widehat{\mu}_{B}, we train each model on the set of all observed reviewer-paper pairs, predict the outcomes for all unobserved pairs, and impute the predicted outcomes as described in Section 4.2. In the training phase, we use 10-fold cross-validation to select the hyperparameters and refit the model on the full set of observed reviewer-paper pairs.

Appendix F Details of AAAI Assignment

In Section 5.1, we described a simplified version of the stage of the AAAI assignment procedure that we analyze, i.e., the assignment of senior reviewers to the first round of submissions. In this section, we describe this stage of the AAAI paper assignment more precisely.

A randomized assignment was computed between 31453145 senior reviewers and 84508450 first-round paper submissions, independent of all other stages of the reviewer assignment. The set of senior reviewers was determined based on reviewing experience and publication record; these criteria were external to the assignment. Each paper was assigned =1\ell=1 senior reviewer. Reviewers were assigned to at most k=4k=4 papers, with the exception of reviewers with a “Machine Learning” primary area or in the “AI For Social Impact” track, who were assigned to at most k=3k=3 papers. The probability limit was q=0.52q=0.52.

The similarities were computed from text-similarity scores TiT_{i}, subject-area scores KiK_{i}, and bids BiB_{i}. Either the text-similarity scores or the area scores could be missing for a given reviewer-paper pair, due to either a reviewer failing to provide the needed information or due to other errors in the computation of the scores. The text-similarity scores TiT_{i} were created using text-based scores from two different sources: (ii) the Toronto Paper Matching System (TPMS) [13], and (iiii) the ACL Reviewer Matching code [33]. The text-similarity scores TiT_{i} was set equal to the TPMS score for all pairs where this score was not missing, set equal to the ACL score for all other pairs where the ACL score was not missing, and marked as missing if both scores were missing. The subject-area scores were computed from reviewer and paper subject areas using the procedure described in Appendix A of [4].

Next, base scores Si=wtextTi+(1wtext)Ki{S_{i}^{\prime}}=w_{\mathrm{text}}T_{i}+(1-w_{\mathrm{text}})K_{i} were then computed with wtext=0.75w_{\mathrm{text}}=0.75, if both TiT_{i} and KiK_{i} were not missing. If either TiT_{i} or KiK_{i} was missing, the base score was equal to the non-missing score of the two. If both were missing, the base score was set as Si=0{S_{i}^{\prime}}=0. For pairs where the bid was “willing” or “eager” and Ki=0K_{i}=0, the base score was set as Si=Ti{S_{i}^{\prime}}=T_{i}.

Next, final scores were computed as Si=Si1/BiS_{i}={S_{i}^{\prime}}^{1/B_{i}}, using the bid values “not willing” (0.050.05), “not entered” (11), “in a pinch” (1+0.5λbid1+0.5\lambda_{\mathrm{bid}}), “willing” (1+1.5λbid1+1.5\lambda_{\mathrm{bid}}), “eager” (1+3λbid1+3\lambda_{\mathrm{bid}}); with λbid=1\lambda_{\mathrm{bid}}=1. If Si<0.15S_{i}<0.15 and KiK_{i} was not missing, the final score was recomputed as Si=min(Ki1/Bi,0.15)S_{i}=\min(K_{i}^{1/B_{i}},0.15). Finally, for reviewers who did not provide their profile for use in conflict-of-interest detection, the final score was reduced by 10%10\%.

In all of our analyses, we follow this same procedure to determine the assignment under alternative policies (varying only the parameters wtextw_{\mathrm{text}}, λbid\lambda_{\mathrm{bid}}, and qq).

Refer to caption
Figure 4: CCDF of the L=|YiYj|/d(Xi,Xj)L=|Y_{i}-Y_{j}|/d(X_{i},X_{j}) values for all pairs of observed points, where YYs are confidence scores. The dashed lines denote the LL values corresponding to less than 10%10\%, 5%5\%, and 1%1\% violations. For TPDP, these values are L=30,60,400L=30,60,400, respectively; for AAAI, L=20,30,70L=20,30,70.
Refer to caption
Figure 5: CCDF of the distances between each relevant unobserved reviewer-paper pair and its closest observed reviewer-paper pair. The dashed lines show the medians: 0.0014 (TPDP) and 0.0011 (AAAI).

Appendix G Details Regarding Assumption Suitability

In this section, we provide additional details on the discussion in Section 5.2 on the suitability of the monotonicity and Lipschitz smoothness assumptions.

First, we examine the fraction of pairs of observed reviewer-paper pairs that violate the Lipschitz condition for each value of LL. Figure 4 shows the CCDF of LL for pairs of observations (in other words, the fraction of violating observation-pairs for each value of LL) with respect to confidence. The corresponding plot for expertise is shown in Figure 1.

Next, we examine the distances from unobserved reviewer-paper pairs to their closest observed reviewer-paper pair. In Figure 5, we show the CCDF of these distances for unobserved reviewer-paper pairs within a set of “relevant” pairs. We define the set of “relevant” unobserved pairs to be all pairs not supported on-policy that have positive probability in at least one policy among all off-policies with varying wtextw_{\mathrm{text}} with q=1q=1 for TPDP, and all off-policies varying wtextw_{\mathrm{text}} and λbid\lambda_{\mathrm{bid}} with q=1q=1 for AAAI.

Appendix H Similarity Cost of Randomization

In [22], Jecmen et al. empirically analyze the “cost of randomization” in terms of the expected total assignment similarity, i.e., the objective value of LP (2), as qq changes. This approach is also used by conference program chairs to choose an acceptable level of qq in practice. In Figure 6, we show this trade-off between qq and sum-similarity (as a ratio to the optimal deterministic sum-similarity) for both TPDP and AAAI. Note that in contrast, our approach in this work is to measure assignment quality via self-reported expertise or confidence rather than by similarity. In particular, the cost of randomization for TPDP is high in terms of sum-similarity but is revealed by our analysis to be mild in terms of expertise (Section 6).

Refer to caption
Figure 6: The “cost of randomization” as measured by the expected total assignment similarity. The plot shows the ratio between the sum of similarities under a randomized assignment (LP 2) with (q1q\leq 1) and the sum of similarities under a deterministic assignment (q=1q=1). The dashed lines show the values of qq set on-policy.

Appendix I Tie-Breaking Behavior

In Section 6, we specify a policy in terms of the parameters of LP (2) (specifically, by altering the parameters qq, wtextw_{\mathrm{text}}, and λbid\lambda_{\mathrm{bid}} from the on-policy values). However, LP (2) may not have a unique solution, and thus each policy may not correspond to a unique set of assignment probabilities. Of particular concern, the on-policy specification of LP (2) does not uniquely identify the actual on-policy assignment probabilities.

Ideally, we could use the same tie-breaking methodology as was used in the on-policy to pick a solution in each off-policy to avoid introducing additional effects from variations in tie-breaking behavior. However, this behavior was not specified in the venues we analyze. To resolve this, we fixed arbitrary tie-breaking behaviors such that the on-policy solution to LP (2) matches the actual on-policy assignment probabilities; we then use these same behaviors for all off-policies.

In the TPDP analyses, we perturb all similarities by small constants such that all similarity values are unique. Specifically, we change the objective of LP (2) to i×𝒫P(Zi)[(1λ)Si+λi]\sum_{i\in\mathcal{R}\times\mathcal{P}}P(Z_{i})[(1-\lambda)S_{i}+\lambda\mathcal{E}_{i}], where λ=1e8\lambda=1e^{-8}, and ||×|𝒫|\mathcal{E}\in\mathbb{R}^{|\mathcal{R}|\times|\mathcal{P}|} is the same for all policies. To choose \mathcal{E}, we sampled each entry uniformly at random from [0,1][0,1] and checked that the solution of the perturbed on-policy LP matches the on-policy assignment probabilities, resampling until it does. This value of \mathcal{E} was then fixed for all policies.

In the AAAI analyses, the larger size of the similarity matrix meant that randomly choosing an \mathcal{E} that recovers the on-policy solution was not feasible. Instead, we more directly choose how to perturb the similarities in order to achieve consistency with the on-policy. We change the objective of LP (2) to i×𝒫P(Zi)(Siϵ𝕀[PA(Zi)=0])\sum_{i\in\mathcal{R}\times\mathcal{P}}P(Z_{i})(S_{i}-\epsilon\mathbb{I}[P_{A}(Z_{i})=0]), where ϵ\epsilon\in\mathbb{R} is chosen for each policy by the following procedure. For each policy, ϵ\epsilon is chosen to be the largest value from {109,106,103}\{10^{-9},10^{-6},10^{-3}\} such that the difference in total similarity between the solution of the original and perturbed LPs is no greater than a tolerance of 10510^{-5}. We confirmed that using this procedure to perturb the on-policy LP recovers the on-policy assignment probabilities, as desired.

Appendix J Power Investigation: Purposefully Bad Policies

Many of the off-policy assignments we consider in Section 5 have shown to have relatively similar estimated quality. A possible explanation for this tendency is that most “reasonable” optimized policies are roughly equivalent in terms of quality, since our analyses only consider adjusting parameters of the (presumably reasonable) optimized on-policy. To investigate this possibility, we analyze a policy intentionally chosen to have poor quality.

Designing a “bad” policy that can be feasibly analyzed presents a challenge, as the on-policies are both optimized and thus rarely place probability on obviously bad reviewer-paper pairs. To work within this constraint, we look for bad policies where all reviewer pairs with zero on-policy probability are regarded as conflicts. We then contrast the deterministic (q=1q=1) policy that maximizes the total similarity score with the “bad” policy that minimizes it. Since the on-policy similarities are presumably somewhat indicative of expertise, we expect the minimization policy to be worse.

The results of this comparison are presented in Table 1. On both TPDP and AAAI, we see that our methods clearly identify the minimization policies as worse. The differences in quality between the policies becomes clearer with the addition of Lipschitz and monotonicity assumptions to address attrition. This illustrates that our methods are able to distinguish a good policy (the best of the best matches) from a clearly worse one (the worst of the best matches). Thus, it is likely that our primary analyses are simply exploring high-quality regions of the assignment-policy space, and that peer review assignment quality is often robust to the exact values of the various parameters.

Table 1: Expertise of bad policies (95% confidence intervals). L=50L=50 for TPDP and L=40L=40 for AAAI.
Policy Manski Monotonicity Lipschitz
TPDP Max [2.6115, 2.7045] [2.6551, 2.6782] [2.6498, 2.6744]
TPDP Min [2.5521, 2.6126] [2.5521, 2.5986] [2.5521, 2.5937]
AAAI Max [3.3919, 3.5213] [3.4756, 3.4783] [3.4764, 3.4809]
AAAI Min [3.2591, 3.3846] [3.3394, 3.3419] [3.3396, 3.3443]

Appendix K Results for Confidence Outcomes

Figure 7 shows the results of our analyses using confidence as a quality measure (YY). We find that the results are substantively very similar to those reported in Section 6 using expertise.

(continues on the next page)

Refer to caption
Figure 7: Confidence of off-policies varying wtextw_{\mathrm{text}} and qq for TPDP, and wtextw_{\mathrm{text}}, λbid\lambda_{\mathrm{bid}}, and qq for AAAI, computed using the different estimation methods described in Section 4. The dashed blue lines indicate Manski bounds around the on-policy expertise and the grey lines indicate Manski bounds around the off-policy expertise. The error bands denoted μ^BCI\widehat{\mu}_{B}^{CI} for the Manski bounds, μ^B|MCI\widehat{\mu}_{B|M}^{CI} for the monotonicity bounds, and μ^B|LCI\widehat{\mu}_{B|L}^{CI} for the Lipschitz bounds, represent confidence intervals that asymptotically contain the true value of μB\mu_{B} with probability at least 95%95\% as described in Section 4. Note that the vertical axis does not start at zero in order to focus on the most relevant regions of the plots.