\newcites

supSupplementary References

Measuring Risk of Bias in Biomedical Reports: The RoBBR Benchmark

Jianyou Wang Weili Cao¹¹1 The total count of data points across all categories exceeds the count for "Full" because some data points belong to multiple categories. For example, "blinding (performance bias and detection bias) all outcomes" falls under both "Performance" and "Detection" biases. Longtian Bao Youze Zheng Gil Pasternak
Kaicheng Wang Xiaoyue Wang Ramamohan Paturi Leon Bergen²²footnotemark: 2
UC San Diego, Laboratory for Emerging Intelligence
{jiw101, w2cao, rpaturi, lbergen}@ucsd.edu
Equal ContributionEqual Senior Authors

Abstract

Systems that answer questions by reviewing the scientific literature are becoming increasingly feasible. To draw reliable conclusions, these systems should take into account the quality of available evidence, placing more weight on studies that use a valid methodology. We present a benchmark for measuring the methodological strength of biomedical papers, drawing on the risk-of-bias framework used for systematic reviews. The four benchmark tasks, drawn from more than 500 papers, cover the analysis of research study methodology, followed by evaluation of risk of bias in these studies. The benchmark contains 2000 expert-generated bias annotations, and a human-validated pipeline for fine-grained alignment with research paper content. We evaluate a range of large language models on the benchmark, and find that these models fall significantly short of expert-level performance. By providing a standardized tool for measuring judgments of study quality, the benchmark can help to guide systems that perform large-scale aggregation of scientific data. The dataset is available at https://github.com/RoBBR-Benchmark/RoBBR.

1 Introduction

Systems that automatically answer questions by reviewing the scientific literature are becoming increasingly feasible. These systems offer the potential to provide scientists with on-demand access to knowledge that is synthesized from across the literature. This can help scientists understand what is already known about topics outside of their research focus, and help clinicians and practitioners keep up to date with best practices.

When assessing what is known about a field, not all studies should be weighed equally I et al. [2023]. Studies with stronger methodologies contribute more to a body of evidence than those with weaker methodologies. By weighing studies appropriately, systems can increase the reliability of their summaries and recommendations Turner et al. [2009].

In biomedical research, there is a large body of work which investigates the factors that decrease the validity of a study’s methods, and best practices for addressing these issues I et al. [2023], Welton et al. [2009]. Reporting bias occurs when there are systematic differences in how outcomes are reported or disclosed between the groups that are compared (e.g., the results of the treatment group are more frequently or favorably published than those in the placebo group). This bias is often best addressed by adhering to strict reporting standards and protocols, such as registering the study in advance and committing to publish all results Kotz and West [2022]. Other biases, such as detection bias, attrition bias, and selection biases, can frequently be addressed using careful study design and implementation Berger [2005], JPT et al. [2023], I et al. [2023].

This paper presents the RoBBR benchmark for assessing the methodological strength of biomedical studies. The benchmark follows the risk of bias framework developed by Cochrane Jpt et al. [2023], JPT et al. [2023], Higgins et al. [2011], Practice and of Care [EPOC], Sterne et al. [2014], a global independent network that conducts systematic reviews of healthcare interventions. Cochrane’s risk of bias framework is widely recognized as a standard for evaluating the quality and reliability of research studies. It serves as an important resource for national health agencies in various countries, including the United States Viswanathan et al. [2012], the United Kingdom Alderson and Tan [2011], and Australia NHMRC [2019]. Notably, Cochrane’s systematic reviews are used to formulate national and international guidelines on healthcare practices Bunn et al. [2015]. Furthermore, empirical comparisons between Cochrane and non-Cochrane systematic reviews consistently demonstrate higher methodological quality in those conducted by Cochrane, highlighting its critical role in advancing evidence-based healthcare Goldkuhle et al. [2018], Page et al. [2018], Matthias et al. [2020], Tsoi et al. [2020].

Refer to caption — Figure 1: The figure illustrates the components of a risk of bias assessment. Given a bias category and a research study, a meta-reviewer identifies factors that make the study high or low risk for that bias. The meta-reviewer summarizes their findings in a support judgment, and then make a final decision about the risk level.

2 Background

The RoBBR benchmark aims to provide an expert-validated tool for assessing whether models can make risk of bias judgments for biomedical papers. The benchmark uses a risk-of-bias framework developed in the context of systematic reviews, which is the gold standard for assessing biomedical studies. We provide more background on systematic reviews and the risk of bias framework below.

Systematic reviews and meta-analyses are tools used in evidence-based medicine, which synthesize evidence from multiple studies to answer important clinical questions. They help researchers and healthcare professionals make informed decisions based on the best available evidence Ahn and Kang [2018], TJ et al. [2023], JJ et al. [2023].

A systematic review is a comprehensive search of the literature to identify all relevant studies on a particular topic TJ et al. [2023], C et al. [2023]. The review process follows a pre-specified protocol to minimize bias and ensure transparency. The protocol outlines the research question, search strategy, study selection criteria, data extraction methods, and plans for quality assessment and synthesis of result Jpt et al. [2023], Page et al. [2021].

A meta-analysis is a type of systematic review that uses statistical methods to combine the results from multiple studies Ahn and Kang [2018]. By pooling the data across studies, a meta-analysis can provide a more precise estimate of the treatment effect JJ et al. [2023], Guevara et al. [2004] and explore potential sources of heterogeneity Higgins and Thompson [2002].

The research question for a systematic review is often structured using the PICO framework, which stands for Participant, Intervention, Comparison, and Outcome Aslam and Emmanuel [2010]. The PICO elements help define the key components of the research question and guide the study selection process J et al. [2023], JE et al. [2023a]. For example, a systematic review on the effectiveness of a new hypertension drug might have the following PICO: Participant: Adults with primary hypertension; Intervention: New antihypertensive drug; Comparison: Placebo or standard care; Outcome: Change in systolic and diastolic blood pressure.

By clearly defining the PICO elements, the review authors can develop a focused search strategy and establish explicit criteria for selecting studies to include in the review J et al. [2023], JE et al. [2023b].

Inclusion criteria typically specify the characteristics of the study population, the interventions and comparisons of interest, the outcomes to be measured, and the study designs to be included JE et al. [2023b], Page et al. [2021], Tawfik et al. [2019] (e.g., randomized controlled trials). Exclusion criteria describe reasons why a study must be excluded from an analysis, such as using the wrong study design C et al. [2023], Page et al. [2021], Tawfik et al. [2019].

Risk of bias assessment involves evaluating each included study to determine the likelihood that its results may be biased JPT et al. [2023], Practice and of Care [EPOC], Sterne et al. [2014]. The Cochrane tool for assessing risk of bias Higgins et al. [2011] is widely used and assesses several types of bias, including random sequence generation, allocation concealment, blinding of participants and outcome assessors, incomplete outcome data, selective reporting, and other sources of bias. During the assessment, two review authors independently evaluate the risk of bias using the Cochrane tool and resolve disagreements through discussion. They record their rationale for each bias assessment. These rationales are referred to as "support judgments."

For example, a support judgment of Reporting Bias could be "The study protocol was registered (ClinicalTrials.gov NCT01863706) but not all of the outcomes projected by methodological descriptions were reported as results in the study report (cases of diarrhoea, nausea and vomiting were not completely reported). Moreover, the study publication reports outcomes (hypotension, nausea, transfusion) not listed in the registered protocol." A decision regarding the risk of reporting bias is then made on the basis of the support judgment. See Figure 1 for an illustration.

Studies with a high risk of bias may overestimate or underestimate the true treatment effect, leading to inaccurate conclusions in the systematic review. In this paper, we consider eight categories of biases: selection bias, attrition bias, performance bias, detection bias, reporting bias, analysis bias, special bias (specific to individual meta-analyses), and other bias (i.e. biases not covered by the other categories). We benchmark model performance on assessing papers for each of these bias categories.

3 Benchmark Development

The RoBBR benchmark is derived from 63 Cochrane meta-analyses and 532 papers referenced in those meta-analyses. All of the meta-analyses and papers have a Creative Commons license or are public domain. The benchmark is designed to evaluate a model’s ability to judge the risk of bias in biomedical reports.

The data were collected in April 2024. We included all meta-analyses from Cochrane that are in the BioC database, are classified as intervention reviews, have a CC-BY-NC license, and contribute at least one data point for each task. We obtained the XML format of the papers included in these meta-analyses either through the BioC API or by manually downloading and converting PDFs to XML using GROBID. We then extract task data from the XML of these papers.

The benchmark is divided into four main tasks (See in Appendix A.3 for detailed statistics):

•

Study Inclusion/Exclusion (Task 1): Determine whether a study’s objectives and methodology fit a set of requirements.
•

Bias Retrieval (Task 2): Retrieve sentences from the paper that support a judgment about the study’s risk of bias.
•

Support Judgment Selection (Task 3): From a list of options, select the judgment that best supports a determination of the study’s risk of bias.
•

Risk Level Determination (Task 4): Assess the study’s risk of specific biases.

These four tasks mirror the procedure followed by meta-reviewers. Initially, reviewers decide if a study should be included or excluded based on predefined criteria, which reflect the goals of the review (Task 1). Included studies will be analyzed for indications of the risk of bias (Task 2). Reviewers will then summarize the evidence collected from the study, and analyze its implications (Task 3). Finally, they assess the risk level of specific biases for each identified bias (Task 4).

By structuring the dataset around these tasks, we provide a comprehensive framework for simulating the nuanced process of risk of bias assessment in meta-analysis.

3.1 Train/Test Split

We randomly select 23 meta-analyses and their 228 corresponding papers in the dev set. We put the other 40 meta-analyses and their corresponding 304 papers in the test set.

3.2 Task 1: Study Inclusion/Exclusion

For each potentially relevant study to be included in a meta-analysis, the search protocol information and meta-analysis objective are provided as input, and the task is to decide if the study should be included in the meta-analysis. The objective is included in order to provide similar context to what is normally available for a meta-reviewer and make it easier to interpret the criteria in the search protocol. An example is provided in Figure 2.

3.3 Task 2: Risk of Bias Sentence Retrieval

When judging the risk of bias for a specific study, the review authors provide support judgments which justify their risk of bias rating. These support judgments are grounded in specific aspects of the study, which the support judgment describes. Task 2 aims to test a model’s ability to retrieve the correct source for the support judgment from a paper’s text. The support judgments are intentionally omitted in the Bias Retrieval task. This exclusion simulates the systematic review process by challenging the model to independently generate these justifications. Including them would come close to providing the answer, undermining the purpose of the task. See Figure 3 for an illustration.

Some review authors directly quote sentences from the paper in their support judgment, allowing for straightforward extraction and matching of these sentences with the paper’s text. However, most reviewers prefer to paraphrase the sentences or engage in deeper analysis, synthesizing information from multiple parts of the paper.

To match these judgments with sentences from the paper, we use a three part annotation pipeline.

First, a support judgment often contains multiple pieces of information. We decompose the support judgment into distinct, non-overlapping pieces, each focusing on a specific aspect of the original support judgment.

Second, we exclude any part of the support judgment containing a negation or non-specific commentary, because these parts of the judgment cannot be matched to any sentence. The remaining parts of the support judgments generally contain specific claims about the study.

Third, these aspects are mapped to specific sentences within the paper, allowing us to pinpoint the exact sources of information that underlie each aspect of the judgment. A single aspect could potentially be supported by multiple sentences from the paper, and similarly a single sentence can support multiple aspects.

Each of these steps, including the aspect decomposition, uses GPT-4 (gpt-4-0125) OpenAI [2023]. The prompts can be found in the Appendix C.

We manually inspect the 2221 support judgment aspects generated in the second stage, to ensure that all aspects are related to specific information from the paper.

In the next section, we evaluate annotation quality by comparing to human judgments.

3.3.1 Evaluating LLM Annotations

To match each aspect from a support judgment with specific sentences in the paper, we need to evaluate each sentence (on average, 200 sentences) against each aspect. For a support judgment with 5 aspects, this results in approximately 1,000 decisions. Given the scale of our entire dataset, which includes over 500 (bias, study) pairs, this translates to more than 1 million decisions – an impractical task for human annotators to perform manually.

We hypothesize that determining whether a sentence matches an aspect is straightforward enough for GPT-4 to achieve human-level performance with a specialized prompt. Following the protocol introduced by Wang et al. [2023], we created a development set of 30 papers. A single support judgment aspect was chosen for each paper, and every sentence in the paper was annotated for that aspect, totaling 8789 decisions. We optimized the instruction prompt for GPT-4 on this development set. Details about the prompt and experimental procedure are provided in the Appendix C.

Once the prompt was finalized, four annotators were tasked with annotating 50 randomly sampled (aspect, full paper) pairs (all annotators saw the same pairs). Every sentence in each paper was annotated, requiring 13575 annotations in total. The annotators were divided into two teams, with each team consisting of two annotators. Each annotator performed the annotation tasks individually given a pre-defined annotation guideline. Annotators from the same team then collaborated to resolve differences and eliminate mistakes. Refer to the Appendix B for the full annotation guidelines.

Each team produced a set of annotation results. GPT-4 annotated the same (aspect, paper) pairs. We calculated Binary Accuracy, F1-binary, Spearman Correlation, and Kappa Coefficient between the two human teams and between each human team and GPT-4. Table 1 shows that the human teams matched each other and GPT-4 more than 99% of the time. The other agreement metrics have values in the low-to-mid 70’s, indicating substantial agreement among annotators.¹¹1These metrics are lower than exact agreement because of class imbalance. Positive matches between sentences and aspects are rare. Bootstrapped hypothesis tests do not demonstrate any statistical difference between GPT-4 and the human annotators.

Table 1: Inter-annotator agreement among the two teams of humans and the human teams and GPT-4.

Metrics	Human & Human Average	Human & GPT Average	p-value
Exact Accuracy	99.4 $\pm$ 0.2	99.3 $\pm$ 0.2	0.31
F1 Binary	73.4 $\pm$ 6.3	71.7 $\pm$ 5.6	0.55
Cohen’s $\kappa$	73.1 $\pm$ 6.3	71.4 $\pm$ 5.6	0.54
Spearman’s $\rho$	73.8 $\pm$ 6.0	71.7 $\pm$ 5.6	0.43

3.4 Task 3: Risk of Bias Support Judgment Selection

The previous task evaluates a model’s ability to extract information from biomedical papers to in order to justify assessments of bias. However, this approach alone is insufficient for a risk-of-bias assessment. The evaluation must also include the model’s ability to reason about the retrieved information. For example, a paper might report the loss to follow-up yet fail to provide reasons for missing data, which raises the possibility of attrition bias. Similarly, while a paper may state that it implemented double-blinding, expert reviewers could consider such blinding infeasible due to the nature of the trial, potentially introducing performance bias.

To address this, we propose a multiple-choice task where the model selects the correct support judgment. This task includes three synthetically generated hard options, three options derived from other papers’ support judgments for the same bias category, and the Cochrane reviewer’s actual judgment. We prompt GPT-4 to generate the three synthetic options by imitating support judgments from other papers concerning the same type of bias. These options are tailored to be paper-specific while maintaining the underlying reasoning. See Appendix D for details on construction of these synthetic judgments.

In order to ensure that all incorrect options were actually incorrect, we performed filtering in two stages. GPT-4 was used to perform preliminary filtering (the prompt is shown in Figure 13 of Appendix D). In the second stage, we manually reviewed all of the remaining options, and removed any that were not incorrect.

Figure 4 shows an example of a multiple-choice question with one correct answer and three synthetically generated answers. Both options C and D refer to the same information from the paper, in this case the operational challenges affecting the trial’s methodology, but describe different reasoning given this information. This demonstrates that retrieving the correct information from the paper is not sufficient for solving this task.

3.5 Task 4: Risk Level Determination

The final task directly evaluates a model’s ability to assess the risk of bias in biomedical research papers. It requires the model to categorize the risk level of a specific bias as high, low, or unclear. The input to the model includes the whole paper, the PICO of the study, the objective of the meta-analysis, and the definition of the bias, as illustrated in Figure 5.

The review authors use bias with definitions from Cochrane Handbook 5.1 Higgins and Green [2011], Cochrane Effective Practice and Organisation of Care Practice and of Care [EPOC], A Cochrane Risk of Bias Assessment Tool Sterne et al. [2014], or specific biases that are defined within the meta-analysis. For the last category, we extract the bias definition from the specific meta-analysis. (Bias definitions are also extracted for Tasks 2 and 3.)

This assessment requires synthesizing various elements identified in the previous tasks, including the accurate retrieval and interpretation of relevant sentences and the ability to critically evaluate the sufficiency and reliability of the evidence presented regarding each bias.

4 Experiments

4.1 Experimental Models and Procedures

We evaluate GPT-4o OpenAI [2024], Claude3-Opus Anthropic [2024a], Claude3.5-Sonnet Anthropic [2024b], and Gemini-1.5-Pro Google [2024]. Due to the long-context nature of RoBBR’s tasks, with many datapoints exceeding 16k tokens, many open-sourced LLMs AI [2024], Poli et al. [2023], Li et al. [2023] are considered but none of which is equipped to evaluate RoBBR.

We use chain-of-thought as prompting strategy to stabilize the model performance. We optimize our prompts on a separate development set. See Appendix E for details on prompt optimization.

4.2 Evaluation metrics

For Task 1, study inclusion/exclusion, we report macro-F1, accuracy, and weighted accuracy.

For Task 2, Risk of Bias Sentence Retrieval, we report aspect-level recall metrics, which measure the proportion of the bias aspects that are covered by the retrieved sentences. Specifically, Bias Recall (BR) @ 3 measures the proportion aspects covered when three sentences are retrieved. BR @ Optimal measures recall when the “optimal" number of sentences are retrieved. The optimal number of sentences is defined as the fewest which are sufficient to cover all bias aspects.

For Task 3, Risk of Bias Support Judgment Selection, we report accuracy.

For Task 4, Risk Level Determination, we report accuracy and weighted accuracy.

4.3 Categorization

For Tasks 2, 3, and 4, we consider 8 categories of biases: selection bias, attrition bias, performance bias, detection bias, reporting bias, analysis bias, special bias (the biases that are meta-analysis specific), and other bias (i.e. biases that do not fall into any of the other categories). In order to reduce evaluation costs, we randomly sampled 50 task instances for some categories. It is important to note that systematic reviews vary in which biases they report. When the authors judge that a bias is not important for the objective of a review, they may decide not to evaluate a specific bias. As a result, not all categories have the same number of examples, and some categories have fewer than 50 examples. See Table 10 for detailed statistics.

5 Results

Table 2: Results for Task 1: Study Inclusion/Exclusion

	F1	Accuracy	Weighted Accuracy
Claude3-Opus	75.5	77.1	74.9
Claude3-Sonnet	80.6	81.1	80.8
Gemini-1.5-Pro	72.2	72.7	73.0
GPT-4o	75.5	75.6	77.6

Table 3: Results for Task 2: Sentence Retrieval

{}^{\ref{fn:data_explain}}

	Bias Type
Model	Full	Selection	Attrition	Performance	Detection	Reporting	Analysis	Special	Other
	n = 337	n = 117	n = 57	n= 50	n = 50	n = 35	n = 13	n = 7	n = 23
	Bias Recall@Optimal
Claude3-Opus	39.0	55.4	22.0	57.3	46.4	11.0	47.9	13.9	4.4
Gemini-1.5-Pro	41.1	57.2	28.7	49.1	48.8	23.3	54.2	13.9	3.3
GPT-4o	44.8	60.3	28.6	64.6	48.7	32.9	28.7	28.0	2.3
	Bias Recall@3
Claude3-Opus	52.3	67.9	40.2	63.1	55.9	25.5	55.9	64.6	19.7
Gemini-1.5-Pro	56.1	69.7	44.1	73.7	65.3	35.6	56.3	35.5	6.7
GPT-4o	61.8	72.2	47.9	82.0	78.5	43.2	67.9	50.4	10.3

5.1 Performance on the Four Tasks

In Task 1 (Study Inclusion/Exclusion), shown in Table 2, Claude3-Opus and GPT-4o perform best in terms of F1 score. Claude3-Opus has higher accuracy and GPT-4o higher weighted accuracy.

In Task 2 (Sentence Retrieval Evaluation Results) 3, GPT-4o has stronger results than both Gemini-1.5 and Claude3-Opus for almost all bias types. However, in Task 3 (Support Judgment Selection Evaluation Results) 4, Claude3-Opus has the strongest results.

Tasks 2 and 3 are critical for evaluating the different capabilities of the LLMs. Task 2 is a long-context task, requiring retrieval of a non-redundant set of sentences from a 200-sentence paper. Task 3 requires complex reasoning to evaluate the retrieved sentences and identify the implications for risk of bias. Our results demonstrate that GPT-4o performs better in long-context retrieval tasks, while Claude3-Opus exhibits stronger reasoning abilities.

Figure 4 illustrates Claude3-Opus’s stronger reasoning skills. Options C and D present different reasoning based on the same information from the paper. Notably, GPT-4o is misled by a deceptive option, whereas Claude3-Opus correctly handles the multiple-choice question.

Task 4 (Risk Level Determination Evaluation Results) 5 provides an overall assessment of each model’s ability to determine the risk level, integrating all information from the previous tasks. The three models show comparable performance in this task.

Table 4: Results for Task 3: Support Judgment Selection

{}^{\ref{fn:data_explain}}

	Bias Type
Model	Full	Selection	Attrition	Performance	Detection	Reporting	Analysis	Special
	n = 310	n = 86	n = 60	n = 52	n = 56	n = 50	n = 8	n = 6
	Accuracy
Claude3-Opus	51.2	62.6	56.7	48.0	46.5	44.2	50.0	0.0
Claude3.5-Sonnet	60.2	69.8	79.9	55.8	44.8	51.8	62.5	0.0
Gemini-1.5-Pro	49.3	57.1	59.9	46.4	42.8	40.2	49.3	16.6
GPT-4o	45.8	52.3	63.1	42.5	37.7	28.0	74.5	0.0

Table 5: Results for Task 4: Risk Level Determination

{}^{\ref{fn:data_explain}}

	Bias Type
Model	Full	Selection	Attrition	Performance	Detection	Reporting	Analysis	Special	Other
	n = 382	n = 93	n = 50	n = 50	n = 50	n = 50	n = 18	n = 26	n = 50
	F1
Claude3-Opus	47.7	54.3	34.1	46.7	37.4	23.0	69.9	27.9	27.9
Claude3.5-Sonnet	49.7	62.7	23.9	49.9	46.7	30.1	48.3	28.7	33.8
Gemini-1.5-Pro	43.1	53.9	26.0	43.7	46.8	30.4	44.6	20.1	22.8
GPT-4o	51.9	59.0	36.5	48.9	36.0	27.7	53.8	34.7	29.8
	Accuracy
Claude3-Opus	56.8	64.6	58.1	48.2	41.7	40.0	88.9	53.8	72.2
Claude3.5-Sonnet	62.3	64.4	52.0	52.1	55.9	64.3	83.0	69.5	72.1
Gemini-1.5-Pro	49.3	62.5	42.4	46.5	48.1	56.0	77.5	42.4	23.7
GPT-4o	62.6	71.0	68.3	54.4	40.1	57.6	88.7	65.4	62.1
	Weighted Accuracy
Claude3-Opus	49.5	55.3	36.2	48.0	46.5	30.4	69.3	28.2	32.4
Claude3.5-Sonnet	50.3	63.1	24.1	51.6	48.2	32.1	46.8	36.1	36.1
Gemini-1.5-Pro	47.9	56.3	34.0	44.4	54.2	36.4	43.6	22.2	32.5
GPT-4o	53.6	60.4	36.3	51.3	40.7	30.0	50.0	64.3	32.6

5.2 Performance across Bias Types

The three models exhibit significant variations in performance across different biases due to the varying challenges associated with each bias type. All models have stronger performance at assessing selection bias, as biomedical papers typically detail their randomization processes explicitly.

In contrast, assessing reporting bias requires a comprehensive understanding of the entire paper to determine whether all outcomes specified in the study protocol are reported. Evaluating attrition bias involves extracting data on participant dropout during the trial and assessing how this dropout impacts the study’s statistical estimates. The most challenging bias type to assess is "other bias," where the model must grasp all potential biases, such as those stemming from funding sources or bias due to lack of power. Successfully identifying them requires a thorough understanding of all possible biases a study might have, thereby necessitating the treatment of each bias type as a distinct task.

6 Related Work

Scientific and Biomedical Information Retrieval has been the subject of a significant body of work in recent years Jin et al. [2023], MacAvaney et al. [2020], Wang et al. [2023]. A subset of this work has been focused on benchmarks Ben Abacha and Demner-Fushman [2019], Krithara et al. [2023], Voorhees et al. [2021], Cohen et al. [2017].

Study Screening Many works have focused on the task of screening studies for inclusion in systematic reviews Cohen et al. [2006], Khabsa et al. [2016], Kontonatsios et al. [2017], El-Gayar et al. [2015], with more recent work often focusing on LLMs Wilkins [2023], Tran et al. , Guo et al. [2024], Na et al. [2024], Cai et al. [2023], Ye et al. [2024], Wang et al. [2024]. Huotala et al. [2024] conduct an evaluation of LLM paper screening, finding that few-shot chain-of-thought prediction performs comparably to human annotators. Robinson et al. [2023] leverage an instruction-tuned Guanaco-7B model on a large dataset of systematic reviews to exceed the screening performance of ChatGPT. Cohen et al. [2006] produce a dataset of screening judgments for 15 systematic drug reviews.

Study Risk of Bias Risk of bias identification requires a thorough reading of an entire paper. Marshall et al. [2015, 2016] train an SVM model with documents annotated for risk of bias, creating RobotReviewer system to predict risk of bias levels. Wang et al. [2022b] utilized a BERT-based pipeline for risk of bias prediction. However, several other works find lower agreement rates between LLMs and human annotators Lai et al. [2024], Hasan et al. [2024], Barsby et al. [2024].

Other LLM Extraction Efforts Recent years have seen LLMs benchmarked in extracting "ground truth" from systematic reviews, particularly in the context of data extraction Sun et al. [2024], Gartlehner et al. [2023], multi-document summaries Wang et al. [2022a], Wallace et al. [2020] and relevance prediction Al-Hussaini et al. [2022]. Another use case has been the extraction of PICOs (Participants, Interventions, Comparisons, Outcomes) from studies. While earlier methods achieved strong results Zafar et al. [2023], Hu et al. [2023], Wallace et al. [2016], Dhrangadhariya and Müller [2023], Jin and Szolovits [2020], Kang et al. [2019], a body of recent work has focused on extracting PICOs with LLMs. Fine-tuning Wadhwa et al. [2023] and prompt tuning Tang et al. [2024] have been used to enhance extraction.

Evaluation of Zero-shot LLM Extraction Kartchner et al. [2023] show that ChatGPT performed inconsistently on various meta-analysis extraction tasks. For instance, while it correctly extracted cancer type roughly 90% of the time, treatment type accuracy fell below 25%. This inconsistency is a considerable motive to more comprehensively understand LLM reliability on various meta-analysis extraction tasks.

Limitation Tasks in Scientific Documents Early efforts in assessing papers’ limitations focused on performing summary extraction and evaluating the result using human evaluation Cohan et al. [2018, 2022], Liu and Shah [2023].

For understanding scientific documents, identifying and retrieving a paper’s limitation is essential to understand the utility and viability of claims and evidences presented in the paper. Faizullah et al. [2024] evaluate fine-tuned embedding models and LLMs’ ability to generate limitation using the extracted limitation from the original paper.

Marshall et al. [2016] use direct quotes in risk of bias judgment to obtain annotations for review authors’ judgments, where exact matching texts are labeled as relevant to a risk of bias. Suster et al. [2023] present a quality assessment task to assess a biomedical paper using the Grading of Recommendation, Assessment, Development, and Evaluation with justifications.

7 Conclusion

We have presented RoBBR, a benchmark for assessing the ability to judge risk of bias in biomedical papers. The benchmark is based on gold-standard risk of bias assessments from Cochrane, and uses a human-validated annotation pipeline for scalably constructing task examples. The four benchmark tasks evaluate different skills required for these assessments, including retrieval and reasoning.

Performance of state-of-the-art language models on the benchmark tasks reveals that current models fall significantly short of expert-level performance. By providing a standardized tool for measuring risk of bias judgments, the RoBBR benchmark can help guide the development of AI systems that aim to automatically assess study quality and synthesize scientific evidence. As systems that review the scientific literature are developed, it is important that they can reliably judge the strength of evidence in order to draw sound conclusions. ML systems which perform well on this benchmark could potentially reduce the time required for risk of bias assessments in systematic reviews, which currently averages nearly six hours per study Crocker et al. [2023]. We hope the RoBBR benchmark will spur further research into developing trustworthy systems for this task.

8 Limitations

The RoBBR benchmark has several limitations. The source of all bias judgments in the benchmark is Cochrane meta-analyses. While these are considered very high quality meta-analyses among biomedical researchers, the bias judgments that they contain are likely to have occasional errors. In addition, there can be disagreements in judgment among different meta-analysis authors.

The Cochrane risk of bias framework does not necessarily identify all types of bias in research studies. The framework uses a predefined set of bias categories, which may not be appropriate for all studies.

The benchmark only includes interventional studies, and therefore excludes observational studies. Assessment of risk of bias in observational studies requires a different procedure than for interventional studies. Our benchmark does not measure performance for observational studies.

Only a limited number of prompting strategies were evaluated for the LLMs. It is likely that improved prompting and task presentation would improve performance on the tasks.

References

Ahn and Kang [2018] E. Ahn and H. Kang. Introduction to systematic review and meta-analysis. Korean J. Anesthesiol., 71(2):103–112, Apr. 2018.
AI [2024] M. AI. Meta llama 3, 2024. URL https://ai.meta.com/blog/meta-llama-3/.
Al-Hussaini et al. [2022] I. Al-Hussaini, D. Nakajima An, A. J. Lee, S. Bi, and C. S. Mitchell. Ccs explorer: Relevance prediction, extractive summarization, and named entity recognition from clinical cohort studies. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, Dec. 2022. doi: 10.1109/bigdata55660.2022.10020807. URL http://dx.doi.org/10.1109/BigData55660.2022.10020807.
Alderson and Tan [2011] P. Alderson and T. Tan. The use of cochrane reviews in nice clinical guidelines. Cochrane Database of Systematic Reviews, (8), 2011. ISSN 1465-1858. doi: 10.1002/14651858.ED000032. URL https://doi.org//10.1002/14651858.ED000032.
Anthropic [2024a] Anthropic. Introducing the next generation of claude, 2024a. URL https://www.anthropic.com/news/claude-3-family. Accessed: 2024-05-22.
Anthropic [2024b] Anthropic. Claude 3.5 sonnet, 2024b. URL https://www.anthropic.com/news/claude-3-5-sonnet. Accessed: 2024-07-14.
Aslam and Emmanuel [2010] S. Aslam and P. Emmanuel. Formulating a researchable question: A critical step for facilitating good clinical research. Indian J. Sex. Transm. Dis. AIDS, 31(1):47–50, Jan. 2010.
Barsby et al. [2024] J. Barsby, S. Hume, H. A. Lemmey, J. Cutteridge, R. Lee, and K. D. Bera. Pilot study on large language models for risk-of-bias assessments in systematic reviews: A(i) new type of bias? BMJ Evidence-Based Medicine, 2024. ISSN 2515-446X. doi: 10.1136/bmjebm-2024-112990. URL https://ebm.bmj.com/content/early/2024/05/23/bmjebm-2024-112990.
Ben Abacha and Demner-Fushman [2019] A. Ben Abacha and D. Demner-Fushman. A question-entailment approach to question answering. BMC Bioinformatics, 20(1):511, Oct. 2019.
Berger [2005] V. W. Berger. Quantifying the magnitude of baseline covariate imbalances resulting from selection bias in randomized clinical trials. Biom. J., 47(2):119–27; discussion 128–39, Apr. 2005.
Bunn et al. [2015] F. Bunn, D. Trivedi, P. Alderson, L. Hamilton, A. Martin, E. Pinkney, and S. Iliffe. The impact of cochrane reviews: a mixed-methods evaluation of outputs from cochrane review groups supported by the national institute for health research. Health technology assessment (Winchester, England), 19(28):1—99, v—vi, April 2015. ISSN 1366-5278. doi: 10.3310/hta19280. URL http://europepmc.org/books/NBK285310.
C et al. [2023] L. C, G. J, B. S, F. R, L. A, M. M-I, N.-S. A, P. R, R. T, T. J, and W. LS. Chapter 4: Searching for and selecting studies. In: Higgins JPT, 6.(4), Oct. 2023.
Cai et al. [2023] X. Cai, Y. Geng, Y. Du, B. Westerman, D. Wang, C. Ma, and J. J. G. Vallejo. Utilizing chatgpt to select literature for meta-analysis shows workload reduction while maintaining a similar recall level as manual curation. medRxiv, 2023. doi: 10.1101/2023.09.06.23295072. URL https://www.medrxiv.org/content/early/2023/09/07/2023.09.06.23295072.
Cohan et al. [2018] A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian. A discourse-aware attention model for abstractive summarization of long documents. In M. Walker, H. Ji, and A. Stent, editors, Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 615–621, New Orleans, Louisiana, June 2018. Association for Computational Linguistics. doi: 10.18653/v1/N18-2097. URL https://aclanthology.org/N18-2097.
Cohan et al. [2022] A. Cohan, G. Feigenblat, T. Ghosal, and M. Shmueli-Scheuer. Overview of the first shared task on multi perspective scientific document summarization (MuP). In A. Cohan, G. Feigenblat, D. Freitag, T. Ghosal, D. Herrmannova, P. Knoth, K. Lo, P. Mayr, M. Shmueli-Scheuer, A. de Waard, and L. L. Wang, editors, Proceedings of the Third Workshop on Scholarly Document Processing, pages 263–267, Gyeongju, Republic of Korea, Oct. 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.sdp-1.32.
Cohen et al. [2006] A. M. Cohen, W. R. Hersh, K. Peterson, and P.-Y. Yen. Reducing workload in systematic review preparation using automated citation classification. J. Am. Med. Inform. Assoc., 13(2):206–219, Mar. 2006.
Cohen et al. [2017] T. Cohen, K. Roberts, A. E. Gururaj, X. Chen, S. Pournejati, G. Alter, W. R. Hersh, D. Demner-Fushman, L. Ohno-Machado, and H. Xu. A publicly available benchmark for biomedical dataset retrieval: the reference standard for the 2016 bioCADDIE dataset retrieval challenge. Database, 2017:bax061, 08 2017. ISSN 1758-0463. doi: 10.1093/database/bax061. URL https://doi.org/10.1093/database/bax061.
Crocker et al. [2023] T. F. Crocker, N. Lam, M. Jordão, C. Brundle, M. Prescott, A. Forster, J. Ensor, J. Gladman, and A. Clegg. Risk-of-bias assessment using cochrane’s revised tool for randomized trials (rob 2) was useful but challenging and resource-intensive: observations from a systematic review. Journal of Clinical Epidemiology, 161:39–45, 2023. ISSN 0895-4356. doi: https://doi.org/10.1016/j.jclinepi.2023.06.015. URL https://www.sciencedirect.com/science/article/pii/S0895435623001634.
Dhrangadhariya and Müller [2023] A. Dhrangadhariya and H. Müller. Not so weak PICO: leveraging weak supervision for participants, interventions, and outcomes recognition for systematic review automation. JAMIA Open, 6(1):ooac107, Apr. 2023.
El-Gayar et al. [2015] O. F. El-Gayar, J. Liu, and P. Timsina. Active learning for the automation of medical systematic review creation. In AMCIS. Association for Information Systems, 2015.
Faizullah et al. [2024] A. R. B. M. Faizullah, A. Urlana, and R. Mishra. Limgen: Probing the llms for generating suggestive limitations of research papers, 2024.
Gartlehner et al. [2023] G. Gartlehner, L. Kahwati, R. Hilscher, I. Thomas, S. Kugley, K. Crotty, M. Viswanathan, B. Nussbaumer-Streit, G. Booth, N. Erskine, A. Konet, and R. Chew. Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Research Synthesis Methods, n/a(n/a), 2023. doi: https://doi.org/10.1002/jrsm.1710. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1710.
Goldkuhle et al. [2018] M. Goldkuhle, V. M. Narayan, A. Weigl, P. Dahm, and N. Skoetz. A systematic assessment of cochrane reviews and systematic reviews published in high-impact medical journals related to cancer. BMJ Open, 8(3), 2018. ISSN 2044-6055. doi: 10.1136/bmjopen-2017-020869. URL https://bmjopen.bmj.com/content/8/3/e020869.
Gonzalez et al. [2015] U. Gonzalez, M. Pinart, D. Sinclair, A. Firooz, C. Enk, I. D. Velez, T. M. Esterhuizen, M. Tristan, and J. Alvar. Vector and reservoir control for preventing leishmaniasis. Cochrane Database of Systematic Reviews, 2015(8), Aug. 2015. ISSN 1465-1858. doi: 10.1002/14651858.cd008736.pub2. URL http://dx.doi.org/10.1002/14651858.CD008736.pub2.
Google [2024] Google. Introducing gemini 1.5, google’s next-generation ai model, 2024. URL https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/#sundar-note. Accessed: 2024-05-22.
Guevara et al. [2004] J. P. Guevara, J. A. Berlin, and F. M. Wolf. Meta-analytic methods for pooling rates when follow-up duration varies: a case study. BMC Med. Res. Methodol., 4:17, July 2004.
Guo et al. [2024] E. Guo, M. Gupta, J. Deng, Y.-J. Park, M. Paget, and C. Naugler. Automated paper screening for clinical reviews using large language models: Data analysis study. J. Med. Internet Res., 26:e48996, Jan. 2024.
Hasan et al. [2024] B. Hasan, S. Saadi, N. S. Rajjoub, M. Hegazi, M. Al-Kordi, F. Fleti, M. Farah, I. B. Riaz, I. Banerjee, Z. Wang, and M. H. Murad. Integrating large language models in systematic reviews: a framework and case study using robins-i for risk of bias assessment. BMJ Evidence-Based Medicine, 2024. ISSN 2515-446X. doi: 10.1136/bmjebm-2023-112597. URL https://ebm.bmj.com/content/early/2024/02/21/bmjebm-2023-112597.
Higgins and Green [2011] J. Higgins and S. Green, editors. Cochrane Handbook for Systematic Reviews of Interventions Version 5.1.0 (updated March 2011). The Cochrane Collaboration, 2011. URL training.cochrane.org/handbook/archive/v5.1/.
Higgins and Thompson [2002] J. P. T. Higgins and S. G. Thompson. Quantifying heterogeneity in a meta-analysis. Stat. Med., 21(11):1539–1558, June 2002.
Higgins et al. [2011] J. P. T. Higgins, D. G. Altman, P. C. Gøtzsche, P. Jüni, D. Moher, A. D. Oxman, J. Savovic, K. F. Schulz, L. Weeks, J. A. C. Sterne, Cochrane Bias Methods Group, and Cochrane Statistical Methods Group. The cochrane collaboration’s tool for assessing risk of bias in randomised trials. BMJ, 343(oct18 2):d5928, Oct. 2011.
Hu et al. [2023] Y. Hu, V. K. Keloth, K. Raja, Y. Chen, and H. Xu. Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach. Bioinformatics, 39(9):btad542, 09 2023. ISSN 1367-4811. doi: 10.1093/bioinformatics/btad542. URL https://doi.org/10.1093/bioinformatics/btad542.
Huotala et al. [2024] A. Huotala, M. Kuutila, P. Ralph, and M. Mäntylä. The promise and challenges of using llms to accelerate the screening process of systematic reviews, 2024.
I et al. [2023] B. I, P. MJ, H. JPT, A. DG, L. A, and H. A. Chapter 7: Considering bias and conflicts of interest among the included studies. In H. JPT, T. J, C. J, C. M, L. T, P. MJ, and W. VA, editors, Cochrane Handbook for Systematic Reviews of Interventions. Cochrane, 2023.
J et al. [2023] T. J, K. D, M. JE, B. SE, and B. S. Chapter 2: Determining the scope of the review and the questions it will address. In: Higgins JPT, 6.(4), Aug. 2023.
JE et al. [2023a] M. JE, B. SE, R. RE, T. HJ, and J. RV. Chapter 9: Summarizing study characteristics and preparing for synthesis. In: Higgins JPT, 6.(4), Aug. 2023a.
JE et al. [2023b] M. JE, B. SE, R. RE, T. HJ, J. RV, and T. J. Chapter 3: Defining the criteria for including studies and how they will be grouped for the synthesis. In: Higgins JPT, 6.(4), Aug. 2023b.
Jin and Szolovits [2020] D. Jin and P. Szolovits. Advancing PICO element detection in biomedical text via deep neural networks. Bioinformatics, 36(12):3856–3862, 04 2020. ISSN 1367-4803. doi: 10.1093/bioinformatics/btaa256. URL https://doi.org/10.1093/bioinformatics/btaa256.
Jin et al. [2023] Q. Jin, W. Kim, Q. Chen, D. C. Comeau, L. Yeganova, W. J. Wilbur, and Z. Lu. MedCPT: Contrastive Pre-trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39(11):btad651, 11 2023. ISSN 1367-4811. doi: 10.1093/bioinformatics/btad651. URL https://doi.org/10.1093/bioinformatics/btad651.
JJ et al. [2023] D. JJ, H. JPT, and A. D. (editors). Chapter 10: Analysing data and undertaking meta-analyses. In: Higgins JPT, 6.(4), Aug. 2023.
JPT et al. [2023] H. JPT, S. J, P. MJ, E. RG, and S. JAC. Chapter 8: Assessing risk of bias in a randomized trial. In H. JPT, T. J, C. J, C. M, L. T, P. MJ, and W. VA, editors, Cochrane Handbook for Systematic Reviews of Interventions. Cochrane, Aug. 2023.
Jpt et al. [2023] H. Jpt, J. Thomas, J. Chandler, M. Cumpston, T. Li, M. J. Page, and V. A. Welch, editors. Cochrane Handbook for Systematic Reviews of Interventions version 6.4 (updated August 2023). Cochrane Available from www.training.cochrane.org/handbook, 2023.
Kang et al. [2019] T. Kang, S. Zou, and C. Weng. Pretraining to recognize PICO elements from randomized controlled trial literature. Stud. Health Technol. Inform., 264:188–192, Aug. 2019.
Kartchner et al. [2023] D. Kartchner, S. Ramalingam, I. Al-Hussaini, O. Kronick, and C. Mitchell. Zero-shot information extraction for clinical meta-analysis using large language models. In D. Demner-fushman, S. Ananiadou, and K. Cohen, editors, The 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 396–405, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.bionlp-1.37. URL https://aclanthology.org/2023.bionlp-1.37.
Khabsa et al. [2016] M. Khabsa, A. Elmagarmid, I. Ilyas, H. Hammady, and M. Ouzzani. Learning to identify relevant studies for systematic reviews using random forest and external information. Mach. Learn., 102(3):465–482, Mar. 2016.
Kontonatsios et al. [2017] G. Kontonatsios, A. J. Brockmeier, P. Przybyła, J. McNaught, T. Mu, J. Y. Goulermas, and S. Ananiadou. A semi-supervised approach using label propagation to support citation screening. J. Biomed. Inform., 72:67–76, Aug. 2017.
Kotz and West [2022] D. Kotz and R. West. Key concepts in clinical epidemiology: addressing and reporting sources of bias in randomized controlled trials. J. Clin. Epidemiol., 143:197–201, Mar. 2022.
Krithara et al. [2023] A. Krithara, A. Nentidis, K. Bougiatiotis, and G. Paliouras. BioASQ-QA: A manually curated corpus for biomedical question answering. Sci. Data, 10(1):170, Mar. 2023.
Lai et al. [2024] H. Lai, L. Ge, M. Sun, B. Pan, J. Huang, L. Hou, Q. Yang, J. Liu, J. Liu, Z. Ye, D. Xia, W. Zhao, X. Wang, M. Liu, J. R. Talukdar, J. Tian, K. Yang, and J. Estill. Assessing the risk of bias in randomized clinical trials with large language models. JAMA Netw. Open, 7(5):e2412687, May 2024.
Li et al. [2023] D. Li, R. Shao, A. Xie, Y. Sheng, L. Zheng, J. E. Gonzalez, I. Stoica, X. Ma, and H. Zhang. How long can open-source llms truly promise on context length?, June 2023. URL https://lmsys.org/blog/2023-06-29-longchat.
Liu and Shah [2023] R. Liu and N. B. Shah. Reviewergpt? an exploratory study on using large language models for paper reviewing. arXiv preprint arXiv:2306.00622, 2023. URL https://doi.org/10.48550/arXiv.2306.00622.
MacAvaney et al. [2020] S. MacAvaney, A. Cohan, and N. Goharian. Sledge: A simple yet effective baseline for covid-19 scientific knowledge search, 2020.
Marshall et al. [2015] I. J. Marshall, J. Kuiper, and B. C. Wallace. Automating risk of bias assessment for clinical trials. IEEE J. Biomed. Health Inform., 19(4):1406–1412, July 2015.
Marshall et al. [2016] I. J. Marshall, J. Kuiper, and B. C. Wallace. RobotReviewer: evaluation of a system for automatically assessing bias in clinical trials. J Am Med Inform Assoc, 23(1):193–201, Jan. 2016.
Matthias et al. [2020] K. Matthias, O. Rissling, D. Pieper, J. Morche, M. Nocon, A. Jacobs, U. Wegewitz, J. Schirm, and R. C. Lorenz. The methodological quality of systematic reviews on the treatment of adult major depression needs improvement according to amstar 2: A cross-sectional study. Heliyon, 6(9):e04776, 2020. doi: 10.1016/j.heliyon.2020.e04776. URL https://doi.org/10.1016/j.heliyon.2020.e04776.
Na et al. [2024] C. B. Na, G. Sinanian, N. Gimpaya, A. Mokhtar, D. Chopra, M. Scaffidi, E. Yeung, and S. Grover. A105 pilot study on the accuracy of chatgpt in article screening for systematic reviews in gastroenterology. Journal of the Canadian Association of Gastroenterology, 7:76–78, 02 2024. ISSN 2515-2084. doi: 10.1093/jcag/gwad061.105. URL https://doi.org/10.1093/jcag/gwad061.105.
NHMRC [2019] NHMRC. Guidelines for guidelines: Assessing risk of bias. https://nhmrc.gov.au/guidelinesforguidelines/develop/assessing-risk-bias, 2019. Last published 29 August 2019.
OpenAI [2023] OpenAI. Gpt-4 technical report, 2023.
OpenAI [2024] OpenAI. Hello gpt-4o, 2024. URL https://openai.com/index/hello-gpt-4o/. Accessed: 2024-05-22.
Page et al. [2018] M. J. Page, D. G. Altman, L. Shamseer, J. E. McKenzie, N. Ahmadzai, D. Wolfe, F. Yazdi, F. Catalá-López, A. C. Tricco, and D. Moher. Reproducible research practices are underused in systematic reviews of biomedical interventions. Journal of Clinical Epidemiology, 94:8–18, 2018. ISSN 0895-4356. doi: https://doi.org/10.1016/j.jclinepi.2017.10.017. URL https://www.sciencedirect.com/science/article/pii/S0895435617305358.
Page et al. [2021] M. J. Page, J. E. McKenzie, P. M. Bossuyt, I. Boutron, T. C. Hoffmann, C. D. Mulrow, L. Shamseer, J. M. Tetzlaff, E. A. Akl, S. E. Brennan, R. Chou, J. Glanville, J. M. Grimshaw, A. Hróbjartsson, M. M. Lalu, T. Li, E. W. Loder, E. Mayo-Wilson, S. McDonald, L. A. McGuinness, L. A. Stewart, J. Thomas, A. C. Tricco, V. A. Welch, P. Whiting, and D. Moher. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. Syst. Rev., 10(1):89, Mar. 2021.
Poli et al. [2023] M. Poli, J. Wang, S. Massaroli, J. Quesnelle, R. Carlow, E. Nguyen, and A. Thomas. StripedHyena: Moving Beyond Transformers with Hybrid Signal Processing Models, 12 2023. URL https://github.com/togethercomputer/stripedhyena.
Practice and of Care [EPOC] C. E. Practice and O. of Care (EPOC). Suggested risk of bias criteria for epoc reviews, 2017.
Robinson et al. [2023] A. Robinson, W. Thorne, B. P. Wu, A. Pandor, M. Essat, M. Stevenson, and X. Song. Bio-sieve: Exploring instruction tuning large language models for systematic review automation, 2023.
Sterne et al. [2014] J. Sterne, J. Higgins, and B. o. b. o. t. d. g. f. A.-N. Reeves. A cochrane risk of bias assessment tool: for non-randomized studies of interventions (acrobat-nrsi), version 1.0.0, 2014.
Sun et al. [2024] Z. Sun, R. Zhang, S. A. Doi, L. Furuya-Kanamori, T. Yu, L. Lin, and C. Xu. How good are large language models for automated data extraction from randomized trials? medRxiv, 2024. doi: 10.1101/2024.02.20.24303083.
Suster et al. [2023] S. Suster, T. Baldwin, J. Lau, A. Jimeno Yepes, D. Martinez Iraola, Y. Otmakhova, and K. Verspoor. Automating quality assessment of medical evidence in systematic reviews: Model development and validation study. J Med Internet Res, 25:e35568, 2023. doi: 10.2196/35568. URL https://www.jmir.org/2023/1/e35568.
Tang et al. [2024] Y. Tang, Z. Xiao, X. Li, Q. Zhang, E. W. Chan, and I. C. Wong. Large language model in medical information extraction from titles and abstracts with prompt engineering strategies: A comparative study of gpt-3.5 and gpt-4. medRxiv, 2024. doi: 10.1101/2024.03.20.24304572. URL https://www.medrxiv.org/content/early/2024/03/21/2024.03.20.24304572.
Tawfik et al. [2019] G. M. Tawfik, K. A. S. Dila, M. Y. F. Mohamed, D. N. H. Tam, N. D. Kien, A. M. Ahmed, and N. T. Huy. A step by step guide for conducting a systematic review and meta-analysis with simulation data. Trop. Med. Health, 47(1):46, Aug. 2019.
TJ et al. [2023] L. TJ, T. J, and H. JPT. Chapter 1: Starting a review. In: Higgins JPT, 6.(4), Aug. 2023.
Tran et al. [0] V.-T. Tran, G. Gartlehner, S. Yaacoub, I. Boutron, L. Schwingshackl, J. Stadelmaier, I. Sommer, F. Alebouyeh, S. Afach, J. Meerpohl, and P. Ravaud. Sensitivity and specificity of using gpt-3.5 turbo models for title and abstract screening in systematic reviews and meta-analyses. Annals of Internal Medicine, 0(0):null, 0. doi: 10.7326/M23-3389. URL https://doi.org/10.7326/M23-3389. PMID: 38768452.
Tsoi et al. [2020] A. K. Tsoi, L. T. Ho, I. X. Wu, C. H. Wong, R. S. Ho, J. Y. Lim, C. Mao, E. K. Lee, and V. C. Chung. Methodological quality of systematic reviews on treatments for osteoporosis: A cross-sectional study. Bone, 139:115541, 2020. ISSN 8756-3282. doi: https://doi.org/10.1016/j.bone.2020.115541. URL https://www.sciencedirect.com/science/article/pii/S8756328220303215.
Turner et al. [2009] R. M. Turner, D. J. Spiegelhalter, G. C. S. Smith, and S. G. Thompson. Bias modelling in evidence synthesis. J. R. Stat. Soc. Ser. A Stat. Soc., 172(1):21–47, Jan. 2009.
Viswanathan et al. [2012] M. Viswanathan, M. T. Ansari, N. D. Berkman, S. Chang, L. Hartling, M. L. McPheeters, P. L. Santaguida, T. Shamliyan, K. Singh, A. Tsertsvadze, and J. R. Treadwell. Assessing the Risk of Bias of Individual Studies in Systematic Reviews of Health Care Interventions, March 2012. URL https://effectivehealthcare.ahrq.gov/. AHRQ Publication.
Voorhees et al. [2021] E. Voorhees, T. Alam, S. Bedrick, D. Demner-Fushman, W. R. Hersh, K. Lo, K. Roberts, I. Soboroff, and L. L. Wang. Trec-covid: constructing a pandemic information retrieval test collection. SIGIR Forum, 54(1), feb 2021. ISSN 0163-5840. doi: 10.1145/3451964.3451965. URL https://doi.org/10.1145/3451964.3451965.
Wadhwa et al. [2023] S. Wadhwa, J. DeYoung, B. Nye, S. Amir, and B. C. Wallace. Jointly extracting interventions, outcomes, and findings from rct reports with llms, 2023.
Wallace et al. [2016] B. C. Wallace, J. Kuiper, A. Sharma, M. Zhu, and I. J. Marshall. Extracting pico sentences from clinical trial reports using supervised distant supervision. J. Mach. Learn. Res., 17(1):4572–4596, jan 2016. ISSN 1532-4435.
Wallace et al. [2020] B. C. Wallace, S. Saha, F. Soboczenski, and I. J. Marshall. Generating (factual?) narrative summaries of rcts: Experiments with neural multi-document summarization, 2020.
Wang et al. [2023] J. Wang, K. Wang, X. Wang, P. Naidu, L. Bergen, and R. Paturi. Doris-mae: Scientific document retrieval using multi-level aspect-based queries. In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2023.
Wang et al. [2022a] L. L. Wang, J. DeYoung, and B. Wallace. Overview of MSLR2022: A shared task on multi-document summarization for literature reviews. In A. Cohan, G. Feigenblat, D. Freitag, T. Ghosal, P. Herrmannova, Drahomira andKnoth, K. Lo, P. Mayr, M. Shmueli-Scheuer, A. de Waard, and L. L. Wang, editors, Proceedings of the Third Workshop on Scholarly Document Processing, pages 175–180, Gyeongju, Republic of Korea, Oct. 2022a. Association for Computational Linguistics. URL https://aclanthology.org/2022.sdp-1.20.
Wang et al. [2022b] Q. Wang, J. Liao, M. Lapata, and M. Macleod. Risk of bias assessment in preclinical literature using natural language processing. Res. Synth. Methods, 13(3):368–380, May 2022b.
Wang et al. [2024] S. Wang, H. Scells, S. Zhuang, M. Potthast, B. Koopman, and G. Zuccon. Zero-shot generative large language models for systematic review screening automation, 2024.
Welton et al. [2009] N. J. Welton, A. E. Ades, J. B. Carlin, D. G. Altman, and J. A. C. Sterne. Models for potentially biased evidence in meta-analysis using empirically based priors. Journal of the Royal Statistical Society: Series A (Statistics in Society), 172(1):119–136, 2009. doi: https://doi.org/10.1111/j.1467-985X.2008.00548.x. URL https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-985X.2008.00548.x.
Werneck et al. [2014] G. L. Werneck, C. H. N. Costa, F. A. A. de Carvalho, M. d. S. Pires e Cruz, J. H. Maguire, and M. C. Castro. Effectiveness of insecticide spraying and culling of dogs on the incidence of leishmania infantum infection in humans: A cluster randomized trial in teresina, brazil. PLoS Neglected Tropical Diseases, 8(10):e3172, Oct. 2014. ISSN 1935-2735. doi: 10.1371/journal.pntd.0003172. URL http://dx.doi.org/10.1371/journal.pntd.0003172.
Wilkins [2023] D. Wilkins. Automated title and abstract screening for scoping reviews using the gpt-4 large language model, 2023.
Ye et al. [2024] A. Ye, A. Maiti, M. Schmidt, and S. J. Pedersen. A hybrid semi-automated workflow for systematic and literature review processes with large language model analysis. Future Internet, 16(5), 2024. ISSN 1999-5903. URL https://www.mdpi.com/1999-5903/16/5/167.
Zafar et al. [2023] I. Zafar, A. Wali, M. A. Kunwar, N. Afzal, and M. Raza. A pipeline for medical literature search and its evaluation. J. Inf. Sci., page 016555152311615, Apr. 2023.

Appendix: Table of Contents

A.
DatasetA
1. 1.
  
  Dataset License and Code LicenseA.1
2. 2.
  
  Dataset Hosting, Accessibility and MaintenanceA.2
3. 3.
  
  Dataset StatisticsA.3
4. 4.
  
  Dataset Collection and Processing A.4
5. 5.
  
  RoBBR Structure A.5
B.

Annotation GuidelineB
C.
Task 2 Bias Retrieval Prompt and OptimizationC
- •
  
  Aspect Decomposition Prompt C.1
- •
  
  Aspect Filtering Prompt C.2
- •
  
  Aspect-to-Sentence Mapping Prompt and Optimization C.3
- •
  
  A Motivating Example C.4
D.

Task 3 Synthetic Option Generation Procedure and PromptD
E.
Experiment DetailsE
- •
  
  Prompt Optimization E.1
- •
  Prompts for Evaluation E.2
  - –
    
    Prompt to Evaluate Study Inclusion/Exclusion (Task 1) E.2.1
  - –
    
    Prompt to Evaluate Bias Retrieval (Task 2) E.2.2
  - –
    
    Prompt to Evaluate Support Judgment Selection (Task 3) E.2.3
  - –
    
    Prompt to Evaluate Risk Level Determination (Task 4) E.2.4
- •
  
  Results with Standard Error E.3
F.
Full Example F
- •
  
  Search Protocol F.1
- •
  
  Objective of the Meta-Analysis F.2
- •
  
  Full Paper F.3
- •
  
  Bias Name and Definition F.4
- •
  
  PICO F.5

Appendix A Dataset

A.1 Dataset License and Code License

The RoBBR dataset is made available under CC-BY-NC. A copy of the full license can be found at https://github.com/RoBBR-Benchmark/RoBBR/blob/main/LICENSE.md.

The code used in this paper is released under the MIT License. The MIT License is a permissive open-source license that allows for the free use, modification, and distribution of the code, as long as the original license is included with any derivative work. A copy of the full license can be found at https://github.com/RoBBR-Benchmark/RoBBR/blob/main/LICENSE.md.

A.2 Dataset Hosting, Accessibility and Maintenance

The RoBBR dataset with its meta-data is released and can be accessed freely at (https://github.com/RoBBR-Benchmark/RoBBR/blob/main/LICENSE.md). We commit to regularly maintain the dataset and codebase by incorporating user feedback. We will potentially introduce more features as part of future work in the next version of RoBBR. We confirm that the current version of RoBBR will always remain accessible at the same link.

A.3 Dataset Statistics

See table 6 7 8 9 for detailed test set statistics of the four tasks.

Table 6: Test set statistics for Task 1 (Study Inclusion/Exclusion)

		# of Tokens in Query			Correct Option Count
Dataset	Points	min	avg	max	Included	Excluded
Test Set	309	2685	9380	28464	128	181

Table 7: Test set statistics for Task 2 (Risk of Bias: Sentence Retrieval)

		# of Tokens in the Query			Sentences			Covered Aspects			Optimal # of sentences
Dataset	Points	min	avg	max	min	avg	max	min	avg	max	min	avg	max
Test Set	513	4104	11220	21494	87	220.1	400	1	1.9	10	1	1.4	9

Table 8: Test set statistics Task 3 (Risk of Bias: Support Judgement Selection)

		# of Tokens in the Query
Dataset	Points	min	avg	max
Test Set	479	3515	10583	21187

Table 9: Test set statistics for Task 4 (Risk of Bias: Risk Level Determination)

		# of Tokens in Query			Correct Option Count
Dataset	Points	min	avg	max	Low	Unclear	High
Test Set	1052	3256	9763	19834	676	213	163

Table 10: Test set category statistics for Task 2, 3, and 4

		Bias Type
Task	Points	Selection	Performance	Detection	Reporting	Attrition	Other	Analysis	Paper specific
Task 2	513	202	81	82	35	95	23	13	7
Task 3	479	130	87	82	80	98	0	8	6
Task 4	1052	332	170	168	119	148	102	18	26

A.4 Dataset Collection and Processing

We use BioC \citesupbioc API to download meta-reviews and papers in PMC database. We manually download the rest of the papers. We use GROBID \citesupGROBID to parse papers from PDF to XML format. We use Stanza \citesupstanza to split paragraphs into sentences.

A.5 RoBBR Structure

•

task1_SIE_test.json and task1_SIE_dev.json: Test and dev set of task 1 Study Inclusion/Exclusion
- –
  
  paper_doi: The DOI of the paper.
- –
  
  objective: The meta-analysis objective.
- –
  
  search_protocol: search protocol information of the meta-analysis.
- –
  
  full_paper: The full paper content.
- –
  
  label: One of [included, excluded], showing whether the paper is included or excluded in the meta-analysis.
•

task2_ROBSR_test.json and task2_ROBSR_dev.json: Test and dev set of task 2 Risk of Bias Sentence Retrieval
- –
  
  paper_doi: The DOI of the paper.
- –
  
  bias: The bias to be considered.
- –
  
  PICO: PICO of a study in the paper, including Methods, Participants, Intervention, Outcome, and Notes.
- –
  
  objective: The meta-analysis objective.
- –
  
  paper_as_candidate_pool: A tuple of text elements from the paper. Each text element is a sentence, a section title, a table, or a figure caption.
- –
  
  aspects: A dictionary that maps aspect id to bias aspect.
- –
  
  aspect2sentence_indices: a mapping (i.e. dictionary) between aspect id and all sentence indices that independently are source of information for that aspect, as annotated by our pipeline.
- –
  
  sentence_index2aspects: a mapping (i.e. dictionary) between sentence index and all aspect ids that this sentence is the source of information of.
- –
  bias_retrieval_at_optimal_evaluation: This is a dictionary that contains the necessary information for evaluating the model’s performance on the task Bias Retrieval @Optimal.
  - *
    
    optimal: A positive integer, which is the smallest number of sentences needed to cover the largest number of aspects.
  - *
    
    one_selection_of_sentences: a list of sentence indices. The list size is the optimal number. The list of sentences cover the largest number of aspects. Note, there are potentially other lists of sentences that has the same size and also cover the largest number of aspects.
  - *
    
    covered_aspects: the list of aspects that are covered. In this case, the list of aspects covered is list of all aspects.
- –
  bias_retrieval_at_3_evaluation: This is a dictionary that contains the necessary information for evaluating your model’s performance on the task Bias Retrieval @3.
  - *
    
    one_selection_of_sentences: a list of 3 sentence indices. The list of sentences cover the largest number of aspects that can be covered under the restriction of 3 sentences. Note, there are potentially other lists of sentences that has size 3 and cover the same number of aspects.
  - *
    
    covered_aspects: the list of aspects that are covered. In this case, this list of aspects may not be all the aspects. Since in the paper, we calculate aspect recall by diving the number of aspects covered by model’s retrieved sentences against the total number of aspects, for Bias Retrieval @3, the maximum possible performance is not 100%.
•

task3_SJS_test.json and task3_SJS_dev.json: Test and dev set of task 3 Support Judgment Selection
- –
  
  paper_doi: The DOI of the paper.
- –
  
  bias: The bias to be considered.
- –
  
  PICO: PICO of a study in the paper, including Methods, Participants, Intervention, Outcome, and Notes.
- –
  
  objective: The meta-analysis objective.
- –
  
  full_paper: The full paper content.
- –
  
  options: The seven options for the multiple choice.
- –
  
  label: The index of the correct option.
•

task4_RLD_test.json and task4_RLD_dev.json: Test and dev set of task 4 Risk Level Determination
- –
  
  paper_doi: The DOI of the paper.
- –
  
  bias: The bias to be considered.
- –
  
  PICO: PICO of a study in the paper, including Methods, Participants, Intervention, Outcome, and Notes.
- –
  
  objective: The meta-analysis objective.
- –
  
  full_paper: The full paper content.
- –
  
  label: One of [low,high,unclear], representing the risk level of the bias.

Appendix B Annotation Guideline

Below, we show the annotation guideline for aspect mapping of 50 randomly sampled (aspect, full paper) pair. The four annotators form two teams of two persons, all seeing the same annotation guideline.

You have a total of 50 annotation task packets. Each task packet is a docx. file that contains the following information.

•

An Aspect (one piece of important information/detail).
•

The support judgment and the bias.
•

The doi of the paper.
•

An indexed list of text elements of the paper (a text element could be a sentence, tables, a figure caption, etc.)

You have to pledge the following conditions are met during annotation for each task packet.

•

For words you are not familiar with and believe are important for comprehension, conduct the search to understand its meaning.
•

For every text element in the list, you have to look at it and read it at least once.
•

You cannot talk to the other annotator team about anything related to your task, including progress and insights.
•

You should independently do the task first, and then consult with the other person in your team after completing the task.
•

You have to take a mandatory 5 minute break after every 1 hour of performing annotation.
•

You cannot exceed 8 hours of annotation per day.

Below is the recommended procedure for annotating each packet.

•

Read and understand the aspect and the support judgment first. You need to understand the context of the aspect, i.e., the role of the aspect in the support judgment given the bias. You also need to understand why this aspect is important for judging the bias. You can consult with LLM to understand this aspect.
•

Decide what details are important in the aspect. Geo, temporal, and numerical data are all important details.
•

For each text element, you need to decide if a significant amount of important details can be implied from the text element.
- –
  
  When deciding the level of implication, consider how the text element can help judge the bias. You should not make complicated implications, i.e., can you see the aspect from the text element within 30 seconds? If not, then it is not a match.
- –
  
  Pay attention to acronyms, abbreviations, or different presentation formats of the same information.
- –
  
  Even if you find significant information that can be implied from the text element, you have to make sure the text element is in the same context as the aspect.
  - *
    
    Same context typically refers to the same study or experiment, and for numerical results in Table, it means row and column must indicate the same setting.
  - *
    
    Check if the text element refers to the same experiment as the aspect. Since different experiments could be in one paper. Maybe the text element does not refer to any experiment at all.
- –
  
  If you suspect that there might be a relationship between the text element and the aspect but do not understand the meaning of the text element, you can use LLM for help. However, you must justify the response from LLM, and you should not rely on the response from LLM.
•

If a text element is a Table, refer to the actual Table in the pdf for better understanding. However, only consider the information in the text element.
•

After you have independently finished all 50 tasks, you should talk to the other person in your team following these procedures:
- –
  
  Go through task 0-49.
- –
  
  Resolve your difference, check if you made a mistake, or if you missed something. If you made a conceptual error (e.g., fail to understand some terminology), you may have to quickly go through the paper again.
- –
  
  For sentences that you cannot resolve your difference after discussion, i.e., one person says yes and the other person says no, or if both people are unsure, you should include them in your final list of decisions.
•

Ultimately, you and your teammate should collaboratively arrive at a consensus for each of the 50 tasks. Write the collective answers in the answer file provided.

Appendix C Task 2 Bias Retrieval Prompt and Optimization

C.1 Aspect Decomposition Prompt

See Figure 6 for the aspect decomposition prompt.

C.2 Aspect Filtering Prompt

See Figure 7 for prompt to filter negations.

See Figure 8 for prompt to filter non-specific commentaries.

C.3 Aspect-to-Sentence Mapping Prompt and Optimization

To enhance the agreement between human annotators and the GPT4-0125-turbo model, we developed and tested various prompt engineering strategies on a development set consisting of 30 unique (aspect, paper) pairs. It is important to note that no paper or aspect from the development set overlapped with the 50 tasks in the hypothesis testing set.

GPT4-0125-preview was chosen due to its robust instruction-following capabilities and our familiarity with its performance across different settings. Our experimentation revealed that using in-context learning examples did not improve agreement rates and tended to cause overfitting. Instead, we found that embedding the same instruction at both the beginning and the end of the prompt effectively helped the model maintain focus on the task of identifying text elements that significantly cover the details specified in an aspect.

We also implemented chain-of-thought reasoning, prompting GPT4 to articulate its thought process following a specific keyword. This approach not only enhanced the quality of the model’s reasoning but also stabilized its performance and reduced common-sense errors.

To address the issue of entity forgetting in long-context tasks (a typical paper might contain around 5000 tokens), we employed a sliding window technique. Each window, containing 10 text elements with a 5-element overlap, allowed GPT4 to process and evaluate each text element within a manageable context size. The hyperparameters, window length and overlap size, were optimized using the development set. This overlapping approach ensures that each text element is evaluated twice, significantly reducing the likelihood of false negatives.

Figure 9 illustrates the final prompt template used to match text elements (such as sentences, tables, figure captions, etc.) with the aspects. Each template inputs one aspect, the 10 text elements within a sliding window, and the contextual background (specifically, the summary of the IARC paper from which the aspect was derived).

This methodology not only ensures high fidelity in aspect-text alignment but also leverages the model’s capabilities to provide consistent and accurate annotations across extensive text bodies.

C.4 A Motivating Example

Here we provide an example of how we build the bias retrieval task.

Bias

Selective reporting (reporting bias)

Support Judgment For the Bias

The trial authors’ original plan was to use Indirect Fluorescent Antibody Test (IFAT) at 6 and 12 months, but due to operational problems, data on IFAT results were not considered valid for the analysis, and serology was not used as a marker of infection in the trial. Problems with serology were poor sensitivity and reproducibility. The authors decided not to use IFAT results in the trial and relied on conversion of the Montenegro skin test (MST) at 18 months of follow-up as the only outcome measure, since no clinical cases of Visceral Leishmaniasis (VL) were detected among the studied population.

Decomposition of Support Judgment into Aspects

•

Aspect 1: The trial authors’ original plan was to use IFAT test at 6 and 12 months
•

Aspect 2: Due to operational problems, data on IFAT results were not considered valid for the analysis
•

Aspect 3: Serology was not used as a marker of infection in the trial
•

Aspect 4: Problems with serology were poor sensitivity and reproducibility.
•

Aspect 5: The authors decided not to use Indirect Fluorescent Antibody Test (IFAT) results in the trial
•

Aspect 6: The authors relied on conversion of the Montenegro skin test (MST) at 18 months of follow-up as the only outcome measure
•

Aspect 7: No clinical cases of VL were detected among the studied population

Aspect Filtering

Using prompt 8, Aspect 4 is filtered since it is a commentary of the reviewer.

Mapping Aspects to Sentences in the paper

Utilizing the procedure described in section C.3, we map the remaining 6 aspects to all sentences in the paper Werneck et al. [2014].

We only show sentences from the paper that are matched with at least one aspect.

•

Sentence 4: The main outcome is the incidence of infection assessed by the conversion of the Montenegro skin test (MST) after 18 months of follow-up in residents aged $\geq$ 1 year with no previous history of visceral leishmaniasis (VL). (Mapped with aspect 6)
•

Sentence 66: The original plan was to repeat the IFAT test at 6 and 12 months, but due to operational problems, data on IFAT results were not considered valid for the analysis, and serology was not used as a marker of infection in the study. (Mapped with aspect 2, 3, 5, 7)
•

Sentence 73: In any case, we decided not to use IFAT results in this study and relied on conversion of the MST at 18 months of follow-up as the only outcome measure, since no clinical cases of VL were detected among the studied population. (Mapped with aspect 1, 3, 5, 6)

Bias Recall @ Optimal

One of our evaluation metric, Bias Recall @ Optimal, measures the the optimal number of sentences to cover all aspects. In this example, we only need two sentences, sentence 66 and sentence 73, to cover all aspects.

Appendix D Task 3 Synthetic Option Generation Procedure and Prompt

We first generate three detailed synthetic options that are specifically crafted to imitate the support judgments from other papers concerning the same type of bias, ensuring that they are tailored to be relevant to the specific paper in question while maintaining the foundational reasoning. See Figure 10 for the prompt template. Following this, we condense these detailed options into shorter versions that preserve the original meaning, using prompt 11. To prevent heuristic algorithms from solving the task easily, we randomly select either the long or short version of each synthetic option to include in the multiple choice questions.

For selecting the three options derived from other papers’ support judgments within the same bias category, we use prompt 12.

Finally, once the six incorrect options are constructed, we conduct a manual review with the help of prompt 13 on all data points. This review ensures that the options are incorrect and not false negatives.

Appendix E Experiment Details

Due to budget limit, we only evaluate the LLMs on a subset of our test set. We provide this subset in our GitHub.

E.1 Prompt Optimization

All evaluation prompts are optimized using disjoint development set for each task. Each prompt includes specialized instructions designed to elicit chain-of-thought reasoning, thereby enhancing the models’ reasoning abilities. These instructions are repeated twice: once at the beginning and once at the end of the prompt, ensuring that the models retain the instructions even after processing the entire paper. Empirical evidence shows that few-shot in-context learning does not improve model performance. Consequently, evaluation prompts do not incorporate in-context learning. It is important to note that prompt optimization is not tailored to any specific large language model (LLM); instead, the aggregated performance across all models is used to guide the optimization of evaluation prompts.

E.2 Prompts for Evaluation

E.2.1 Prompt to Evaluate Study Inclusion/Exclusion (Task 1)

See Figure 14 for prompt to evaluate Study Inclusion/Exclusion (Task 1).

E.2.2 Prompt to Evaluate Bias Retrieval (Task 2)

See Figure 15 for prompt to evaluate Bias Retrieval (Task 2). We use an additional multi-turn prompt 16 when the model outputs more than required.

E.2.3 Prompt to Evaluate Support Judgment Selection (Task 3)

See Figure 17 for prompt to evaluate Support Judgment Selection (Task 3).

E.2.4 Prompt to Evaluate Risk Level Determination (Task 4)

See Figure 18 for prompt to evaluate Risk Level Determination (Task 4).

E.3 Results with Standard Error

We provide bootstrapped standard errors for the experimental results 11,12,13,14.

Table 11: Results for Task 1: Study Inclusion/Exclusion Standard Error Included

	F1	Accuracy	Weighted Accuracy
Claude3-Opus	75.5 $\pm$ 2.6	77.1 $\pm$ 2.5	74.9 $\pm$ 2.5
Gemini-1.5-Pro	72.2 $\pm$ 2.6	72.7 $\pm$ 2.6	73.0 $\pm$ 2.7
GPT-4o	75.5 $\pm$ 2.6	75.6 $\pm$ 2.5	77.6 $\pm$ 2.3

Table 12: Results for Task 2: Sentence Retrieval Standard Error Included

	Bias Type
Model	Full	Selection	Attrition	Performance	Detection	Reporting	Analysis	Special	Other
	n = 337	n = 117	n = 57	n= 50	n = 50	n = 35	n = 13	n = 7	n = 23
	Bias Recall@Optimal
Claude3-Opus	39.0 $\pm$ 2.4	55.4 $\pm$ 4.0	22.0 $\pm$ 4.7	57.3 $\pm$ 6.39	46.4 $\pm$ 6.5	11.0 $\pm$ 4.9	47.9 $\pm$ 12.3	13.9 $\pm$ 8.3	4.4 $\pm$ 2.9
Gemini-1.5-Pro	41.1 $\pm$ 2.5	57.2 $\pm$ 4.0	28.7 $\pm$ 5.2	49.1 $\pm$ 6.4	48.8 $\pm$ 6.3	23.3 $\pm$ 6.4	54.2 $\pm$ 11.4	13.9 $\pm$ 8.3	3.3 $\pm$ 2.3
GPT-4o	44.8 $\pm$ 2.5	60.3 $\pm$ 4.0	28.6 $\pm$ 5.0	64.6 $\pm$ 5.5	48.7 $\pm$ 6.4	32.9 $\pm$ 7.3	28.7 $\pm$ 9.7	28.0 $\pm$ 14.1	2.3 $\pm$ 2.2
	Bias Recall@3
Claude3-Opus	52.3 $\pm$ 2.5	67.9 $\pm$ 3.8	40.2 $\pm$ 5.8	63.1 $\pm$ 5.9	55.9 $\pm$ 6.6	25.5 $\pm$ 7.1	55.9 $\pm$ 12.3	64.6 $\pm$ 16.9	19.7 $\pm$ 7.8
Gemini-1.5-Pro	56.1 $\pm$ 2.4	69.7 $\pm$ 3.8	44.1 $\pm$ 5.7	73.7 $\pm$ 5.3	65.3 $\pm$ 6.1	35.6 $\pm$ 7.6	56.3 $\pm$ 12.8	35.5 $\pm$ 16.1	6.7 $\pm$ 4.8
GPT-4o	61.8 $\pm$ 2.4	72.2 $\pm$ 3.8	47.9 $\pm$ 5.8	82.0 $\pm$ 4.5	78.5 $\pm$ 5.4	43.2 $\pm$ 8.2	67.9 $\pm$ 10.3	50.4 $\pm$ 17.1	10.3 $\pm$ 5.8

Table 13: Results for Task 3: Support Judgment Selection Standard Error Included

	Bias Type
Model	Full	Selection	Attrition	Performance	Detection	Reporting	Analysis	Special
	n = 310	n = 86	n = 60	n = 52	n = 56	n = 50	n = 8	n = 6
	Accuracy
Claude3-Opus	51.2 $\pm$ 2.8	62.6 $\pm$ 5.2	56.7 $\pm$ 6.4	48.0 $\pm$ 6.9	46.5 $\pm$ 6.9	44.2 $\pm$ 7.0	50.0 $\pm$ 17.9	0.0 $\pm$ 0.0
Gemini-1.5-Pro	49.3 $\pm$ 3.0	57.1 $\pm$ 5.3	59.9 $\pm$ 6.5	46.4 $\pm$ 7.0	42.8 $\pm$ 6.8	40.2 $\pm$ 7.0	49.3 $\pm$ 17.7	16.6 $\pm$ 15.5
GPT-4o	45.8 $\pm$ 2.9	52.3 $\pm$ 5.4	63.1 $\pm$ 6.2	42.5 $\pm$ 6.7	37.7 $\pm$ 6.5	28.0 $\pm$ 6.2	74.5 $\pm$ 15.7	0.0 $\pm$ 0.0

Table 14: Results for Task 4: Risk Level Determination Standard Error Included

	Bias Type
Model	Full	Selection	Attrition	Performance	Detection	Reporting	Analysis	Special	Other
	n = 337	n = 93	n = 50	n = 50	n = 50	n = 50	n = 18	n = 26	n = 50
	F1
Claude3-Opus	47.7 $\pm$ 2.9	54.3 $\pm$ 6.6	34.1 $\pm$ 7.0	46.7 $\pm$ 6.8	37.4 $\pm$ 7.6	23.0 $\pm$ 4.7	69.9 $\pm$ 18.8	27.9 $\pm$ 6.7	27.9 $\pm$ 1.8
Gemini-1.5-Pro	43.1 $\pm$ 2.6	53.9 $\pm$ 5.8	26.0 $\pm$ 3.7	43.7 $\pm$ 7.2	46.8 $\pm$ 7.4	30.4 $\pm$ 5.2	44.6 $\pm$ 7.7	20.1 $\pm$ 3.4	22.8 $\pm$ 6.1
GPT-4o	47.4 $\pm$ 2.9	54.3 $\pm$ 6.6	30.4 $\pm$ 5.8	46.7 $\pm$ 6.8	37.4 $\pm$ 7.6	22.6 $\pm$ 3.9	69.9 $\pm$ 18.8	32.2 $\pm$ 8.5	27.0 $\pm$ 1.6
	Accuracy
Claude3-Opus	56.8 $\pm$ 2.6	64.6 $\pm$ 5.0	58.1 $\pm$ 7.1	48.2 $\pm$ 7.1	41.7 $\pm$ 7.1	40.0 $\pm$ 7.3	88.9 $\pm$ 7.3	53.8 $\pm$ 9.7	72.2 $\pm$ 6.4
Gemini-1.5-Pro	49.3 $\pm$ 2.7	62.5 $\pm$ 5.1	42.4 $\pm$ 7.2	46.5 $\pm$ 6.8	48.1 $\pm$ 7.1	56.0 $\pm$ 7.2	77.5 $\pm$ 9.5	42.4 $\pm$ 9.9	23.7 $\pm$ 6.2
GPT-4o	57.0 $\pm$ 2.6	64.6 $\pm$ 5.0	58.1 $\pm$ 7.2	48.2 $\pm$ 7.1	41.7 $\pm$ 7.2	43.6 $\pm$ 7.2	88.9 $\pm$ 7.3	57.7 $\pm$ 9.7	68.3 $\pm$ 6.5
	Weighted Accuracy
Claude3-Opus	49.5 $\pm$ 3.0	55.3 $\pm$ 5.1	36.2 $\pm$ 9.0	48.0 $\pm$ 7.0	46.5 $\pm$ 5.9	30.4 $\pm$ 10.1	69.3 $\pm$ 19.6	28.2 $\pm$ 4.9	32.4 $\pm$ 0.9
Gemini-1.5-Pro	47.9 $\pm$ 3.2	56.3 $\pm$ 5.6	34.0 $\pm$ 10.1	44.4 $\pm$ 7.0	54.2 $\pm$ 8.2	36.4 $\pm$ 10.1	43.6 $\pm$ 4.2	22.2 $\pm$ 4.9	32.5 $\pm$ 10.1
GPT-4o	49.3 $\pm$ 3.1	55.3 $\pm$ 5.1	29.3 $\pm$ 3.9	48.0 $\pm$ 7.0	46.5 $\pm$ 5.9	24.5 $\pm$ 8.4	69.3 $\pm$ 19.6	60.4 $\pm$ 24.4	30.6 $\pm$ 1.5

Appendix F Full Example

We provide the full version of the example presented in the main paper. The example is shown in its entirety here, whereas in the main paper it was shortened to save space.

F.1 Search Protocol Gonzalez et al. [2015]

Types of Studies	Randomized controlled trials (RCTs)
Types of Participants	People living in leishmaniasis endemic regions.
Types of Interventions	Any intervention that aims to reduce leishmaniasis incidence through vector or reservoir control.
Types of Outcome Measures	People developing CL or VL infections. Estimates of the vector density measured by an appropriate technique (adult sandfly density estimated by counts of vectors either landing on exposed body parts of humans acting as baits or collected resting inside buildings, for example, on walls). Number of participants with positive immunological or biochemical tests that detect contact with the parasite (for example, leishmanin skin test conversion rates or lymphocyte proliferation rates, or both). Adverse effects on people. Adherence to control measures; for example, the extent to which specified intervention components were delivered as prescribed. Measures of environmental impact (assessment of the possible impact - positive or negative - that the interventions may have on the natural environment) or sustainability (assessment of the ability to change biological and human processes, functions, biodiversity and productivity), or both.

F.2 Objective of the Meta-Analysis Gonzalez et al. [2015]

To assess the effects of vector and reservoir control interventions for cutaneous and for visceral leishmaniasis.

F.3 Full Paper

See Werneck et al. for full paper content.

F.4 Bias Name and Definition Higgins and Green [2011]

Bias Name

selective reporting (reporting bias)

Bias Definition

Reporting bias due to selective outcome reporting.

F.5 PICO

Methods	Trial design: Cluster-RCT. Unit of randomization: Geographic area. Number of clusters: 40 geographic areas. Entomological data collection: Not done. Clinical data collection: Conversion of the Montenegro skin test (MST) at 18 months of follow-up. Length of follow-up: 18 months. Analysis: Analysed at cluster level.
Participants	Ten localities in 7 neighbourhoods of the city of Teresina (Brazil) were divided into blocks, each containing an average of 60 residences. For each locality, 4 blocks were selected to minimize the risk of cross-contamination of interventions. Eligible participants were residents of selected blocks aged 1 year or above with no history of VL. The 40 geographic areas (blocks) randomly allocated to the 4 types of interventions (697 subjects MST-). Endemic disease: VL caused by L. chagasi (L. infantum).
Interventions	1. Spraying households and residential annexes with insecticide. 2. Elimination of infected dogs. 3. Combination of spraying and eliminating infected dogs. 4. No intervention. Description of spraying: performed according to the routine of the VL Control Program of the Zoonosis Control Center of the Teresina City Health Department. Interventions were delivered in the selected blocks every 6 months, for three times, beginning just after each household visit. The elimination of infected dogs was decided if indirect immunofluorescence test was more or equalled 1:40.
Outcomes	Cases of infection by L. infantum at 18 months determined by conversion of the MST (MST- at the beginning) or diagnosis of active VL.
Notes	Country: Brazil (Teresina, Itararé quarter). Trial dates: January 2004 to December 2006. Trial sponsor: Funded by Health Surveillance Unit from the Brazilian Ministry of Health. One author was partially funded by the Brazilian Research Council (CNPq 306267/2010-1 and 202088/2012-0). The founders had no role in trial design, data collection and analysis, decision to publish, or preparation of the manuscript. The authors have declared that no competing interests exist. Sample size: Calculated. Compliance assessment: Not reported.

\bibliographystylesup

abbrvnat \bibliographysupsupplementary_papers