Participation in TREC 2020 COVID Track Using Continuous Active Learning

(Jean) Xue Jun Wang
University of Waterloo
Waterloo, ON, Canada
[email protected] Maura R Grossman
University of Waterloo
Waterloo, ON, Canada
[email protected] (Kevin) Seung Gyu Hyun
University of Waterloo
Waterloo, ON, Canada
[email protected]

Abstract

We describe our participation in all five rounds of the TREC 2020 COVID Track (TREC-COVID). The goal of TREC-COVID is to contribute to the response to the COVID-19 pandemic by identifying answers to many pressing questions and building infrastructure to improve search systems [8]. All five rounds of this Track challenged participants to perform a classic ad-hoc search task on the new data collection CORD-19. Our solution addressed this challenge by applying the Continuous Active Learning model (CAL) and its variations. Our results showed us to be amongst the top scoring manual runs and we remained competitive within all categories of submissions.

1 Introduction

As the spread of COVID-19 continues around the globe, researchers, clinicians, and policy makers involved with its response are constantly searching for reliable information on the virus. This presents those of us in information retrieval (IR) and text processing communities with a unique opportunity to contribute to the response to this pandemic by building infrastructure to improve search systems and to help identify answers for some of today’s most pressing questions [8]. The task of TREC-COVID is for participants to retrieve the most relevant documents from the CORD-19 data-set for a given set of topics. To address this challenge, we implemented a system based on CAL, following the work of Grossman and Cormack in [3, 4], using the tool kit provided as part of the Baseline Model Implementation (BMI), created by Roegiest and Cormack in [6], and ourselves as the human assessors.

2 Related Work

In this section, we discuss prior research on CAL. We then discuss prior research on BMI, which provides the tool kits we heavily relied upon for this challenge.

Continuous Active Learning (CAL).

CAL is a method for finding virtually all relevant information on a particular subject within a vast sea of electronically stored information (ESI): it repeatedly refines its understanding about which of the remaining documents are most likely to be of interest, based on the users’ feedback regarding the documents already judged [4]. This protocol is most famously used in technology-assisted review (TAR) for electronic discovery in legal matters, achieving the best results reported in scientific literature to date [2]. Building on the CAL protocol, many implementations, such as BMI, have been highly successful at performing ad-hoc retrieval tasks, such as in the TREC 2015/2016 Total Recall Track [7, 5] and the TREC 2019 Decision Track [1].

Baseline Model Implementation (BMI).

BMI is an augmented version of CAL. It is autonomous and was initially made available to participants of the TREC 2015/2016 Total Recall Tracks [7, 5], as well as the TREC 2019 Decision Track [1] to provide a baseline for comparison. However, BMI turned out to be highly competitive, with none of the manual participants achieving consistently superior results to this fully automated method [6].

While BMI has been shown to generally outperform human-in-the-loop CAL implementations [6], it requires labelled data, which was very limited, if available at all, for TREC-COVID; thus, we chose to insert a human back into the loop to make judgements. All other components, such as creating feature vectors, the learner, etc., were taken directly from the BMI tool kits.

3 System Overview & General Approach

Document Set Processing.

The document set used in the TREC-COVID Challenge is the COVID-19 Open Research Data-set (CORD-19). Our team opted to judge a document’s relevancy using strictly the information available in the metadata file (year, authors, publisher, title, abstract) based on the work of Zhang et al. [9] which show that participants achieve higher recall using CAL when presented with only a single short excerpt rather than an entire document.

CAL.

The following shows an outline of our specific implementation of CAL.

•

STEP 1: Create a hypothetical relevant document, known as a synthetic document.
To create the synthetic documents, we concatenated the query, question, and narrative components of the topics file provided by TREC-COVID, as shown in Figures 2 and 2.

Figure 1: The synthetic document for topic 1.

Figure 2: Snippet of topic 1 in the xml topics file provided by TREC-COVID.
•

STEP 2: Use a machine-learning algorithm to suggest the next most-likely relevant document.
The machine-learning algorithm we chose is Sofia-ML which Roegiest and Cormack used in their participation in the 2015 Total Recall Track [6].
•

STEP 3: Review the suggested documents and provide relevance feedback to the learning algorithm, indicating whether each suggested document is actually relevant or not.
To do this, we sorted the results given by Sofia-ML in decreasing order of confidence, presenting the top most result to the human assessor using a text based user interface. The judgement made by our human assessors is one of {0-not relevant, 1-partially relevant, 2-relevant}. This corresponds to the annotations made by biomedical experts as part of TREC-COVID following each round. As Sofia-ML does not distinguish between relevant judgements and partially relevant judgments, both were designated to be relevant in training.
•

STEP 4: Repeat Step 2 and 3 until very few, if any, of the suggested documents are relevant.
Using the same stopping condition as in [5], we aimed to stop when the following criterion was met:

$n\geq a*m+b$

where m is the number of relevant documents reviewed, n is the number of irrelevant documents reviewed, a is a constant which determines how many non-relevant documents are to be reviewed in the course of finding each relevant document, and b is a constant which represents a fixed overhead for the number of irrelevant documents that must be reviewed.

S-CAL.

One of the major drawbacks of the CAL method outlined above is the impractical number of documents that must be reviewed when the number of relevant documents is large. Scalable Continuous Active Learning (S-CAL) [3] addresses this issue by

1.

Segmenting the corpus into batches and allowing assessors to label only a small finite sample of documents from each successive batch.
2.

Temporarily augmenting each training set by adding a set of 100 random documents from the corpus - which is, with high probability, not relevant for a large corpus - labelled not relevant.

However, the stopping condition for S-CAL outlined in [3] is still infeasible to achieve with CORD-19 and our team size; thus, we exchange the initial dynamic stopping condition for a static goal of assessing 300 documents per topic.

Hyper-parameter Tuning.

Given the availability of labelled data after the first round, we performed hyper-parameter tuning on both the loop_type and the lambda value to better fit CORD-19. Finding no significant differences in our tests, we decided to continue with our initial values taken from [6], which were decided upon discussion with the author of Sofia-ML as well as their internal experiments.

Creating Runs.

To generate the results for our runs, we created lists of 1000 documents ordered as shown in Figure 3.

Refer to caption — Figure 3: Document orderings for runs.
(i) Lead with documents labelled relevant, followed by partially relevant, and finally filled by Sofia-ML’s labelling of unseen documents in descending order of confidence.
(ii) All documents are arranged using Sofia-ML. No special consideration is given to documents already assessed by a human assessor.
(iii) Keep documents that annotators have labelled to be not relevant in the final run.

Key-Term Highlighting.

Key-term highlighting is a feature commonly provided by IR systems, such as Google, to assist human readers in processing information. Following the online sample of CAL, as show in Figure 5, given as a supplement to [4], we chose to highlight the top five highest-scoring words from a document, according to Sofia-ML, in our UI for assessors, as show in Figure 5.

4 Results and Discussion

Table 1 shows the specifications of our system for each round of TREC-COVID and Table 2 shows our results. From these, we are able to make some interesting observations:

1.

Despite our human assessors having provided more labelled documents in round 2 than round 1, our performance decreased. One possible explanation could be that, through the use of the key-term highlighting feature, our human assessor(s) exchanged quantity for quality resulting in an overall poorer model.
2.

Despite being able to provide more labelled documents in round 5 than 4, our performance once again decreased. One possible explanation could be that we did not perform the necessary quality control required for additional human assessors - once again, exchanging quantity for quality labels, resulting in an overall poorer performance.
3.

The runs ordered by method Figure 3(iii) consistently outperformed our other runs. This could imply that the documents judged to be not-relevant by our assessors are still more relevant than Sofia-ML’s labelling of unseen documents.

System Overview

Run submissions

System Issues

Round 1

Document set processing,

CAL,

1 assessor

xj4wang_run1:

ordered by method (i)

Being pressed for time, we were unable to reach our stopping

condition, prematurely stopping after 40 document assessments

for each topic.

Using sort -rn instead of sort -rg resulting in documents with

exponentially low confidence being sorted to the top during

both the assessing process and the run creation.

Round 2

Same as Round 1,

+ Key-term highlighting

xj4wang_run3:

ordered by method (i)

Being pressed for time, we were unable to reach our stopping

condition, prematurely stopping after 60 document

assessments for each topic.

Round 3

Same as Round 2,

\pm

Switching out CAL for S-CAL,

+ 1 additional assessor, total of 2

xj4wang_run1:

ordered by method (iii)

xj4wang_run2:

ordered by method (ii)

xj4wang_run3:

ordered by method (i)

Being pressed for time, we were unable to reach our stopping

condition for every topic.

Round 4

Same as Round 3,

+ 1 additional assessor, total of 3

Same as Round 3

Round 5

Same as Round 4,

+ 2 additional assessor, total of 5

Same as Round 3

Table 1: Specifications of system design for each round of TREC-COVID.

ROUND1

ROUND2

ROUND3

ROUND4

ROUND5

xj4wang

_run1

xj4wang

_run3

xj4wang

_run1

xj4wang

_run2

xj4wang

_run3

xj4wang

_run1

xj4wang

_run2

xj4wang

_run3

xj4wang

_run1

xj4wang

_run2

xj4wang

_run3

Number of topics

Total number retrieved

30,000

35000

39938

39941

39942

41931

42160

42241

49923

49927

49928

Total relevant

2352

3002

4698

5824

10910

Total relevant retrieved

1216

1768

2857

2794

2742

3241

3024

2950

6188

5889

5743

MAP

0.2367

0.2210

0.2836

0.2534

0.2751

0.2963

0.2774

0.2775

0.2647

0.2509

0.2448

Mean Bpref

0.4599

0.4823

0.5681

0.5537

0.5464

0.5507

0.5216

0.5084

0.5254

0.5062

0.4912

Mean NDCG@10

0.6513

0.5907

0.7431

0.6252

0.7413

Mean NDCG@20

0.7019

0.6855

0.7019

0.6663

0.6685

0.6663

Mean RBP(p=0.5)

0.6546

+0.0041

0.7300

+0.0407

0.6303

+0.2002

0.7299

+0.0412

0.7946

+0.0194

0.7486

+0.0201

0.7946

+0.0194

0.7767

+0.0018

0.7539

+0.0015

0.7767

+0.0018

P@5

0.8333

0.7314

0.8350

0.7000

0.8350

0.8933

0.8844

0.8933

0.8640

0.8440

0.8640

P@10

0.7167

0.6400

0.8325

0.7050

0.8275

0.8422

0.8400

0.8422

0.8080

P@15

0.6067

0.5505

0.7200

0.6733

0.7183

0.7644

0.7719

0.7644

0.7307

0.7427

0.7307

P@20

0.5417

0.4957

0.6625

0.6400

0.6588

0.7244

0.7278

0.7244

0.6780

0.7030

0.6780

P@30

0.4522

0.4162

0.5633

0.5517

0.5608

0.6378

0.6326

0.6385

0.6247

0.6400

0.6247

R-Precision

0.2843

0.2644

0.3325

0.3111

0.3280

0.3503

0.3266

0.3316

0.3207

0.3076

0.2984

Table 2: Results of all runs of team xj4wang.

ROUND1

ROUND2

ROUND3

xj4wang

_run1

Highest

manual

run score

Highest

overall

run score

xj4wang

_run3

Highest

manual

run score

Highest

overall

run score

xj4wang

_run1

Highest

manual

run score

Highest

overall

run score

MAP

0.2367

0.3008

0.3128

0.2210

0.338

0.2836

0.3244

0.3333

Mean Bpref

0.4599

0.5294

0.4823

0.5679

0.5681

0.5828

0.6084

Mean NDCG@10

0.6513

0.6844

0.5907

0.6893

0.7431

0.7740

Mean RBP(p=0.5)

0.724

0.7699

0.6546

0.7547

0.7300

0.7770

0.8068

P@5

0.8333

0.7314

0.8514

0.8350

0.8950

ROUND4

ROUND5

xj4wang

_run1

Highest

manual

run score

Highest

overall

run score

xj4wang

_run1

xj4wang

_run2

Highest

manual

run score

Highest

overall

run score

MAP

0.2963

0.3923

0.4681

0.2647

0.2509

0.3254

0.4731

Mean Bpref

0.5507

0.6317

0.6801

0.5254

0.5062

0.5255

0.6378

Mean NDCG@20

0.7019

0.7843

0.6663

0.6685

0.6877

0.8496

Mean RBP(p=0.5)

0.7946

0.8056

0.8838

0.7767

0.7539

0.7789

0.9399

P@20

0.7244

0.7278

0.8211

0.6780

0.7030

0.74

0.8760

Table 3: Results of team xj4wang and the maximum scores obtained per measurement across all different teams.

5 Conclusion

In this paper, we report on our participation to the TREC 2020 COVID Track rounds 1 though 5, describing our approach, results, and lessons learned. We initially use CAL [4], implemented using tools from BMI’s feature kit [6], with ourselves as the annotators. The large human labelling effort required for our system motivated us to implement a key-term highlighting feature, use S-CAL [3], and recruit more human assessors. The results in Table 3 show us to be among the top-scoring manual runs and competitive within all categories of submissions throughout all rounds. Our results in Table 2 also bring up an age-old question of quantity versus quality when it comes to data in IR.

Acknowledgement

A special thanks goes to Gordon Cormack for his valuable guidance and Anmol Singh for his insights.
We would also like to thank Charlotte Stinson, Eric Sheen, and Solaiappan Alagappan for the time and effort they spent assessing these documents.

This research was funded in part by a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant awarded to Maura R. Grossman, No. RGPIN-2017-04239, titled “Evaluation of High-Recall Human-in-the-Loop Information Retrieval Technology.”

References

[1] Mustafa Abualsaud, Fuat C. Beylunioglu, Mark D. Smucker, and P. Robert Duimering. Uwaterloomds at the trec 2019 decision track. In NIST Special Publication 1250: The Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019), 2019.
[2] Gordon V. Cormack and Maura R. Grossman. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, pages 153–162, New York, NY, USA, 2014. Association for Computing Machinery.
[3] Gordon V Cormack and Maura R Grossman. Scalability of continuous active learning for reliable high-recall text classification. In Proceedings of the 25th ACM international on conference on information and knowledge management, pages 1039–1048, 2016.
[4] Maura R. Grossman and Gordon V. Cormack. Continuous active learning for tar. Practical Law Journal, 2016.
[5] Maura R Grossman, Gordon V Cormack, and Adam Roegiest. Trec 2016 total recall track overview. In NIST Special Publication 500-319: The Twenty-Fourth Text REtrieval Conference Proceedings (TREC 2015), 2016.
[6] Adam Roegiest and Gordon V Cormack. Total recall track tools architecture overview. Proc. TREC-2015, 2015.
[7] Adam Roegiest, Gordon V Cormack, Charles LA Clarke, and Maura R Grossman. Trec 2015 total recall track overview. In NIST Special Publication 500-319: The Twenty-Fourth Text REtrieval Conference Proceedings (TREC 2015), 2015.
[8] Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. Trec-covid: Constructing a pandemic information retrieval test collection. arXiv preprint arXiv:2005.04474, 2020.
[9] Haotian Zhang, Mustafa Abualsaud, Nimesh Ghelani, Mark D. Smucker, Gordon V. Cormack, and Maura R. Grossman. Effective user interaction for high-recall retrieval: Less is more. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, page 187–196, New York, NY, USA, 2018. Association for Computing Machinery.