This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Participation in TREC 2020 COVID Track Using Continuous Active Learning

(Jean) Xue Jun Wang
University of Waterloo
Waterloo, ON, Canada
[email protected]
   Maura R Grossman
University of Waterloo
Waterloo, ON, Canada
[email protected]
   (Kevin) Seung Gyu Hyun
University of Waterloo
Waterloo, ON, Canada
[email protected]
Abstract

We describe our participation in all five rounds of the TREC 2020 COVID Track (TREC-COVID). The goal of TREC-COVID is to contribute to the response to the COVID-19 pandemic by identifying answers to many pressing questions and building infrastructure to improve search systems [8]. All five rounds of this Track challenged participants to perform a classic ad-hoc search task on the new data collection CORD-19. Our solution addressed this challenge by applying the Continuous Active Learning model (CAL) and its variations. Our results showed us to be amongst the top scoring manual runs and we remained competitive within all categories of submissions.

1 Introduction

As the spread of COVID-19 continues around the globe, researchers, clinicians, and policy makers involved with its response are constantly searching for reliable information on the virus. This presents those of us in information retrieval (IR) and text processing communities with a unique opportunity to contribute to the response to this pandemic by building infrastructure to improve search systems and to help identify answers for some of today’s most pressing questions [8]. The task of TREC-COVID is for participants to retrieve the most relevant documents from the CORD-19 data-set for a given set of topics. To address this challenge, we implemented a system based on CAL, following the work of Grossman and Cormack in [3, 4], using the tool kit provided as part of the Baseline Model Implementation (BMI), created by Roegiest and Cormack in [6], and ourselves as the human assessors.

2 Related Work

In this section, we discuss prior research on CAL. We then discuss prior research on BMI, which provides the tool kits we heavily relied upon for this challenge.

Continuous Active Learning (CAL).

CAL is a method for finding virtually all relevant information on a particular subject within a vast sea of electronically stored information (ESI): it repeatedly refines its understanding about which of the remaining documents are most likely to be of interest, based on the users’ feedback regarding the documents already judged [4]. This protocol is most famously used in technology-assisted review (TAR) for electronic discovery in legal matters, achieving the best results reported in scientific literature to date [2]. Building on the CAL protocol, many implementations, such as BMI, have been highly successful at performing ad-hoc retrieval tasks, such as in the TREC 2015/2016 Total Recall Track [7, 5] and the TREC 2019 Decision Track [1].

Baseline Model Implementation (BMI).

BMI is an augmented version of CAL. It is autonomous and was initially made available to participants of the TREC 2015/2016 Total Recall Tracks [7, 5], as well as the TREC 2019 Decision Track [1] to provide a baseline for comparison. However, BMI turned out to be highly competitive, with none of the manual participants achieving consistently superior results to this fully automated method [6].

While BMI has been shown to generally outperform human-in-the-loop CAL implementations [6], it requires labelled data, which was very limited, if available at all, for TREC-COVID; thus, we chose to insert a human back into the loop to make judgements. All other components, such as creating feature vectors, the learner, etc., were taken directly from the BMI tool kits.

3 System Overview & General Approach

Document Set Processing.

The document set used in the TREC-COVID Challenge is the COVID-19 Open Research Data-set (CORD-19). Our team opted to judge a document’s relevancy using strictly the information available in the metadata file (year, authors, publisher, title, abstract) based on the work of Zhang et al. [9] which show that participants achieve higher recall using CAL when presented with only a single short excerpt rather than an entire document.

CAL.

The following shows an outline of our specific implementation of CAL.

  • STEP 1: Create a hypothetical relevant document, known as a synthetic document.
    To create the synthetic documents, we concatenated the query, question, and narrative components of the topics file provided by TREC-COVID, as shown in Figures 2 and 2.

    Refer to caption
    Figure 1: The synthetic document for topic 1.
    Refer to caption
    Figure 2: Snippet of topic 1 in the xml topics file provided by TREC-COVID.
  • STEP 2: Use a machine-learning algorithm to suggest the next most-likely relevant document.
    The machine-learning algorithm we chose is Sofia-ML which Roegiest and Cormack used in their participation in the 2015 Total Recall Track [6].

  • STEP 3: Review the suggested documents and provide relevance feedback to the learning algorithm, indicating whether each suggested document is actually relevant or not.
    To do this, we sorted the results given by Sofia-ML in decreasing order of confidence, presenting the top most result to the human assessor using a text based user interface. The judgement made by our human assessors is one of {0-not relevant, 1-partially relevant, 2-relevant}. This corresponds to the annotations made by biomedical experts as part of TREC-COVID following each round. As Sofia-ML does not distinguish between relevant judgements and partially relevant judgments, both were designated to be relevant in training.

  • STEP 4: Repeat Step 2 and 3 until very few, if any, of the suggested documents are relevant.
    Using the same stopping condition as in [5], we aimed to stop when the following criterion was met:

    nam+bn\geq a*m+b

    where m is the number of relevant documents reviewed, n is the number of irrelevant documents reviewed, a is a constant which determines how many non-relevant documents are to be reviewed in the course of finding each relevant document, and b is a constant which represents a fixed overhead for the number of irrelevant documents that must be reviewed.

S-CAL.

One of the major drawbacks of the CAL method outlined above is the impractical number of documents that must be reviewed when the number of relevant documents is large. Scalable Continuous Active Learning (S-CAL) [3] addresses this issue by

  1. 1.

    Segmenting the corpus into batches and allowing assessors to label only a small finite sample of documents from each successive batch.

  2. 2.

    Temporarily augmenting each training set by adding a set of 100 random documents from the corpus - which is, with high probability, not relevant for a large corpus - labelled not relevant.

However, the stopping condition for S-CAL outlined in [3] is still infeasible to achieve with CORD-19 and our team size; thus, we exchange the initial dynamic stopping condition for a static goal of assessing 300 documents per topic.

Hyper-parameter Tuning.

Given the availability of labelled data after the first round, we performed hyper-parameter tuning on both the loop_type and the lambda value to better fit CORD-19. Finding no significant differences in our tests, we decided to continue with our initial values taken from [6], which were decided upon discussion with the author of Sofia-ML as well as their internal experiments.

Creating Runs.

To generate the results for our runs, we created lists of 1000 documents ordered as shown in Figure 3.

Refer to caption
Figure 3: Document orderings for runs.
(i) Lead with documents labelled relevant, followed by partially relevant, and finally filled by Sofia-ML’s labelling of unseen documents in descending order of confidence.
(ii) All documents are arranged using Sofia-ML. No special consideration is given to documents already assessed by a human assessor.
(iii) Keep documents that annotators have labelled to be not relevant in the final run.

Key-Term Highlighting.

Key-term highlighting is a feature commonly provided by IR systems, such as Google, to assist human readers in processing information. Following the online sample of CAL, as show in Figure 5, given as a supplement to [4], we chose to highlight the top five highest-scoring words from a document, according to Sofia-ML, in our UI for assessors, as show in Figure 5.

Refer to caption
Figure 4: A sample document retrieved from Grossman and Cormack’s online CAL platform, https://cormack.uwaterloo.ca/cal/, showing key-term highlighting.
Refer to caption
Figure 5: Text based user interface showing key-term highlighting.

4 Results and Discussion

Table 1 shows the specifications of our system for each round of TREC-COVID and Table 2 shows our results. From these, we are able to make some interesting observations:

  1. 1.

    Despite our human assessors having provided more labelled documents in round 2 than round 1, our performance decreased. One possible explanation could be that, through the use of the key-term highlighting feature, our human assessor(s) exchanged quantity for quality resulting in an overall poorer model.

  2. 2.

    Despite being able to provide more labelled documents in round 5 than 4, our performance once again decreased. One possible explanation could be that we did not perform the necessary quality control required for additional human assessors - once again, exchanging quantity for quality labels, resulting in an overall poorer performance.

  3. 3.

    The runs ordered by method Figure 3(iii) consistently outperformed our other runs. This could imply that the documents judged to be not-relevant by our assessors are still more relevant than Sofia-ML’s labelling of unseen documents.

System Overview Run submissions System Issues
Round 1
Document set processing,
CAL,
1 assessor
xj4wang_run1:
ordered by method (i)
Being pressed for time, we were unable to reach our stopping
condition, prematurely stopping after 40 document assessments
for each topic.
Using sort -rn instead of sort -rg resulting in documents with
exponentially low confidence being sorted to the top during
both the assessing process and the run creation.
Round 2
Same as Round 1,
+ Key-term highlighting
xj4wang_run3:
ordered by method (i)
Being pressed for time, we were unable to reach our stopping
condition, prematurely stopping after 60 document
assessments for each topic.
Round 3
Same as Round 2,
±\pm Switching out CAL for S-CAL,
+ 1 additional assessor, total of 2
xj4wang_run1:
ordered by method (iii)
xj4wang_run2:
ordered by method (ii)
xj4wang_run3:
ordered by method (i)
Being pressed for time, we were unable to reach our stopping
condition for every topic.
Round 4
Same as Round 3,
+ 1 additional assessor, total of 3
Same as Round 3
Round 5
Same as Round 4,
+ 2 additional assessor, total of 5
Same as Round 3
Table 1: Specifications of system design for each round of TREC-COVID.
ROUND1 ROUND2 ROUND3 ROUND4 ROUND5
xj4wang
_run1
xj4wang
_run3
xj4wang
_run1
xj4wang
_run2
xj4wang
_run3
xj4wang
_run1
xj4wang
_run2
xj4wang
_run3
xj4wang
_run1
xj4wang
_run2
xj4wang
_run3
Number of topics 30 35 40 45 50
Total number retrieved 30,000 35000 39938 39941 39942 41931 42160 42241 49923 49927 49928
Total relevant 2352 3002 4698 5824 10910
Total relevant retrieved 1216 1768 2857 2794 2742 3241 3024 2950 6188 5889 5743
MAP 0.2367 0.2210 0.2836 0.2534 0.2751 0.2963 0.2774 0.2775 0.2647 0.2509 0.2448
Mean Bpref 0.4599 0.4823 0.5681 0.5537 0.5464 0.5507 0.5216 0.5084 0.5254 0.5062 0.4912
Mean NDCG@10 0.6513 0.5907 0.7431 0.6252 0.7413
Mean NDCG@20 0.7019 0.6855 0.7019 0.6663 0.6685 0.6663
Mean RBP(p=0.5)
0.6546
+0.0041
0.7300
+0.0407
0.6303
+0.2002
0.7299
+0.0412
0.7946
+0.0194
0.7486
+0.0201
0.7946
+0.0194
0.7767
+0.0018
0.7539
+0.0015
0.7767
+0.0018
P@5 0.8333 0.7314 0.8350 0.7000 0.8350 0.8933 0.8844 0.8933 0.8640 0.8440 0.8640
P@10 0.7167 0.6400 0.8325 0.7050 0.8275 0.8422 0.8400 0.8422 0.8080 0.8080 0.8080
P@15 0.6067 0.5505 0.7200 0.6733 0.7183 0.7644 0.7719 0.7644 0.7307 0.7427 0.7307
P@20 0.5417 0.4957 0.6625 0.6400 0.6588 0.7244 0.7278 0.7244 0.6780 0.7030 0.6780
P@30 0.4522 0.4162 0.5633 0.5517 0.5608 0.6378 0.6326 0.6385 0.6247 0.6400 0.6247
R-Precision 0.2843 0.2644 0.3325 0.3111 0.3280 0.3503 0.3266 0.3316 0.3207 0.3076 0.2984
Table 2: Results of all runs of team xj4wang.
ROUND1 ROUND2 ROUND3
xj4wang
_run1
Highest
manual
run score
Highest
overall
run score
xj4wang
_run3
Highest
manual
run score
Highest
overall
run score
xj4wang
_run1
Highest
manual
run score
Highest
overall
run score
MAP 0.2367 0.3008 0.3128 0.2210 0.338 0.338 0.2836 0.3244 0.3333
Mean Bpref 0.4599 0.5294 0.5294 0.4823 0.5679 0.5679 0.5681 0.5828 0.6084
Mean NDCG@10 0.6513 0.6844 0.6844 0.5907 0.6893 0.6893 0.7431 0.7431 0.7740
Mean RBP(p=0.5) 0.724 0.7699 0.7699 0.6546 0.7547 0.7547 0.7300 0.7770 0.8068
P@5 0.8333 0.8333 0.8333 0.7314 0.8514 0.8514 0.8350 0.8350 0.8950
ROUND4 ROUND5
xj4wang
_run1
Highest
manual
run score
Highest
overall
run score
xj4wang
_run1
xj4wang
_run2
Highest
manual
run score
Highest
overall
run score
MAP 0.2963 0.3923 0.4681 0.2647 0.2509 0.3254 0.4731
Mean Bpref 0.5507 0.6317 0.6801 0.5254 0.5062 0.5255 0.6378
Mean NDCG@20 0.7019 0.7019 0.7843 0.6663 0.6685 0.6877 0.8496
Mean RBP(p=0.5) 0.7946 0.8056 0.8838 0.7767 0.7539 0.7789 0.9399
P@20 0.7244 0.7278 0.8211 0.6780 0.7030 0.74 0.8760
Table 3: Results of team xj4wang and the maximum scores obtained per measurement across all different teams.

5 Conclusion

In this paper, we report on our participation to the TREC 2020 COVID Track rounds 1 though 5, describing our approach, results, and lessons learned. We initially use CAL [4], implemented using tools from BMI’s feature kit [6], with ourselves as the annotators. The large human labelling effort required for our system motivated us to implement a key-term highlighting feature, use S-CAL [3], and recruit more human assessors. The results in Table 3 show us to be among the top-scoring manual runs and competitive within all categories of submissions throughout all rounds. Our results in Table 2 also bring up an age-old question of quantity versus quality when it comes to data in IR.

Acknowledgement

A special thanks goes to Gordon Cormack for his valuable guidance and Anmol Singh for his insights.
We would also like to thank Charlotte Stinson, Eric Sheen, and Solaiappan Alagappan for the time and effort they spent assessing these documents.

This research was funded in part by a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant awarded to Maura R. Grossman, No. RGPIN-2017-04239, titled “Evaluation of High-Recall Human-in-the-Loop Information Retrieval Technology.”

References

  • [1] Mustafa Abualsaud, Fuat C. Beylunioglu, Mark D. Smucker, and P. Robert Duimering. Uwaterloomds at the trec 2019 decision track. In NIST Special Publication 1250: The Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019), 2019.
  • [2] Gordon V. Cormack and Maura R. Grossman. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, pages 153–162, New York, NY, USA, 2014. Association for Computing Machinery.
  • [3] Gordon V Cormack and Maura R Grossman. Scalability of continuous active learning for reliable high-recall text classification. In Proceedings of the 25th ACM international on conference on information and knowledge management, pages 1039–1048, 2016.
  • [4] Maura R. Grossman and Gordon V. Cormack. Continuous active learning for tar. Practical Law Journal, 2016.
  • [5] Maura R Grossman, Gordon V Cormack, and Adam Roegiest. Trec 2016 total recall track overview. In NIST Special Publication 500-319: The Twenty-Fourth Text REtrieval Conference Proceedings (TREC 2015), 2016.
  • [6] Adam Roegiest and Gordon V Cormack. Total recall track tools architecture overview. Proc. TREC-2015, 2015.
  • [7] Adam Roegiest, Gordon V Cormack, Charles LA Clarke, and Maura R Grossman. Trec 2015 total recall track overview. In NIST Special Publication 500-319: The Twenty-Fourth Text REtrieval Conference Proceedings (TREC 2015), 2015.
  • [8] Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. Trec-covid: Constructing a pandemic information retrieval test collection. arXiv preprint arXiv:2005.04474, 2020.
  • [9] Haotian Zhang, Mustafa Abualsaud, Nimesh Ghelani, Mark D. Smucker, Gordon V. Cormack, and Maura R. Grossman. Effective user interaction for high-recall retrieval: Less is more. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, page 187–196, New York, NY, USA, 2018. Association for Computing Machinery.