Participation in TREC 2020 COVID Track Using Continuous Active Learning
Abstract
We describe our participation in all five rounds of the TREC 2020 COVID Track (TREC-COVID). The goal of TREC-COVID is to contribute to the response to the COVID-19 pandemic by identifying answers to many pressing questions and building infrastructure to improve search systems [8]. All five rounds of this Track challenged participants to perform a classic ad-hoc search task on the new data collection CORD-19. Our solution addressed this challenge by applying the Continuous Active Learning model (CAL) and its variations. Our results showed us to be amongst the top scoring manual runs and we remained competitive within all categories of submissions.
1 Introduction
As the spread of COVID-19 continues around the globe, researchers, clinicians, and policy makers involved with its response are constantly searching for reliable information on the virus. This presents those of us in information retrieval (IR) and text processing communities with a unique opportunity to contribute to the response to this pandemic by building infrastructure to improve search systems and to help identify answers for some of today’s most pressing questions [8]. The task of TREC-COVID is for participants to retrieve the most relevant documents from the CORD-19 data-set for a given set of topics. To address this challenge, we implemented a system based on CAL, following the work of Grossman and Cormack in [3, 4], using the tool kit provided as part of the Baseline Model Implementation (BMI), created by Roegiest and Cormack in [6], and ourselves as the human assessors.
2 Related Work
In this section, we discuss prior research on CAL. We then discuss prior research on BMI, which provides the tool kits we heavily relied upon for this challenge.
Continuous Active Learning (CAL).
CAL is a method for finding virtually all relevant information on a particular subject within a vast sea of electronically stored information (ESI): it repeatedly refines its understanding about which of the remaining documents are most likely to be of interest, based on the users’ feedback regarding the documents already judged [4]. This protocol is most famously used in technology-assisted review (TAR) for electronic discovery in legal matters, achieving the best results reported in scientific literature to date [2]. Building on the CAL protocol, many implementations, such as BMI, have been highly successful at performing ad-hoc retrieval tasks, such as in the TREC 2015/2016 Total Recall Track [7, 5] and the TREC 2019 Decision Track [1].
Baseline Model Implementation (BMI).
BMI is an augmented version of CAL. It is autonomous and was initially made available to participants of the TREC 2015/2016 Total Recall Tracks [7, 5], as well as the TREC 2019 Decision Track [1] to provide a baseline for comparison. However, BMI turned out to be highly competitive, with none of the manual participants achieving consistently superior results to this fully automated method [6].
While BMI has been shown to generally outperform human-in-the-loop CAL implementations [6], it requires labelled data, which was very limited, if available at all, for TREC-COVID; thus, we chose to insert a human back into the loop to make judgements. All other components, such as creating feature vectors, the learner, etc., were taken directly from the BMI tool kits.
3 System Overview & General Approach
Document Set Processing.
The document set used in the TREC-COVID Challenge is the COVID-19 Open Research Data-set (CORD-19). Our team opted to judge a document’s relevancy using strictly the information available in the metadata file (year, authors, publisher, title, abstract) based on the work of Zhang et al. [9] which show that participants achieve higher recall using CAL when presented with only a single short excerpt rather than an entire document.
CAL.
The following shows an outline of our specific implementation of CAL.
-
•
STEP 1: Create a hypothetical relevant document, known as a synthetic document.
To create the synthetic documents, we concatenated the query, question, and narrative components of the topics file provided by TREC-COVID, as shown in Figures 2 and 2.Figure 1: The synthetic document for topic 1. Figure 2: Snippet of topic 1 in the xml topics file provided by TREC-COVID. -
•
STEP 2: Use a machine-learning algorithm to suggest the next most-likely relevant document.
The machine-learning algorithm we chose is Sofia-ML which Roegiest and Cormack used in their participation in the 2015 Total Recall Track [6]. -
•
STEP 3: Review the suggested documents and provide relevance feedback to the learning algorithm, indicating whether each suggested document is actually relevant or not.
To do this, we sorted the results given by Sofia-ML in decreasing order of confidence, presenting the top most result to the human assessor using a text based user interface. The judgement made by our human assessors is one of {0-not relevant, 1-partially relevant, 2-relevant}. This corresponds to the annotations made by biomedical experts as part of TREC-COVID following each round. As Sofia-ML does not distinguish between relevant judgements and partially relevant judgments, both were designated to be relevant in training. -
•
STEP 4: Repeat Step 2 and 3 until very few, if any, of the suggested documents are relevant.
Using the same stopping condition as in [5], we aimed to stop when the following criterion was met:where m is the number of relevant documents reviewed, n is the number of irrelevant documents reviewed, a is a constant which determines how many non-relevant documents are to be reviewed in the course of finding each relevant document, and b is a constant which represents a fixed overhead for the number of irrelevant documents that must be reviewed.
S-CAL.
One of the major drawbacks of the CAL method outlined above is the impractical number of documents that must be reviewed when the number of relevant documents is large. Scalable Continuous Active Learning (S-CAL) [3] addresses this issue by
-
1.
Segmenting the corpus into batches and allowing assessors to label only a small finite sample of documents from each successive batch.
-
2.
Temporarily augmenting each training set by adding a set of 100 random documents from the corpus - which is, with high probability, not relevant for a large corpus - labelled not relevant.
However, the stopping condition for S-CAL outlined in [3] is still infeasible to achieve with CORD-19 and our team size; thus, we exchange the initial dynamic stopping condition for a static goal of assessing 300 documents per topic.
Hyper-parameter Tuning.
Given the availability of labelled data after the first round, we performed hyper-parameter tuning on both the loop_type and the lambda value to better fit CORD-19. Finding no significant differences in our tests, we decided to continue with our initial values taken from [6], which were decided upon discussion with the author of Sofia-ML as well as their internal experiments.
Creating Runs.
To generate the results for our runs, we created lists of 1000 documents ordered as shown in Figure 3.

(i) Lead with documents labelled relevant, followed by partially relevant, and finally filled by Sofia-ML’s labelling of unseen documents in descending order of confidence.
(ii) All documents are arranged using Sofia-ML. No special consideration is given to documents already assessed by a human assessor.
(iii) Keep documents that annotators have labelled to be not relevant in the final run.
Key-Term Highlighting.
Key-term highlighting is a feature commonly provided by IR systems, such as Google, to assist human readers in processing information. Following the online sample of CAL, as show in Figure 5, given as a supplement to [4], we chose to highlight the top five highest-scoring words from a document, according to Sofia-ML, in our UI for assessors, as show in Figure 5.


4 Results and Discussion
Table 1 shows the specifications of our system for each round of TREC-COVID and Table 2 shows our results. From these, we are able to make some interesting observations:
-
1.
Despite our human assessors having provided more labelled documents in round 2 than round 1, our performance decreased. One possible explanation could be that, through the use of the key-term highlighting feature, our human assessor(s) exchanged quantity for quality resulting in an overall poorer model.
-
2.
Despite being able to provide more labelled documents in round 5 than 4, our performance once again decreased. One possible explanation could be that we did not perform the necessary quality control required for additional human assessors - once again, exchanging quantity for quality labels, resulting in an overall poorer performance.
-
3.
The runs ordered by method Figure 3(iii) consistently outperformed our other runs. This could imply that the documents judged to be not-relevant by our assessors are still more relevant than Sofia-ML’s labelling of unseen documents.
System Overview | Run submissions | System Issues | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Round 1 |
|
|
|
|||||||||||
Round 2 |
|
|
|
|||||||||||
Round 3 |
|
|
|
|||||||||||
Round 4 |
|
Same as Round 3 | ||||||||||||
Round 5 |
|
Same as Round 3 |
ROUND1 | ROUND2 | ROUND3 | ROUND4 | ROUND5 | |||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||
Number of topics | 30 | 35 | 40 | 45 | 50 | ||||||||||||||||||||||||||||
Total number retrieved | 30,000 | 35000 | 39938 | 39941 | 39942 | 41931 | 42160 | 42241 | 49923 | 49927 | 49928 | ||||||||||||||||||||||
Total relevant | 2352 | 3002 | 4698 | 5824 | 10910 | ||||||||||||||||||||||||||||
Total relevant retrieved | 1216 | 1768 | 2857 | 2794 | 2742 | 3241 | 3024 | 2950 | 6188 | 5889 | 5743 | ||||||||||||||||||||||
MAP | 0.2367 | 0.2210 | 0.2836 | 0.2534 | 0.2751 | 0.2963 | 0.2774 | 0.2775 | 0.2647 | 0.2509 | 0.2448 | ||||||||||||||||||||||
Mean Bpref | 0.4599 | 0.4823 | 0.5681 | 0.5537 | 0.5464 | 0.5507 | 0.5216 | 0.5084 | 0.5254 | 0.5062 | 0.4912 | ||||||||||||||||||||||
Mean NDCG@10 | 0.6513 | 0.5907 | 0.7431 | 0.6252 | 0.7413 | ||||||||||||||||||||||||||||
Mean NDCG@20 | 0.7019 | 0.6855 | 0.7019 | 0.6663 | 0.6685 | 0.6663 | |||||||||||||||||||||||||||
Mean RBP(p=0.5) |
|
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||
P@5 | 0.8333 | 0.7314 | 0.8350 | 0.7000 | 0.8350 | 0.8933 | 0.8844 | 0.8933 | 0.8640 | 0.8440 | 0.8640 | ||||||||||||||||||||||
P@10 | 0.7167 | 0.6400 | 0.8325 | 0.7050 | 0.8275 | 0.8422 | 0.8400 | 0.8422 | 0.8080 | 0.8080 | 0.8080 | ||||||||||||||||||||||
P@15 | 0.6067 | 0.5505 | 0.7200 | 0.6733 | 0.7183 | 0.7644 | 0.7719 | 0.7644 | 0.7307 | 0.7427 | 0.7307 | ||||||||||||||||||||||
P@20 | 0.5417 | 0.4957 | 0.6625 | 0.6400 | 0.6588 | 0.7244 | 0.7278 | 0.7244 | 0.6780 | 0.7030 | 0.6780 | ||||||||||||||||||||||
P@30 | 0.4522 | 0.4162 | 0.5633 | 0.5517 | 0.5608 | 0.6378 | 0.6326 | 0.6385 | 0.6247 | 0.6400 | 0.6247 | ||||||||||||||||||||||
R-Precision | 0.2843 | 0.2644 | 0.3325 | 0.3111 | 0.3280 | 0.3503 | 0.3266 | 0.3316 | 0.3207 | 0.3076 | 0.2984 |
ROUND1 | ROUND2 | ROUND3 | |||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|
|
|||||||||||||||||||||||||
MAP | 0.2367 | 0.3008 | 0.3128 | 0.2210 | 0.338 | 0.338 | 0.2836 | 0.3244 | 0.3333 | ||||||||||||||||||||||||
Mean Bpref | 0.4599 | 0.5294 | 0.5294 | 0.4823 | 0.5679 | 0.5679 | 0.5681 | 0.5828 | 0.6084 | ||||||||||||||||||||||||
Mean NDCG@10 | 0.6513 | 0.6844 | 0.6844 | 0.5907 | 0.6893 | 0.6893 | 0.7431 | 0.7431 | 0.7740 | ||||||||||||||||||||||||
Mean RBP(p=0.5) | 0.724 | 0.7699 | 0.7699 | 0.6546 | 0.7547 | 0.7547 | 0.7300 | 0.7770 | 0.8068 | ||||||||||||||||||||||||
P@5 | 0.8333 | 0.8333 | 0.8333 | 0.7314 | 0.8514 | 0.8514 | 0.8350 | 0.8350 | 0.8950 | ||||||||||||||||||||||||
ROUND4 | ROUND5 | ||||||||||||||||||||||||||||||||
|
|
|
|
|
|
|
|||||||||||||||||||||||||||
MAP | 0.2963 | 0.3923 | 0.4681 | 0.2647 | 0.2509 | 0.3254 | 0.4731 | ||||||||||||||||||||||||||
Mean Bpref | 0.5507 | 0.6317 | 0.6801 | 0.5254 | 0.5062 | 0.5255 | 0.6378 | ||||||||||||||||||||||||||
Mean NDCG@20 | 0.7019 | 0.7019 | 0.7843 | 0.6663 | 0.6685 | 0.6877 | 0.8496 | ||||||||||||||||||||||||||
Mean RBP(p=0.5) | 0.7946 | 0.8056 | 0.8838 | 0.7767 | 0.7539 | 0.7789 | 0.9399 | ||||||||||||||||||||||||||
P@20 | 0.7244 | 0.7278 | 0.8211 | 0.6780 | 0.7030 | 0.74 | 0.8760 |
5 Conclusion
In this paper, we report on our participation to the TREC 2020 COVID Track rounds 1 though 5, describing our approach, results, and lessons learned. We initially use CAL [4], implemented using tools from BMI’s feature kit [6], with ourselves as the annotators. The large human labelling effort required for our system motivated us to implement a key-term highlighting feature, use S-CAL [3], and recruit more human assessors. The results in Table 3 show us to be among the top-scoring manual runs and competitive within all categories of submissions throughout all rounds. Our results in Table 2 also bring up an age-old question of quantity versus quality when it comes to data in IR.
Acknowledgement
A special thanks goes to Gordon Cormack for his valuable guidance and Anmol Singh for his insights.
We would also like to thank Charlotte Stinson, Eric Sheen, and Solaiappan Alagappan for the time and effort they spent assessing these documents.
This research was funded in part by a Natural Sciences and Engineering Research Council of Canada (NSERC) Discovery Grant awarded to Maura R. Grossman, No. RGPIN-2017-04239, titled “Evaluation of High-Recall Human-in-the-Loop Information Retrieval Technology.”
References
- [1] Mustafa Abualsaud, Fuat C. Beylunioglu, Mark D. Smucker, and P. Robert Duimering. Uwaterloomds at the trec 2019 decision track. In NIST Special Publication 1250: The Twenty-Eighth Text REtrieval Conference Proceedings (TREC 2019), 2019.
- [2] Gordon V. Cormack and Maura R. Grossman. Evaluation of machine-learning protocols for technology-assisted review in electronic discovery. In Proceedings of the 37th International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’14, pages 153–162, New York, NY, USA, 2014. Association for Computing Machinery.
- [3] Gordon V Cormack and Maura R Grossman. Scalability of continuous active learning for reliable high-recall text classification. In Proceedings of the 25th ACM international on conference on information and knowledge management, pages 1039–1048, 2016.
- [4] Maura R. Grossman and Gordon V. Cormack. Continuous active learning for tar. Practical Law Journal, 2016.
- [5] Maura R Grossman, Gordon V Cormack, and Adam Roegiest. Trec 2016 total recall track overview. In NIST Special Publication 500-319: The Twenty-Fourth Text REtrieval Conference Proceedings (TREC 2015), 2016.
- [6] Adam Roegiest and Gordon V Cormack. Total recall track tools architecture overview. Proc. TREC-2015, 2015.
- [7] Adam Roegiest, Gordon V Cormack, Charles LA Clarke, and Maura R Grossman. Trec 2015 total recall track overview. In NIST Special Publication 500-319: The Twenty-Fourth Text REtrieval Conference Proceedings (TREC 2015), 2015.
- [8] Ellen Voorhees, Tasmeer Alam, Steven Bedrick, Dina Demner-Fushman, William R Hersh, Kyle Lo, Kirk Roberts, Ian Soboroff, and Lucy Lu Wang. Trec-covid: Constructing a pandemic information retrieval test collection. arXiv preprint arXiv:2005.04474, 2020.
- [9] Haotian Zhang, Mustafa Abualsaud, Nimesh Ghelani, Mark D. Smucker, Gordon V. Cormack, and Maura R. Grossman. Effective user interaction for high-recall retrieval: Less is more. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, CIKM ’18, page 187–196, New York, NY, USA, 2018. Association for Computing Machinery.