This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving Generated and Retrieved Knowledge Combination Through Zero-shot Generation

1st Xinkai Du Dig&AI Tech
Sunshine Insurance Group
Beijing, China
[email protected]
   2nd Quanjie Han Equal contribution. Dig&AI Tech
Sunshine Insurance Group
Beijing, China
[email protected]
   3rd Chao Lv Dig&AI Tech
Sunshine Insurance Group
Beijing, China
[email protected]
   4th Yan Liu Dig&AI Tech
Sunshine Insurance Group
Beijing, China
[email protected]
   5th Yalin Sun Dig&AI Tech
Sunshine Insurance Group
Beijing, China
[email protected]
   6th Hao Shu Dept. of Comp. Sci. & Tech.
Tsinghua University
Beijing, China
[email protected]
   7th Hongbo Shan Dept. of Comp. Sci. & Tech.
Tsinghua University
Beijing, China
[email protected]
   8th Maosong Sun Corresponding Author. Dept. of Comp. Sci. & Tech.
Tsinghua University
Beijing, China
[email protected]
Abstract

Open-domain Question Answering (QA) has garnered substantial interest by combining the advantages of faithfully retrieved passages and relevant passages generated through Large Language Models (LLMs). However, there is a lack of definitive labels available to pair these sources of knowledge. In order to address this issue, we propose an unsupervised and simple framework called Bi-Reranking for Merging Generated and Retrieved Knowledge (BRMGR), which utilizes re-ranking methods for both retrieved passages and LLM-generated passages. We pair the two types of passages using two separate re-ranking methods and then combine them through greedy matching. We demonstrate that BRMGR is equivalent to employing a bipartite matching loss when assigning each retrieved passage with a corresponding LLM-generated passage. The application of our model yielded experimental results from three datasets, improving their performance by +1.7 and +1.6 on NQ and WebQ datasets, respectively, and obtaining comparable result on TriviaQA dataset when compared to competitive baselines.

Index Terms:
unsupervised passage reranking, retrieval augmentation, generation augmentation, open question answering.

I Introduction

In the realm of knowledge-intensive tasks, such as open-domain question answering, a vast repository of world and domain-specific knowledge is paramount. A commonly employed approach, known as ”retrieve-then-read” [1, 2, 3] involves leveraging external corpora to retrieve relevant passages. Subsequently, a reader model processes the query and retrieved contexts to furnish answers. Nevertheless, this approach is prone to limitations. Specifically, the candidate documents utilized for retrieval are typically chunked into fixed-length segments (e.g., 100 words), leading to the retrieval of noisy or irrelevant information that does not directly pertain to the query at hand [4, 5].

Refer to caption
Figure 1: Top-3 retrieval exact match score for single knowledge source after reranking knowledge sources.

Recently, the strong generative capabilities of Large Language Models (LLMs) [6, 7, 8] have garnered significant attention. These models trained on vast amounts of unsupervised data, possess the ability to encode a wide range of knowledge within their parameters. LLMs serve as an alternative information source, complementing traditional retrieval methods [6, 9, 10]. In fact, an alternative framework known as ”generate-then-read” has been proposed, which generates query-related contexts rather than retrieving them from a corpus [4, 11].

Drawing from the foundations established by generation-augmented and retrieval-augmented methods, recent hybrid approaches aim to integrate these techniques to further enhance performance in Question Answering (QA) tasks [12, 13, 14]. Recognizing that retrieval-augmented methods may introduce unrelated content while generation-augmented methods may yield relevant yet plausible contexts, a compatibility-oriented framework has been introduced to merge retrieved knowledge and LLM-generated knowledge [13]. This framework strives to maintain the factuality of retrieved knowledge while leveraging the relevance of LLM-generated knowledge. To assess compatibility between the two knowledge sources, discriminators such as the evidentiality discriminator and the consistency discriminator have been proposed. However, these methods often rely on silver label mining [15, 16, 13], which can be complex and data-intensive.

Motivated by the limitations of existing approaches and inspired by unsupervised reranking methods for retrieval-augmented QA [17, 18, 19, 20], we find that reranking both the retrieved knowledge and generated knowledge [21] can improve the performance (Figure 1). Furthermore, we introduce an unsupervised method for computing compatibility scores between retrieved and LLM-generated knowledge. This approach leverages zero-shot generation to determine compatibility, enabling a simpler and more efficient reranking process. Our method not only combines the strengths of retrieved and generated knowledge but also offers a complementary reranking approach for generated knowledge. Extensive experimental results demonstrate the effectiveness of our method, paving the way for future research in this exciting domain.

II Proposed Method

An overview of our unsupervised Bi-Reranking for Merging Generated and Retrieved knowledge (BRMGR) method is depicted in Figure 2. It illustrates the process of combining the faithfulness of retrieved knowledge with the relevant evidence from LLM-generated knowledge using unsupervised learning.

Refer to caption

Figure 2: Overview of the BRMGR Framework: It uses an unsupervised method to rerank both LLM-generated and retrieved knowledge for Open-Domain QA. We compute the relevance score of retrieved knowledge based on the log-likelihood score of the query conditioned on retrieved knowledge, and compute the relevance score of generated knowledge via the log-likelihood score of generated knowledge conditioned on query. Finally, the retrieved knowledge and generated knowledge are combined using a greedy matching approach.

Let PL={lp1,lp2,,lpM}\textbf{P}_{L}=\{\textbf{lp}_{1},\textbf{lp}_{2},...,\textbf{lp}_{M}\}, PR={rp1,rp2,,rpN}\textbf{P}_{R}=\{\textbf{rp}_{1},\textbf{rp}_{2},...,\textbf{rp}_{N}\} be a collection of the LLM-generated passages and retrieved passages respectively. Given a query q, the goal is to find an optimal combination between lpi\textbf{lp}_{i} and rpj\textbf{rp}_{j} such that the correct answer is ranked as highly as possible. The combination relevance score is computed with a language model p(lpi,rpj|q),i=1,2,,M,j=1,2,,Np(\textbf{lp}_{i},\textbf{rp}_{j}|\textbf{q}),i=1,2,...,M,j=1,2,...,N.

Given the query q, assume the combination relevance score between lpi\textbf{lp}_{i} and rpj\textbf{rp}_{j} is conditional independent, then it can be factorized into two probabilities:

p(lpi,rpj|q)=p(lpi|q)p(rpj|q)p(\textbf{lp}_{i},\textbf{rp}_{j}|\textbf{q})=p(\textbf{lp}_{i}|\textbf{q})p(\textbf{rp}_{j}|\textbf{q}) (1)

In general, the retrieved passages may contain content unrelated to the query. The Unsupervised Passages Reranking (UPR) suggests the use of the zero-shot query likelihood method to improve passage retrieval [18].

p(rpj|q)=p(q|rpj)p(rpj)p(q)p(q|rpj)p(rpj)p(\textbf{rp}_{j}|\textbf{q})=\frac{p(\textbf{q}|\textbf{rp}_{j})p(\textbf{rp}_{j})}{p(\textbf{q})}\propto p(\textbf{q}|\textbf{rp}_{j})p(\textbf{rp}_{j}) (2)

Assume p(rpj)p(\textbf{rp}_{j}) is uniform, then equation (2) reduces to

p(rpj|q)p(q|rpj),rpjPRp(\textbf{rp}_{j}|\textbf{q})\propto p(\textbf{q}|\textbf{rp}_{j}),\forall\textbf{rp}_{j}\in\textbf{P}_{R} (3)

Similar to UPR [18], the combination relevance score in retrieval component is defined as:

logp(q|rpj)=1|q|tlogp(qt|q<t,rpj;Θ)\log p(\textbf{q}|\textbf{rp}_{j})=\frac{1}{|\textbf{q}|}\sum_{t}\log p(q_{t}|\textbf{q}_{<t},\textbf{rp}_{j};\Theta) (4)

where Θ\Theta denotes the parameters of a pretrained language model and |q||\textbf{q}| denotes the number of query tokens.

Due to the Hallucination of large language model [22], the generated passages may contain some unhelpful content. Given the large language model’s strong ability, the generated information is highly related to the query. We propose to utilize the conditional probability of the generated document conditioned on query p(lpi|q)p(\textbf{lp}_{i}|\textbf{q}) to rerank the generated passages.

p(lpi|q)=1|lpi|tlogp(lpi,t|lpi,<t,q;Θ)p(\textbf{lp}_{i}|\textbf{q})=\frac{1}{|\textbf{lp}_{i}|}\sum_{t}\log p({lp}_{i,t}|\textbf{lp}_{i,<t},\textbf{q};\Theta) (5)

where Θ\Theta denotes the parameters of a pretrained language model and |lpi||\textbf{lp}_{i}| denotes the number of tokens for the ii-th generated passage.

Finally the relevance of how the ii-th generated passage and the jj-th retrieved passage related to the query is computed by:

p(lpi,rpj|q)p(lpi|q)p(q|rpj)p(\textbf{lp}_{i},\textbf{rp}_{j}|\textbf{q})\propto p(\textbf{lp}_{i}|\textbf{q})p(\textbf{q}|\textbf{rp}_{j}) (6)

where p(lpi|q)p(\textbf{lp}_{i}|\textbf{q}) and p(q|rpj)p(\textbf{q}|\textbf{rp}_{j}) are computed by equation (5) and (4), respectively.

Since the combination of computed relevance scores of retrieved and LLM-generated knowledge results in a M×NM\times N matrix, it raises an interesting question: why not use the bipartite matching loss [23, 24] and Hungarian algorithm [25] to find an optimal match?

Theorem 1.

If we assume that the combination relevance score of p(lpi,rpj|q)p(\textbf{lp}_{i},\textbf{rp}_{j}|\textbf{q}) can be factorized into the relevance scores of the generated knowledge and the retrieved knowledge, and further, if the number of both types of knowledge is the same, then we can conclude that the optimal match obtained through bipartite matching loss is equivalent to the one obtained through the greedy matching.

Proof.

We will prove this statement using mathematical induction. Since the combination relevance score of p(lpi,rpj|q)p(\textbf{lp}_{i},\textbf{rp}_{j}|\textbf{q}) can be factorized, we define aij=p(lpi,rpj|q)=bicj,i,j=1,2,,Na_{ij}=p(\textbf{lp}_{i},\textbf{rp}_{j}|\textbf{q})=b_{i}c_{j},i,j=1,2,...,N.

For k=1k=1, this is clearly true. Assuming that it holds true for k=N1k=N-1, we can show that it also holds true for k=Nk=N using the Hungarian algorithm.

The maximum value of aija_{ij} is the product of the maximum value of bib_{i} and cjc_{j}, thus the top-1 combination is the combination of the top-1 generated knowledge and the top-1 retrieved knowledge. By induction, we can solve the remaining N1N-1 pairs using a greedy matching.

III Experiments

III-A Experimental Setup

III-A1 Datasets

Following previous work [13], we use the three popular open-domain Question Answering (QA) datasets of TriviaQA [26], Natural Questions (NQ) [27] and WebQuestions (WebQ) [28].

For the generated knowledge, we use those provided by [4] which were produced by prompting InstructGPT [29]. There are 20 generated passages for each question with human prompts and 10 generated passages for clustering-based prompt method.

III-A2 Evaluation Metric

To evaluate the effectiveness of the proposed method, we measure its performance using the conventional top-K retrieval exact match metric [30]. This metric calculates the proportion of questions for which at least one passage within the top-K passages contains a span that matches the human-annotated answer for that question [18].

III-A3 Baselines

Comparable to [13], we employ single and two knowledge sources in comparison experiments to demonstrate our method’s effectiveness.

For single knowledge sources, we consider four methods:

  • Retri-Origin: Utilizes retrieved knowledge directly.

  • Retri-Rerank: Reranks retrieved knowledge using UPR.

  • Gen-Origin: Utilizes generated knowledge directly.

  • Gen-Rerank: Reranks generated knowledge using UPR.

When dealing with two knowledge sources (retrieved and generated), we propose two additional methods:

  • Origin-Combi: Simply merges the original retrieved and generated knowledge without reranking.

  • COMBO: COMBO matches LLM-generated passages with their retrieved counterparts to form compatible pairs. It’s solely utilized as a comparative technique for our proposed approach in open question answering scenarios.

  • BRMGR: Our proposed unsupervised Bi-Reranking for Merging Generated and Retrieved knowledge method, which aims to optimize the combination of both sources.

III-A4 Implementation Details

Following the approach in [18], we employ a range of Flan-T5 models [31], varying from base to xlarge versions. Additionally, we consider the T0-3B model [32] due to its promising performance in retrieved passages [18]. To rerank the retrieved passages, the verbalizer is set as Please write a question based on this passage. To provide context, the title and text of each passage are concatenated with a verbalizer head as follows: ”passage: ”. All experiments are conducted on a single 80G A100 GPU.

III-B Main Results

All methods in this study utilize 10 retrieved and/or LLM-generated passages for each question. The set of 10 retrieved passages is selected based on the top-10 passages returned by the Dense Passage Retrieval (DPR) method [2]. On the other hand, the 10 generated passages are sampled from the passages generated by humans prompts in [4].

Retrieval The results of the top-K retrieval exact match metric, using the T0-3B model, are presented in Table I for the three datasets. It is evident that our method achieves the best performance, and the utilization of two knowledge sources outperforms using a single knowledge source. Notably, for the TriviaQA and WebQ test sets, the generated passages are found to be more helpful compared to the retrieved passages. Additionally, following the implementation of two separate unsupervised re-ranking methods, the performance of both the retrieved and generated passages exhibits improvement.

TABLE I: Top-{3, 5} exact match result on test set before and after reranking of the 10 retrieved and/or generated passages.
Methods TriviaQA NQ WebQ
Top-3 Top-5 Top-3 Top-5 Top-3 Top-5
Single Knowledge
Retri-Origin 68.57 72.40 63.35 68.78 60.09 65.01
Retri-Rerank 73.27 74.48 65.48 70.97 61.61 66.29
Gen-Origin 75.71 78.66 54.63 59.92 62.11 66.58
Gen-Rerank 75.78 78.82 56.62 61.58 64.76 68.31
Two Knowledge
Origin-Combi 82.34 84.41 75.68 79.78 74.61 78.20
BRMGR 83.06 84.81 77.01 81.44 74.90 78.49

Question Answering In order to assess the performance of our method in open question answering, we employ Fusion-in-Decoder [3] as the reader model, with COMBO [13] serving as the baseline method. The results of this evaluation are presented in Table II.

The Table II clearly demonstrates that our method outperforms the baseline approach, achieving the strongest overall performance. In addition, re-ranking the original order of both the retrieved and generated passages leads to improved results. Moreover, incorporating two knowledge sources for open question answering proves to be more effective than relying on a single knowledge source alone.

TABLE II: Exact match scores computed by FiD on test dataset.
Methods TriviaQA NQ WebQ
Single Knowledge
Retri-Origin 66.3 50.8 50.1
Gen-Origin 69.6 52.6 42.6
Two Knowledge
COMBO 74.6 54.2 53.0
Origin-Combi 73.2 53.8 52.4
Retri-Rerank 73.5 54.4 53.1
Gen-Rerank 73.8 54.6 53.3
BRMGR 74.4 55.9 54.6

III-C Ablation Studies

III-C1 Effect of the Pretrained-Language Models

To examine the impact of pretrained language models (PLMs) on the exact match score of generated passages, a range of Flan-T5 models [31] is utilized. Additionally, T0-3B model [32] is utilized to evaluate the performance on the NQ development set. Results presented in Table III demonstrate that all the pretrained language models contribute to overall performance improvement in generated passages. This contrasts with the size-dependent phenomenon observed in retrieved passages. Since the generated passages are generated by these powerful and large language models, reranking with different PLM sizes yields similar results.

TABLE III: Comparison of different pre-trained language models (PLMs) as re-rankers for the generated passages on the NQ development set.
Generated Reranker NQ (dev)
Top-1 Top-3 Top-5 Top-8
   None 39.34 53.25 58.22 62.49
   T5-base 40.11 55.24 59.63 62.96
   T5-large 40.09 55.28 59.62 62.96
   T5-xlarge 40.09 55.29 59.61 62.92
   T0-3B 40.07 55.25 59.68 62.94

III-C2 Importance of Document Generation

To grasp the significance of re-ranking using document generation p(lp|q)p(\textbf{lp}|\textbf{q}), we contrast it with an alternative unsupervised method where re-ranking is question generation conditioned on the generated knowledge p(q|lp)p(\textbf{q}|\textbf{lp}). This value can be approximated by calculating the average log-likelihood of generating the question tokens using PLM and teacher-forcing.

logp(q|lp)=1|q|tlogp(qt|q<t,lp;Θ)\log p(\textbf{q}|\textbf{lp})=\frac{1}{|\textbf{q}|}\sum_{t}\log p(q_{t}|\textbf{q}_{<t},\textbf{lp};\Theta) (7)

From the observed results in figure 3, it is evident that the reranking scores, computed by the log-likelihood of generating passage tokens conditioned on the query, consistently enhance the original outcome. However, the reranking scores computed by the log-likelihood of generating the query conditioned on the generated knowledge demonstrate a deterioration in performance. Notably, this performance degradation becomes more significant with a smaller size of the PLM.

Refer to caption

Figure 3: Comparison of two passage re-ranking approaches on the NQ development set: (1) when generating question tokens conditioned on the passage p(q|lp)p(\textbf{q}|\textbf{lp}), and (2) when generating passage tokens conditioned on the question p(lp|q)p(\textbf{lp}|\textbf{q}). Results highlight the usefulness of document generation in generated knowledge for reranking.

IV Conclusion

In this work, we propose an unsupervised bi-reranking method BRMGR for merging retrieved passages and LLM-generated passages in Open-domain Question Answering. Rather than relying on mined silver labels for computing compatibility scores between the two types of passages. Extensive experiments on three datasets demonstrate the success of this proposed method.

Acknowledgements This work is supported by Beijing Municipal Science and Technology Plan Project (Z241100001324025).

References

  • [1] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” Advances in Neural Information Processing Systems, vol. 33, pp. 9459–9474, 2020.
  • [2] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense passage retrieval for open-domain question answering,” in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP).   Association for Computational Linguistics, 2020.
  • [3] G. Izacard and E. Grave, “Leveraging passage retrieval with generative models for open domain question answering,” in EACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics.   Association for Computational Linguistics, 2021, pp. 874–880.
  • [4] W. Yu, D. Iter, S. Wang, Y. Xu, M. Ju, S. Sanyal, C. Zhu, M. Zeng, and M. Jiang, “Generate rather than retrieve: Large language models are strong context generators,” in The Eleventh International Conference on Learning Representations, 2022.
  • [5] L. Gao, X. Ma, J. Lin, and J. Callan, “Precise zero-shot dense retrieval without relevance labels,” arXiv preprint arXiv:2212.10496, 2022.
  • [6] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  • [7] A. Chowdhery, S. Narang, J. Devlin, M. Bosma, G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sutton, S. Gehrmann et al., “Palm: Scaling language modeling with pathways,” Journal of Machine Learning Research, vol. 24, no. 240, pp. 1–113, 2023.
  • [8] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  • [9] D. Singh, S. Reddy, W. Hamilton, C. Dyer, and D. Yogatama, “End-to-end training of multi-document reader and retriever for open-domain question answering,” Advances in Neural Information Processing Systems, vol. 34, pp. 25 968–25 981, 2021.
  • [10] P. Zhang, S. Xiao, Z. Liu, Z. Dou, and J.-Y. Nie, “Retrieve anything to augment large language models,” arXiv preprint arXiv:2310.07554, 2023.
  • [11] M. Li, H. Zhuang, K. Hui, Z. Qin, J. Lin, R. Jagerman, X. Wang, and M. Bendersky, “Generate, filter, and fuse: Query expansion via multi-step keyword generation for zero-shot neural rankers,” arXiv preprint arXiv:2311.09175, 2023.
  • [12] A. Mallen, A. Asai, V. Zhong, R. Das, H. Hajishirzi, and D. Khashabi, “When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories,” arXiv preprint arXiv:2212.10511, 2022.
  • [13] Y. Zhang, M. Khalifa, L. Logeswaran, M. Lee, H. Lee, and L. Wang, “Merging generated and retrieved knowledge for open-domain qa,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4710–4728.
  • [14] A. Abdallah and A. Jatowt, “Generator-retriever-generator: A novel approach to open-domain question answering,” arXiv preprint arXiv:2307.11278, 2023.
  • [15] A. Asai, M. Gardner, and H. Hajishirzi, “Evidentiality-guided generation for knowledge-intensive nlp tasks,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022, pp. 2226–2243.
  • [16] H. Zhang, R. Zhang, J. Guo, M. de Rijke, Y. Fan, and X. Cheng, “From relevance to utility: Evidence retrieval with feedback for fact verification,” arXiv preprint arXiv:2310.11675, 2023.
  • [17] J. M. Ponte and W. B. Croft, “A language modeling approach to information retrieval,” in ACM SIGIR Forum, vol. 51, no. 2.   ACM New York, NY, USA, 2017, pp. 202–208.
  • [18] D. Sachan, M. Lewis, M. Joshi, A. Aghajanyan, W.-t. Yih, J. Pineau, and L. Zettlemoyer, “Improving passage retrieval with zero-shot question generation,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022, pp. 3781–3797.
  • [19] S. Zhuang, B. Liu, B. Koopman, and G. Zuccon, “Open-source large language models are strong zero-shot query likelihood models for document ranking,” in Findings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 8807–8817.
  • [20] C. N. d. Santos, X. Ma, R. Nallapati, Z. Huang, and B. Xiang, “Beyond [cls] through ranking by generation,” arXiv preprint arXiv:2010.03073, 2020.
  • [21] H. Tan, F. Sun, W. Yang, Y. Wang, Q. Cao, and X. Cheng, “Blinded by generated contexts: How language models merge generated and retrieved contexts when knowledge conflicts?” in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 6207–6227.
  • [22] S. Tonmoy, S. Zaman, V. Jain, A. Rani, V. Rawte, A. Chadha, and A. Das, “A comprehensive survey of hallucination mitigation techniques in large language models,” arXiv preprint arXiv:2401.01313, 2024.
  • [23] D. Sui, X. Zeng, Y. Chen, K. Liu, and J. Zhao, “Joint entity and relation extraction with set prediction networks,” IEEE Transactions on Neural Networks and Learning Systems, 2023.
  • [24] X. Du, Q. Han, Y. Sun, C. Lv, and M. Sun, “Label dependencies-aware set prediction networks for multi-label text classification,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 11 206–11 210.
  • [25] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
  • [26] M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer, “Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 1601–1611.
  • [27] T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee et al., “Natural questions: a benchmark for question answering research,” Transactions of the Association for Computational Linguistics, vol. 7, pp. 453–466, 2019.
  • [28] J. Berant, A. Chou, R. Frostig, and P. Liang, “Semantic parsing on freebase from question-answer pairs,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1533–1544.
  • [29] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in Neural Information Processing Systems, vol. 35, pp. 27 730–27 744, 2022.
  • [30] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang, “Squad: 100,000+ questions for machine comprehension of text,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2383–2392.
  • [31] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  • [32] S. Victor, W. Albert, R. Colin, B. Stephen, S. Lintang, A. Zaid, C. Antoine, S. Arnaud, R. Arun, D. Manan et al., “Multitask prompted training enables zero-shot task generalization,” in International Conference on Learning Representations, 2022.