On the Evaluation of Answer-Agnostic Paragraph-level Multi-Question Generation

Jishnu Ray Chowdhury
University of Illinois at Chicago
[email protected]
&Debanjan Mahata
Moody’s Analytics
[email protected]
\ANDCornelia Caragea
University of Illinois at Chicago
[email protected]

Abstract

We study the task of predicting a set of salient questions from a given paragraph without any prior knowledge of the precise answer. We make two main contributions. First, we propose a new method to evaluate a set of predicted questions against the set of references by using the Hungarian algorithm to assign predicted questions to references before scoring the assigned pairs. We show that our proposed evaluation strategy has better theoretical and practical properties compared to prior methods because it can properly account for the coverage of references. Second, we compare different strategies to utilize a pre-trained seq2seq model to generate and select a set of questions related to a given paragraph. The code is available¹¹1https://github.com/JRC1995/QuestionGenerationPub.

1 Introduction

Question generation (QG) is the task of automatically generating questions that are pertinent to a given text. QG has multiple applications. For example, it can be used to construct educational materials by generating practice questions to test reading comprehension Heilman and Smith (2010), help dialogue systems ask better questions Wang et al. (2018), or drive conversations around news articles Laban et al. (2020). QG has been also used for creating synthetic data to train competitive question answering models Shakeri et al. (2020); Puri et al. (2020), for pre-training models to improve downstream performance Narayan et al. (2020), and for evaluating the factuality of automated abstractive summarizations Wang et al. (2020). For any large-scale application on emerging data, it would be infeasible to manually construct questions; thus, the need for automating QG arises. There are, however, multiple variants of QG. In this paper, we focus on a particular variant of the task (see Figure 1) with the following characteristics:

Input: the central government estimates that over 7,000 inadequately engineered schoolrooms collapsed in the earthquake chinese citizens have since invented a catch phrase tofu-dregs schoolhouses….. Generated Questions: 1. what policy caused many families to lose their only child? 2. what is the catch phrase for inadequately engineered schoolhouses? 3. how many inadequately engineered schoolrooms collapsed in the earthquake? 4. what is the age of the so-called illegal children? Reference Questions: 1. how many schoolrooms collapsed in the quake? 2. what catch-phrase was invented as a result of collapsed schools? 3. why did so many schools collapse during the earthquake? 4. what are the estimations of how many schoolrooms collapsed? 5. what has the citizenry started calling these type of schools? 6. what can illegal children be registered as in place of their dead siblings?

Figure 1: A sample of a passage from SQuAD1.1 with a set of generated questions and a set of references

1. Answer-Agnostic - Given a document, our models are trained to generate questions related to the document without any prior knowledge about the answer (answer-agnostic QG). The model has to learn to (implicitly or explicitly) seek out question-worthy sentences and phrases to generate their corresponding questions. Answer-agnostic QG is, in general, more challenging than answer-aware question generation (in which the answer is usually highlighted in some fashion). Practical applications of answer-aware QG can be partly limited because they would require some external mechanism (for example a named entity tagger Wang et al. (2020) or a keyphrase generator Willis et al. (2019)) or manual annotation to choose specific answers to generate questions from. An external mechanism may be only able to pick very specific types of answers (like keyphrases or named entities) and thus restrict the range of questions that can be generated.

2. Paragraph-Level - Instead of generating questions at a sentence-level, we focus on generating the most important questions for a whole paragraph (or a document). The model has to learn to decide which areas on the paragraph it should focus while generating the most salient questions.

3. Multiple Questions - We focus on generating and evaluating a set of multiple questions that can be asked from the given paragraph.

Overall, similar to Du and Cardie (2017), our aim with this task is to generate an appropriate number of diverse questions around the most important (“question-worthy") areas in the paragraph. In Figure 1, we show an example that reflects the above characteristics of the task. As can be seen in the figure, we have a paragraph input, a set of ground references, and a set of generated questions, ideally corresponding to the salient question-worthy areas of the paragraph. However, although we aim to generate a set of questions comparable to a ground truth set, standard evaluations of QG Du et al. (2017); Qi et al. (2020); Lopez et al. (2021) based on BLEU Papineni et al. (2002), METEOR Banerjee and Lavie (2005), ROUGE-L Lin (2004), or Q-BLEU Nema and Khapra (2018) are only designed to measure the quality of a single generated question against a set of references per sample. These metrics are not designed to compare a set of generated questions against the set of references (see $\S$ 3). For proper evaluation, we propose a novel evaluation scheme which utilizes the Hungarian algorithm Kuhn (1955) to first optimally assign each generated question to a particular ground truth question before comparing the assigned pairs.

Besides evaluation, as another contribution, we provide a systematic empirical comparison (under a common framework) of multiple task-specific strategies based on generation granularity-level ( $\S$ 4.1) and multiplicity (or mode) ( $\S$ 4.2). Given that state of the art is generally achieved by pre-trained Transformer-based models Varanasi et al. (2020); Qi et al. (2020); Lopez et al. (2021), we investigate whether direct brute use of pre-trained models is sufficient or some of the prior strategies are still relevant to improve overall performance.

2 Task Definition

For our task, as input we have a paragraph $P$ as a sequence of sentences $P=(s_{i})_{i\in[1,n]}$ where each sentence $s_{i}$ is a sequence of word (or sub-word) tokens $\left(s_{i}=(t_{j})_{j\in[1,m]}\right)$ . The desired output is a set of questions $Q=\{q_{l}|l\in[1,p]\}$ where each question $q_{i}$ is a sequence of word (or sub-word) tokens $\left(q_{l}=(t_{j})_{j\in[1,r]}\right)$ . We assume that each question $q_{l}$ have its answer information in the given paragraph $P$ . We also assume that for any question $q_{l}$ , $P$ contains at least one sentence $s_{i}$ that have sufficient information to answer the question $q_{l}$ . Although, a sensible question (for example, questions asking for relevant missing information) can have its answer absent from the paragraph, in this work, we solely focus on generating questions that are answerable from the information in the input.

Throughout this paper, we assume that the provided set of ground truth questions contains the ideal questions to generate and also, the ideal number of questions to generate. The assumption may be violated in practice because, depending on the annotation process, the given ground truth set may not reflect the ideals; moreover, in certain applications the ideal number of questions to be generated may need to be defined by the user or some specific rules. However, given the difficulty of accounting for all these variables, we stick to the aforementioned assumption.

3 Multi-Question Evaluation

Typically, standard text generation evaluation $n$ -gram match metrics like BLEU, ROUGE, and METEOR are used for QG. Nema and Khapra (2018) proposed interpolating $n$ -gram match metrics with an answerability score creating Q-BLEU/Q-METEOR/Q-ROUGE as more suitable metrics for evaluating questions. All these $n$ -gram match metrics are still designed to evaluate a single generated sequence against a set of ground truth reference sequences per sample. However, we aim to compare a set of multiple generated questions against the set of references per sample (e.g., as shown in Figure 1). A simple approach to do this is to calculate an $n$ -gram match score (BLEU, METEOR, etc.) individually for each prediction in the set and then average the results. However, such average-metrics have at least two major limitations:

Case 1: Number mismatch Generated Questions (Set 1): 1. who is the current president of the united states? Case 2: Reference miscoverage Generated Questions (Set 2): 1. who is the current president of the united states? 2. who is the president of the united states now? 3. who is the president of the united states currently? Reference Questions: 1. who is the current president of the united states? 2. when was the great wall of china built? 3. how does the business model of wikipedia work?

Figure 2: Example cases where a high average n-gram match score can be achieved despite the set of predictions being significantly different from the set of references. In case 1, there are too few predictions. In case 2, the generated predictions are all paraphrases and none are close to reference 2 or 3.

Prediction: what policy caused many families to lose their only child? $\xrightarrow{\text{assigned to}}$ Reference: why did so many schools collapse during the earthquake? (METEOR: $9.33$ ) Prediction: what is the catch phrase for inadequately engineered schoolhouses? $\xrightarrow{\text{assigned to}}$ Reference: what catch-phrase was invented as a result of collapsed schools? (METEOR: $18.19$ ) Prediction: how many inadequately engineered schoolrooms collapsed in the earthquake? $\xrightarrow{\text{assigned to}}$ Reference: how many schoolrooms collapsed in the quake? (METEOR: $48.83$ ) Prediction: what is the age of the so-called illegal children? $\xrightarrow{\text{assigned to}}$ Reference: what can illegal children be registered as in place of their dead siblings? (METEOR: $16.46$ ) Average METEOR: $23.20$ ; Overall Match Score (S): $92.81$ ; Multi-METEOR: $18.56$

Figure 3: Example optimal assignment based on METEOR using the Hungarian algorithm when the set of predictions (generated questions) and the set of references are the same as that given in figure 1.

1. Number Mismatch - The average-metrics can fail to account for the mismatch between the number of predictions and the number of ground truth questions. As a concrete example, consider case 1 in Figure 2. The single prediction matches exactly with the first reference. Thus, the average $n$ -gram match score of all the predictions (in this case, just one) will be perfect. However, there is only one prediction compared to three references. There are two references that the model failed to predict but the average-metrics cannot quantify this.

2. Reference Miscoverage - The average-metrics fail to account for references that are not covered by the generated set. As a concrete example, consider case 2 in Figure 2. In this case there is no number mismatch. However, all the predictions are paraphrases of each other. As such, all of them match highly only with reference 1. Thus, the average $n$ -gram match score of all predictions will be high. Nevertheless, the average-metrics remain oblivious to the failure of the model to generate anything close to reference 2 or reference 3.

Du and Cardie (2017) proposed two evaluation strategies (“conservative" and “liberal") to evaluate the full-system for paragraph-level multi-question generation but their schemes partly depend on aligning predictions and references based on their shared question-worthy sentence (if any). However, such a method is only possible for sentence-level or type-level granularity where each generated question is explicitly associated with a classified question-worthy sentence. In paragraph-level granularity the generated questions are not explicitly associated to any sentence. We propose a more model-agnostic and general evaluation strategy.

3.1 Multi-metrics

We propose a new evaluation scheme for comparing two sets of sequences to address the aforementioned limitations of existing strategies. First, we note that the one cause of reference miscoverage is the fact that multiple repetitive predictions can match with the same single reference. To solve this issue we decide to add a constraint.

Constraint 1 - A prediction can be assigned to exactly one reference and exactly one prediction can be assigned to a reference. Only assigned pairs of prediction and reference can be scored.

That is, we first construct a mapping among predictions and references by assigning one to another. Next, we compute the matching scores (based on some sequence comparison metric) for only the assigned pairs. Unlike before, we do not allow multiple predictions to match with the same reference or vice versa. Nevertheless, there can be multiple assignments that meet constraint 1. Thus, we set another constraint:

Constraint 2 - Assume that there are $m$ predictions $(p_{1},p_{2},\dots,p_{m})$ , $n$ references $(r_{1},r_{2},\dots,r_{n})$ , and a sequence-level evaluation metric $M$ where $M(p_{i},r_{j})$ returns a score for how well prediction $p_{i}$ matches with reference $r_{j}$ . We then choose a set $A$ of $k$ $(k=minimum(m,n))$ assignment pairs between predictions and references such that the “overall match score" ( $S$ ) of the assignments pairs are maximized. Formally, the overall match score is computed as: $S=\sum_{(p_{i},r_{j})\in A}M(p_{i},r_{j})$ .

The evaluation score $M$ can be any $n$ -gram matching metric (e.g., BLEU or METEOR) or even newer metrics like BERTScore Zhang* et al. (2020). We only consider metrics that return non-negative values. Combining constraint 1 and constraint 2, we get exactly the problem of optimal assignment. For example, we can treat predictions as workers, references as jobs and some $n$ -gram match score of a particular pair of prediction and reference as the job-performance of the worker (prediction). For cases like these, the Hungarian algorithm Kuhn (1955), which is a combinatorial optimization algorithm, was designed to find an optimal one-to-one assignment that maximizes the overall job-performance under the constraints. In our proposed evaluation scheme, we first use the Hungarian algorithm to find the $k$ assignments between predictions and references and also compute the overall match score $S$ . In Figure 3, we show an example of the assignments made by the Hungarian algorithm based on METEOR (as $M$ ) on the prediction set and the reference set as given in Figure 1.

Nevertheless, having this overall match score $S$ is not enough because the metric can ignore some references/predictions if there are more than $k$ references/predictions. As such, the problems with accounting number mismatch may remain. To resolve these issues, we use the overall score $S$ in an F-score-like framework.

First, we compute a precision-like metric as $Pr=\frac{S}{m}$ . $Pr$ reflects the overall degree of match that the predictions have with the references. The maixmum value of $S$ will be $c\cdot k$ where $c$ is the maximum value of evaluation $M$ . Now, if there are too many predictions compared to ground truths then $m$ will be $\gg k$ . High $m$ will reduce $Pr$ and mitigate the problem of over-prediction (one aspect of number mismatch).

Second, we compute a recall-like metric as $Re=\frac{S}{n}$ . $Re$ reflects the overall degree of match the references have with the predictions. That is, it reflects how well the references are covered by the predictions. If there are too few predictions compared to ground truths then $m$ will be $\ll n$ ; thus, $n$ will be $\gg k$ . This will reduce $Re$ and mitigate the problem of under-prediction (another aspect of number mismatch).

Finally, similar to F₁, we compute an unweighted harmonic mean (Multi- $M$ ) of $Pr$ and $Re$ :

\textrm{Multi-}M=\frac{2\cdot Pr\cdot Re}{Pr+Re}

Here, in the name “Multi- $M$ ", $M$ denotes the evaluation metric as mentioned before. $M$ can be any metric like BLEU or METEOR. Thus, we can have different corresponding variants of our set-level Multi- $M$ metrics as multi-BLEU or multi-METEOR depending on the metric $M$ .

4 Generation Frameworks

As mentioned earlier, currently, state of the art in QG is achieved by pre-trained Seq2Seq models. Thus, we choose T5 Raffel et al. (2020) (a pre-trained Seq2Seq model) as the main model for question generation. For our specific task, there are several distinct strategies that we contrast utilizing T5. At a high level, some of these strategies involve generating one question at a time or generating a concatenated series of questions; generating at a document-level or generating at a sentence-level and collating the generations for every sentence. Overall, we consider two main factors of variation: (1) generation granularity and (2) generation mode. We describe them both below:

4.1 Generation Granularity

While we do solely focus on paragraph-level generation, that does not mean we have to directly generate from the paragraph-level. Instead, we can also generate questions by focusing on a lower granularity (for example, sentences in the paragraph) and then collate the results in the end to have a paragraph-level output. Below, we discuss question generation in different levels of granularity.

1. Paragraph-level - In the paragraph-level granularity, we feed the whole paragraph to T5 and let it generate all the suitable questions from it directly. This model implicitly learns to find potential question-worthy areas (and just candidate answer-phrases) to generate questions about them.

2. Sentence-level - In the sentence-level granularity, we follow the strategy proposed by Du and Cardie (2017). During training we train two models. One model is a question-worthiness classifier which classifies whether a sentence in a given paragraph is question-worthy or not. We describe the question-worthiness classifier in appendix 6.1. The other model is a seq2seq question generator which is trained to do sentence-level question generation. During inference, in the first step, we classify each sentence in the given paragraph as either question-worthy or not. In the second step, for every sentence classified as question-worthy we do sentence-level question generation using the trained question-generator. In the final step, we collate all the generated questions for every question-worthy sentences to represent the overall paragraph-level output. However, we integrate two notable differences from Du and Cardie (2017): 1. We use T5 instead of an RNN-based question-generator; 2. Instead of using just a sentence as input for the sentence-level question generator, we use the whole paragraph (after flattening it into a single sequence of tokens) as input. We however “highlight" the sentence in the paragraph to make the generator focus on that highlighted sentence for sentence-level generation. Highlighting is done by adding a special token <hl> at the beginning of the sentence, and a special token </hl> at its end. This was done because sometimes the surrounding context of a sentence is necessary to generate the ground truth questions Zhao et al. (2018).

Refer to caption — Figure 4: Framework for QG at a type-level granularity during inference. Step 1 shows question-worthiness classification followed by question-type prediction from the classified question-worthy sentences (highlighted in blue). Step 2 shows generating type-conditioned questions for each predicted question-type and its corresponding question-worthy sentence (highlighted in blue). Generation at a sentence-level granularity works similarly but excluding the question-type classifier (step 1) and type-conditioning (step 2).

3. Type-level - In type-level generation, during training, we train three separate models. As in sentence-level, we train a question-worthiness classifier and a T5-based question-generator. However, in addition, we also train a question-type classifier which predicts all types (e.g., who, where, how etc.) of questions that are worthy to be asked from the question-worthy sentences. We frame the question-type prediction task as a multi-label sentence classification problem. We consider question type labels $\in\{\textrm{who},\textrm{when},\textrm{where},\textrm{what},\textrm{why},\textrm{which},\textrm{how},\\ \textrm{quantity},\textrm{other}\}$ . We describe the question-type classifier in appendix 6.1. Different from sentence-level, we train the T5 sentence-level question-generator to be conditioned by the question-type. Essentially, we prepend a special token indicating the ground-truth question type to the input of the sentence-level question-generator during training. As such, the generator learns to condition its generation on the question-type as specified by the prepended token. A similar sentence-highlighting strategy is used as before for the generator input.

During inference, first, we classify each sentence in a given paragraph to check whether it is question-worthy or not. Second, for every classified question-worthy sentence we predict all the types of questions that are worthy to be asked from that sentence. Third, for every question-worthy sentence and for every question type appropriate to be asked from that sentence, we perform a type-conditioned sentence-level question generation for that question-type (conditioned by preprending) and that sentence. Finally, all the generated questions for each sentence and its question types, can be collated together to have an output set of questions for the overall paragraph. Figure 4 shows the pipeline for type-level generation during inference.

Similar to us, Wu et al. (2020) used type-driven generation in an answer-agnostic setup but they focus on sentence-level single question generation. Moreover, they frame question type prediction as a multi-class classification problem (question types for a sentence are considered to be mutually exclusive). While Wu et al. (2020) proposed to select top $K$ most probable question types to generate multiple questions for multiple types for the same sentence, the model is still dependent on the hyperparameter $K$ . Instead, in our approach, we simply let the classification model decide which and how many types to predict by framing question-type prediction as a multi-label classification problem.

4.2 Generation Mode

Besides granularity, the generation multiplicity (or mode) is another factor to consider. We consider two different modes of generation, which we discuss below.

1. One2One Generation - In one2one generation, we train the question-generator model to maximize the likelihood for a single ground truth question for the sample input (of whatever granularity). To have generation of multiple questions in an one2one setting, we can use beam search or multiple sampling-based decoding. We can also generate multiple questions by generating at a lower granularity and then collating the results.

2. One2Many Generation - In one2many generation settings, we train the question-generation model to maximize the likelihood of the concatenation of all the questions that are asked (in the ground truth set) for the given sample input of a specific granularity. An one2many model can generate multiple questions even when using greedy decoding. A key benefit of the one2many mode is that we can allow the model itself to determine how many questions to generate. In contrast, in the one2one setting, we have to manually decide how many questions we want to generate (using beam search or sampling techniques). Another benefit of one2many mode is that the generation of one question can be informed by prior generated questions for the given input.

A similar one2many mode of generation was proposed for keyphrase generation Yuan et al. (2020). For question generation, in particular, Lopez et al. (2021) used the one2many mode of generation with pre-trained Transformer models. Different from any prior works, we explore combinations of one2many generation in sentence and type-level granularity considering that there could be multiple questions that can be asked even for a single sentence or a question type.

For training one2many models, we simply prepare the training ground truths by concatenating all the questions (for the given granularity of input) while using a special token <sep> as a delimiter. The concatenation of questions is done in the order (first to last) of the occurrence of their corresponding question-worthy sentences.

Overall, there are six ways of combining the above factors (granularity and mode), and thus a total of six possible strategies. We compare all six.

5 Ranking and Selection

When using one2one generation mode, we have to also decide on a method to generate multiple questions, particularly when generating directly at the paragraph-level granularity. Thus, for one2one mode, we employ an overgenerate-and-rank strategy. First we overgenerate multiple questions (the overgeneration method is discussed in section 5.1). Second, we rank the generations using some ranking method. Third, we select some $k$ highest ranked generations. For ranking and selection, we consider the following methods:

1. Rand@5: In this method, we randomly select $5$ generations. The average number of questions per paragraph is also $5$ in the dataset that we use.

2. Top@1: In this method we only generate and select a single greedy-decoded question. Thus, in this case, there will be only one generated question per input at a specific granularity-level. For example, in case of paragraph-level granularity there will be only one generation overall and in case of sentence-level granularity there can be as many generations as there are question-worthy sentences.

3. Rank@5: In this method, we rank the questions based on their answerability probability and then select the top $5$ questions. We use a neural question-answering model to predict the answerability probability for each question. In particular, we use an ELECTRA-large model Clark et al. (2020) that was trained on SQuAD2.0 Rajpurkar et al. (2018) to not only extract an answer span but also to decide whether the question is answerable from the given context.²²2https://huggingface.co/ahotrod/electra_large_discriminator_squad2_512 We also filter any question with less than $0.5$ answerability probability.

5.1 Overgeneration

For overgenerating multiple ( $K$ ) questions in one2one methods, we generate one question using greedy decoded for the given granularity level. Rest of the $K$ generations at the given granularity level is decoded using multiple runs of Nucleus Sampling Holtzman et al. (2020) with a top-p value of $0.9$ . We set $K=20$ for paragraph-level granularity, $K=10$ for sentence-level granularity, and $K=5$ for type-level granularity. The $K$ -values were chosen arbitrarily for most part. We decrease the $K$ -value for lower granularities because the overall generation can be multiple times higher than $K$ depending on how many question-worthy sentences or how many question-types there are.

6 Experiments

Model	Prec.	Rec.	F₁	Acc.
Du and Cardie (2017)	73.15	89.29	80.42	72.52
ELECTRA CL	76.19	74.02	75.09	70.11
ELECTRA HC	75.88	88.24	81.59	75.77

Table 1: Question-worthiness classification results

Type	Prec.	Rec.	F₁
Who	39.64	88.75	54.80
When	31.85	78.63	45.34
Where	21.62	79.05	33.96
What	83.59	49.11	61.87
Quantity	56.72	88.65	69.18
How	10.97	65.53	18.79
Why	09.12	64.84	15.99
Which	20.86	64.78	31.55
Others	02.35	32.26	04.37

Table 2: Question-type classification results

In this section, we discuss the details of our experimental models, datasets, evaluation, and results. We use an ELECTRA-large-based classifier Clark et al. (2020) for question-worthiness classification and question-type classification. More details on the classifier architectures are in section 6.1. Hyperparameter details are in section 6.3 and 6.4.

6.1 Classifier Architecture

Question-type Classification: For Question-Type classification, we use ELECTRA-large as a multi-label sentence classifier. We transform the final representation of the CLS token using two layers with a GELU activation Hendrycks and Gimpel (2016) in between for classification.

Question-worthiness Classification: For Question-worthiness classification we try two distinct approaches: (1) ELECTRA CL and (2) ELECTRA HC. We discover that ELECTRA HC outperforms ELECTRA CL and so only ELECTRA HC is used in the main experiments (Table 3,4). Table 1 also shows the result of ELECTRA HC. We describe both methods below:

1. ELECTRA CL: In this approach we simply use ELECTRA-large as a sentence-level binary classifier similar to how we use it as a multi-label classifier for question-type classification.

2. ELECTRA HC: ELECTRA HC takes a hierarchical approach towards classification. It uses a similar framework as used by Du and Cardie (2017). First, we encode each sentence in a paragraph into a single vector (sentence-vector) using a sentence encoder (as a result, we will have a sequence of sentence-vectors representing the paragraph). Second, the sequence of vectors are contextualized by a BiLSTM Hochreiter and Schmidhuber (1997); Graves and Schmidhuber (2005). Third, we apply a binary classifier (two layers with a GELU activation in between) for each of the sentence-vectors in the sequence. In this strategy the classification of question-worthiness of each sentence is informed by the context of the surrounding sentences in the paragraph. Different from Du and Cardie (2017), we use ELECTRA-large, a more modern model, for sentence encoding. We treat the final representation of the CLS token of ELECTRA as the encoded sentence vector for each sentence.

6.2 Question-type Determination

For most of the part, we determine the question-type of a question based on whether the question-type word is present in the question. For example, if “who" is present in a question, its question-type is determined to be “who". Sometimes the question-type word occur in the middle of the question so we did not restricted ourselves to only checking the first question word (several examples of such cases can be observed in table 5 reference questions). This rule is not foolproof, but generally works in the dataset that we use. There are, however, multiple exceptions to the general rule mentioned above. If “whose" or “whom" are present in the question, the question-type is still determined as “who". If “how much" or “how many" is present in the question then the question-type is determined to be “quantity" instead of “how". We do this because general “how" questions are of a different breed (asking for a process) than questions asking “how much"/“how many" (generally asking for some quantity or intensity). Presence of the words “quantity" or “other" in a question do not determine the question-type to be “quantity" or “other". Any questions whose type remains undetermined by the above rules are determined as of type “other". Questions with answers “yes" or “no" are also determined to be of type “other". We do not keep a separate type for boolean questions (yes and no questions) because there are very few instances of this type in the dataset that we use.

6.3 Classifier Hyperparameters

All ELECTRA-based classifiers use two layers on top of the final representation of the CLS token. There is a GELU activation function Hendrycks and Gimpel (2016) in between. The first layer has as many neurons as in the ELECTRA hidden state (except when we use ELECTRA HC. For ELECTRA HC the number of neurons of this layer is the same as that of BiLSTM hidden size). The last layer has a sigmoid activation function. In case of question-worthiness classification (binary classification) the last layer have a single neuron. In case of question-type classification, the last layer have as many neurons as there are question-type labels. The total hidden size of the BiLSTM hidden state (forward and backward combined), as used in ELECTRA HC, is $300$ . Some of the general hyperparameters used for all classifiers are a weight-decay of $0.01$ , a maximum gradient normalization clipping of $5$ , a maximum sequence length of $512$ , a batch size of $64$ , a maximum number of epochs of $30$ , and an early stop patience of $2$ (the training is terminated when the loss does not reduce for two consecutive epochs). We also use RAdam Liu et al. (2020) as the optimizer. For all classifiers, the learning rate is tuned using grid-search among the options $\{0.001,0.0002,0.0001,0.00002\}$ . $0.00002$ was chosen for each. Label weights were used for question-type classification. The label weight for a particular label were determined as the total number of negative instances divided by the total number of positive instances of that label. Model selection and hyperparameter selection are done based on the best validation loss.

6.4 Generator Hyperparameters

Some of the shared hyperparameters of the T5-based question generator models are a batch size of $128$ , a weight decay of $0$ , a maximum number of epochs as $20$ , a maximum sequence length of $512$ , a maximum gradient normalization clipping of $5$ , and an early stop patience of $2$ (the training is terminated when the loss does not reduce for two consecutive epochs). We also use SM3 Anil et al. (2019) as the optimizer. Greedy decoding is used for generation in one2many mode. One2one modes utilize overgenerate-and-rank methods which are discussed earlier. The learning rate is tuned using grid-search among the options $\{1.0,0.1,0.01,0.001\}$ for all generators separately. We limit the maximum epochs to $10$ for each hyperparameter tuning trial. A learning rate of $0.1$ was chosen for paragraph-level one2many generation and a learning rate of $0.01$ was chosen for the rest. Learning rate (lr) warmup as follows (based on the recommended procedure for SM3 ³³3https://github.com/google-research/google-research/tree/master/sm3):

lr_{s}=lr_{0}\cdot min(1,(s/w)^{2})

(1)

$lr_{s}$ indicates the learning rate at step $s$ . The inital learning rate ( $lr_{0}$ ) is whatever is chosen after hyperparameter tuning. $s$ indicates the current update step number. $w$ indicates total warmup steps (set as $2000$ ). Model selection and hyperparameter selection are done based on the best validation loss.

6.5 Dataset

We perform all our experiments on the processed SQuAD1.1 Rajpurkar et al. (2016) split as provided by Du et al. (2017)⁴⁴4https://github.com/xinyadu/nqg/tree/master/data/processed for question generation. For question-worthiness classification, all the sentences from all the paragraphs in the dataset are input samples. A sample input sentence is considered question-worthy if that sentence has a corresponding ground-truth question in the dataset. For question-type classification, only question-worthy sentences are sample inputs. For each sample, its question-type labels are the question-types of all the questions associated to that sample sentence. The question-type of a question is determined based on some heuristic rules that are detailed in appendix 6.2. We maintain the original train-development-test split (as provided by Du et al. (2017)) for all experiments.

Granularity	Generation Mode	Multi-BLEU4	Multi-MTR	Multi-R-L	Multi-qBLEU1	BLEU4	MTR	R-L	qBLEU1
Du and Cardie (2017)		—	—	—	—	12.28	16.62	39.75	—
Lopez et al. (2021)		—	—	—	—	8.26	21.2	44.38	–
Paragraph	one2one (Rand@5)	5.14	15.78	28.79	28.78	7.78	20.52	37.67	43.50
Paragraph	one2one (Top@1)	3.73	9.10	16.94	16.58	12.68	23.58	44.12	49.27
Paragraph	one2one (Rank@5)	6.78	17.07	30.17	30.38	11.21	23.71	41.69	47.98
\cdashline1-10 Paragraph	one2many	6.25	15.55	28.97	28.48	11.15	22.12	42.41	47.58
Sentence	one2one (Rand@5)	5.37	15.85	28.97	29.24	7.85	20.48	37.86	43.88
Sentence	one2one (Top@1)	6.97	16.56	30.02	29.96	11.91	23.20	42.81	48.61
Sentence	one2one (Rank@5)	6.77	16.70	29.50	29.76	11.22	23.72	41.51	47.86
\cdashline1-10 Sentence	one2many	7.80	17.93	32.08	32.08	11.64	22.65	41.96	47.60
Type	one2one (Rand@5)	4.86	15.34	28.07	28.15	7.15	19.82	36.37	42.36
Type	one2one (Top@1)	7.01	16.03	29.06	28.90	12.30	23.27	42.78	48.55
Type	one2one (Rank@5)	6.43	16.42	29.00	29.13	10.45	23.08	40.35	46.57
\cdashline1-10 Type	one2many	7.57	16.79	30.05	30.11	12.67	23.4	42.65	48.54

Table 3: Multi-metrics and average metrics performance on paragraph-level multi-question generation on SQuAD2.0 (split by Du et al. (2017)) using different granularity-levels and generation modes.

Granularity	Generation Mode	Self-BLEU2	car. diff
Paragraph	one2one (Rand@5)	66.18	-0.11
Paragraph	one2one (Top@1)	0	3.89
Paragraph	one2one (Rank@5)	49.25	-0.11
\cdashline1-4 Paragraph	one2many	38.82	1.45
Sentence	one2one (Rand@5)	72.39	-0.10
Sentence	one2one (Top@1)	17.84	1.78
Sentence	one2one (Rank@5)	52.72	-0.10
\cdashline1-4 Sentence	one2many	34.91	0.39
Type	one2one (Rand@5)	71.32	-0.10
Type	one2one (Top@1)	16.43	1.99
Type	one2one (Rank@5)	50.63	-0.09
\cdashline1-4 Type	one2many	17.04	1.76

Table 4: Self-BLEU and cardinality difference (car. diff) for QG on Du et al. (2017) split using different granularity-levels and generation modes.

Example # 1

Generated Questions:

1. what were the early courses in in the college of science?

2. when was the college of engineering established?

Reference Questions:

1. how many bs level degrees are offered in the college of engineering at notre dame?

2. in what year was the college of engineering at notre dame formed?

3. before the creation of the college of engineering similar studies were carried out at which notre dame college?

4. how many departments are within the stinson-remick hall of engineering?

5. the college of science began to offer civil engineering courses beginning at what time at notre dame?

BLEU4:

40.34

; Multi-BLEU4:

13.26

; METEOR:

22.06

; Multi-METEOR:

11.81

;

ROUGE-L:

42.38

; Multi-ROUGE-L:

22.91

; Q-BLEU1:

40.26

; Multi-Q-BLEU1:

17.01

Example # 2

Generated Questions:

1. when was theodore m. hesburgh library completed?

2. what is the library system of the university divided between?

3. what is the name of the library that houses the main collection of books?

4. what is the word of life mural known as?

5. what does the word of life mural appear to make?

6. who designed the word of life mural?

Reference Questions:

1. how many stories tall is the main library at notre dame?

2. what is the name of the main library at notre dame?

3. in what year was the theodore m. hesburgh library at notre dame finished?

4. which artist created the mural on the theodore m. hesburgh library?

5. what is a common name to reference the mural created by millard sheets at notre dame?

BLEU4:

10.65

; Multi-BLEU4:

11.38

; METEOR:

17.25

; Multi-METEOR:

15.04

;

ROUGE-L:

40.15

; Multi-ROUGE-L:

33.6

; Q-BLEU1:

36.44

; Multi-Q-BLEU1:

27.72

Example # 3

Generated Questions:

1. what do most people with dogs describe their pet as?

2. what does a study of conversations in dog-human families show?

3. what is the popular reconceptualization of the dog-human family as a pack?

4. what is a dominance model of dog-human relationships promoted by some dog trainers?

Reference Questions:

1. how do most people describe the relationship with their dogs?

2. what television show uses a dominance model of dog and human relationships?

3. most people today describe their dogs as what?

4. what tv show promotes a dominance model for the relationships people have with their dogs?

BLEU4:

05.56

; Multi-BLEU4:

05.56

; METEOR:

24.28

; Multi-METEOR:

21.21

;

ROUGE-L:

37.13

; Multi-ROUGE-L:

32.43

; Q-BLEU1:

41.97

; Multi-Q-BLEU1:

37.10

Table 5: SQuAD1.1 examples with generations from T5 sentence-level one2many. Comparison between different evaluation metrics are presented.

Example # 1

Generated Questions:

1. what is the name of the oldest building on campus?

Reference Questions:

1. where is the headquarters of the congregation of the holy cross?

2. what is the primary seminary of the congregation of the holy cross?

3. what is the oldest structure at notre dame?

4. what individuals live at fatima house at notre dame?

5. which prize did frederick buechner create?

BLEU4:

0

; Multi-BLEU4:

0

; METEOR:

17.58

; Multi-METEOR:

05.86

;

ROUGE-L:

50

; Multi-ROUGE-L:

15.12

; Q-BLEU1:

43.00

; Multi-Q-BLEU1:

12.07

Example # 2

Generated Questions:

1. what is the name of the college of engineering?

Reference Questions:

1. how many bs level degrees are offered in the college of engineering at notre dame?

2. in what year was the college of engineering at notre dame formed?

3. before the creation of the college of engineering similar studies were carried out at which notre dame college?

4. how many departments are within the stinson-remick hall of engineering?

5. the college of science began to offer civil engineering courses beginning at what time at notre dame?

BLEU4:

43.44

; Multi-BLEU4:

07.54

; METEOR:

24.33

; Multi-METEOR:

08.11

;

ROUGE-L:

49.23

; Multi-ROUGE-L:

15.47

; Q-BLEU1:

47.69

; Multi-Q-BLEU1:

12.52

Table 6: SQuAD1.1 examples with generations from T5 paragraph-level one2one (Top@1). Comparison between different evaluation metrics are presented.

6.6 Evaluation

We use standard precision, recall, F1 measures for the classification tasks. For question generation, we use different instances of multi-metrics (discussed in $\S$ 3.1): Multi-BLEU4, Multi-METEOR (Multi-MTR), Multi-ROUGE-L (Multi-R-L), and Multi-qBLEU1. We also report the average BLEU4, METEOR (MTR), ROUGE-L (R-L), and qBLEU1 metrics. In Table 4, we use self-BLEU2 Zhu et al. (2018) on the prediction set to show prediction diversity (Lower self-BLEU means higher diversity). We also show the difference of the number of predictions from the number of ground truth questions. We refer to this metric as cardinality difference (car. diff). This shows the degree of number mismatch. We use nlg-eval Sharma et al. (2017) for $n$ -gram match scores. For Q-BLEU1 we use the same parameters as recommended for Q-BLEU1 on SQuAD by the original paper (See table 5 in Nema and Khapra (2018)).

6.7 Results

Table 1 compares the performance of ELECTRA HC and ELECTRA CL. As can be seen, ELECTRA HC outperforms ELECTRA CL by a significant margin and obtains the overall best performance on recall, F1, and accuracy. In Table 2, we show the performance for each question-type labels. While the type classification performance is low, it is on par with results obtained by prior methods Wu et al. (2020) (although the results are not strictly comparable due to differences in question-types).

In Table 3, we show the main results of question generation. Among the one2one ranking methods, Top@1 seems to work best at sentence-level or type-level granularities but it causes higher magnitude cardinality difference (as shown in Table 4). The multi-metrics performance for Top@1 paragraph-level is poor because it can only predict one question for the whole paragraph causing high number mismatch. Rank@5 works better than Rand@5 for any granularity level, which makes sense given that Rank@5 takes answerability into account. Among generation modes, one2many generally works better than one2one ranking when using sentence-level or type-level granularity. Among granularity-levels, sentence-level seems to generally perform the best on multi-metrics. Type-level generation does not offer much further benefit. Type-level generation also causes higher magnitude cardinality difference (Table 4) although they get lower self-BLEU (higher diversity) (Table 4). However, in some cases, lower self-BLEU can result from under-prediction (having fewer predictions reduces the chances of $n$ -gram overlaps among each other); thus, lower self-BLEU is not always good.

7 On the Efficacy of Multi-metrics

We motivate the efficacy of multi-metrics on two different grounds.

First, we emphasize the theoretically established virtues of multi-metrics in terms of its ability to account for failure in reference miscoverage and also, number mismatch (in $\S$ 3).

Second, we show a concrete instance where average metrics are high despite a critical failure of the model, whereas multi-metrics are low as it should be. This can be observed in Table 3 in the case of paragraph-level one2one generation with Top@1 ranking. Here, the model can only predict one question at most, whereas there are five references on average. Thus, we see a high cardinality difference in Table 4 for this case. Despite this, the average $n$ -gram metrics obtain very high scores for this model. These metrics are completely insensitive to this model failure. Multi-metrics, on the other hand, can take this failure into account. Thus, we observe multi-metrics assign it the lowest scores.

We also present some concrete examples (of generated set of questions and ground truth set of questions or references) along with their corresponding metrics (both average-based n-gram match metrics and multi-metrics) in table 5 and 6. In table 5 example # 1, we find a degree of number mismatch. There are only two predictions whereas there are five references. As expected, we find a quite a bit of difference between multi-metrics and average-metrics here because multi-metrics is penalized heavily because of number mismatch and average-metric is not penalized as much. In example # 2 and # 3 from table 5, we find that the multi-metrics are much closer to the average metrics because the number of predictions are close to the number of references; and furthermore, there are no issues with paraphrases in predictions. On the other hand, in the examples in table 6, we again observe amplified difference between multi-metrics and average-metrics given the high degree of number mismatch that the average-metrics fail to take into account.

8 Related Work

Assignment-based Evaluations - Rus and Lintean (2012) proposed an optimal-matching-based method for embedding-based text similarity measure but not in the context of comparing sets of sequences in an F₁-like framework. Similarly, several text evaluation approaches Kusner et al. (2015); Chow et al. (2019); Clark et al. (2019); Zhao et al. (2019) used earth mover’s distance Rubner et al. (1998) whereas Zhang* et al. (2020) used greedy-matching. Similar to us, Schlichtkrull and Cheng (2020) proposed an evaluation for QG but using greedy matching (instead of optimal assignment) which allows multiple predictions to match with a single reference and vice versa.

Question Generation- Du et al. (2017) presented one of the earliest works on answer-agnostic neural QG. There were also several early answer-aware QG approaches Yuan et al. (2017); Zhou et al. (2017). Several works took joint-training or multi-task approaches to train both question answering and QG Duan et al. (2017); Song et al. (2017); Tang et al. (2017); Wang et al. (2017). Du and Cardie (2017) proposed QG in the sentence-level granularity. Subramanian et al. (2018) generated questions based on detected keyphrases. Similarly, Wang et al. (2019) used a multi-agent communication framework to first identify question-worthy phrases and then generate questions with their assistance. Zhao et al. (2018) used maxout pointer and gated self-attention to exploit paragraph-level information for QG. Scialom et al. (2019) used Transformer-based approaches for answer-agnostic QG. Multiple works Fan et al. (2018); Sun et al. (2018); Hu et al. (2018); Zhou et al. (2019); Wu et al. (2020) utilized question-words information or a question-type driven framework for different variants of QG. Newer approaches Chan and Fan (2019a, b); Dong et al. (2019); Varanasi et al. (2020); Qi et al. (2020); Lopez et al. (2021) used pre-trained Transformer models.

9 Conclusion

We proposed a new evaluation method (multi-metrics) to evaluate multi-question generation. We motivate the evaluation theoretically and also empirically in terms of the contrast discussed in $\S$ 7. Using both new and old evaluations, we also empirically compare combinations of various strategies for paragraph-level multi-question generation under a common framework. Our results show that using factorized sentence-level generation in one2many mode is better than directly generating from paragraph-level even when using powerful pre-trained models.

References

Anil et al. (2019) Rohan Anil, Vineet Gupta, Tomer Koren, and Yoram Singer. 2019. Memory efficient adaptive optimization. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 65–72, Ann Arbor, Michigan. Association for Computational Linguistics.
Chan and Fan (2019a) Ying-Hong Chan and Yao-Chung Fan. 2019a. BERT for question generation. In Proceedings of the 12th International Conference on Natural Language Generation, pages 173–177, Tokyo, Japan. Association for Computational Linguistics.
Chan and Fan (2019b) Ying-Hong Chan and Yao-Chung Fan. 2019b. A recurrent BERT-based model for question generation. In Proceedings of the 2nd Workshop on Machine Reading for Question Answering, pages 154–162, Hong Kong, China. Association for Computational Linguistics.
Chow et al. (2019) Julian Chow, Lucia Specia, and Pranava Madhyastha. 2019. WMDO: Fluency-based word mover’s distance for machine translation evaluation. In Proceedings of the Fourth Conference on Machine Translation (Volume 2: Shared Task Papers, Day 1), pages 494–500, Florence, Italy. Association for Computational Linguistics.
Clark et al. (2019) Elizabeth Clark, Asli Celikyilmaz, and Noah A. Smith. 2019. Sentence mover’s similarity: Automatic evaluation for multi-sentence texts. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2748–2760, Florence, Italy. Association for Computational Linguistics.
Clark et al. (2020) Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. Electra: Pre-training text encoders as discriminators rather than generators. In International Conference on Learning Representations.
Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
Du and Cardie (2017) Xinya Du and Claire Cardie. 2017. Identifying where to focus in reading comprehension for neural question generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 2067–2073, Copenhagen, Denmark. Association for Computational Linguistics.
Du et al. (2017) Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1342–1352, Vancouver, Canada. Association for Computational Linguistics.
Duan et al. (2017) Nan Duan, Duyu Tang, Peng Chen, and Ming Zhou. 2017. Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 866–874, Copenhagen, Denmark. Association for Computational Linguistics.
Fan et al. (2018) Zhihao Fan, Zhongyu Wei, Piji Li, Yanyan Lan, and Xuanjing Huang. 2018. A question type driven framework to diversify visual question generation. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI-18, pages 4048–4054. International Joint Conferences on Artificial Intelligence Organization.
Graves and Schmidhuber (2005) Alex Graves and Jürgen Schmidhuber. 2005. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5):602–610. IJCNN 2005.
Heilman and Smith (2010) Michael Heilman and Noah A. Smith. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pages 609–617, Los Angeles, California. Association for Computational Linguistics.
Hendrycks and Gimpel (2016) Dan Hendrycks and Kevin Gimpel. 2016. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. ArXiv, abs/1606.08415.
Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
Holtzman et al. (2020) Ari Holtzman, Jan Buys, Li Du, Maxwell Forbes, and Yejin Choi. 2020. The curious case of neural text degeneration. In International Conference on Learning Representations.
Hu et al. (2018) Wenpeng Hu, Bing Liu, Jinwen Ma, Dongyan Zhao, and Rui Yan. 2018. Aspect-based question generation. In International Conference of Representation Learning Workshop.
Kuhn (1955) H. W. Kuhn. 1955. The hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83–97.
Kusner et al. (2015) Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q. Weinberger. 2015. From word embeddings to document distances. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, page 957–966. JMLR.org.
Laban et al. (2020) Philippe Laban, John Canny, and Marti A. Hearst. 2020. What’s the latest? a question-driven news chatbot. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 380–387, Online. Association for Computational Linguistics.
Lin (2004) Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
Liu et al. (2020) Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. 2020. On the variance of the adaptive learning rate and beyond. In International Conference on Learning Representations.
Lopez et al. (2021) Luis Enrico Lopez, Diane Kathryn Cruz, Jan Christian Blaise Cruz, and Charibeth Cheng. 2021. Transformer-based end-to-end question generation. In Proceedings of Pacific Rim International Conferences on Artificial Intelligence (PRICAI).
Narayan et al. (2020) Shashi Narayan, Gonçalo Simões, Ji Ma, Hannah Craighead, and Ryan T. McDonald. 2020. QURIOUS: question generation pretraining for text generation. arXiv, abs/2004.11026.
Nema and Khapra (2018) Preksha Nema and Mitesh M. Khapra. 2018. Towards a better metric for evaluating question generation systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3950–3959, Brussels, Belgium. Association for Computational Linguistics.
Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.
Puri et al. (2020) Raul Puri, Ryan Spring, Mohammad Shoeybi, Mostofa Patwary, and Bryan Catanzaro. 2020. Training question answering models from synthetic data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5811–5826, Online. Association for Computational Linguistics.
Qi et al. (2020) Weizhen Qi, Yu Yan, Yeyun Gong, Dayiheng Liu, Nan Duan, Jiusheng Chen, Ruofei Zhang, and Ming Zhou. 2020. ProphetNet: Predicting future n-gram for sequence-to-SequencePre-training. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2401–2410, Online. Association for Computational Linguistics.
Raffel et al. (2020) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21(140):1–67.
Rajpurkar et al. (2018) Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know what you don’t know: Unanswerable questions for SQuAD. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 784–789, Melbourne, Australia. Association for Computational Linguistics.
Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. SQuAD: 100,000+ questions for machine comprehension of text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas. Association for Computational Linguistics.
Rubner et al. (1998) Y. Rubner, C. Tomasi, and L.J. Guibas. 1998. A metric for distributions with applications to image databases. In Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), pages 59–66.
Rus and Lintean (2012) Vasile Rus and Mihai Lintean. 2012. A comparison of greedy and optimal assessment of natural language student input using word-to-word similarity metrics. In Proceedings of the Seventh Workshop on Building Educational Applications Using NLP, pages 157–162, Montréal, Canada. Association for Computational Linguistics.
Schlichtkrull and Cheng (2020) Michael Sejr Schlichtkrull and Weiwei Cheng. 2020. Evaluating for diversity in question generation over text. ArXiv, abs/2008.07291.
Scialom et al. (2019) Thomas Scialom, Benjamin Piwowarski, and Jacopo Staiano. 2019. Self-attention architectures for answer-agnostic neural question generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 6027–6032, Florence, Italy. Association for Computational Linguistics.
Shakeri et al. (2020) Siamak Shakeri, Cicero Nogueira dos Santos, Henghui Zhu, Patrick Ng, Feng Nan, Zhiguo Wang, Ramesh Nallapati, and Bing Xiang. 2020. End-to-end synthetic data generation for domain adaptation of question answering systems. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 5445–5460, Online. Association for Computational Linguistics.
Sharma et al. (2017) Shikhar Sharma, Layla El Asri, Hannes Schulz, and Jeremie Zumer. 2017. Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. CoRR, abs/1706.09799.
Song et al. (2017) Linfeng Song, Zhiguo Wang, and Wael Hamza. 2017. A unified query-based generative model for question generation and question answering. ArXiv, abs/1709.01058.
Subramanian et al. (2018) Sandeep Subramanian, Tong Wang, Xingdi Yuan, Saizheng Zhang, Adam Trischler, and Yoshua Bengio. 2018. Neural models for key phrase extraction and question generation. In Proceedings of the Workshop on Machine Reading for Question Answering, pages 78–88, Melbourne, Australia. Association for Computational Linguistics.
Sun et al. (2018) Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, Yanjun Ma, and Shi Wang. 2018. Answer-focused and position-aware neural question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3930–3939, Brussels, Belgium. Association for Computational Linguistics.
Tang et al. (2017) Duyu Tang, Nan Duan, Tao Qin, and Ming Zhou. 2017. Question answering and question generation as dual tasks. ArXiv, abs/1706.02027.
Varanasi et al. (2020) Stalin Varanasi, Saadullah Amin, and Guenter Neumann. 2020. CopyBERT: A unified approach to question generation with self-attention. In Proceedings of the 2nd Workshop on Natural Language Processing for Conversational AI, pages 25–31, Online. Association for Computational Linguistics.
Wang et al. (2020) Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5008–5020, Online. Association for Computational Linguistics.
Wang et al. (2019) Siyuan Wang, Zhongyu Wei, Zhihao Fan, Yang Liu, and Xuanjing Huang. 2019. A multi-agent communication framework for question-worthy phrase extraction and question generation. Proceedings of the AAAI Conference on Artificial Intelligence, 33(01):7168–7175.
Wang et al. (2017) Tong Wang, Xingdi Yuan, and Adam Trischler. 2017. A joint model for question answering and question generation. ArXiv, abs/1706.01450.
Wang et al. (2018) Yansen Wang, Chenyi Liu, Minlie Huang, and Liqiang Nie. 2018. Learning to ask questions in open-domain conversational systems with typed decoders. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2193–2203, Melbourne, Australia. Association for Computational Linguistics.
Willis et al. (2019) Angelica Willis, Glenn Davis, Sherry Ruan, Lakshmi Manoharan, James Landay, and Emma Brunskill. 2019. Key phrase extraction for generating educational question-answer pairs. In Proceedings of the Sixth (2019) ACM Conference on Learning @ Scale, L@S ’19, New York, NY, USA. Association for Computing Machinery.
Wu et al. (2020) Xiuyu Wu, Nan Jiang, and Yunfang Wu. 2020. A question type driven and copy loss enhanced framework for answer-agnostic neural question generation. In Proceedings of the Fourth Workshop on Neural Generation and Translation, pages 69–78, Online. Association for Computational Linguistics.
Yuan et al. (2017) Xingdi Yuan, Tong Wang, Caglar Gulcehre, Alessandro Sordoni, Philip Bachman, Saizheng Zhang, Sandeep Subramanian, and Adam Trischler. 2017. Machine comprehension by text-to-text neural question generation. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 15–25, Vancouver, Canada. Association for Computational Linguistics.
Yuan et al. (2020) Xingdi Yuan, Tong Wang, Rui Meng, Khushboo Thaker, Peter Brusilovsky, Daqing He, and Adam Trischler. 2020. One size does not fit all: Generating and evaluating variable number of keyphrases. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7961–7975, Online. Association for Computational Linguistics.
Zhang* et al. (2020) Tianyi Zhang*, Varsha Kishore*, Felix Wu*, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
Zhao et al. (2019) Wei Zhao, Maxime Peyrard, Fei Liu, Yang Gao, Christian M. Meyer, and Steffen Eger. 2019. MoverScore: Text generation evaluating with contextualized embeddings and earth mover distance. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 563–578, Hong Kong, China. Association for Computational Linguistics.
Zhao et al. (2018) Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. 2018. Paragraph-level neural question generation with maxout pointer and gated self-attention networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3901–3910, Brussels, Belgium. Association for Computational Linguistics.
Zhou et al. (2017) Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao, and Ming Zhou. 2017. Neural question generation from text: A preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing.
Zhou et al. (2019) Wenjie Zhou, Minghua Zhang, and Yunfang Wu. 2019. Question-type driven question generation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 6032–6037, Hong Kong, China. Association for Computational Linguistics.
Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A benchmarking platform for text generation models. In The 41st International ACM SIGIR Conference on Research; Development in Information Retrieval, SIGIR ’18, page 1097–1100, New York, NY, USA. Association for Computing Machinery.