This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Evaluation of Answer-Agnostic Paragraph-level Multi-Question Generation

Jishnu Ray Chowdhury
University of Illinois at Chicago
[email protected]
&Debanjan Mahata
Moody’s Analytics
[email protected]
\ANDCornelia Caragea
University of Illinois at Chicago
[email protected]
Abstract

We study the task of predicting a set of salient questions from a given paragraph without any prior knowledge of the precise answer. We make two main contributions. First, we propose a new method to evaluate a set of predicted questions against the set of references by using the Hungarian algorithm to assign predicted questions to references before scoring the assigned pairs. We show that our proposed evaluation strategy has better theoretical and practical properties compared to prior methods because it can properly account for the coverage of references. Second, we compare different strategies to utilize a pre-trained seq2seq model to generate and select a set of questions related to a given paragraph. The code is available111https://github.com/JRC1995/QuestionGenerationPub.

1 Introduction

Question generation (QG) is the task of automatically generating questions that are pertinent to a given text. QG has multiple applications. For example, it can be used to construct educational materials by generating practice questions to test reading comprehension Heilman and Smith (2010), help dialogue systems ask better questions Wang et al. (2018), or drive conversations around news articles Laban et al. (2020). QG has been also used for creating synthetic data to train competitive question answering models Shakeri et al. (2020); Puri et al. (2020), for pre-training models to improve downstream performance Narayan et al. (2020), and for evaluating the factuality of automated abstractive summarizations Wang et al. (2020). For any large-scale application on emerging data, it would be infeasible to manually construct questions; thus, the need for automating QG arises. There are, however, multiple variants of QG. In this paper, we focus on a particular variant of the task (see Figure 1) with the following characteristics:

Input: the central government estimates that over 7,000 inadequately engineered schoolrooms collapsed in the earthquake chinese citizens have since invented a catch phrase tofu-dregs schoolhouses….. Generated Questions: 1. what policy caused many families to lose their only child? 2. what is the catch phrase for inadequately engineered schoolhouses? 3. how many inadequately engineered schoolrooms collapsed in the earthquake? 4. what is the age of the so-called illegal children? Reference Questions: 1. how many schoolrooms collapsed in the quake? 2. what catch-phrase was invented as a result of collapsed schools? 3. why did so many schools collapse during the earthquake? 4. what are the estimations of how many schoolrooms collapsed? 5. what has the citizenry started calling these type of schools? 6. what can illegal children be registered as in place of their dead siblings?

Figure 1: A sample of a passage from SQuAD1.1 with a set of generated questions and a set of references

1. Answer-Agnostic - Given a document, our models are trained to generate questions related to the document without any prior knowledge about the answer (answer-agnostic QG). The model has to learn to (implicitly or explicitly) seek out question-worthy sentences and phrases to generate their corresponding questions. Answer-agnostic QG is, in general, more challenging than answer-aware question generation (in which the answer is usually highlighted in some fashion). Practical applications of answer-aware QG can be partly limited because they would require some external mechanism (for example a named entity tagger Wang et al. (2020) or a keyphrase generator Willis et al. (2019)) or manual annotation to choose specific answers to generate questions from. An external mechanism may be only able to pick very specific types of answers (like keyphrases or named entities) and thus restrict the range of questions that can be generated.

2. Paragraph-Level - Instead of generating questions at a sentence-level, we focus on generating the most important questions for a whole paragraph (or a document). The model has to learn to decide which areas on the paragraph it should focus while generating the most salient questions.

3. Multiple Questions - We focus on generating and evaluating a set of multiple questions that can be asked from the given paragraph.

Overall, similar to Du and Cardie (2017), our aim with this task is to generate an appropriate number of diverse questions around the most important (“question-worthy") areas in the paragraph. In Figure 1, we show an example that reflects the above characteristics of the task. As can be seen in the figure, we have a paragraph input, a set of ground references, and a set of generated questions, ideally corresponding to the salient question-worthy areas of the paragraph. However, although we aim to generate a set of questions comparable to a ground truth set, standard evaluations of QG Du et al. (2017); Qi et al. (2020); Lopez et al. (2021) based on BLEU Papineni et al. (2002), METEOR Banerjee and Lavie (2005), ROUGE-L Lin (2004), or Q-BLEU Nema and Khapra (2018) are only designed to measure the quality of a single generated question against a set of references per sample. These metrics are not designed to compare a set of generated questions against the set of references (see §\S3). For proper evaluation, we propose a novel evaluation scheme which utilizes the Hungarian algorithm Kuhn (1955) to first optimally assign each generated question to a particular ground truth question before comparing the assigned pairs.

Besides evaluation, as another contribution, we provide a systematic empirical comparison (under a common framework) of multiple task-specific strategies based on generation granularity-level (§\S4.1) and multiplicity (or mode) (§\S4.2). Given that state of the art is generally achieved by pre-trained Transformer-based models Varanasi et al. (2020); Qi et al. (2020); Lopez et al. (2021), we investigate whether direct brute use of pre-trained models is sufficient or some of the prior strategies are still relevant to improve overall performance.

2 Task Definition

For our task, as input we have a paragraph PP as a sequence of sentences P=(si)i[1,n]P=(s_{i})_{i\in[1,n]} where each sentence sis_{i} is a sequence of word (or sub-word) tokens (si=(tj)j[1,m])\left(s_{i}=(t_{j})_{j\in[1,m]}\right). The desired output is a set of questions Q={ql|l[1,p]}Q=\{q_{l}|l\in[1,p]\} where each question qiq_{i} is a sequence of word (or sub-word) tokens (ql=(tj)j[1,r])\left(q_{l}=(t_{j})_{j\in[1,r]}\right). We assume that each question qlq_{l} have its answer information in the given paragraph PP. We also assume that for any question qlq_{l}, PP contains at least one sentence sis_{i} that have sufficient information to answer the question qlq_{l}. Although, a sensible question (for example, questions asking for relevant missing information) can have its answer absent from the paragraph, in this work, we solely focus on generating questions that are answerable from the information in the input.

Throughout this paper, we assume that the provided set of ground truth questions contains the ideal questions to generate and also, the ideal number of questions to generate. The assumption may be violated in practice because, depending on the annotation process, the given ground truth set may not reflect the ideals; moreover, in certain applications the ideal number of questions to be generated may need to be defined by the user or some specific rules. However, given the difficulty of accounting for all these variables, we stick to the aforementioned assumption.

3 Multi-Question Evaluation

Typically, standard text generation evaluation nn-gram match metrics like BLEU, ROUGE, and METEOR are used for QG. Nema and Khapra (2018) proposed interpolating nn-gram match metrics with an answerability score creating Q-BLEU/Q-METEOR/Q-ROUGE as more suitable metrics for evaluating questions. All these nn-gram match metrics are still designed to evaluate a single generated sequence against a set of ground truth reference sequences per sample. However, we aim to compare a set of multiple generated questions against the set of references per sample (e.g., as shown in Figure 1). A simple approach to do this is to calculate an nn-gram match score (BLEU, METEOR, etc.) individually for each prediction in the set and then average the results. However, such average-metrics have at least two major limitations:

Case 1: Number mismatch Generated Questions (Set 1): 1. who is the current president of the united states? Case 2: Reference miscoverage Generated Questions (Set 2): 1. who is the current president of the united states? 2. who is the president of the united states now? 3. who is the president of the united states currently? Reference Questions: 1. who is the current president of the united states? 2. when was the great wall of china built? 3. how does the business model of wikipedia work?

Figure 2: Example cases where a high average n-gram match score can be achieved despite the set of predictions being significantly different from the set of references. In case 1, there are too few predictions. In case 2, the generated predictions are all paraphrases and none are close to reference 2 or 3.

Prediction: what policy caused many families to lose their only child? assigned to\xrightarrow{\text{assigned to}} Reference: why did so many schools collapse during the earthquake? (METEOR: 9.339.33) Prediction: what is the catch phrase for inadequately engineered schoolhouses? assigned to\xrightarrow{\text{assigned to}} Reference: what catch-phrase was invented as a result of collapsed schools? (METEOR: 18.1918.19) Prediction: how many inadequately engineered schoolrooms collapsed in the earthquake? assigned to\xrightarrow{\text{assigned to}} Reference: how many schoolrooms collapsed in the quake? (METEOR: 48.8348.83) Prediction: what is the age of the so-called illegal children? assigned to\xrightarrow{\text{assigned to}} Reference: what can illegal children be registered as in place of their dead siblings? (METEOR: 16.4616.46) Average METEOR: 23.2023.20; Overall Match Score (S): 92.8192.81; Multi-METEOR: 18.5618.56

Figure 3: Example optimal assignment based on METEOR using the Hungarian algorithm when the set of predictions (generated questions) and the set of references are the same as that given in figure 1.

1. Number Mismatch - The average-metrics can fail to account for the mismatch between the number of predictions and the number of ground truth questions. As a concrete example, consider case 1 in Figure 2. The single prediction matches exactly with the first reference. Thus, the average nn-gram match score of all the predictions (in this case, just one) will be perfect. However, there is only one prediction compared to three references. There are two references that the model failed to predict but the average-metrics cannot quantify this.

2. Reference Miscoverage - The average-metrics fail to account for references that are not covered by the generated set. As a concrete example, consider case 2 in Figure 2. In this case there is no number mismatch. However, all the predictions are paraphrases of each other. As such, all of them match highly only with reference 1. Thus, the average nn-gram match score of all predictions will be high. Nevertheless, the average-metrics remain oblivious to the failure of the model to generate anything close to reference 2 or reference 3.

Du and Cardie (2017) proposed two evaluation strategies (“conservative" and “liberal") to evaluate the full-system for paragraph-level multi-question generation but their schemes partly depend on aligning predictions and references based on their shared question-worthy sentence (if any). However, such a method is only possible for sentence-level or type-level granularity where each generated question is explicitly associated with a classified question-worthy sentence. In paragraph-level granularity the generated questions are not explicitly associated to any sentence. We propose a more model-agnostic and general evaluation strategy.

3.1 Multi-metrics

We propose a new evaluation scheme for comparing two sets of sequences to address the aforementioned limitations of existing strategies. First, we note that the one cause of reference miscoverage is the fact that multiple repetitive predictions can match with the same single reference. To solve this issue we decide to add a constraint.

Constraint 1 - A prediction can be assigned to exactly one reference and exactly one prediction can be assigned to a reference. Only assigned pairs of prediction and reference can be scored.

That is, we first construct a mapping among predictions and references by assigning one to another. Next, we compute the matching scores (based on some sequence comparison metric) for only the assigned pairs. Unlike before, we do not allow multiple predictions to match with the same reference or vice versa. Nevertheless, there can be multiple assignments that meet constraint 1. Thus, we set another constraint:

Constraint 2 - Assume that there are mm predictions (p1,p2,,pm)(p_{1},p_{2},\dots,p_{m}), nn references (r1,r2,,rn)(r_{1},r_{2},\dots,r_{n}), and a sequence-level evaluation metric MM where M(pi,rj)M(p_{i},r_{j}) returns a score for how well prediction pip_{i} matches with reference rjr_{j}. We then choose a set AA of kk (k=minimum(m,n))(k=minimum(m,n)) assignment pairs between predictions and references such that the “overall match score" (SS) of the assignments pairs are maximized. Formally, the overall match score is computed as: S=(pi,rj)AM(pi,rj)S=\sum_{(p_{i},r_{j})\in A}M(p_{i},r_{j}).

The evaluation score MM can be any nn-gram matching metric (e.g., BLEU or METEOR) or even newer metrics like BERTScore Zhang* et al. (2020). We only consider metrics that return non-negative values. Combining constraint 1 and constraint 2, we get exactly the problem of optimal assignment. For example, we can treat predictions as workers, references as jobs and some nn-gram match score of a particular pair of prediction and reference as the job-performance of the worker (prediction). For cases like these, the Hungarian algorithm Kuhn (1955), which is a combinatorial optimization algorithm, was designed to find an optimal one-to-one assignment that maximizes the overall job-performance under the constraints. In our proposed evaluation scheme, we first use the Hungarian algorithm to find the kk assignments between predictions and references and also compute the overall match score SS. In Figure 3, we show an example of the assignments made by the Hungarian algorithm based on METEOR (as MM) on the prediction set and the reference set as given in Figure 1.

Nevertheless, having this overall match score SS is not enough because the metric can ignore some references/predictions if there are more than kk references/predictions. As such, the problems with accounting number mismatch may remain. To resolve these issues, we use the overall score SS in an F-score-like framework.

First, we compute a precision-like metric as Pr=SmPr=\frac{S}{m}. PrPr reflects the overall degree of match that the predictions have with the references. The maixmum value of SS will be ckc\cdot k where cc is the maximum value of evaluation MM. Now, if there are too many predictions compared to ground truths then mm will be k\gg k. High mm will reduce PrPr and mitigate the problem of over-prediction (one aspect of number mismatch).

Second, we compute a recall-like metric as Re=SnRe=\frac{S}{n}. ReRe reflects the overall degree of match the references have with the predictions. That is, it reflects how well the references are covered by the predictions. If there are too few predictions compared to ground truths then mm will be n\ll n; thus, nn will be k\gg k. This will reduce ReRe and mitigate the problem of under-prediction (another aspect of number mismatch).

Finally, similar to F1, we compute an unweighted harmonic mean (Multi-MM) of PrPr and ReRe:

Multi-M=2PrRePr+Re\textrm{Multi-}M=\frac{2\cdot Pr\cdot Re}{Pr+Re}

Here, in the name “Multi-MM", MM denotes the evaluation metric as mentioned before. MM can be any metric like BLEU or METEOR. Thus, we can have different corresponding variants of our set-level Multi-MM metrics as multi-BLEU or multi-METEOR depending on the metric MM.

4 Generation Frameworks

As mentioned earlier, currently, state of the art in QG is achieved by pre-trained Seq2Seq models. Thus, we choose T5 Raffel et al. (2020) (a pre-trained Seq2Seq model) as the main model for question generation. For our specific task, there are several distinct strategies that we contrast utilizing T5. At a high level, some of these strategies involve generating one question at a time or generating a concatenated series of questions; generating at a document-level or generating at a sentence-level and collating the generations for every sentence. Overall, we consider two main factors of variation: (1) generation granularity and (2) generation mode. We describe them both below:

4.1 Generation Granularity

While we do solely focus on paragraph-level generation, that does not mean we have to directly generate from the paragraph-level. Instead, we can also generate questions by focusing on a lower granularity (for example, sentences in the paragraph) and then collate the results in the end to have a paragraph-level output. Below, we discuss question generation in different levels of granularity.

1. Paragraph-level - In the paragraph-level granularity, we feed the whole paragraph to T5 and let it generate all the suitable questions from it directly. This model implicitly learns to find potential question-worthy areas (and just candidate answer-phrases) to generate questions about them.

2. Sentence-level - In the sentence-level granularity, we follow the strategy proposed by Du and Cardie (2017). During training we train two models. One model is a question-worthiness classifier which classifies whether a sentence in a given paragraph is question-worthy or not. We describe the question-worthiness classifier in appendix 6.1. The other model is a seq2seq question generator which is trained to do sentence-level question generation. During inference, in the first step, we classify each sentence in the given paragraph as either question-worthy or not. In the second step, for every sentence classified as question-worthy we do sentence-level question generation using the trained question-generator. In the final step, we collate all the generated questions for every question-worthy sentences to represent the overall paragraph-level output. However, we integrate two notable differences from Du and Cardie (2017): 1. We use T5 instead of an RNN-based question-generator; 2. Instead of using just a sentence as input for the sentence-level question generator, we use the whole paragraph (after flattening it into a single sequence of tokens) as input. We however “highlight" the sentence in the paragraph to make the generator focus on that highlighted sentence for sentence-level generation. Highlighting is done by adding a special token <hl> at the beginning of the sentence, and a special token </hl> at its end. This was done because sometimes the surrounding context of a sentence is necessary to generate the ground truth questions Zhao et al. (2018).

Refer to caption
Figure 4: Framework for QG at a type-level granularity during inference. Step 1 shows question-worthiness classification followed by question-type prediction from the classified question-worthy sentences (highlighted in blue). Step 2 shows generating type-conditioned questions for each predicted question-type and its corresponding question-worthy sentence (highlighted in blue). Generation at a sentence-level granularity works similarly but excluding the question-type classifier (step 1) and type-conditioning (step 2).

3. Type-level - In type-level generation, during training, we train three separate models. As in sentence-level, we train a question-worthiness classifier and a T5-based question-generator. However, in addition, we also train a question-type classifier which predicts all types (e.g., who, where, how etc.) of questions that are worthy to be asked from the question-worthy sentences. We frame the question-type prediction task as a multi-label sentence classification problem. We consider question type labels {who,when,where,what,why,which,how,quantity,other}\in\{\textrm{who},\textrm{when},\textrm{where},\textrm{what},\textrm{why},\textrm{which},\textrm{how},\\ \textrm{quantity},\textrm{other}\}. We describe the question-type classifier in appendix 6.1. Different from sentence-level, we train the T5 sentence-level question-generator to be conditioned by the question-type. Essentially, we prepend a special token indicating the ground-truth question type to the input of the sentence-level question-generator during training. As such, the generator learns to condition its generation on the question-type as specified by the prepended token. A similar sentence-highlighting strategy is used as before for the generator input.

During inference, first, we classify each sentence in a given paragraph to check whether it is question-worthy or not. Second, for every classified question-worthy sentence we predict all the types of questions that are worthy to be asked from that sentence. Third, for every question-worthy sentence and for every question type appropriate to be asked from that sentence, we perform a type-conditioned sentence-level question generation for that question-type (conditioned by preprending) and that sentence. Finally, all the generated questions for each sentence and its question types, can be collated together to have an output set of questions for the overall paragraph. Figure 4 shows the pipeline for type-level generation during inference.

Similar to us, Wu et al. (2020) used type-driven generation in an answer-agnostic setup but they focus on sentence-level single question generation. Moreover, they frame question type prediction as a multi-class classification problem (question types for a sentence are considered to be mutually exclusive). While Wu et al. (2020) proposed to select top KK most probable question types to generate multiple questions for multiple types for the same sentence, the model is still dependent on the hyperparameter KK. Instead, in our approach, we simply let the classification model decide which and how many types to predict by framing question-type prediction as a multi-label classification problem.

4.2 Generation Mode

Besides granularity, the generation multiplicity (or mode) is another factor to consider. We consider two different modes of generation, which we discuss below.

1. One2One Generation - In one2one generation, we train the question-generator model to maximize the likelihood for a single ground truth question for the sample input (of whatever granularity). To have generation of multiple questions in an one2one setting, we can use beam search or multiple sampling-based decoding. We can also generate multiple questions by generating at a lower granularity and then collating the results.

2. One2Many Generation - In one2many generation settings, we train the question-generation model to maximize the likelihood of the concatenation of all the questions that are asked (in the ground truth set) for the given sample input of a specific granularity. An one2many model can generate multiple questions even when using greedy decoding. A key benefit of the one2many mode is that we can allow the model itself to determine how many questions to generate. In contrast, in the one2one setting, we have to manually decide how many questions we want to generate (using beam search or sampling techniques). Another benefit of one2many mode is that the generation of one question can be informed by prior generated questions for the given input.

A similar one2many mode of generation was proposed for keyphrase generation Yuan et al. (2020). For question generation, in particular, Lopez et al. (2021) used the one2many mode of generation with pre-trained Transformer models. Different from any prior works, we explore combinations of one2many generation in sentence and type-level granularity considering that there could be multiple questions that can be asked even for a single sentence or a question type.

For training one2many models, we simply prepare the training ground truths by concatenating all the questions (for the given granularity of input) while using a special token <sep> as a delimiter. The concatenation of questions is done in the order (first to last) of the occurrence of their corresponding question-worthy sentences.

Overall, there are six ways of combining the above factors (granularity and mode), and thus a total of six possible strategies. We compare all six.

5 Ranking and Selection

When using one2one generation mode, we have to also decide on a method to generate multiple questions, particularly when generating directly at the paragraph-level granularity. Thus, for one2one mode, we employ an overgenerate-and-rank strategy. First we overgenerate multiple questions (the overgeneration method is discussed in section 5.1). Second, we rank the generations using some ranking method. Third, we select some kk highest ranked generations. For ranking and selection, we consider the following methods:

1. Rand@5: In this method, we randomly select 55 generations. The average number of questions per paragraph is also 55 in the dataset that we use.

2. Top@1: In this method we only generate and select a single greedy-decoded question. Thus, in this case, there will be only one generated question per input at a specific granularity-level. For example, in case of paragraph-level granularity there will be only one generation overall and in case of sentence-level granularity there can be as many generations as there are question-worthy sentences.

3. Rank@5: In this method, we rank the questions based on their answerability probability and then select the top 55 questions. We use a neural question-answering model to predict the answerability probability for each question. In particular, we use an ELECTRA-large model Clark et al. (2020) that was trained on SQuAD2.0 Rajpurkar et al. (2018) to not only extract an answer span but also to decide whether the question is answerable from the given context.222https://huggingface.co/ahotrod/electra_large_discriminator_squad2_512 We also filter any question with less than 0.50.5 answerability probability.

5.1 Overgeneration

For overgenerating multiple (KK) questions in one2one methods, we generate one question using greedy decoded for the given granularity level. Rest of the KK generations at the given granularity level is decoded using multiple runs of Nucleus Sampling Holtzman et al. (2020) with a top-p value of 0.90.9. We set K=20K=20 for paragraph-level granularity, K=10K=10 for sentence-level granularity, and K=5K=5 for type-level granularity. The KK-values were chosen arbitrarily for most part. We decrease the KK-value for lower granularities because the overall generation can be multiple times higher than KK depending on how many question-worthy sentences or how many question-types there are.

6 Experiments

Model Prec. Rec. F1 Acc.
Du and Cardie (2017) 73.15 89.29 80.42 72.52
ELECTRA CL 76.19 74.02 75.09 70.11
ELECTRA HC 75.88 88.24 81.59 75.77
Table 1: Question-worthiness classification results
Type Prec. Rec. F1
Who 39.64 88.75 54.80
When 31.85 78.63 45.34
Where 21.62 79.05 33.96
What 83.59 49.11 61.87
Quantity 56.72 88.65 69.18
How 10.97 65.53 18.79
Why 09.12 64.84 15.99
Which 20.86 64.78 31.55
Others 02.35 32.26 04.37
Table 2: Question-type classification results

In this section, we discuss the details of our experimental models, datasets, evaluation, and results. We use an ELECTRA-large-based classifier Clark et al. (2020) for question-worthiness classification and question-type classification. More details on the classifier architectures are in section 6.1. Hyperparameter details are in section 6.3 and 6.4.

6.1 Classifier Architecture

Question-type Classification: For Question-Type classification, we use ELECTRA-large as a multi-label sentence classifier. We transform the final representation of the CLS token using two layers with a GELU activation Hendrycks and Gimpel (2016) in between for classification.

Question-worthiness Classification: For Question-worthiness classification we try two distinct approaches: (1) ELECTRA CL and (2) ELECTRA HC. We discover that ELECTRA HC outperforms ELECTRA CL and so only ELECTRA HC is used in the main experiments (Table 3,4). Table 1 also shows the result of ELECTRA HC. We describe both methods below:

1. ELECTRA CL: In this approach we simply use ELECTRA-large as a sentence-level binary classifier similar to how we use it as a multi-label classifier for question-type classification.

2. ELECTRA HC: ELECTRA HC takes a hierarchical approach towards classification. It uses a similar framework as used by Du and Cardie (2017). First, we encode each sentence in a paragraph into a single vector (sentence-vector) using a sentence encoder (as a result, we will have a sequence of sentence-vectors representing the paragraph). Second, the sequence of vectors are contextualized by a BiLSTM Hochreiter and Schmidhuber (1997); Graves and Schmidhuber (2005). Third, we apply a binary classifier (two layers with a GELU activation in between) for each of the sentence-vectors in the sequence. In this strategy the classification of question-worthiness of each sentence is informed by the context of the surrounding sentences in the paragraph. Different from Du and Cardie (2017), we use ELECTRA-large, a more modern model, for sentence encoding. We treat the final representation of the CLS token of ELECTRA as the encoded sentence vector for each sentence.

6.2 Question-type Determination

For most of the part, we determine the question-type of a question based on whether the question-type word is present in the question. For example, if “who" is present in a question, its question-type is determined to be “who". Sometimes the question-type word occur in the middle of the question so we did not restricted ourselves to only checking the first question word (several examples of such cases can be observed in table 5 reference questions). This rule is not foolproof, but generally works in the dataset that we use. There are, however, multiple exceptions to the general rule mentioned above. If “whose" or “whom" are present in the question, the question-type is still determined as “who". If “how much" or “how many" is present in the question then the question-type is determined to be “quantity" instead of “how". We do this because general “how" questions are of a different breed (asking for a process) than questions asking “how much"/“how many" (generally asking for some quantity or intensity). Presence of the words “quantity" or “other" in a question do not determine the question-type to be “quantity" or “other". Any questions whose type remains undetermined by the above rules are determined as of type “other". Questions with answers “yes" or “no" are also determined to be of type “other". We do not keep a separate type for boolean questions (yes and no questions) because there are very few instances of this type in the dataset that we use.

6.3 Classifier Hyperparameters

All ELECTRA-based classifiers use two layers on top of the final representation of the CLS token. There is a GELU activation function Hendrycks and Gimpel (2016) in between. The first layer has as many neurons as in the ELECTRA hidden state (except when we use ELECTRA HC. For ELECTRA HC the number of neurons of this layer is the same as that of BiLSTM hidden size). The last layer has a sigmoid activation function. In case of question-worthiness classification (binary classification) the last layer have a single neuron. In case of question-type classification, the last layer have as many neurons as there are question-type labels. The total hidden size of the BiLSTM hidden state (forward and backward combined), as used in ELECTRA HC, is 300300. Some of the general hyperparameters used for all classifiers are a weight-decay of 0.010.01, a maximum gradient normalization clipping of 55, a maximum sequence length of 512512, a batch size of 6464, a maximum number of epochs of 3030, and an early stop patience of 22 (the training is terminated when the loss does not reduce for two consecutive epochs). We also use RAdam Liu et al. (2020) as the optimizer. For all classifiers, the learning rate is tuned using grid-search among the options {0.001,0.0002,0.0001,0.00002}\{0.001,0.0002,0.0001,0.00002\}. 0.000020.00002 was chosen for each. Label weights were used for question-type classification. The label weight for a particular label were determined as the total number of negative instances divided by the total number of positive instances of that label. Model selection and hyperparameter selection are done based on the best validation loss.

6.4 Generator Hyperparameters

Some of the shared hyperparameters of the T5-based question generator models are a batch size of 128128, a weight decay of 0, a maximum number of epochs as 2020, a maximum sequence length of 512512, a maximum gradient normalization clipping of 55, and an early stop patience of 22 (the training is terminated when the loss does not reduce for two consecutive epochs). We also use SM3 Anil et al. (2019) as the optimizer. Greedy decoding is used for generation in one2many mode. One2one modes utilize overgenerate-and-rank methods which are discussed earlier. The learning rate is tuned using grid-search among the options {1.0,0.1,0.01,0.001}\{1.0,0.1,0.01,0.001\} for all generators separately. We limit the maximum epochs to 1010 for each hyperparameter tuning trial. A learning rate of 0.10.1 was chosen for paragraph-level one2many generation and a learning rate of 0.010.01 was chosen for the rest. Learning rate (lr) warmup as follows (based on the recommended procedure for SM3 333https://github.com/google-research/google-research/tree/master/sm3):

lrs=lr0min(1,(s/w)2)lr_{s}=lr_{0}\cdot min(1,(s/w)^{2}) (1)

lrslr_{s} indicates the learning rate at step ss. The inital learning rate (lr0lr_{0}) is whatever is chosen after hyperparameter tuning. ss indicates the current update step number. ww indicates total warmup steps (set as 20002000). Model selection and hyperparameter selection are done based on the best validation loss.

6.5 Dataset

We perform all our experiments on the processed SQuAD1.1 Rajpurkar et al. (2016) split as provided by Du et al. (2017)444https://github.com/xinyadu/nqg/tree/master/data/processed for question generation. For question-worthiness classification, all the sentences from all the paragraphs in the dataset are input samples. A sample input sentence is considered question-worthy if that sentence has a corresponding ground-truth question in the dataset. For question-type classification, only question-worthy sentences are sample inputs. For each sample, its question-type labels are the question-types of all the questions associated to that sample sentence. The question-type of a question is determined based on some heuristic rules that are detailed in appendix 6.2. We maintain the original train-development-test split (as provided by Du et al. (2017)) for all experiments.

Granularity Generation Mode Multi-BLEU4 Multi-MTR Multi-R-L Multi-qBLEU1 BLEU4 MTR R-L qBLEU1
Du and Cardie (2017) 12.28 16.62 39.75
Lopez et al. (2021) 8.26 21.2 44.38
Paragraph one2one (Rand@5) 5.14 15.78 28.79 28.78 7.78 20.52 37.67 43.50
Paragraph one2one (Top@1) 3.73 9.10 16.94 16.58 12.68 23.58 44.12 49.27
Paragraph one2one (Rank@5) 6.78 17.07 30.17 30.38 11.21 23.71 41.69 47.98
\cdashline1-10 Paragraph one2many 6.25 15.55 28.97 28.48 11.15 22.12 42.41 47.58
Sentence one2one (Rand@5) 5.37 15.85 28.97 29.24 7.85 20.48 37.86 43.88
Sentence one2one (Top@1) 6.97 16.56 30.02 29.96 11.91 23.20 42.81 48.61
Sentence one2one (Rank@5) 6.77 16.70 29.50 29.76 11.22 23.72 41.51 47.86
\cdashline1-10 Sentence one2many 7.80 17.93 32.08 32.08 11.64 22.65 41.96 47.60
Type one2one (Rand@5) 4.86 15.34 28.07 28.15 7.15 19.82 36.37 42.36
Type one2one (Top@1) 7.01 16.03 29.06 28.90 12.30 23.27 42.78 48.55
Type one2one (Rank@5) 6.43 16.42 29.00 29.13 10.45 23.08 40.35 46.57
\cdashline1-10 Type one2many 7.57 16.79 30.05 30.11 12.67 23.4 42.65 48.54
Table 3: Multi-metrics and average metrics performance on paragraph-level multi-question generation on SQuAD2.0 (split by Du et al. (2017)) using different granularity-levels and generation modes.
Granularity Generation Mode Self-BLEU2 car. diff
Paragraph one2one (Rand@5) 66.18 -0.11
Paragraph one2one (Top@1) 0 3.89
Paragraph one2one (Rank@5) 49.25 -0.11
\cdashline1-4 Paragraph one2many 38.82 1.45
Sentence one2one (Rand@5) 72.39 -0.10
Sentence one2one (Top@1) 17.84 1.78
Sentence one2one (Rank@5) 52.72 -0.10
\cdashline1-4 Sentence one2many 34.91 0.39
Type one2one (Rand@5) 71.32 -0.10
Type one2one (Top@1) 16.43 1.99
Type one2one (Rank@5) 50.63 -0.09
\cdashline1-4 Type one2many 17.04 1.76
Table 4: Self-BLEU and cardinality difference (car. diff) for QG on Du et al. (2017) split using different granularity-levels and generation modes.
Example # 1
Generated Questions:
1. what were the early courses in in the college of science?
2. when was the college of engineering established?
Reference Questions:
1. how many bs level degrees are offered in the college of engineering at notre dame?
2. in what year was the college of engineering at notre dame formed?
3. before the creation of the college of engineering similar studies were carried out at which notre dame college?
4. how many departments are within the stinson-remick hall of engineering?
5. the college of science began to offer civil engineering courses beginning at what time at notre dame?
BLEU4: 40.3440.34; Multi-BLEU4: 13.2613.26; METEOR: 22.0622.06; Multi-METEOR: 11.8111.81;
ROUGE-L: 42.3842.38; Multi-ROUGE-L: 22.9122.91; Q-BLEU1: 40.2640.26; Multi-Q-BLEU1: 17.0117.01
Example # 2
Generated Questions:
1. when was theodore m. hesburgh library completed?
2. what is the library system of the university divided between?
3. what is the name of the library that houses the main collection of books?
4. what is the word of life mural known as?
5. what does the word of life mural appear to make?
6. who designed the word of life mural?
Reference Questions:
1. how many stories tall is the main library at notre dame?
2. what is the name of the main library at notre dame?
3. in what year was the theodore m. hesburgh library at notre dame finished?
4. which artist created the mural on the theodore m. hesburgh library?
5. what is a common name to reference the mural created by millard sheets at notre dame?
BLEU4: 10.6510.65; Multi-BLEU4: 11.3811.38; METEOR: 17.2517.25; Multi-METEOR: 15.0415.04;
ROUGE-L: 40.1540.15; Multi-ROUGE-L: 33.633.6; Q-BLEU1: 36.4436.44; Multi-Q-BLEU1: 27.7227.72
Example # 3
Generated Questions:
1. what do most people with dogs describe their pet as?
2. what does a study of conversations in dog-human families show?
3. what is the popular reconceptualization of the dog-human family as a pack?
4. what is a dominance model of dog-human relationships promoted by some dog trainers?
Reference Questions:
1. how do most people describe the relationship with their dogs?
2. what television show uses a dominance model of dog and human relationships?
3. most people today describe their dogs as what?
4. what tv show promotes a dominance model for the relationships people have with their dogs?
BLEU4: 05.5605.56; Multi-BLEU4: 05.5605.56; METEOR: 24.2824.28; Multi-METEOR: 21.2121.21;
ROUGE-L: 37.1337.13; Multi-ROUGE-L: 32.4332.43; Q-BLEU1: 41.9741.97; Multi-Q-BLEU1: 37.1037.10
Table 5: SQuAD1.1 examples with generations from T5 sentence-level one2many. Comparison between different evaluation metrics are presented.
Example # 1
Generated Questions:
1. what is the name of the oldest building on campus?
Reference Questions:
1. where is the headquarters of the congregation of the holy cross?
2. what is the primary seminary of the congregation of the holy cross?
3. what is the oldest structure at notre dame?
4. what individuals live at fatima house at notre dame?
5. which prize did frederick buechner create?
BLEU4: 0; Multi-BLEU4: 0; METEOR: 17.5817.58; Multi-METEOR: 05.8605.86;
ROUGE-L: 5050; Multi-ROUGE-L: 15.1215.12; Q-BLEU1: 43.0043.00; Multi-Q-BLEU1: 12.0712.07
Example # 2
Generated Questions:
1. what is the name of the college of engineering?
Reference Questions:
1. how many bs level degrees are offered in the college of engineering at notre dame?
2. in what year was the college of engineering at notre dame formed?
3. before the creation of the college of engineering similar studies were carried out at which notre dame college?
4. how many departments are within the stinson-remick hall of engineering?
5. the college of science began to offer civil engineering courses beginning at what time at notre dame?
BLEU4: 43.4443.44; Multi-BLEU4: 07.5407.54; METEOR: 24.3324.33; Multi-METEOR: 08.1108.11;
ROUGE-L: 49.2349.23; Multi-ROUGE-L: 15.4715.47; Q-BLEU1: 47.6947.69; Multi-Q-BLEU1: 12.5212.52
Table 6: SQuAD1.1 examples with generations from T5 paragraph-level one2one (Top@1). Comparison between different evaluation metrics are presented.

6.6 Evaluation

We use standard precision, recall, F1 measures for the classification tasks. For question generation, we use different instances of multi-metrics (discussed in §\S3.1): Multi-BLEU4, Multi-METEOR (Multi-MTR), Multi-ROUGE-L (Multi-R-L), and Multi-qBLEU1. We also report the average BLEU4, METEOR (MTR), ROUGE-L (R-L), and qBLEU1 metrics. In Table 4, we use self-BLEU2 Zhu et al. (2018) on the prediction set to show prediction diversity (Lower self-BLEU means higher diversity). We also show the difference of the number of predictions from the number of ground truth questions. We refer to this metric as cardinality difference (car. diff). This shows the degree of number mismatch. We use nlg-eval Sharma et al. (2017) for nn-gram match scores. For Q-BLEU1 we use the same parameters as recommended for Q-BLEU1 on SQuAD by the original paper (See table 5 in Nema and Khapra (2018)).

6.7 Results

Table 1 compares the performance of ELECTRA HC and ELECTRA CL. As can be seen, ELECTRA HC outperforms ELECTRA CL by a significant margin and obtains the overall best performance on recall, F1, and accuracy. In Table 2, we show the performance for each question-type labels. While the type classification performance is low, it is on par with results obtained by prior methods Wu et al. (2020) (although the results are not strictly comparable due to differences in question-types).

In Table 3, we show the main results of question generation. Among the one2one ranking methods, Top@1 seems to work best at sentence-level or type-level granularities but it causes higher magnitude cardinality difference (as shown in Table 4). The multi-metrics performance for Top@1 paragraph-level is poor because it can only predict one question for the whole paragraph causing high number mismatch. Rank@5 works better than Rand@5 for any granularity level, which makes sense given that Rank@5 takes answerability into account. Among generation modes, one2many generally works better than one2one ranking when using sentence-level or type-level granularity. Among granularity-levels, sentence-level seems to generally perform the best on multi-metrics. Type-level generation does not offer much further benefit. Type-level generation also causes higher magnitude cardinality difference (Table 4) although they get lower self-BLEU (higher diversity) (Table 4). However, in some cases, lower self-BLEU can result from under-prediction (having fewer predictions reduces the chances of nn-gram overlaps among each other); thus, lower self-BLEU is not always good.

7 On the Efficacy of Multi-metrics

We motivate the efficacy of multi-metrics on two different grounds.

First, we emphasize the theoretically established virtues of multi-metrics in terms of its ability to account for failure in reference miscoverage and also, number mismatch (in §\S3).

Second, we show a concrete instance where average metrics are high despite a critical failure of the model, whereas multi-metrics are low as it should be. This can be observed in Table 3 in the case of paragraph-level one2one generation with Top@1 ranking. Here, the model can only predict one question at most, whereas there are five references on average. Thus, we see a high cardinality difference in Table 4 for this case. Despite this, the average nn-gram metrics obtain very high scores for this model. These metrics are completely insensitive to this model failure. Multi-metrics, on the other hand, can take this failure into account. Thus, we observe multi-metrics assign it the lowest scores.

We also present some concrete examples (of generated set of questions and ground truth set of questions or references) along with their corresponding metrics (both average-based n-gram match metrics and multi-metrics) in table 5 and 6. In table 5 example # 1, we find a degree of number mismatch. There are only two predictions whereas there are five references. As expected, we find a quite a bit of difference between multi-metrics and average-metrics here because multi-metrics is penalized heavily because of number mismatch and average-metric is not penalized as much. In example # 2 and # 3 from table 5, we find that the multi-metrics are much closer to the average metrics because the number of predictions are close to the number of references; and furthermore, there are no issues with paraphrases in predictions. On the other hand, in the examples in table 6, we again observe amplified difference between multi-metrics and average-metrics given the high degree of number mismatch that the average-metrics fail to take into account.

8 Related Work

Assignment-based Evaluations - Rus and Lintean (2012) proposed an optimal-matching-based method for embedding-based text similarity measure but not in the context of comparing sets of sequences in an F1-like framework. Similarly, several text evaluation approaches Kusner et al. (2015); Chow et al. (2019); Clark et al. (2019); Zhao et al. (2019) used earth mover’s distance Rubner et al. (1998) whereas Zhang* et al. (2020) used greedy-matching. Similar to us, Schlichtkrull and Cheng (2020) proposed an evaluation for QG but using greedy matching (instead of optimal assignment) which allows multiple predictions to match with a single reference and vice versa.

Question Generation- Du et al. (2017) presented one of the earliest works on answer-agnostic neural QG. There were also several early answer-aware QG approaches Yuan et al. (2017); Zhou et al. (2017). Several works took joint-training or multi-task approaches to train both question answering and QG Duan et al. (2017); Song et al. (2017); Tang et al. (2017); Wang et al. (2017). Du and Cardie (2017) proposed QG in the sentence-level granularity. Subramanian et al. (2018) generated questions based on detected keyphrases. Similarly, Wang et al. (2019) used a multi-agent communication framework to first identify question-worthy phrases and then generate questions with their assistance. Zhao et al. (2018) used maxout pointer and gated self-attention to exploit paragraph-level information for QG. Scialom et al. (2019) used Transformer-based approaches for answer-agnostic QG. Multiple works Fan et al. (2018); Sun et al. (2018); Hu et al. (2018); Zhou et al. (2019); Wu et al. (2020) utilized question-words information or a question-type driven framework for different variants of QG. Newer approaches Chan and Fan (2019a, b); Dong et al. (2019); Varanasi et al. (2020); Qi et al. (2020); Lopez et al. (2021) used pre-trained Transformer models.

9 Conclusion

We proposed a new evaluation method (multi-metrics) to evaluate multi-question generation. We motivate the evaluation theoretically and also empirically in terms of the contrast discussed in §\S7. Using both new and old evaluations, we also empirically compare combinations of various strategies for paragraph-level multi-question generation under a common framework. Our results show that using factorized sentence-level generation in one2many mode is better than directly generating from paragraph-level even when using powerful pre-trained models.

References