Pragmatically Appropriate Diversity for Dialogue Evaluation

Katherine Stasaski
Salesforce AI Research
[email protected]
\AndMarti A. Hearst
UC Berkeley
[email protected] Research performed while at UC Berkeley.

Abstract

Linguistic pragmatics state that a conversation’s underlying speech acts can constrain the type of response which is appropriate at each turn in the conversation. When generating dialogue responses, neural dialogue agents struggle to produce diverse responses. Currently, dialogue diversity is assessed using automatic metrics, but the underlying speech acts do not inform these metrics.

To remedy this, we propose the notion of Pragmatically Appropriate Diversity, defined as the extent to which a conversation creates and constrains the creation of multiple diverse responses. Using a human-created multi-response dataset, we find significant support for the hypothesis that speech acts provide a signal for the diversity of the set of next responses. Building on this result, we propose a new human evaluation task where creative writers predict the extent to which conversations inspire the creation of multiple diverse responses. Our studies find that writers’ judgments align with the Pragmatically Appropriate Diversity of conversations. Our work suggests that expectations for diversity metric scores should vary depending on the speech act.

1 Introduction

While many different utterances can continue a conversation, some types of responses are more appropriate than others. This work begins with the assumption, based on the linguistic pragmatics literature, that the speech acts associated with a conversation can constrain the responses which are appropriate. In particular, some speech acts are part of adjacency pairs, where a speaker produces a speech act utterance which is expected to be immediately followed in the conversation by the other speaker’s paired speech act utterance Levinson (1983). For example, a Closing utterance by one speaker (such as “Goodbye”) is typically followed by another Closing utterance from the other speaker Schegloff and Sacks (1973).

Speech acts can constrain a conversation even if not part of a strict adjacency pair. In relation to how speech acts influence conversational structure, Searle and Vanderveken (1985) state:

The key to understanding the structure of conversations is to see that each illocutionary act creates the possibility of a finite and usually quite limited set of appropriate illocutionary acts as replies. Sometimes the appropriate illocutionary act reply is very tightly constrained by the act that precedes it, as in question and answer sequences; and sometimes it is more open, as in causal conversations that move from one topic to another … each illocutionary act in a conversation creates and constrains the range of appropriate illocutionary responses.

Conversation A

Speaker 1: Are you free tonight?

Writer-Generated Responses

\bullet

No, not this evening. Can we try for tomorrow night?

\bullet

Let’s see, I’m free around 8pm. Will that work?

\bullet

Yes! What are we getting into?

\bullet

Well that depends on what you have planned.

\bullet

Didn’t we talk about this already? I have a work event tonight.

Conversation B

Speaker 1: Please come in and sit down. I’m happy to finally meet you.

Speaker 2: Same here, Ms. Drake. I’ve been looking forward to this.

Speaker 1: I’m sorry I kept you waiting.

Writer-Generated Responses

\bullet

no problem at all!

\bullet

no worries. how are you?

\bullet

I was just reviewing some files so it’s not a problem.

\bullet

you are just in time!

\bullet

Let’s get started.

Figure 1: Two conversations from DailyDialog++ with sets of responses from writers (Section 4). When considering different ways the speaker could continue the conversation, the Pragmatically Appropriate Diversity is higher for Conversation A (Yes-No Question) than for Conversation B (Apology).

Consider the two conversations presented in Figure 1 from DailyDialog++, a conversational dataset with multiple human-provided responses Sai et al. (2020); Li et al. (2017). We would expect the linguistic diversity of a set of responses for Conversation A to be higher than Conversation B because Conversation B contains a constraining utterance, an apology, as the most recent turn in the conversation. Speech act constraints dictate there are a limited number of ways to respond to an apology utterance and remain pragmatically appropriate. In fact, the first four writer-generated responses to this conversation include some form of Acceptance or Minimization. By contrast, Conversation A ends with a question. Although the question constrains the next response to contain an answer, the answer’s content can be expressed in many different ways.

Relevancy metrics and perplexity are common ways to measure quality and fluency of model-generated dialogue responses, respectively. Additionally, because models over-generate typical responses, another important evaluation criterion is the ability to produce diverse responses, which capture multiple distinct, interesting ways a conversation can be continued. Diversity can be measured using a variety of automatic metrics Li et al. (2016); Stasaski and Hearst (2022); Tevet and Berant (2021); Reimers and Gurevych (2020); Cao and Clark (2017); Zhu et al. (2018); Larson et al. (2019). However, these metrics do not vary given the conversation’s properties; a model’s output would be evaluated the same way whether in response to either conversation from Figure 1. Speech act theory suggests that next turns in conversations are constrained, but currently, evaluations of chitchat dialogue models do not incorporate these constraints explicitly. Our work explores whether automatic linguistic diversity metrics capture pragmatic constraints on appropriate responses.

To address this, we introduce the notion of Pragmatically Appropriate Diversity (PA Diversity), defined as the extent to which a conversation’s speech acts create and constrain the creation of multiple diverse responses. To explore this concept systematically, we conduct two analyses. First, we examine both human-labeled and automatically-labeled speech acts of a human-generated multi-response conversational dataset. We hypothesize that the most recent speech act utterance constrains the diversity of responses. We discover that automated diversity metrics, particularly NLI Diversity Stasaski and Hearst (2022), detect differences among the most-recent speech act utterances.

We next propose a novel human evaluation task: rating the extent to which a conversational prompt inspires the creation of multiple diverse responses. We choose to use creative writers as participants as past work finds that experts are reliably able to judge the quality of creative products while non-experts are not able to do this Amabile (1982); Baer et al. (2004). Our study finds significant differences among writers’ PA Diversity ratings for different speech acts. We also find that these differences align with both the measured differences from the original dataset and our hypotheses about which speech acts cause more diverse responses. To our knowledge, our findings are the first to combine speech acts with diversity-based assessment of neural conversational dialogue systems.

Our contributions in this paper are:

•

The notion of PA Diversity, the extent a conversation’s speech acts creates and constrains creation of multiple diverse responses,
•

An automatic analysis with manually-assigned and automatically-labeled speech acts, finding that the automatic NLI Diversity metric varies based on the most recent speech act and align with our hypotheses, and
•

A novel human-centered study, finding that expert creative writer judgments of PA Diversity align with NLI Diversity values.

DailyDialog++ Speech Act Utterance Definitions and Examples
Inform	Definition: “the speaker provides the addressee certain information which he believes the addressee not to know or not to be aware of, and which he assumes to be correct”
	Example: I could not relocate in the next year, but might be open to it in the future.
Question	Definition: “the speaker wants to know something, which he assumes the addressee to know, and puts pressure on the addressee to provide this information”
	Example: Is there any extra pay for that?
Directive	Definition: “concerned with the speaker’s wish that the addressee performs an action”
	Example: Oh, thank you . I want one.
Commissive	Definition: ‘capture the speaker’s commitments to perform certain actions”
	Example: You are probably right . I’ll go right now and apologize . I try hard not to be late but it is difficult with Beijing traffic.
SWBD Speech Act Utterance Examples (Selected Subset)
Statement Non Opinion	Actually, Costa Rica isn’t in South America . It’s in Central America.
Yes-No Question	Do you mind if I put my jacket there?
Wh Question	What materials do they need in order to apply the passport?
Action Directive	Let me see . $200 makes RIB 1,616 . Here is the cash and exchange memo . Please check it.
Statement Opinion	Well, a good teacher makes good students.
Conventional Closing	Ok . Goodbye.
Open Question	How do you feel about it?
Offers, Options, Commits	Well, I’ll call you immediately for instructions on the matter.
Thanking	Thank you for your encouragement.
Apology	Oh, I am so sorry.

Table 1: Speech acts from DailyDialog and SWBD along with a sample conversational utterance from DailyDialog++. DailyDialog++ speech act definitions are quoted from Amanova et al. (2016). For brevity, we only include one turn for each speech act instead of the entire conversation (full conversations can be seen in Appendix A).

2 Related Work

We place this work in the context of dialogue diversity, pragmatics, and creativity evaluation.

2.1 Diversity Metrics for Dialogue

Several automatic metrics have been proposed to measure the diversity of dialogue model responses. One category of metrics assumes the dialogue model generates one response for each conversation in the test set (the Test Set Diversity setting). A frequently used metric is distinct-n, which measures the number of distinct n-grams across all generated sequences Li et al. (2016). Other metrics measure the number of distinct responses over the test set Cao and Clark (2017) or compute BLEU score among the set of model-generated responses Zhu et al. (2018). Other work measures the diversity of a set of model responses (the Multi-Response Diversity setting) Zhang et al. (2019); Tevet and Berant (2021). This setting measures the model’s ability to generate multiple diverse responses to a single conversation.

We utilize two automatic diversity metrics in the Multi-Response setting. The first metric calculates a diversity score based on a Natural Language Inference (NLI) model’s predictions among pairs of utterances Stasaski and Hearst (2022). A contradiction prediction is given weight of 1, neutral is 0, and entailment is -1. The set of scores for all pairwise combinations are summed to produce the final NLI Diversity score. The second metric uses Sent-BERT Tevet and Berant (2021), based on the similarity of sentence-level BERT embeddings Reimers and Gurevych (2019). We calculate the average pairwise cosine similarity between responses representations, subtracted from 1 to convert the metric from similarity to diversity.

2.2 Crowdworker Diversity Evaluation

Past work has used crowdworkers to indirectly measure diversity of model responses, by having workers rate specificity Li et al. (2016) or interestingness See et al. (2019) of responses. Tevet and Berant (2021) explore whether crowdworkers can directly judge diversity of a set of model responses, finding that crowdworkers are able to do this reliably for semantic diversity. We draw on this work to explore whether crowdworkers are able to determine the PA Diversity of input conversations, rather than the diversity of a set of model responses. In contrast to this work, we rely on an expert crowdworking population of creative writers.

2.3 Conversation Analysis and Speech Acts

We draw from the fields of conversational analysis Schegloff and Sacks (1973) and pragmatics Levinson (1983), with a focus on speech acts. In particular, we consider the notion of adjacency pairs (e.g. question-answer, greeting-greeting, and apology-minimization): the combination of two sequential speech acts produced by different speakers, in which the first speech act requires the production of the second Levinson (1983).

Throughout this work, we make use of the DailyDialog dataset, which includes manually-labeled high-level speech acts, consisting of one of Inform, Question, Directive, or Commissive Li et al. (2017) (see Table 1 for examples). Because the DailyDialog speech acts are general, we also make use of classifier-assigned fine-grained speech act categories from the Switchboard Dialogue Act Corpus Jurafsky et al. (1997), which are an extension of the DAMSL coding scheme Core and Allen (1997) (see Table 1 for examples).

2.4 Evaluating Creativity

For our human study, we draw from related work in the measurement of creativity. In particular, we use the Creative Assessment Technique (CAT), which establishes a framework to assess creative output Amabile (1982). This process engages experts from creative fields to independently judge the creativity of different output (e.g., stories, artwork). Usually this framework is deployed to evaluate creative output produced for the same prompt. However, Baer et al. (2004) showed that CAT produces consistent judgments for products generated in response to different prompts.

This line of research inspires our work in two ways. First, we use expert creative writers instead of typical crowdworkers because Kaufman et al. (2008) found that nonexperts produce inconsistent creative judgments. We also treat each dialogue as a creative prompt and ask writers to rate the extent the prompt inspires the creation of diverse responses. This task is suggested as future work in Baer et al. (2004), but to the best of our knowledge, we are the first to apply it in this manner.

3 Speech Act Analysis

In order to explore whether diversity of sets of responses changes based on the input conversation, we choose the DailyDialog++ dataset, which has two important properties: (i) multiple human responses for each conversation, and (ii) high-level speech acts manually assigned to each utterance. In this section, we analyze this data using diversity metrics to test hypotheses about PA Diversity.

3.1 Dataset

The DailyDialog dataset was scraped from online English learner conversations and has manually-assigned speech act labels Li et al. (2017). DailyDialog++¹¹1https://iitmnlp.github.io/DailyDialog-plusplus/ augments the original conversations with multiple responses from human annotators. To utilize speech labels, we work with DailyDialog++ data which exactly aligns with the original DailyDialog conversations (9,192 conversations out of the 9,258 in the DailyDialog++ training partition), each with 5 human-generated next responses.

3.2 Analysis: Human-Labeled Speech Acts

Speech Act	NLI Diversity $\uparrow$	Sent-BERT Diversity $\uparrow$	Num. Convos.
Inform	4.43	0.72	3875
Question	5.72	0.71	2724
Directive	5.20	0.72	1845
Commissive	4.29	0.71	748

Table 2: Diversity for multi-response sets responding to different speech acts along with the number of conversations. For both metrics, higher values indicate higher level of diversity.

We hypothesize that the most recent speech act utterance in a conversation will influence the diversity of the set of 5 responses. We begin by exploring the original DailyDialog labeled speech acts, which fall into one of four categories based on the structure proposed in Amanova et al. (2016): Inform, Question, Directive, and Commissive Li et al. (2017). Table 1 contains examples of each speech act with corresponding definitions.

DailyDialog speech acts are manually assigned to each turn in the conversation; however, for this work we are most interested in the most recent turn. An assumption that we make throughout this work is that there is one speech act for each utterance, although utterances can in fact have multiple associated speech acts Levinson (1983).

3.2.1 Hypotheses

We explore whether diversity metrics detect how the 5 human-generated responses change based on the most recent speech act utterance. We hypothesize that Question and Inform will produce more diverse sets of responses than Directive and Commissive, which center around an action which likely constrains the set of responses. Although Question is part of an adjacency pair, we hypothesize the content of the question can allow for more diverse responses than Directive and Commissive.

3.2.2 Results

To analyze our hypotheses, we examine the diversity distributions for the five multi-response sets for each of the four speech acts. For each response set, we calculate a diversity score using either NLI Diversity or Sent-BERT Diversity. We report the average diversity scores using both metrics for each speech act in Table 2.

While the Sent-BERT diversity scores are similar for all categories, we find larger differences in NLI Diversity between the speech act categories. In particular, Question is the most diverse, as expected. Commissive is the least diverse for both metrics, supporting our hypothesis that it would yield less diversity than Question and Inform.

To test statistical significance, we run a Kruskal-Wallis test Kruskal and Wallis (1952) with a Dunn posthoc Dunn (1964) and a Bonferroni adjustment to account for multiple pairwise comparisons Haynes (2013). For NLI Diversity, we note significant (p $<$ 0.05) differences among all pairwise categories except for Inform + Commissive. For Sent-BERT Diversity, we note significant (p $<$ 0.05) differences only among Inform + Commissive, Question + Commissive, and Directive + Commissive.

3.3 Analysis: Model-Labeled Speech Acts

The speech act categories for DailyDialog++ are human-labeled but quite broad. Since we are interested in investigating finer-grained categories, e.g. different types of questions, we also consider a speech act set containing 42 categories: Switchboard SWBD-DAMSL Jurafsky et al. (1997); Core and Allen (1997). While there are domain differences between DailyDailog++ and Switchboard, both deal with social conversations and we believe have similar speech acts represented. Additionally, a manual examination of 100 classifications for our speech act subset yields 85 correct predictions.

Refer to caption — Figure 2: NLI Diversity (top) and Sent-BERT (bottom) for responses categorized by most-recent speech act utterance (higher values indicate more diverse, ordered by diversity). Mean values are indicated by the white circle and corresponding text label. Box-and-whisker plots show the interquartile ranges.

3.3.1 Classification Model

In order to apply the SWBD classes to the DailyDialog++ dataset, we rely on a speech act classification model. We use the baseline RoBERTa-based classification model (125M parameters) from He et al. (2021)²²2https://github.com/zihaohe123/speak-turn-emb-dialog-act-clf, training for 9 hours on a Titan RTX GPU. Our trained model achieves accuracy of 0.82 and 0.83 when evaluated with gold labels of the SWBD test and validation sets, respectively.

For each conversation in DailyDialog++, we pass the most recent conversational turn to this classifier, obtaining a speech act classification of one of the 42 categories in SWBD. We restrict our analysis to 10 acts which are most prevalent in DailyDialog++ (predicted at least 100 times by the classifier across the training set). We exclude the “Continued” speech act, because it is specific to speech acts continued from a past turn in SWBD and is not applicable to the DailyDialog++ dataset.

3.3.2 Hypotheses

We hypothesize that more constraining speech acts, such as apology, thanking, and conversational closing, will have lower diversity. Closing-closing, thanking-acceptance, and apology-minimization are all adjacency pairs with constraining responses Levinson (1983). Although question-answer is also an adjacency pair, we expect the answer content to contain more diverse content. Statements of Opinion and Non-Opinion do not generally constrain responses, so we expect diversity to vary.

3.3.3 Results

The NLI Diversity of DailyDialog++ responses for SWBD categories can be seen in Figure 2 (top). The highest diversity is seen in response to Yes-No and Open Questions, as expected. Additionally, as expected, the Thanking and Closing categories result in the lowest NLI Diversity. Unexpectedly, Apology has a higher diversity than both Statement of Opinion and Non Opinion ; we explore this finding in Section 4. It is also surprising that Yes-No Question falls much higher than Wh Question, likely due to the emphasis on contradictions of the NLI Diversity metric. In Appendix B we report significance results, finding more significant differences among speech acts using NLI Diversity compared to Sent-BERT.

4 Study with Creative Writers

To explore whether expert judgments of diversity corresponded to the significant NLI Diversity differences observed in Section 3.3, we designed a study to measure PA Diversity from the prior conversation using a novel human evaluation task.

4.1 Study Design

4.1.1 Participants

Because the task involves judging the creativity of input conversational prompts, we followed recommendations of Kaufman et al. (2008) and work with expert writers instead of a typical crowdworking population. To this end, we employed participants who have experience with creative writing, screenwriting, or playwriting from the Upwork platform, prioritizing those who had past Upwork or professional writing experience. Across all experiments, we worked with a total of 28 writers (two participants repeated one task each on different data). Participants were compensated at a rate of $20 per survey, estimated to take one hour or less.

4.1.2 Task Design

Participants completed five sections in the order: Writing, Drag-and-Drop, Likert, Drag-and-Drop, Likert. The conditions for each task are:

Writing Task: Participants are instructed to generate “unique, interesting, and appropriate responses to each dialogue conversation.” Participants generate 5 responses each for four conversations (one from each of the four speech acts in the set), presented in a random order.

Drag-and-Drop Task: Participants are presented with four randomly-ordered conversations (one from each speech act) and asked to drag and drop them such that the top conversation “most inspires the creation of multiple distinct responses” and the bottom conversation “least inspires this.”

Likert Rating Task: While the drag-and-drop requires participants to rank conversations against one another, we are also interested in their assessments of conversations in isolation. Thus, in this task we ask participants to rate 20 conversations on a 5-point Likert scale, where 1 represents “Does not Inspire Creative Responses” and 5 represents “Does Inspire Creative Responses.” We randomize the presentation of the conversations.

4.1.3 Stimuli Creation

To examine whether the human writings and judgments can uncover differences between different speech acts, we chose two sets of SWBD speech acts with varying levels of NLI Diversity from Section 3.3.3. Set 1 consists of: Yes-No Question, Wh Question, Thanking, and Apology. We hypothesize that the Question acts will result in higher PA Diversity ratings while Thanking and Apology will result in lower ratings, based on trends observed in Section 3.3.3. This set is chosen to display the largest predicted differences between speech acts.

Set 2 consists of: Open Question, Opinion Statement, Non-Opinion Statement, and Closing. We hypothesize that Open Question and Opinion will have higher diversity ratings and Non-Opinion and Closing will have lower ratings. However, these speech acts were not the most or least diverse. Therefore, we expect these rating differences to be less pronounced than Set 1’s.

We intentionally selected conversations from DailyDialog++ whose NLI Diversity scores fell into the median range for the speech act, to ensure the data collected was from prototypical speech act conversations, not outliers. In cases for which sampling from only the median would not result in enough conversations for all surveys, we increased this window to include +/- 3 around the median NLI Diversity value. We manually verified the most-recent turn was classified in the correct speech act category, removing misclassifications.

4.1.4 Study Conditions

Each participant completed 52 tasks on different conversations, evenly distributed among 4 speech acts. We constructed 6 Qualtrics surveys which collectively covered 156 conversations (39 per speech act). We recruited five participants for each survey, resulting in 30 completed surveys.

4.2 Study Results

This section presents results for the creative writer study. For the writing task, we find differences that align with our hypotheses; however, significance results are limited. For both the Drag-and-Drop comparison tasks and the Likert ratings, we find significant differences for Set 1, which align with our hypotheses. Results for Set 2 generally support the hypotheses, with some caveats.

4.2.1 Writing Task

	Speech Act	NLI Diversity $\uparrow$	Sent-BERT Diversity $\uparrow$
Set 1	Yes-No Question	7.3	0.74
	Wh Question	4.8	0.73
	Apology	2.3	0.65
	Thanking	2.6	0.66
Set 2	Open Question	6.2	0.77
	Opinion	3.9	0.74
	Non-Opinion	6.0	0.74
	Closing	4.7	0.74

Table 3: Average NLI and Sent-BERT Diversity among sets of participant-produced responses. Speech acts are ordered within each set based on our hypothesized diversity level, from most to least diverse.

We begin by measuring the diversity of the sets of 5 responses produced in the Writing component of the survey. Table 3 reports the average NLI and Sent-BERT diversity of these produced responses; Appendix C contains responses. Our analysis with responses produced by creative writers replicates the results we saw with the DailyDialog++ responses: For Set 1, Yes-No and Wh are more diverse than Apology and Thanking. For Set 2, Open Question is more diverse than Closing, also as expected. Unexpectedly, Non-Opinion and Closing are more diverse than Opinion using NLI Diversity, but the difference is relatively small.

For the NLI Diversity of the Set 1 responses, we find significant results (p $<$ 0.05) using Kruskal-Wallis with a Dunn posthoc with Bonferroni adjustment between Yes-No Question + Apology. Although a Kruskal-Wallis test yields p $<$ 0.05 for Set 1 + Sent-BERT, a Dunn posthoc with Bonferroni adjustment does not yield significant pairwise results. For Set 2, for both metrics, the initial Kruskal-Wallis test did not show significance so we did not compute pairwise results. A possible explanation for the lack of significance for Sent-BERT is that we selected data based on the median NLI Diversity values of response sets; however, future work could replicate this experiment with median Sent-BERT conversations to see if significance emerges.

4.2.2 Drag-and-Drop Task

Results from the drag-and-drop section can be seen in Figure 3. For Set 1, the highest-ranked speech act is Yes-No Question, followed by Wh-Question, Apology, and Thanking, which confirms our observed NLI Diversity for each speech act from Section 3.3.3. A Friedman’s significance test Friedman (1940) with Nemenyi posthoc Nemenyi (1963) finds that Yes-No Question, Wh-Question, and Apology are rated significantly higher than Thaking, confirming our hypothesis.

For Set 2, we observe the Closing speech act results in the lowest ratings. Surprisingly, the Opinion conversations were rated more diverse than the Open Question conversations. Using a Friedman’s significance test with Nemenyi posthoc, however, we only find significant differences between Open Question + Closing and Opinion + Closing. This aligns with our hypothesis that Closing conversations constrain the diversity of responses.

4.2.3 Likert Rating Task

Likert ratings for each set of speech acts are shown in Figure 4; Appendix D contains scored conversations. For Set 1, we note highest ratings among Yes-No and Wh-Questions, lowest ratings with Thanking, and middle ratings with Apology. This is similar to the drag-and-drop results (Section 4.2.2). A Friedman’s significance test with Nemenyi posthoc yields significance (p $<$ 0.05) among all pairwise speech acts except for Yes-No + Wh Question.

For Set 2, ratings also aligned with our hypotheses, with Open Question, Opinion, Non-Opinion, and Closing ordered from most- to least-diverse. Contrary to the drag-and-drop results in Section 4.2.2, we find Open Questions are rated higher than Opinion conversations in the Likert section. We find significant differences (p < $0.05$ ) between all pairwise speech act combinations except for Open Question + Opinion and Opinion + Non-Opinion. In Appendix E we explore the correlation between Likert ratings and diversity metrics, finding a moderately positive correlation with NLI Diversity and a very weak positive correlation with Sent-BERT.

5 Discussion and Future Work

We found that controlling for speech act type resulted in significant differences in PA Diversity using automatic diversity metrics (Section 3.3.3) as well as creative writing participant judgments (Section 4.2). In particular, certain types of speech act utterances (Thanking, Apology, and Closing) were consistently judged less likely to produce diverse responses than other speech act utterances.

Implications. One important implication of this work is the finding that diversity is sensitive to the pragmatics of the utterance being responded to. This paves the way for pragmatics to be incorporated into model evaluation, e.g. by defining the PA Diversity for conversations in the evaluation dataset and then assessing whether a model’s diversity correlates with PA Diversity scores.

These results have implications for dialogue model generation in addition to evaluation. Perhaps a simpler, rule-based system should be used for pragmatically-constraining conversations, relying on a neural network generation system only for conversations with high PA Diversity. It is important to note that conversational creativity varies; one participant in the study rated every conversation with a score of 5, and wrote: “I tend to think creative options are without limit, especially with limited context.” When creating and assessing dialogue systems, we should consciously choose if we are creating a witty Oscar Wilde or a conversational partner who follows pragmatic norms.

Another important finding of this work is that overall, the NLI Diversity metric, which explicitly incorporates contradiction and entailment predictions, was more distinguishing than the Sent-BERT measure, which supports results found in Stasaski and Hearst (2022).

Future Work. Our findings examined eight different speech acts in depth, but additional common speech acts can be explored, such as Action Directives. While we found significant differences among diversity relating to the most recent speech act utterance, it is possible that the other speech acts in the conversation influence diversity. Future work should also examine the effects of other variables, such as length or the topic of conversation.

To circumvent the need for expert labeling of speech acts, future work should investigate automatically predicting the 5-point PA Diversity rating from expert data. A baseline approach to this task could use the median for the speech act utterance.

6 Conclusion

We introduce Pragmatically Appropriate Diversity, the hypothesis that dialogue diversity will vary based on the speech act type of the input conversation. An analysis finds significant differences for automatic diversity metrics among both human-labeled speech acts from DailyDialog++ and fine-grained speech acts automatically assigned by a SWBD model. We use these findings to create a new human evaluation task, to explore whether the assessments by creative writers align with a conversation’s PA Diversity. We find that writer rankings correspond with our hypotheses, paving the way for PA Diversity to be incorporated into future dialogue evaluation and generation.

Limitations

Our experiments with speech acts are all conducted with human-generated dialogue, instead of model-generated dialogue. While this design decision allows us to demonstrate that Pragmatically Appropriate Diversity differs even among human responses, it means that our results do not include model-specific generation issues, such as hallucinating information from training data Roller et al. (2021) or their ability to generate harmful data in response to adversarial attacks Xu et al. (2021). However, it is important to verify that PA Diversity can detect differences with human-generated dialogue data before exploring neural responses; that can be done in future work. Our approach can also be combined with other work in dialogue which aims to address hallucination and harmful data production.

While our work provides evidence that Pragmatically Appropriate Diversity differs based on the most recent speech act utterance and that this should potentially be incorporated into diversity evaluation of dialogue models, we do not assess it in a downstream application, leaving this to future work.

Additionally, our work only covers a subset of speech acts found in in SWBD, which can be expanded in future work. We are also working with one conversational dataset because it had multiple human-generated responses along with speech act labels; additional experimentation should expand this to other datasets.

We conduct experimentation with a conversational English dataset and motivate our work using pragmatic literature grounded in English interactions. It is possible that the Pragmatically Appropriate Diversity of other languages would differ from English Pragmatically Appropriate Diversity, especially in terms of the particular speech acts and adjacency pairs given documented differences in other languages Huth and Taleghani-Nikazm (2006).

Acknowledgements

This work was supported by an AWS Machine Learning Research Award, an NVIDIA Corporation GPU grant, an AI2 Key Scientific Challenge Proposal grant, and a National Science Foundation (NSF) Graduate Research Fellowship (DGE 1752814). We thank the 28 writers for their participation as well as Philippe Laban, Nate Weinman, and the Hearst Lab Research Group for their helpful comments.

References

Amabile (1982) Teresa M. Amabile. 1982. Social psychology of creativity: A consensual assessment technique. Journal of personality and social psychology, 43(5):997–1013.
Amanova et al. (2016) Dilafruz Amanova, Volha Petukhova, and Dietrich Klakow. 2016. Creating annotated dialogue resources: Cross-domain dialogue act classification. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 111–117, Portorož, Slovenia. European Language Resources Association (ELRA).
Baer et al. (2004) John Baer, James C. Kaufman, and Claudia A. Gentile. 2004. Extension of the consensual assessment technique to nonparallel creative products. Creativity Research Journal, 16(1):113–117.
Cao and Clark (2017) Kris Cao and Stephen Clark. 2017. Latent variable dialogue models and their diversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 182–187, Valencia, Spain. Association for Computational Linguistics.
Core and Allen (1997) Mark G. Core and James F. Allen. 1997. Coding dialogs with the damsl annotation scheme. In Working Notes of the AAAI Fall Symposium on Communicative Action in Humans and Machines, pages 28–35, Cambridge, MA.
Dunn (1964) Olive Jean Dunn. 1964. Multiple comparisons using rank sums. Technometrics, 6(3):241–252.
Friedman (1940) Milton Friedman. 1940. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86–92.
Haynes (2013) Winston Haynes. 2013. Bonferroni Correction, pages 154–154. Springer New York, New York, NY.
He et al. (2021) Zihao He, Leili Tavabi, Kristina Lerman, and Mohammad Soleymani. 2021. Speaker turn modeling for dialogue act classification. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2150–2157, Punta Cana, Dominican Republic. Association for Computational Linguistics.
Huth and Taleghani-Nikazm (2006) Thorsten Huth and Carmen Taleghani-Nikazm. 2006. How can insights from conversation analysis be directly applied to teaching l2 pragmatics? Language Teaching Research, 10(1):53–79.
Jurafsky et al. (1997) Daniel Jurafsky, Elizabeth Shriberg, and Debra Biasca. 1997. Switchboard SWBD-DAMSL shallow-discourse-function annotation coders manual, draft 13. Technical Report 97-02, University of Colorado, Boulder Institute of Cognitive Science, Boulder, CO.
Kaufman et al. (2008) James C. Kaufman, John Baer, Jason C. Cole, and Janel D. Sexton. 2008. A comparison of expert and nonexpert raters using the consensual assessment technique. Creativity Research Journal, 20(2):171–178.
Kruskal and Wallis (1952) William H. Kruskal and W. Allen Wallis. 1952. Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621.
Larson et al. (2019) Stefan Larson, Anish Mahendran, Andrew Lee, Jonathan K. Kummerfeld, Parker Hill, Michael A. Laurenzano, Johann Hauswald, Lingjia Tang, and Jason Mars. 2019. Outlier detection for improved data quality and diversity in dialog systems. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 517–527, Minneapolis, Minnesota. Association for Computational Linguistics.
Levinson (1983) Stephen C. Levinson. 1983. Pragmatics. Cambridge Textbooks in Linguistics. Cambridge University Press.
Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Jianfeng Gao, and Bill Dolan. 2016. A diversity-promoting objective function for neural conversation models. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 110–119, San Diego, California. Association for Computational Linguistics.
Li et al. (2017) Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A manually labelled multi-turn dialogue dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Nemenyi (1963) Peter Bjorn Nemenyi. 1963. Distribution-free multiple comparisons. Ph.D. thesis.
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3982–3992, Hong Kong, China. Association for Computational Linguistics.
Reimers and Gurevych (2020) Nils Reimers and Iryna Gurevych. 2020. Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 4512–4525, Online. Association for Computational Linguistics.
Roller et al. (2021) Stephen Roller, Emily Dinan, Naman Goyal, Da Ju, Mary Williamson, Yinhan Liu, Jing Xu, Myle Ott, Eric Michael Smith, Y-Lan Boureau, and Jason Weston. 2021. Recipes for building an open-domain chatbot. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 300–325, Online. Association for Computational Linguistics.
Sai et al. (2020) Ananya B. Sai, Akash Kumar Mohankumar, Siddhartha Arora, and Mitesh M. Khapra. 2020. Improving dialog evaluation with a multi-reference adversarial dataset and large scale pretraining. Transactions of the Association for Computational Linguistics, 8:810–827.
Schegloff and Sacks (1973) Emanuel A Schegloff and Harvey Sacks. 1973. Opening up closings. Semiotica, 8(4).
Searle and Vanderveken (1985) John R. Searle and Daniel. Vanderveken. 1985. Foundations of Illocutionary Logic. Cambridge University Press Cambridge ; New York ; Melbourne.
See et al. (2019) Abigail See, Stephen Roller, Douwe Kiela, and Jason Weston. 2019. What makes a good conversation? how controllable attributes affect human judgments. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1702–1723, Minneapolis, Minnesota. Association for Computational Linguistics.
Stasaski and Hearst (2022) Katherine Stasaski and Marti Hearst. 2022. Semantic diversity in dialogue with natural language inference. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 85–98, Seattle, United States. Association for Computational Linguistics.
Tevet and Berant (2021) Guy Tevet and Jonathan Berant. 2021. Evaluating the evaluation of diversity in natural language generation. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 326–346, Online. Association for Computational Linguistics.
Xu et al. (2021) Jing Xu, Da Ju, Margaret Li, Y-Lan Boureau, Jason Weston, and Emily Dinan. 2021. Bot-adversarial dialogue for safe conversational agents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2950–2968, Online. Association for Computational Linguistics.
Zhang et al. (2019) Xinyuan Zhang, Yi Yang, Siyang Yuan, Dinghan Shen, and Lawrence Carin. 2019. Syntax-infused variational autoencoder for text generation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2069–2078, Florence, Italy. Association for Computational Linguistics.
Zhu et al. (2018) Yaoming Zhu, Sidi Lu, Lei Zheng, Jiaxian Guo, Weinan Zhang, Jun Wang, and Yong Yu. 2018. Texygen: A Benchmarking Platform for Text Generation Models, page 1097–1100. Association for Computing Machinery, New York, NY, USA.

Appendix A Most Recent Speech Act Utterances with Full Conversations

Table 4 contains full conversations where the most recent utterance is classified as different speech acts.

DailyDialog++ Speech Act Utterance Example Conversations
Inform	Speaker 1: I have children so a steady job is important but I would like a chance to advance. Speaker 2: How would you feel about relocating to another state? Speaker 1: I could not relocate in the next year, but might be open to it in the future.
Question	Speaker 1: Do I often have to work overtime? Speaker 2: Yes, you have to work overtime a lot due to the editing job. Speaker 1: Is there any extra pay for that?
Directive	Speaker 1: What’s the most popular paper circulating in the community? Speaker 2: Atlanta Daily. Speaker 1: Oh, thank you . I want one.
Commissive	Speaker 1: Don’t take any chances . It’d be best if you told him and promised not to be late again . He’s already mad at you for last 2 times . Any more and he might fire you. Speaker 2: You said it . It won’t happen again . Do you really think he’d fire me? Speaker 1: I think he might . You’d better go to his office. Speaker 2: You are probably right . I’ll go right now and apologize . I try hard not to be late but it is difficult with Beijing traffic.
SWBD Speech Act Utterance Example Conversations (Selected Subset)
Statement Non Opinion	Speaker 1: Where are you from, Laura? Speaker 2: Well, my whole family is in the United States now, but we’re from Costa Rica originally. Speaker 1: Oh, so you’re from South America. Speaker 2: Actually, Costa Rica isn’t in South America . It’s in Central America.
Yes-No Question	Speaker 1: Excuse me . Is anyone in that seat next to you? Speaker 2: No, I don’t think so. Speaker 1: Do you mind if I put my jacket there?
Wh Question	Speaker 1: Let me ask some questions . Do you know what people need to get entry into China? Speaker 2: Yes . An passport and a visa are necessary for entry into China. Speaker 1: What materials do they need in order to apply the passport?
Action Directive	Speaker 1: Would you mind showing me your passport? Speaker 2: Here it is . And how much RIB shall I get? Speaker 1: Let me see . $200 makes RIB 1,616 . Here is the cash and exchange memo . Please check it.
Statement Opinion	Speaker 1: Alice, I never knew you had such a lovely voice . You really can sing, can’t you? Speaker 2: Thanks, Mark . I used to be a member of the school choir. Speaker 1: No wonder you can control your voice so well . You are a professional singer. Speaker 2: Well, you are flattering me . I wouldn’t say I am a professional, but I did receive some training at school . My music teacher used to be a professional singer. Speaker 1: Well, a good teacher makes good students.
Conventional Closing	Speaker 1: Well, I must be off now . I have an appointment at six. Speaker 2: In that case, I won’t keep you any longer . Drop in any time. Speaker 1: Ok . Goodbye.
Open Question	Speaker 1: Are you have a hand in locking into the case? Speaker 2: Yes. Speaker 1: How do you feel about it?
Offers, Options, Commits	Speaker 1: I can understand your position . Perhaps I’m asking too much. Speaker 2: We’ll use wooden cases if you insist, but the charge for packing will be considerably higher, and it also slows delivery. Speaker 1: Well, I’ll call you immediately for instructions on the matter.
Thanking	Speaker 1: Oh! That’s much hetter than I did. Speaker 2: Can you tell me how to improve myself? Speaker 1: There’s no secret at all . The only thing for you to do is to practice more . You will succeed. Speaker 2: Thank you for your encouragement.
Apology	Speaker 1: So, Is this your first time to TAIWAN? Speaker 2: No, I first came here 1995. Speaker 1: Oh, really? And you are from the State, right? Speaker 2: Well, I am from Canada . Actually. Speaker 1: Oh, I am so sorry.

Table 4: Full conversations for most recent speech act utterances.

Appendix B SWBD Significance Analysis

We report pairwise difference of means with corresponding significance levels for NLI Diversiy in Figure 5 (top). Looking at a row of this figure, the presence of blue (positive) values indicates the speech act is more diverse; in contrast, the presence of red (negative) values indicates the speech act is less diverse. For example, the “Yes No Question” row has high diversity and the “Thanking” row has low diversity.

Significance for Figure 5 is computed with a Kruskal-Wallis test, Dunn posthoc, and Bonferroni adjustment, which accounts for the amount of pairwise comparisons being made among speech act categories. Significant results have p $<$ 0.05. We note Yes-No Question, Wh-Question, Action Directive, and Open-Question diversity is significantly higher than most other categories while Thanking and Closing are significantly lower. These significance results support our hypotheses about different diversity scores in response to different speech acts.

We additionally compare Sent-BERT diversity scores to NLI scores to see if the differences in diversity for different speech acts are consistent across multiple diversity metrics. Sent-BERT results for SWBD categories can be seen in Figure 2 (bottom). Sent-BERT scores are more similar across categories, with the exception of Closing and Thanking which are lower.

To explore significance among categories for the Sent-BERT comparison, we perform a Kruskal-Wallis test with Dunn posthoc and Bonferroni adjustment. We present mean differences with significance results in Figure 5 (bottom). We note the low (red) rows are Conventional Closing and Thanking, similar to NLI Diversity findings. Overall, we find that Sent-BERT diversity scores give less insight into potential differences between speech act classes than NLI Diversity. Thus, for the next section, we chose to use NLI Diversity to explore the creative writing task.

Appendix C Writing Task Data

Table LABEL:tab:writing_task_data contains conversations with corresponding writer responses for each Set of speech acts. The writer responses are reported in the same order within each Set so the writer responses to different speech acts can be compared.

Table 5: Selected conversations with three crowdworker responses for each Task. Within each Task, the order of response sets is consistent among crowdworkers so responses to different speech acts can be compared.

Speech Act	Past Conversation	Writer-Generated Responses
Yes-No Question	Speaker 1: Honey, where did you buy this skirt? Speaker 2: In the market . Any problem?	$\bullet$ No, no problem. I was just curious. $\bullet$ No, I really like it! $\bullet$ When did you go to the market? $\bullet$ No, but I thought you were going to the grocery store! $\bullet$ Not really, but do you need another skirt. $\bullet$ Which market? $\bullet$ It’s stained right here. You should take it back. $\bullet$ You already have one just like it. $\bullet$ It’s the same as the one she has been looking for. $\bullet$ Why not another color? I don’t like it. $\bullet$ No, it’s just very unique and I had to say something about it! $\bullet$ Do you think there’s a problem with it? $\bullet$ Oh, that makes sense. It has some loose threads. I was wondering if it was a thrift item. $\bullet$ Nope, I was just making small talk. I’m a bit frazzled today and I can’t think of much to say. $\bullet$ Yes, it doesn’t go well with your complexion. Perhaps you should find something in a different color.
Wh Question	Speaker 1: Do you want a taxi? Speaker 2: Yes. Speaker 1: Where to?	$\bullet$ I need to go to the train station. $\bullet$ I would like to go to the mall on 3rd Street. $\bullet$ Can you take me to the airport? $\bullet$ 19822 63rd Street, please. $\bullet$ To the museum near the lake. $\bullet$ The bowling alley down on 8th street, please. $\bullet$ How far east can you go? I’ve got cash to cover it. $\bullet$ I don’t believe that is any of your concern. $\bullet$ I’ll see when he texts me back. $\bullet$ The place we went yesterday. What was it called? $\bullet$ I’m visiting the city with a friend but their flight doesn’t arrive until tomorrow so I’m sightseeing solo today. What’s your favorite area of the city? $\bullet$ I’m headed to work but first I need to stop at the nearest coffee shop. $\bullet$ The nearest nature park, please. I’m in desperate need of fresh air! $\bullet$ I’m not sure. I want to but a few books today but I’m new to the area and wanted to ask a local’s opinion on the best used bookstore. I don’t want to fund any large corporations. Do you know of a place? $\bullet$ The courthouse. I have a fine to pay. Apparently, it’s illegal to leave a couch out on the curb for collection with the garbage. Who knew?!
Apology	Speaker 1: Please come in and sit down . I ’ m happy to finally meet you. Speaker 2: Same here, Ms . Drake . I’ve been looking forward to this. Speaker 1: I ’ m sorry I kept you waiting.	$\bullet$ No worries, I was just finishing up some work. $\bullet$ I was running a little late myself! $\bullet$ It’s okay! I am glad that we are finally able to meet! $\bullet$ Unfortunately, I have another meeting at 1:00 so I don’t have too much time. $\bullet$ I was beginning to wonder if you forgot about our meeting! $\bullet$ It’s not a problem, I didn’t wait very long. $\bullet$ Just don’t keep me waiting again. $\bullet$ He said I should have expected you to be late. $\bullet$ Don’t worry, I know how the trains can be this time of day. $\bullet$ Never mind that. Let’s not waste any more time. $\bullet$ No worries! It’s hectic times these days! $\bullet$ I understand completely. Please, don’t feel guilty about rescheduling. $\bullet$ It’s no bother. I’m busy too but I always have time for you! $\bullet$ Not at all. Now, what would you like to discuss? $\bullet$ Admittedly, it wasn’t a great first impression, but you can make up for that now. What about ordering some lunch for the two of us?
Thanking	Speaker 1: Would you care to make another appointment? Speaker 2: Unfortunately, I’m leaving town rather unexpectedly, and I may be gone for several days. Speaker 1: I see . I can tell Mr.Thomas that you phoned. Speaker 2: Thank you.	$\bullet$ Your welcome. $\bullet$ Is there anything else that I can help with? $\bullet$ What is the number that Mr. Thomas can rea $\bullet$ Please try to call at least a week in advance if you decide to make another appointment. $\bullet$ Would you like to leave a message for Mr. Thomas? $\bullet$ No problem. Have a good rest of your day. $\bullet$ Would you like to make an appointment if he’s available when you return? $\bullet$ He’ll call you to discuss your next appointment. $\bullet$ Is that all for today? $\bullet$ Would you be interested in a phone appointment if he has one? $\bullet$ Is there any other message that you would like me to relay to Mr. Thomas? $\bullet$ You are welcome. Will that be all? $\bullet$ Have a nice day, goodbye. $\bullet$ Of course, have a safe trip! $\bullet$ Would you like to get a call back from Mr. Thomas?
Open Question	Speaker 1: How often? Speaker 2: Three times a year at most . How do you feel about this?	$\bullet$ That’s fine with me. $\bullet$ That sounds a little excessive. How about two? $\bullet$ Only three times a year!? Sign me up! $\bullet$ I don’t think your parents need to come into town that frequently. An annual visit around the holidays should suffice. $\bullet$ It really doesn’t matter to me. Whatever you think is best. $\bullet$ Three times a year seems sporadic; what’s the catch? $\bullet$ How much is the pay again? $\bullet$ I don’t know; would I always be on call? $\bullet$ I’d need to know the exact dates. I will have to check my schedule. $\bullet$ I fell pretty great about it! $\bullet$ That seems like a lot. I don’t even want to go once. $\bullet$ THREE??? $\bullet$ Oh, only three? That’s not so bad. $\bullet$ I wish it would be more often. $\bullet$ How do YOU feel about it?
Opinion	Speaker 1: I’m looking forward to our son’s graduation this weekend. Speaker 2: Yes . So am I . But what will he do after graduation? He really needs to go to college. Speaker 1: Well, dear, we can’t force him to go to college . It’s up to him.	$\bullet$ I understand, but as a parent, I can’t help but worry. $\bullet$ True, but I think he’ll have better career opportunities if he goes. $\bullet$ What do you mean? I thought we were on the same page about this. $\bullet$ I know. I just want him to do whatever makes him happy. $\bullet$ You’re right. Honestly, I’m just ready to have the house back to ourselves. $\bullet$ He’s your son! You can threaten to cut him off if he refuses to do what you want. $\bullet$ Of couse, you can’t force him. But you want the best for him, don’t you? $\bullet$ Children should respect their elders and listen to their sage advice. $\bullet$ Your son is a smart kid; the world deserves to see what he has to offer. $\bullet$ What’s the alternative? A trade? He at least needs to know what direction his life will be taking. $\bullet$ I know, but I’m just worried and want him to have a good future. $\bullet$ I know, but I don’t want to give him all that money we saved up unless it’s for college. $\bullet$ I think that girlfriend of his is influencing his decisions. $\bullet$ That’s true. And he does have a great job here in town if they take him on full time. $\bullet$ I don’t understand why he doesn’t want to go – we both did!
Non-Opinion	Speaker 1: We scored six goals. Speaker 2: I don’t believe it. Speaker 1: And I score three of them. Speaker 2: You are not serious? Speaker 1: We had extra time of course.	$\bullet$ And how many points did the other team have? $\bullet$ That’s a big improvement! I see all the extra practice is paying off. $\bullet$ How much extra time did you have? $\bullet$ It sounds like it was a great game. I’m sorry I missed it. $\bullet$ Finally! You guys have been on a real losing streak. $\bullet$ Extra time should be against the game’s rules! You’re the best goalie this team has seen; I’m proud of you! $\bullet$ Only with extra time would you manage to win. $\bullet$ I can’t believe you scored that many goals with that injury. $\bullet$ That’s why the coach should have made you first string last season $\bullet$ Yeah right! $\bullet$ How much extra time? $\bullet$ How many of your goals were in extra time $\bullet$ How close was the game? $\bullet$ No way.
Closing	Speaker 1: I just dropped in to say good-bye. Speaker 2: What time are you leaving? Speaker 1: I’m going to try to leave by ten. Speaker 2: Take care and give my best to your parents. Speaker 1: Good-bye . Hope to see you soon again next year.	$\bullet$ Ciao! $\bullet$ I hope so too. Feel free to give me a call over the break. $\bullet$ Goodbye! I’ll see you in a few months! $\bullet$ Take care of yourself. $\bullet$ I’ll hit you up on Instagram! $\bullet$ Next year? $\bullet$ I was hoping that we could see each other sooner than that… $\bullet$ You can’t be too busy to spend time with your best friend, can you? $\bullet$ Good-bye; I’ll see you in another year, I guess. $\bullet$ Try to take more time off, will you? We miss you around here. $\bullet$ Yeah! I hope so too! $\bullet$ I hope so too, drive safe! $\bullet$ Good-bye. Will you let me know when you get back? $\bullet$ Okay, we’ll see you next year! $\bullet$ Hopefully sooner than that!

Appendix D Likert Rating Task Data

Table 6 contains conversations from our tasks with the corresponding Likert ratings assigned by creative writers.

Speech Act	Conversation	Ratings
Yes-No Question	Speaker 1: Is that how you feel on the bus? Speaker 2: Uh, well, no, not really, because I haven’t worked in England for a long time, so I haven’t taken a bus for a long time. Speaker 1: So, so you’re British, do you missing them?	3, 4, 4, 5, 5
Wh-Question	Speaker 1: Ant Shirley, it’s being years since we last met . How were you doing in the passing years? Speaker 2: Pretty well . What about you? Speaker 1: Fine . Where are the other guys?	3, 4, 5, 5, 5
Apology	Speaker 1: Can I make preservation now? Speaker 2: Sure . At what time? Speaker 1: 9:45, please . We want a table next to the window. Speaker 2: I am sorry, sir . We are booked until 9:10.	2, 3, 3, 3, 4
Thanking	Speaker 1: Sounds good. Speaker 2: Thanks for your time.	1, 1, 1, 1, 5
Open Question	Speaker 1: Don’t worry, sir . I’ve never cut a customer . Shall I trim your moustache? Speaker 2: Yes, please. Speaker 1: Now, I’ve finished . How do you like it?	2, 4, 5, 5, 5
Opinion	Speaker 1: Tony looks very handsome in the suit. Speaker 2: He prefers suits to jackets. Speaker 1: Judge from his look, he’s a very serious person.	3, 3, 3, 4, 5
Non-Opinion	Speaker 1: Hello! Let me introduce myself . My name is Nancy. Speaker 2: Nice to meet you, I’am Simon . I don’t think I’ Ve seen you around before. Speaker 1: No, I just started working here at IBM, I am in the sale department.	3, 4, 4, 4, 5
Closing	Speaker 1: It’s hard to say . Weather is so changeable in the summer . Please listen to the latest announcement about your flight. Speaker 2: Yes, I will . Thanks a lot . Bye!	1, 1, 1, 2, 5

Table 6: Example conversations from tasks with corresponding Likert ratings from writers.

Appendix E Likert Results Compared to DailyDialog++:

We examine whether participant PA Diversity Likert ratings are correlated to the NLI and Sent-BERT Diversity scores from DailyDialog++ responses. We represent a conversation’s Likert rating by averaging the ratings of the 5 participants who encountered the conversation.

Figure 6 (top) plots averaged participant Likert ratings against NLI Diversity scores for corresponding response sets in DailyDialog++. Note that conversations were selected using median values from each speech act category; therefore, this is not a uniform sample of conversations from DailyDialog++ speech acts. The Spearman’s correlation is 0.43 (p $<$ 0.01), indicating participant ratings are moderately positively correlated with NLI Diversity scores on the DailyDialog++ data.

A corresponding plot for Sent-BERT diversity can be seen in Figure 6 (bottom). The Spearman’s correlation is 0.19 (very weak, p $<$ 0.01), indicating a very weak positive correlation between average Likert ratings and Sent-BERT diversity scores. Future work could investigate if a different conversation sampling strategy would yield a stronger correlation with Sent-BERT diversity scores.