This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Persona-centric Metamorphic Relation guided Robustness Evaluation for Multi-turn Dialogue Modelling

Abstract

Recently there has been significant progress in the field of dialogue system thanks to the introduction of training paradigms such as fine-tune and prompt learning. Persona can function as the prior knowledge for maintaining the personality consistency of dialogue systems, which makes it perform well on accuracy. Nonetheless, the conventional reference-based evaluation method falls short in capturing the genuine text comprehension prowess of the model, significantly relying on the quality of data annotation. In contrast, the application of metamorphic testing offers a more profound insight into the model’s distinct capabilities without necessitating supplementary annotation labels. This approach furnishes a more comprehensive portrayal of the model’s intricacies and exposes intricacies concealed within reference-based validation techniques. Consequently, we introduce a persona-centric metamorphic relation construction for metamorphic testing, aimed at evaluating both the persona consistency and robustness of personalized dialogue models. For that reason, this work evaluates several widely used training paradigms including learning from scratch, pretrain + fine-tune and prompt learning in personalized dialogue retrieval to know if they are more robust or if they have the same flaws as their predecessor. Under three kinds of designed metamorphic relations with consistent outputs, our experimental results reveal that prompt learning shows stronger robustness compared to training from scratch and fine-tune. Although tested retrieval models gain competitively high retrieval accuracy according to the traditional reference-based validation, they are still fragile and demonstrate various unexpected behaviors, thus there is still room for future improvement in personalized dialogue retrieval.

Keywords: Multi-turn Response Selection, Retrieval-based Chatbot, Metamorphic Testing

\UseRawInputEncoding\useunder

\NAT@set@cites

Persona-centric Metamorphic Relation guided Robustness Evaluation for Multi-turn Dialogue Modelling


Yanbing Chen1, Lin Li1, Xiaohui Tao2 , Dong Zhou3
1Wuhan University of Technology,
2University of Southern Queensland,
3Guangdong University of Foreign Studies
[email protected], [email protected], [email protected],
[email protected]

Abstract content

1.   Introduction

The fine-tune and prompt learning paradigms Liu et al. (2023) have made remarkable progress in tackling challenges related to dialogue systems, including dialogue retrieval Wolf et al. (2019); Humeau et al. (2020), generation Gu et al. (2021b); Liu et al. (2022b), and summarization Feng et al. (2022, 2021). Within the dialogue system, retrieval-based response selection plays a pivotal role. It not only synergizes effectively with other modules, such as the generative module, but also exhibits impressive standalone performance. Among these advancements, the integration of prior knowledge, such as persona Zhang et al. (2018); Mazaré et al. (2018); Gu et al. (2019); Zhao et al. (2019a); Liao et al. (2023) and background information Gu et al. (2020b); Rashkin et al. (2021); Wu et al. (2022a, b), into the dialogue retrieval system stands out as a significant breakthrough. This incorporation enhances retrieval accuracy and introduces a personalized dialogue retrieval Li et al. (2016); Gu et al. (2021a); Das et al. (2022). In this task, the system selects an appropriate response from a set of candidates based on the context and the speaker’s persona. In the realm of multi-turn dialogue models, maintaining personality consistency is often a challenge due to the diverse nature of training data comprising different speakers and the absence of explicit long-term memory for each speaker. However, by incorporating detailed portrait information of the speaker’s persona, the dialogue retrieval system effectively ensures consistency in personality, leading to substantial improvements in accuracy and overall system performance.

Although significant advances have been made in multi-turn dialogue retrieval tasks by incorporating persona and proposing various optimization algorithms to enhance dialogue models Gu et al. (2021a); Das et al. (2022); Liu et al. (2022a), the validation methods for these models have not received adequate attention. Currently, most evaluation of dialogue models still relies on reference-based methods, where the model’s outputs are compared with ground truth samples to assess consistency. However, this widely adopted approach has practical limitations. Firstly, its dependence on extensively annotated labels poses substantial demands. For generative dialogues, human evaluations of generated content entail individuals possessing both knowledge and discernment. In cases of retrieval dialogues, meticulously curated datasets are requisite. These demands translate to labor-intensive efforts, with limited assessments potentially failing to unveil real-world application issues effectively. Secondly, the consistency achieved through reference-based evaluation may not result from a genuine understanding of the underlying text and logical reasoning, but rather from the model memorizing specific expression patterns or keywords Aspillaga et al. (2020); Ribeiro et al. (2021). Moreover, this approach provides only an overall performance assessment of the model, without revealing specific strengths and weaknessesChen et al. (2021). Lastly, in the case of personalized dialogue systems, the failure to address the system’s shortcomings in maintaining a consistent personality can lead to user perception of the system as untrustworthy, eroding confidence in its real-world capabilities.

To tackle this pervasive challenge in the field of natural language processing, researchers have recently turned to metamorphic testing (MT), a technique commonly employed in software engineering. Metamorphic testing breaks the reliance on annotation labels and shifts focus to comparing the relationships between multiple inputs and outputs using a predefined set of attributes known as metamorphic relations (MR). This approach has been successfully applied to tasks such as sentiment analysis, machine translation, natural language inference, and question answering Manino et al. (2022); Ribeiro et al. (2021); Zhou and Sun (2018); Chan et al. (2022). However, it is worth noting that recent models designed for personalized dialogue retrieval have not undergone systematic testing to demonstrate their robustness under perturbations from persona-based metamorphic relations, where consistent outputs (Details in Sec. 4). Therefore, a validation method is needed to assess the persona-centric language comprehension of these personalized dialogue models and uncover their specific strengths and weaknesses.

In this study, we assess the persona-based robustness of various personalized dialogue retrieval models. Our objective is to examine the capacity of dialog systems, employing various architectures, to maintain consistent personality in responses while being exposed to different persona perturbations. Specifically, we categorize personalized dialogue retrieval models into three training paradigms: non-pretraining techniques like Bi-LSTM Hochreiter and Schmidhuber (1997), fine-tune approaches based on BERT Devlin et al. (2019), and the increasingly successful paradigm of prompt learning Liu et al. (2023). Additionally, we define the perturbation-sensitive property of response to persona in personalized dialogues as a metamorphic relation. Afterward, we conducted metamorphic testings to complement the reference-based validation approach by more specifically assessing the robustness of each model against persona-centric metamorphic relations, akin to those investigated in prior studies that employed equivalence relations Manino et al. (2022).

Refer to caption
Figure 1: References-based Validation

These tests involved introducing noise and replacing synonyms Li et al. (2017) to evaluate the models’ performance. We examined various scenarios, including noise perturbations of character-level exchanging, sentence transformations using synonymous phrases, and variations in partners’ personas. For instance, we perturbed each word of the persona by randomly swapping two characters, resulting in variations such as [love the musiclvoe teh muisc], and so forth.

Our main contributions are summarized as follows: 1) three metamorphic relations based on the persona are designed, aiming to ensure personality consistency and highlight the conversational model’s robustness; 2) to support our claims, a comprehensive quantitative analysis shows that prompt learning is stronger robustness compared to training from scratch and fine-tune. Although tested retrieval models gain competitively high retrieval accuracy according to the traditional reference-based validation, our persona-centric metamorphic testings have an impact on all retrieval models, albeit with varying degrees.

The dataset, replication package, and detailed results for this paper will be made available online.

2.   Limitation on Reference-Based Validation

In the personalized dialogue retrieval task, the effectiveness of popular methods can be easily assessed by comparing the model’s output with the true labels of the samples. High accuracy indicates excellent task performance for these models. The reference-based method used to evaluate personalized dialogue systems has practical limitations, despite its widespread adoption.

Self-persona Partner-persona
I am afraid of heights.
I love animals and have two dogs.
I am Native American and live in Oklahoma.
I work as an attorney.
I am not religious.
I love watching movies and TV.
I have a husband who I love very much.
I do not like exercise or physical activity.
My favorite food is ice cream.
I am a homebody.
Partner  : Hey there. How are you?
Self  : Good, do you like animals?
Partner  : Yeah, I like cats. I have one.
Self  : I have 2 dogs, they are great, where do you work?
Partner  : I stay at home with the kids.
Self  : Are you afraid of heights? I certainly am.
Partner  : No. Do you like TV?
Self  : Sure, I like TV, what do you watch?
Partner  : Really anything, what about you?
Self  : I do not have time to watch TV, I am an attorney, so I work a lot.
Partner  : I am not a very active person.
Self  : I certainly am, I am part Native American, I live here in Oklahoma.
Partner  : Oh. Tell me something about yourself.
Self  : Well, I do not like heights very much and I love animals. What about you?
Partner  : I am a boring person.
Self  : I am not much fun either. So what else is new with you?
Table 1: Persona-chat dataset example. The green denotes sentences that demonstrate self-persona, the blue represents sentences that exemplify partner-persona, and the red represents the context and response of the last turn of the complete conversation.

2.1.   Compulsory Labels

In the case of retrieval dialogues, the requirement for meticulously curated datasets is crucial. These datasets serve as references for evaluating the system’s ability to retrieve relevant responses. The process of creating such a dataset necessitates extensive human effort and the application of relevant knowledge to meticulously choose and annotate suitable responses that align with a given context or query. It involves a significant investment of human labor and requires individuals with expertise in the field to ensure the careful curation of the dataset. The meticulous curation process further adds to the labor-intensive nature of evaluating personalized retrieval-based dialogue systems. Additionally, the constraints of human resources and time can lead to limited assessments, which might not fully capture the complexities and challenges that dialogue systems may encounter in applications.

2.2.   Limitation on Capability with Reference-Based Validation

The examples as illustrated in Figure 1, reveal an important limitation. When we modify the persona to a synonymous sentence with a similar meaning and input it into the model, it becomes confused, resulting in an output that deviates from the expected relationship with the original source output. This indicates a lack of robustness in the model’s performance. Moreover, some researches find that certain models may achieve correct answers through the memorization of specific patterns or keywords rather than a genuine understanding and inference of the answers Gardner et al. (2020).

Traditional reference-based validation methods solely focus on reporting the "consistency" of actual outputs with the true labels, and unfortunately, they do not effectively reveal the true language comprehension capabilities of the models Ribeiro et al. (2021). Additionally, these methods fail to capture the associated shortcomings of model robustness. While traditional reference-based metrics provide a basic understanding of model performance, there is a need for more comprehensive and rigorous evaluation methods, such as metamorphic testing, to assess the consistency and robustness of personalized dialogue retrieval systems under challenging conditions. This can help identify potential weaknesses and limitations of existing methods and guide the development of more reliable and robust personalized dialogue systems.

3.   TASK Description

3.1.   Task

In the domain of personalized multi-turn dialogue systems, various types of datasets exist, such as personalized datasets Qian et al. (2021), personalized empathy datasets Zhong et al. (2020), and datasets incorporating personalized information with background knowledge Jang et al. (2022). For this paper, our primary focus is on the Persona-Chat dataset Zhang et al. (2018), while acknowledging the presence of other relevant datasets that will be explored and discussed in future studies.

The Persona-Chat dataset, represented as DD, consisting of n conversation tuples in the format of (c,p,r,y)(c,p,r,y). Specifically, c={u1,u2,,unc}c=\left\{u_{1},u_{2},...,u_{n_{c}}\right\} represents the ncn_{c} context utterances, p={p1,p2,,pnp}p=\left\{p_{1},p_{2},...,p_{n_{p}}\right\} is the npn_{p} personas of the speaker, and rr is the response candidate for cc. The label y{0,1}y\in\left\{0,1\right\} indicates whether rr is the appropriate response for (c,p)(c,p), where y=1y=1 means it is appropriate, while y=0y=0 means it is not. Our objective is to learn a matching function gg from D(c,p,r)D(c,p,r) such that, given any tuple (c,p,r)(c,p,r), g(c,p,r)g(c,p,r) calculates the degree of matching between (c,p)(c,p) and rr.

The Persona-Chat dataset Zhang et al. (2018) is currently the most extensive publicly available dataset of multi-turn dialogues conditioned on persona. It consists of 65,719 context-response pairs for the training set, 7,801 for the validation set, and 7,512 for the test set. The dataset includes correct responses from real humans, while incorrect responses are randomly sampled. To increase the challenge of the task, measures are taken to ensure that there is no overlap of contexts and roles between the training, validation, and test sets. This approach guarantees a robust evaluation of models’ generalization capabilities across diverse scenarios. Table 1 presents a selection of illustrative examples from this diverse dataset. Several studies Gu et al. (2019, 2020b, 2021a); Das et al. (2022); Wolf et al. (2019) have explored both non-pre-trained and pre-trained approaches on this dataset to enhance personalized dialogue retrieval and maintain consistent personality in the responses.

3.2.   Tested Models

In our study of multi-turn personalized dialogue retrieval, we investigate three paradigms and conduct evaluations of the task’s robustness performance. As shown in Figure 2(a), the non-pretraining approach involves training the personalized dialogue retrieval model using dataset D(c,p,r,y)D(c,p,r,y), such as DIM Gu et al. (2019) and FIRE Gu et al. (2020b).

Pretraining methods utilize a pretrained model trained on a large-scale corpus. Fine-tuning initializes the personalized dialogue retrieval model with pretrained model parameters and updates its weights using dataset DD and matching function gg to adapt it to the task, as shown in Figure 2(b). We will test CoBERT Zhong et al. (2020) and BERT-CRA Gu et al. (2021a).

Illustrated in Figure 2(c), the prompt learning includes an appropriate prompt pp within dataset DD, following BERT’s Masked Language Modeling (MLM) and Next Sentence Prediction (NSP) format. These prompts, such as cloze prompts Gao et al. (2021) (i)) and prompts introduced by Sun et al. (2022)(ii)), aid the pre-trained model in assimilating relevant information, improving personalized dialogue retrieval performance by leveraging acquired knowledge. Therefore,we predominantly utilize three models:prompt-MLM(BERT), which is based on BERT-CRA converted to MLM format, prompt-MLM(DialogLM),which is based on the post-train dialogue model DialogLMZhong et al. (2022) and converted to MLM format, and prompt-NSP(BERT), which is converted to NSP format.

4.   Metamorphic Relations with Consistent Outputs

The Persona-Chat(PC) dataset encompasses diverse persona sources, including self-persona, partner-persona, and scenarios involving both personas. It also considers cases such as no persona, both personas, or one of the two. Furthermore, the dataset incorporates a revised persona description, which introduces complexity and tests the model’s ability to generate accurate and contextually appropriate responses by reintroducing, promoting, or specializing original operations. To comprehensively investigate the robustness of persona-centric multi-turn dialogue models, we define the following metamorphic relations (MRs) to explore consistent outputs from three distinct viewpoints.

Refer to caption
Figure 2: Three Training Paradigms

MR1: Consistency with synonyms of persona Previous research Ribeiro et al. (2021); Chen et al. (2021); Malfa et al. (2020) has employed the technique of replacing keywords or adjectives in a sentence with their respective synonyms to assess the model’s comprehension of crucial aspects within the sentence. This approach aims to evaluate the model’s understanding rather than solely focusing on improving accuracy by selecting words that are strictly similar in literal terms. Considering the impact of persona sentences, it becomes crucial to capture overall meaning rather than individual words. To address this, we propose substituting synonymous sentences for the input persona, expecting that this modification will not affect the prediction results of the original model.

MT1: Synonymous Sentences Test In this test, we used MR1, where synonymous sentences replaced the original persona in the PC dataset. The aim was to assess the model’s ability to recognize synonymous sentences and adhere to persona-centric metamorphic relations, specifically invariance. We found that the revised version of the PC dataset is well-suited for our testing purposes. Therefore, we directly used this test set to evaluate perturbations on the original self-persona model.

Example 1 as an illustration of this metamorphic relation (MR). Since the two personas have identical meanings, any deviation in the model’s output for these inputs would be considered an error.

[Uncaptioned image]

MR2: Consistency with persona of the partner interference In a dialogue scenario, the model receives persona information from both participants simultaneously, including self and partner. As the conversation progresses in a multi-turn dialogue, the topics discussed may be relevant to both participants Gu et al. (2021a). Hence, it is crucial to consider the influence of the partner’s persona on the subsequent conversation topics. Inferring the personas of both parties based on the history of multi-turn dialogues, in order to generate an appropriate response unaffected by external influences, is a critical aspect in assessing the model’s robustness. To address this, we propose substituting the input persona with an unrelated partner-persona. The expected outcome following this interference should align with the model’s prediction prior to the intervention. Through such evaluations, we gain valuable insights into the model’s ability to handle diverse persona inputs and assess its resilience in generating contextually appropriate responses.

MT2: Persona Sensitive Test In the persona sensitive test, we replaced the original persona (e.g.,self-persona) with the conversation partner’s persona using MR2. Similar to the synonymous sentences test, we evaluated the self-persona model using the partner-persona version of the original Persona-Chat dataset as our test set. Example 2 illustrates this transformation.

[Uncaptioned image]

MR3: Consistent with character-level noise Similar to previous works Ribeiro et al. (2021); Aspillaga et al. (2020), our objective is to assess the models’ robustness in realistic scenarios by introducing data perturbations. We inject noise, such as spelling errors, into the input data (e.g., persona) to uphold response consistency. To examine the model’s performance under challenging conditions, we introduce character-level noise, simulating extreme cases. A robust model should maintain consistent predictions even with added noise. This evaluation provides valuable insights into the model’s ability to handle variations and disturbances in input data.

MT3: Noise Test In contrast to the previous two MTs, MT3 assesses the original versions of both the self-persona model and the partner-persona model, breaking away from relying on the transformed version provided in the PC dataset. To align with MR3, we redesigned the test set and implemented Swap Noise settings as described in Aspillaga et al. (2020), which involves swapping one randomly selected pair of consecutive characters for each word in the text (e.g.,changechnage). These noise experiments focus on the most recent-turn context of a multi-turn dialogue to gain a deeper understanding of the importance of persona and context in personalized dialogue retrieval. Examples 3-1, 3-2, and 3-3 illustrate the transformations for this metamorphic testing.

MT3-1: Persona-Only Test. We limit the character exchange to randomly selected consecutive characters in each persona word.

[Uncaptioned image]

MT3-2: Context-Only Test. We only substitute randomly chosen consecutive characters in each context word.

[Uncaptioned image]

MT3-3: Persona-and-Context Test. We perform a simultaneous substitution of random consecutive characters for each word of both the persona and context.

[Uncaptioned image]

This paper primarily focuses on metamorphic relations that ensure consistent outputs. However, future work will explore additional relations, including those involving inversed outputs. For example: Conflicting Persona Information: It can be referred in Chen et al. (2021) where Question&answering tasks are well discussed under metamorphic relations with inversed outputs.

5.   Experiments Setup

MT3
MT1 MT2 MT3-1 MT3-2 MT3-3
Model Self Self Self Self Partner Self Partner
DIM (EMNLP 2019) 25.00% 40.43% 38.19% 41.68% 65.50% 79.02% 68.00%
FIRE (EMNLP 2020) 22.26% 38.75% 34.66% 41.33% 61.67% 72.51% 64.20%
CoBERT (EMNLP 2020) 24.56% 40.87% 37.33% 43.37% 67.15% 74.45% 71.92%
BERT_CRA (SIGIR 2021) 20.42% 38.30% 34.53% 34.80% 56.58% 70.71% 58.36%
prompt_MLM (IJCNLP 2021) 20.78% 38.90% 34.16% 33.45% 56.16% 69.24% 57.92%
prompt_NSP (COLING 2022) 20.19% 38.94% 34.94% 35.30% 56.59% 71.61% 60.12%
prompt-MLM-DialogLM (AAAI 2022) 33.57% 48.42% 41.01% 33.15% 54.14% 68.57% 58.27%
Table 2: Violation Rate (VrV_{r}) of All metamorphic testing. The models with the lowest violation rates are indicated by bold numbers, while models with the next lowest rates are indicated by underlined numbers. The “Self" refers to the original self-persona, while the “Partner" represents the original partner-persona.

5.1.   Evaluation Metrics

The assessment of the software outcomes in study Chen et al. (2021) is quantified as the “violation rate" (VrV_{r}). This metric is employed to gauge the extent to which the models under investigation generate responses that do not satisfy the metamorphic relation. Specifically, we establish the metamorphic relation MRi(ti,r)MR_{i}\left(t_{i},r\right) based on the persona in the personalized dialogue retrieval task, where tit_{i} represents a transformation and rr denotes the output relation. As previously mentioned, the rr signifies consistent output. The test set 𝕊={s1,s2,,sn}\mathbb{S}=\left\{s_{1},s_{2},...,s_{n}\right\} is utilized to transform an invariant-example 𝕊i={s1i,s2i,,sni}\mathbb{S^{\mathit{i}}}=\left\{s_{1}^{i},s_{2}^{i},...,s_{n}^{i}\right\} based on this relation. Subsequently, we input 𝕊\mathbb{S} and 𝕊i\mathbb{S^{\mathit{i}}} into the personalized dialogue retrieval model PP to obtain results OjO_{j} and OjiO_{j}^{i}, respectively. Based on rr, we determine that Vri=1V_{r}^{i}=1 if OjOjiO_{j}\neq O_{j}^{i}, and Vri=0V_{r}^{i}=0 otherwise.

For a standardized evaluation of personalized dialogue retrieval models with previous studies, the recall of true positive responses (hits@1) and mean reverse ranking (MRR) are used.

5.2.   Research Questions

Our study revolves around answering three research questions:

RQ1: To assess the overall validity of metamorphic testing constructed using persona-based metamorphic relations. As previously discussed, persona plays a vital role in maintaining the consistency of the personality of responses in a multi-turn dialogue model. We have developed persona-centric metamorphic relations to assess the model’s robustness based on persona. This experimental question seeks to determine the effectiveness of persona-centric metamorphic testing in identifying flaws in the model under test. It will also evaluate whether our proposed metamorphic relation can achieve the expected advantage.

RQ2: To compare the performance of multi-turn dialogue models trained with different paradigms. By conducting metamorphic testing on models that adhere to the three paradigms of non-pretraining, fine-tuning, and prompt learning, and by presenting the test results using the proposed metamorphic relations, we can effectively demonstrate the performance of these models across various facets of their capabilities.

RQ3: To investigate the relationship between metrics based on reference answers and the violation rate metrics of metamorphic testing. In subsection 5.1, we will present two metrics utilized in our study: the traditional metric hits@1 and the violation rate of the metamorphic testing. These metrics evaluate our test results from distinct perspectives. Specifically, we will compare the extent of the decrease in hits@1 and the percentage of violations reported by the metamorphic testing. We will also assess their error detection effects.

6.   Results and Analysis

6.1.   RQ1: Persona-based Metamorphic Relations

Metamorphic testing was conducted on seven multi-turn dialogue models using three persona-centered metamorphic relations, with test cases drawn from self-persona and partner-persona of the Persona-Chat dataset. Table 2 presents the violation rates observed during testing. The results indicate that all models produced erroneous outputs, as evidenced by violation rates exceeding 0. These findings demonstrate the effectiveness of metamorphic testing in uncovering flaws in the models’ abilities to handle persona-related dialogues, and underscore the importance of designing appropriate metamorphic relations for testing such models.

Refer to caption
Figure 3: The result of self-persona’s original.

It is worth noting that we conducted experiments on the last turn of context to compare the impact of noise introduced by persona on personalized dialogue retrieval. Comparing the effects, we observe that the introduction of noise to either persona or context alone has a similar impact on the model, without causing a significant change. However, when noise is simultaneously applied to both persona and context, we observe a sharp increase in the violation rate, nearly twice as much as the effect of applying noise to one aspect alone. Additionally, referencing the significant drop in hits@1 and MRR demonstrated in Figure 3, this phenomenon suggests that even when the persona and context undergo change separately, the model still possesses the ability to identify a response that aligns with the semantics and personality of the other participant. Hence, it indicates that the model has the capacity to learn and incorporate persona information from the dialogue history. This observation underscores the fact that metamorphic testing can reveal gaps in the models’ semantic comprehension abilities.

To summarize, our results indicate that all personalized dialogue retrieval models experienced a decrease in performance when exposed to data perturbation, with the self-persona models showing a great vulnerability. Therefore, our proposed metamorphic testing approach, along with the application of specific metamorphic relations, shows effective in identifying issues within these models.

6.2.   RQ2: Multi-turn Dialogue Modelling

The research question we investigate is independent of the dataset used. As Table 2 illustrates, the differences in performance between models in the same scenes are comparable. Therefore, our evaluation primarily centers around the scenes of the self-persona’s original.

Figure 4 displays the violation rates of multi-turn dialogue models for the three training paradigms where Seven models are classified into three types. As stated in RQ1, the violation rate for each category exceeds 0, indicating a robustness issue for each model type. Our results reveal that prompt-based models exhibit lower violation rates than those subjected to fine-tune. This aligns with our expectation that prompt learning leverages more latent knowledge acquired during pre-training compared to fine-tune.

Specially, the prompt-MLM of DialogLM, exhibits the highest violation rate in MT1, surpassing DIM by 8.57%, which has the second-highest violation rate. This phenomenon can be attributed to DialogLM’s pre-training using multi-speaker conversations and long dialog understanding, making our simple prompt setup less suited for its pre-training task compared to BERT. Merely employing prompts without careful consideration not only fails to fully leverage the model’s potential knowledge but also hinders its performance, as evidenced by the self-original hits@1 or MRR scores of the prompt-MLM-DialogLM model in Fig 3.

The violation rates observed in the metamorphic testings based on persona constructions indicate that prompt-based models still possess an advantage in discerning between conversational styles of two speakers based on conversation history, even after introducing perturbations to the personas.

The preceding discussion demonstrates that all model categories experience performance degradation when exposed to data perturbation. The robustness of models that incorporate prompts is generally superior to those based on fine-tune, indicating the presence of untapped knowledge in pre-trained models. These findings highlight the need for further research to fully leverage the potential of pre-trained models.

Refer to caption
Figure 4: The VrV_{r} result of self-persona’s original scenario (the average of the violation rates for their respective included models each paradigm).
Refer to caption
Figure 5: Comparison of hits@1 and violation rates(VrV_{r}) for prompt learning paradigm. The color region is the part where the violation rate is more than hits@1. Specifically, the red area indicates self-persona’s original. The values in the figure are the average of the three prompt learning models.

6.3.   RQ3: Reference-Based VS MT

As Figure 5 depicts, the decline in each model’s metamorphic testing is relatively limited when evaluated using the hits@1 metric with labeled answers. This suggests that these models exhibit a certain level of robustness in terms of synonym recognition, resistance to irrelevant sentence influence, and performance affected by noise. However, it is important to note that this approach, which employs standardized answers for testing, can be considered a type of adversarial testing. Nonetheless, it doesn’t provide sufficient insight into the internal deficiencies of the models or their actual understanding of the required knowledge, like the limitations described in Sec. 2.

On the other hand, by utilizing VrV_{r} to assess the results of the model test, we were able to identify more errors in the models. Moreover, the use of VrV_{r} highlights the possibility of answerless scenarios in metamorphic testing. By establishing a metamorphic relation based on persona consistency and constructing metamorphic testing, we gain a comparative advantage over adversarial testing. During the analysis of the preceding two experimental questions (RQs), we leveraged the traditional hits@1 metric to support the evaluation of violation rates in personalized dialogue retrieval tasks with labeled answers. This approach enabled us to conduct a more thorough examination of anomalies in the MT results.

6.4.   More Discussions

Our study introduces three specific metamorphic testings to examine persona-based output consistency in personalized dialogue retrieval. The results effectively demonstrate the utility of our work in identifying potential comprehension issues within models. We observe significant performance degradation across different model types when exposed to intentional noise, emphasizing the critical role of our tests in revealing limitations. The intentionally introduced noise represents an extreme scenario, providing a lower-bound estimate of the model’s performance in real-world scenarios with natural noise. Our analysis reveals the model’s proficiency in grasping context and persona characteristics, successfully matching response to specific persona even with partial information loss.

Our evaluation of three training paradigms reveals that fine-tuning and prompt learning through pre-training exhibit greater stability than non-pretrained models, showing less susceptibility to data perturbations and performance degradation. This enhanced stability can be attributed to the fact that pre-trained models are trained on large-scale corpora, allowing them to better handle noise and improve generalization. Prompt learning effectively leverages the latent knowledge of the pre-trained model, but constructing templates within the pre-trained format does not fully unleash its capability and can lead to performance degradation. In summary, existing personalized dialogue retrieval models have room for improvement in performance.

Hybrid retrieval and generative modules are commonly used in dialogue systems, leading us to investigate the robustness of generative models. Additionally, we find that the persona-centric metamorphic relation in personalization-based dialogue systems is also applicable to generative models. For example, in Sec. 2, we expect a robust generative model to maintain the originality of generated sentences when introducing persona sentences with similar meanings.

7.   Related Work

7.1.   Reference-based Validation on Multi-turn Dialogue System

The concept of persona was introduced by the researchers to ensure consistent characterization and clear memory for generating rational responses Zhang et al. (2018); Zhong et al. (2020). Evaluation of personalized dialogue systems commonly involves a reference-based approach to ensure persona consistency. Standardized labels, generated through human evaluation, are used alongside traditional metrics like hits@1 to assess system performance Zhang et al. (2018); Gu et al. (2021a); Das et al. (2022); Wolf et al. (2019). This ensures robust and accurate measurement of system effectiveness. Some researchers Li et al. (2019) have explored robustness in retrieval-based dialogue systems by generating adversarial examples in black-box settings, still the continued need for labeled data.

Metamorphic testing complements traditional reference-based metrics, focusing on targeted assessment. An important benefit is its reliance on existing annotated labels, enabling automated generation of additional testing data without further manual annotation.

7.2.   Metamorphic Testing In NLP

Metamorphic testing Chen et al. (2020) is an approach used to evaluate the internal consistency of NLP models by examining the preservation of expected relationships between inputs and outputs Ribeiro et al. (2021). A significant focus of existing metamorphic relations in NLP is to assess the robustness of NLP models. These relations specifically evaluate the ability of a model to maintain consistent output Aspillaga et al. (2020); Belinkov and Bisk (2018); Li et al. (2017); Malfa et al. (2020). Demonstrating their effectiveness, robustness relations have been successfully applied in testing various NLP tasks, including Sentiment Analysis Ribeiro et al. (2021); Jia et al. (2019) and NLI Aspillaga et al. (2020); Malfa et al. (2020), among others. Additionally, metamorphic testing has been extended to test other aspects of model performance, including fairness Ma et al. (2020), and moreRibeiro et al. (2019); Manino et al. (2022). This highlights the versatility of metamorphic testing in assessing different dimensions of NLP models.

Metamorphic testing in NLP is being explored in various tasks with mixed success. Our work aims further to investigate persona-based consistent outputs to assess its robustness with the help of MT.

8.   Conclusions

In this paper, we conduct an evaluation of three paradigms using three persona-centric relational construct tests, aimed at uncovering the strengths and weaknesses of personalized dialogue models in a comprehensive manner. Our findings indicate that all of the models experience performance degradation, highlighting that their accuracy performance alone does not necessarily reflect their actual effectiveness, as they are susceptible to data perturbations. Notably, the prompt learning paradigm exhibits relatively higher stability, but we also discover that the design of the template significantly influences the model’s stability.

9.   Limitations

Although our proposed persona-centric metamorphic relation partially uncovers potential issues with personalized conversation retrieval models, there are remaining considerations for generation based dialogue system. Our designed metamorphic relations can be applied to other dialogue modelling besides retrieval based one. Further extension and validation of our metamorphic relations on relevant conversation datasets are needed. Moving forward, we aim to expand our evaluation by incorporating additional metamorphic relations such as considering fairness and biases. Furthermore, we plan to conduct assessments on some specific dialogue datasets, including those designed to explore empathic dialogues.

10.   Bibliographical References

\c@NAT@ctr

  • Aspillaga et al. (2020) Carlos Aspillaga, Andrés Carvallo, and Vladimir Araujo. 2020. Stress test evaluation of transformer-based models in natural language understanding tasks. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 1882–1894. European Language Resources Association.
  • Belinkov and Bisk (2018) Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  • Chan et al. (2022) Alvin Chan, Lei Ma, Felix Juefei-Xu, Yew-Soon Ong, Xiaofei Xie, Minhui Xue, and Yang Liu. 2022. Breaking neural reasoning architectures with metamorphic relation-based adversarial examples. IEEE Trans. Neural Networks Learn. Syst., 33(11):6976–6982.
  • Chen et al. (2021) Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Validation on machine reading comprehension software without annotated labels: a property-based method. In ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, pages 590–602. ACM.
  • Chen et al. (2020) Tsong Yueh Chen, S. C. Cheung, and Siu-Ming Yiu. 2020. Metamorphic testing: A new approach for generating next test cases. CoRR, abs/2002.12543.
  • Das et al. (2022) Souvik Das, Sougata Saha, and Rohini K. Srihari. 2022. Using multi-encoder fusion strategies to improve personalized response selection. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 532–541. International Committee on Computational Linguistics.
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  • Feng et al. (2022) Xiachong Feng, Xiaocheng Feng, and Bing Qin. 2022. A survey on dialogue summarization: Recent advances and new frontiers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5453–5460. ijcai.org.
  • Feng et al. (2021) Xiachong Feng, Xiaocheng Feng, Libo Qin, Bing Qin, and Ting Liu. 2021. Language model as an annotator: Exploring dialogpt for dialogue summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1479–1491. Association for Computational Linguistics.
  • Gao et al. (2021) Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 3816–3830. Association for Computational Linguistics.
  • Gardner et al. (2020) Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1307–1323. Association for Computational Linguistics.
  • Gu et al. (2020a) Jia-Chen Gu, Tianda Li, Quan Liu, Zhen-Hua Ling, Zhiming Su, Si Wei, and Xiaodan Zhu. 2020a. Speaker-aware BERT for multi-turn response selection in retrieval-based chatbots. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 2041–2044. ACM.
  • Gu et al. (2020b) Jia-Chen Gu, Zhen-Hua Ling, Quan Liu, Zhigang Chen, and Xiaodan Zhu. 2020b. Filtering before iteratively referring for knowledge-grounded response selection in retrieval-based chatbots. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1412–1422. Association for Computational Linguistics.
  • Gu et al. (2019) Jia-Chen Gu, Zhen-Hua Ling, Xiaodan Zhu, and Quan Liu. 2019. Dually interactive matching network for personalized response selection in retrieval-based chatbots. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 1845–1854. Association for Computational Linguistics.
  • Gu et al. (2021a) Jia-Chen Gu, Hui Liu, Zhen-Hua Ling, Quan Liu, Zhigang Chen, and Xiaodan Zhu. 2021a. Partner matters! an empirical study on fusing personas for personalized response selection in retrieval-based chatbots. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 565–574. ACM.
  • Gu et al. (2021b) Xiaodong Gu, Kang Min Yoo, and Sang-Woo Lee. 2021b. Response generation with context-aware prompt learning. CoRR, abs/2111.02643.
  • Heigold et al. (2018) Georg Heigold, Stalin Varanasi, Günter Neumann, and Josef van Genabith. 2018. How robust are character-based word embeddings in tagging and MT against wrod scramlbing or randdm nouse? In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas, AMTA 2018, Boston, MA, USA, March 17-21, 2018 - Volume 1: Research Papers, pages 68–80. Association for Machine Translation in the Americas.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
  • Humeau et al. (2020) Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Jang et al. (2022) Yoonna Jang, Jungwoo Lim, Yuna Hur, Dongsuk Oh, Suhyune Son, Yeonsoo Lee, Donghoon Shin, Seungryong Kim, and Heuiseok Lim. 2022. Call for customized conversation: Customized conversation grounding persona and knowledge. Proceedings of the AAAI Conference on Artificial Intelligence, pages 10803–10812.
  • Jia et al. (2019) Robin Jia, Aditi Raghunathan, Kerem Göksel, and Percy Liang. 2019. Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4127–4140. Association for Computational Linguistics.
  • Li et al. (2019) Jia Li, Chongyang Tao, Nanyun Peng, Wei Wu, Dongyan Zhao, and Rui Yan. 2019. Evaluating and enhancing the robustness of retrieval-based dialogue systems with adversarial examples. In Natural Language Processing and Chinese Computing - 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9-14, 2019, Proceedings, Part I, volume 11838 of Lecture Notes in Computer Science, pages 142–154. Springer.
  • Li et al. (2016) Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, and William B. Dolan. 2016. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
  • Li et al. (2017) Yitong Li, Trevor Cohn, and Timothy Baldwin. 2017. Robust training under linguistic adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 21–27. Association for Computational Linguistics.
  • Liao et al. (2023) Lizi Liao, Grace Hui Yang, and Chirag Shah. 2023. Proactive conversational agents. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM 2023, Singapore, 27 February 2023 - 3 March 2023, pages 1244–1247. ACM.
  • Lim et al. (2022) Jungwoo Lim, Myunghoon Kang, Yuna Hur, Seung Won Jeong, Jinsung Kim, Yoonna Jang, Dongyub Lee, Hyesung Ji, DongHoon Shin, Seungryong Kim, and Heuiseok Lim. 2022. You truly understand what I need : Intellectual and friendly dialog agents grounding persona and knowledge. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1053–1066. Association for Computational Linguistics.
  • Liu et al. (2022a) Junfeng Liu, Christopher T. Symons, and Ranga Raju Vatsavai. 2022a. Persona-based conversational AI: state of the art and challenges. In IEEE International Conference on Data Mining Workshops, ICDM 2022 - Workshops, Orlando, FL, USA, November 28 - Dec. 1, 2022, pages 993–1001. IEEE.
  • Liu et al. (2023) Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):195:1–195:35.
  • Liu et al. (2020) Qian Liu, Yihong Chen, Bei Chen, Jian-Guang Lou, Zixuan Chen, Bin Zhou, and Dongmei Zhang. 2020. You impress me: Dialogue generation via mutual persona perception. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 1417–1427. Association for Computational Linguistics.
  • Liu et al. (2022b) Zihan Liu, Mostofa Patwary, Ryan Prenger, Shrimai Prabhumoye, Wei Ping, Mohammad Shoeybi, and Bryan Catanzaro. 2022b. Multi-stage prompting for knowledgeable dialogue generation. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1317–1337. Association for Computational Linguistics.
  • Ma et al. (2020) Pingchuan Ma, Shuai Wang, and Jin Liu. 2020. Metamorphic testing and certified mitigation of fairness violations in NLP models. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pages 458–465. ijcai.org.
  • Malfa et al. (2020) Emanuele La Malfa, Min Wu, Luca Laurenti, Benjie Wang, Anthony Hartshorn, and Marta Kwiatkowska. 2020. Assessing robustness of text classification through maximal safe radius computation. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 2949–2968. Association for Computational Linguistics.
  • Manino et al. (2022) Edoardo Manino, Julia Rozanova, Danilo S. Carvalho, André Freitas, and Lucas C. Cordeiro. 2022. Systematicity, compositionality and transitivity of deep NLP models: a metamorphic testing perspective. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2355–2366. Association for Computational Linguistics.
  • Mazaré et al. (2018) Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. 2018. Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2775–2779. Association for Computational Linguistics.
  • Qian et al. (2021) Hongjin Qian, Xiaohe Li, Hanxun Zhong, Yu Guo, Yueyuan Ma, Yutao Zhu, Zhanliang Liu, Zhicheng Dou, and Ji-Rong Wen. 2021. Pchatbot: A large-scale dataset for personalized chatbot. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 2470–2477. Association for Computing Machinery.
  • Rashkin et al. (2021) Hannah Rashkin, David Reitter, Gaurav Singh Tomar, and Dipanjan Das. 2021. Increasing faithfulness in knowledge-grounded dialogue with controllable features. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 704–718. Association for Computational Linguistics.
  • Ribeiro et al. (2019) Marco Túlio Ribeiro, Carlos Guestrin, and Sameer Singh. 2019. Are red roses red? evaluating consistency of question-answering models. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 6174–6184. Association for Computational Linguistics.
  • Ribeiro et al. (2021) Marco Túlio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2021. Beyond accuracy: Behavioral testing of NLP models with checklist (extended abstract). In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 4824–4828. ijcai.org.
  • Sun et al. (2022) Yi Sun, Yu Zheng, Chao Hao, and Hangping Qiu. 2022. NSP-BERT: A prompt-based few-shot learner through an original pre-training task - - next sentence prediction. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 3233–3250. International Committee on Computational Linguistics.
  • Wolf et al. (2019) Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. Transfertransfo: A transfer learning approach for neural network based conversational agents. CoRR, abs/1901.08149.
  • Wu et al. (2022a) Sixing Wu, Minghui Wang, Ying Li, Dawei Zhang, and Zhonghai Wu. 2022a. Improving the applicability of knowledge-enhanced dialogue generation systems by using heterogeneous knowledge from multiple sources. In WSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, pages 1149–1157. ACM.
  • Wu et al. (2022b) Sixing Wu, Minghui Wang, Ying Li, Dawei Zhang, and Zhonghai Wu. 2022b. Improving the applicability of knowledge-enhanced dialogue generation systems by using heterogeneous knowledge from multiple sources. In WSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, pages 1149–1157. ACM.
  • Zhang et al. (2021) Hang Zhang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan, and Weizhu Chen. 2021. Poolingformer: Long document modeling with pooling attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 12437–12446. PMLR.
  • Zhang et al. (2018) Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2204–2213. Association for Computational Linguistics.
  • Zhao et al. (2019a) Xueliang Zhao, Chongyang Tao, Wei Wu, Can Xu, Dongyan Zhao, and Rui Yan. 2019a. A document-grounded matching network for response selection in retrieval-based chatbots. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 5443–5449. ijcai.org.
  • Zhao et al. (2019b) Xueliang Zhao, Chongyang Tao, Wei Wu, Can Xu, Dongyan Zhao, and Rui Yan. 2019b. A document-grounded matching network for response selection in retrieval-based chatbots. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 5443–5449. ijcai.org.
  • Zhong et al. (2022) Ming Zhong, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2022. Dialoglm: Pre-trained model for long dialogue understanding and summarization. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 11765–11773. AAAI Press.
  • Zhong et al. (2020) Peixiang Zhong, Chen Zhang, Hao Wang, Yong Liu, and Chunyan Miao. 2020. Towards persona-based empathetic conversational models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6556–6566. Association for Computational Linguistics.
  • Zhou and Sun (2018) Zhi Quan Zhou and Liqun Sun. 2018. Metamorphic testing for machine translations: MT4MT. In 25th Australasian Software Engineering Conference, ASWEC 2018, Adelaide, Australia, November 26-30, 2018, pages 96–100. IEEE Computer Society.
\c@NAT@ctr

  • Aspillaga et al. [2020] Carlos Aspillaga, Andrés Carvallo, and Vladimir Araujo. 2020. Stress test evaluation of transformer-based models in natural language understanding tasks. In Proceedings of The 12th Language Resources and Evaluation Conference, LREC 2020, Marseille, France, May 11-16, 2020, pages 1882–1894. European Language Resources Association.
  • Belinkov and Bisk [2018] Yonatan Belinkov and Yonatan Bisk. 2018. Synthetic and natural noise both break neural machine translation. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings. OpenReview.net.
  • Chan et al. [2022] Alvin Chan, Lei Ma, Felix Juefei-Xu, Yew-Soon Ong, Xiaofei Xie, Minhui Xue, and Yang Liu. 2022. Breaking neural reasoning architectures with metamorphic relation-based adversarial examples. IEEE Trans. Neural Networks Learn. Syst., 33(11):6976–6982.
  • Chen et al. [2021] Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Validation on machine reading comprehension software without annotated labels: a property-based method. In ESEC/FSE ’21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, Athens, Greece, August 23-28, 2021, pages 590–602. ACM.
  • Chen et al. [2020] Tsong Yueh Chen, S. C. Cheung, and Siu-Ming Yiu. 2020. Metamorphic testing: A new approach for generating next test cases. CoRR, abs/2002.12543.
  • Das et al. [2022] Souvik Das, Sougata Saha, and Rohini K. Srihari. 2022. Using multi-encoder fusion strategies to improve personalized response selection. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 532–541. International Committee on Computational Linguistics.
  • Devlin et al. [2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pages 4171–4186. Association for Computational Linguistics.
  • Feng et al. [2022] Xiachong Feng, Xiaocheng Feng, and Bing Qin. 2022. A survey on dialogue summarization: Recent advances and new frontiers. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5453–5460. ijcai.org.
  • Feng et al. [2021] Xiachong Feng, Xiaocheng Feng, Libo Qin, Bing Qin, and Ting Liu. 2021. Language model as an annotator: Exploring dialogpt for dialogue summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 1479–1491. Association for Computational Linguistics.
  • Gao et al. [2021] Tianyu Gao, Adam Fisch, and Danqi Chen. 2021. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 3816–3830. Association for Computational Linguistics.
  • Gardner et al. [2020] Matt Gardner, Yoav Artzi, Victoria Basmova, Jonathan Berant, Ben Bogin, Sihao Chen, Pradeep Dasigi, Dheeru Dua, Yanai Elazar, Ananth Gottumukkala, Nitish Gupta, Hannaneh Hajishirzi, Gabriel Ilharco, Daniel Khashabi, Kevin Lin, Jiangming Liu, Nelson F. Liu, Phoebe Mulcaire, Qiang Ning, Sameer Singh, Noah A. Smith, Sanjay Subramanian, Reut Tsarfaty, Eric Wallace, Ally Zhang, and Ben Zhou. 2020. Evaluating models’ local decision boundaries via contrast sets. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1307–1323. Association for Computational Linguistics.
  • Gu et al. [2020a] Jia-Chen Gu, Tianda Li, Quan Liu, Zhen-Hua Ling, Zhiming Su, Si Wei, and Xiaodan Zhu. 2020a. Speaker-aware BERT for multi-turn response selection in retrieval-based chatbots. In CIKM ’20: The 29th ACM International Conference on Information and Knowledge Management, Virtual Event, Ireland, October 19-23, 2020, pages 2041–2044. ACM.
  • Gu et al. [2020b] Jia-Chen Gu, Zhen-Hua Ling, Quan Liu, Zhigang Chen, and Xiaodan Zhu. 2020b. Filtering before iteratively referring for knowledge-grounded response selection in retrieval-based chatbots. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 1412–1422. Association for Computational Linguistics.
  • Gu et al. [2019] Jia-Chen Gu, Zhen-Hua Ling, Xiaodan Zhu, and Quan Liu. 2019. Dually interactive matching network for personalized response selection in retrieval-based chatbots. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 1845–1854. Association for Computational Linguistics.
  • Gu et al. [2021a] Jia-Chen Gu, Hui Liu, Zhen-Hua Ling, Quan Liu, Zhigang Chen, and Xiaodan Zhu. 2021a. Partner matters! an empirical study on fusing personas for personalized response selection in retrieval-based chatbots. In SIGIR ’21: The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, Virtual Event, Canada, July 11-15, 2021, pages 565–574. ACM.
  • Gu et al. [2021b] Xiaodong Gu, Kang Min Yoo, and Sang-Woo Lee. 2021b. Response generation with context-aware prompt learning. CoRR, abs/2111.02643.
  • Heigold et al. [2018] Georg Heigold, Stalin Varanasi, Günter Neumann, and Josef van Genabith. 2018. How robust are character-based word embeddings in tagging and MT against wrod scramlbing or randdm nouse? In Proceedings of the 13th Conference of the Association for Machine Translation in the Americas, AMTA 2018, Boston, MA, USA, March 17-21, 2018 - Volume 1: Research Papers, pages 68–80. Association for Machine Translation in the Americas.
  • Hochreiter and Schmidhuber [1997] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Comput., 9(8):1735–1780.
  • Humeau et al. [2020] Samuel Humeau, Kurt Shuster, Marie-Anne Lachaux, and Jason Weston. 2020. Poly-encoders: Architectures and pre-training strategies for fast and accurate multi-sentence scoring. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.
  • Jang et al. [2022] Yoonna Jang, Jungwoo Lim, Yuna Hur, Dongsuk Oh, Suhyune Son, Yeonsoo Lee, Donghoon Shin, Seungryong Kim, and Heuiseok Lim. 2022. Call for customized conversation: Customized conversation grounding persona and knowledge. Proceedings of the AAAI Conference on Artificial Intelligence, pages 10803–10812.
  • Jia et al. [2019] Robin Jia, Aditi Raghunathan, Kerem Göksel, and Percy Liang. 2019. Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pages 4127–4140. Association for Computational Linguistics.
  • Li et al. [2019] Jia Li, Chongyang Tao, Nanyun Peng, Wei Wu, Dongyan Zhao, and Rui Yan. 2019. Evaluating and enhancing the robustness of retrieval-based dialogue systems with adversarial examples. In Natural Language Processing and Chinese Computing - 8th CCF International Conference, NLPCC 2019, Dunhuang, China, October 9-14, 2019, Proceedings, Part I, volume 11838 of Lecture Notes in Computer Science, pages 142–154. Springer.
  • Li et al. [2016] Jiwei Li, Michel Galley, Chris Brockett, Georgios P. Spithourakis, Jianfeng Gao, and William B. Dolan. 2016. A persona-based neural conversation model. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. The Association for Computer Linguistics.
  • Li et al. [2017] Yitong Li, Trevor Cohn, and Timothy Baldwin. 2017. Robust training under linguistic adversity. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2017, Valencia, Spain, April 3-7, 2017, Volume 2: Short Papers, pages 21–27. Association for Computational Linguistics.
  • Liao et al. [2023] Lizi Liao, Grace Hui Yang, and Chirag Shah. 2023. Proactive conversational agents. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining, WSDM 2023, Singapore, 27 February 2023 - 3 March 2023, pages 1244–1247. ACM.
  • Lim et al. [2022] Jungwoo Lim, Myunghoon Kang, Yuna Hur, Seung Won Jeong, Jinsung Kim, Yoonna Jang, Dongyub Lee, Hyesung Ji, DongHoon Shin, Seungryong Kim, and Heuiseok Lim. 2022. You truly understand what I need : Intellectual and friendly dialog agents grounding persona and knowledge. In Findings of the Association for Computational Linguistics: EMNLP 2022, pages 1053–1066. Association for Computational Linguistics.
  • Liu et al. [2022a] Junfeng Liu, Christopher T. Symons, and Ranga Raju Vatsavai. 2022a. Persona-based conversational AI: state of the art and challenges. In IEEE International Conference on Data Mining Workshops, ICDM 2022 - Workshops, Orlando, FL, USA, November 28 - Dec. 1, 2022, pages 993–1001. IEEE.
  • Liu et al. [2023] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2023. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9):195:1–195:35.
  • Liu et al. [2020] Qian Liu, Yihong Chen, Bei Chen, Jian-Guang Lou, Zixuan Chen, Bin Zhou, and Dongmei Zhang. 2020. You impress me: Dialogue generation via mutual persona perception. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, pages 1417–1427. Association for Computational Linguistics.
  • Liu et al. [2022b] Zihan Liu, Mostofa Patwary, Ryan Prenger, Shrimai Prabhumoye, Wei Ping, Mohammad Shoeybi, and Bryan Catanzaro. 2022b. Multi-stage prompting for knowledgeable dialogue generation. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 1317–1337. Association for Computational Linguistics.
  • Ma et al. [2020] Pingchuan Ma, Shuai Wang, and Jin Liu. 2020. Metamorphic testing and certified mitigation of fairness violations in NLP models. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI 2020, pages 458–465. ijcai.org.
  • Malfa et al. [2020] Emanuele La Malfa, Min Wu, Luca Laurenti, Benjie Wang, Anthony Hartshorn, and Marta Kwiatkowska. 2020. Assessing robustness of text classification through maximal safe radius computation. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020, volume EMNLP 2020 of Findings of ACL, pages 2949–2968. Association for Computational Linguistics.
  • Manino et al. [2022] Edoardo Manino, Julia Rozanova, Danilo S. Carvalho, André Freitas, and Lucas C. Cordeiro. 2022. Systematicity, compositionality and transitivity of deep NLP models: a metamorphic testing perspective. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2355–2366. Association for Computational Linguistics.
  • Mazaré et al. [2018] Pierre-Emmanuel Mazaré, Samuel Humeau, Martin Raison, and Antoine Bordes. 2018. Training millions of personalized dialogue agents. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, pages 2775–2779. Association for Computational Linguistics.
  • Qian et al. [2021] Hongjin Qian, Xiaohe Li, Hanxun Zhong, Yu Guo, Yueyuan Ma, Yutao Zhu, Zhanliang Liu, Zhicheng Dou, and Ji-Rong Wen. 2021. Pchatbot: A large-scale dataset for personalized chatbot. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, page 2470–2477. Association for Computing Machinery.
  • Rashkin et al. [2021] Hannah Rashkin, David Reitter, Gaurav Singh Tomar, and Dipanjan Das. 2021. Increasing faithfulness in knowledge-grounded dialogue with controllable features. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, pages 704–718. Association for Computational Linguistics.
  • Ribeiro et al. [2019] Marco Túlio Ribeiro, Carlos Guestrin, and Sameer Singh. 2019. Are red roses red? evaluating consistency of question-answering models. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pages 6174–6184. Association for Computational Linguistics.
  • Ribeiro et al. [2021] Marco Túlio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. 2021. Beyond accuracy: Behavioral testing of NLP models with checklist (extended abstract). In Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 4824–4828. ijcai.org.
  • Sun et al. [2022] Yi Sun, Yu Zheng, Chao Hao, and Hangping Qiu. 2022. NSP-BERT: A prompt-based few-shot learner through an original pre-training task - - next sentence prediction. In Proceedings of the 29th International Conference on Computational Linguistics, COLING 2022, Gyeongju, Republic of Korea, October 12-17, 2022, pages 3233–3250. International Committee on Computational Linguistics.
  • Wolf et al. [2019] Thomas Wolf, Victor Sanh, Julien Chaumond, and Clement Delangue. 2019. Transfertransfo: A transfer learning approach for neural network based conversational agents. CoRR, abs/1901.08149.
  • Wu et al. [2022a] Sixing Wu, Minghui Wang, Ying Li, Dawei Zhang, and Zhonghai Wu. 2022a. Improving the applicability of knowledge-enhanced dialogue generation systems by using heterogeneous knowledge from multiple sources. In WSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, pages 1149–1157. ACM.
  • Wu et al. [2022b] Sixing Wu, Minghui Wang, Ying Li, Dawei Zhang, and Zhonghai Wu. 2022b. Improving the applicability of knowledge-enhanced dialogue generation systems by using heterogeneous knowledge from multiple sources. In WSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, pages 1149–1157. ACM.
  • Zhang et al. [2021] Hang Zhang, Yeyun Gong, Yelong Shen, Weisheng Li, Jiancheng Lv, Nan Duan, and Weizhu Chen. 2021. Poolingformer: Long document modeling with pooling attention. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 12437–12446. PMLR.
  • Zhang et al. [2018] Saizheng Zhang, Emily Dinan, Jack Urbanek, Arthur Szlam, Douwe Kiela, and Jason Weston. 2018. Personalizing dialogue agents: I have a dog, do you have pets too? In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pages 2204–2213. Association for Computational Linguistics.
  • Zhao et al. [2019a] Xueliang Zhao, Chongyang Tao, Wei Wu, Can Xu, Dongyan Zhao, and Rui Yan. 2019a. A document-grounded matching network for response selection in retrieval-based chatbots. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 5443–5449. ijcai.org.
  • Zhao et al. [2019b] Xueliang Zhao, Chongyang Tao, Wei Wu, Can Xu, Dongyan Zhao, and Rui Yan. 2019b. A document-grounded matching network for response selection in retrieval-based chatbots. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI 2019, Macao, China, August 10-16, 2019, pages 5443–5449. ijcai.org.
  • Zhong et al. [2022] Ming Zhong, Yang Liu, Yichong Xu, Chenguang Zhu, and Michael Zeng. 2022. Dialoglm: Pre-trained model for long dialogue understanding and summarization. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022, Thirty-Fourth Conference on Innovative Applications of Artificial Intelligence, IAAI 2022, The Twelveth Symposium on Educational Advances in Artificial Intelligence, EAAI 2022 Virtual Event, February 22 - March 1, 2022, pages 11765–11773. AAAI Press.
  • Zhong et al. [2020] Peixiang Zhong, Chen Zhang, Hao Wang, Yong Liu, and Chunyan Miao. 2020. Towards persona-based empathetic conversational models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6556–6566. Association for Computational Linguistics.
  • Zhou and Sun [2018] Zhi Quan Zhou and Liqun Sun. 2018. Metamorphic testing for machine translations: MT4MT. In 25th Australasian Software Engineering Conference, ASWEC 2018, Adelaide, Australia, November 26-30, 2018, pages 96–100. IEEE Computer Society.